thesis freek maes - final version

60
Master Thesis Authorship Disambiguation and Alias Resolution in Email Data F.P.E. Maes Master Thesis DKE 12-16 Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science of Artificial Intelligence at the Department of Knowledge Engineering of the Maastricht University Thesis Committee: Prof. Dr. Ir. Johannes C. Scholtes Dr. Ir. Ing. Nico Roos Maastricht University Faculty of Humanities and Sciences Department of Knowledge Engineering Master Artificial Intelligence June 26, 2012

Upload: freek-maes

Post on 12-Apr-2017

182 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Thesis Freek Maes - Final Version

Master Thesis

Authorship Disambiguation and Alias

Resolution in Email Data

FPE Maes

Master Thesis DKE 12-16

Thesis submitted in partial fulfillmentof the requirements for the degree of Master of Science

of Artificial Intelligence at the Department of KnowledgeEngineering of the Maastricht University

Thesis Committee

Prof Dr Ir Johannes C ScholtesDr Ir Ing Nico Roos

Maastricht UniversityFaculty of Humanities and Sciences

Department of Knowledge EngineeringMaster Artificial Intelligence

June 26 2012

Abstract

This thesis deals with authorship disambiguation and alias resolution in emaildata sets Given a set of emails it is investigated how to resolve aliases anddisambiguate authors even if their names are misspelled if they use completelydifferent email addresses or if they deliberately use aliases A review of stringmetrics as well as the relevant literature from authorship attribution and linkanalysis is given It is proposed that a combination of techniques from differentdomains can perform better than each of these techniques individually A subsetof the ENRON email data set is selected and artificial aliases are created inorder to test this hypothesis Four different individual approaches are evaluated(1) Jaro-Winkler similarity on email addresses (2) Support Vector Machines onemail content (3) Jaccard similarity of the link network and (4) Connected Pathsimilarity of the link network Moreover two combinations of these techniquesare created and evaluated The results show that a combination of Jaro-Winkleremail address similarity Support Vector Machine on writing style attributes andJaccard similarity of the link network performs best on two different test sets

Acknowledgments

I would like to thank several people that have guided me in the process ofwriting this thesis without whom I would not have been able to finish it FirstI would like to thank Dr Johannes Scholtes for his supervision of my thesisand internship and for his time to help me with questions and challenges alongthe way Without his continuing enthusiasm and optimism it would have been amuch harder if not impossible task to write this thesis I would also like to thankDr Nico Roos for evaluating this thesis as a second assessor Finally I wouldlike to thank Eva van den Hurk for her constant support and encouragementduring the past year

Contents

List of Figures 1

List of Tables 2

1 Introduction 311 Structure of the thesis 4

2 Literature Review 521 String metrics 5

211 Techniques 622 Authorship Attribution 8

221 Instance vs profile-based 8222 Features 8223 Feature Selection 12224 Techniques 13

23 Link analysis 21231 Techniques 21

24 Combining Approaches 2425 Evaluation measures 2626 Conclusion 27

3 Methods 2931 ENRON Corpus 2932 Individual Techniques 3533 Combinations of Techniques 38

4 Results 40

5 Discussion 4651 Conclusion 4752 Future Recommendations 48

6 Bibliography 50

Appendix 56

List of Figures

21 The structure of a supervised authorship attribution system 1522 Example of a decision tree 1723 Linear Separation using Support Vector Machines 1824 Mapping of feature space for SVM using RBF-kernel 1925 Example of an Artificial Neural Network 2026 Connected Triples in a link network 2227 Connected Path similarity in a link network 24

31 Information extracted from an email in the ENRON data set 3032 Evaluation of different kernels and training sizes for SVM 3233 Distribution of email messages per author 3334 Distribution of total number of words per author 3335 Network graph of the authors in the ENRON subset 3436 Structure of the combined approach 39

41 Performance of individual techniques on the mixed test set 4242 Performance of combined techniques on the mixed test set 4243 Performance of individual techniques on the hard test set 4344 Performance of combined techniques on the hard test set 4345 Best performance of different techniques on the mixed test set 4446 Best performance of different techniques on the hard test set 45

1

List of Tables

21 Soundex algorithm rules 722 Contingency table for evaluation 26

31 Preprocessing steps applied to the ENRON corpus 3132 Artificial Aliases in the ENRON data set by type 3533 Distribution of alias-types in two different test sets 3534 Feature set for the authorship SVM 37

2

Chapter 1

Introduction

Authorship disambiguation and alias resolution are increasingly important con-cepts in domains such as intelligence and law where email collections may con-tain authors that use one or more aliases Aliases occur when a person usesmultiple email addresses for either intentional or unintentional reasons For ex-ample people can try to hide their identity by intentionally adopting severaldifferent email addresses something that is common in intelligence data setssuch as terrorist networks On the other hand the use of different email ad-dresses (home office etc) is becoming common nowadays Hence there alsoexist many unintentional aliases where only the domain of the email address isdifferent or where a simple misspelling of a name has occurred

Various approaches have been applied successfully to resolve aliases in emaildata sets although each has its own shortcomings Unintentional aliases canbe resolved by employing metrics that indicate how much two email addresseslook alike However these metrics are easily fooled by persons using completelydifferent email addresses Another approach focuses on the content of the emailby creating a profile of an authorrsquos writing style By comparing the writingstyle of different authors and finding those that employ similar writing stylesaliases that are more complex can be detected This approach has been appliedsuccessfully to attribute authorship of disputed literary works However itencounters scalability issues when the number of authors grows large or thelength of the texts grows small as is the case in email data sets A thirdapproach makes use of the fact that even if an author use a completely differentemail address and writing style the people with whom he corresponds via emailmight remain stable The similarity between different authorsrsquo email contactscan be determined using link analysis techniques These techniques achievereasonable results and sometimes manage to find aliases that other techniquesdo not find

3

The three approaches mentioned above operate on different domains namelythe email address the content of the email and the email network Finding away to combine these approaches and utilize their combined strengths mightenable us to overcome their individual weaknesses In order to guide the re-search that has been conducted for this thesis three research questions havebeen formulated

1 Which authorship disambiguation and alias resolution techniques existthat can be used on email data

2 How can techniques from different domains be combined

3 Can a combination of techniques from different domains increase perfor-mance over individual techniques

11 Structure of the thesis

The structure of the remaining parts of this thesis is as follows

bull Chapter 2 introduces multiple techniques from the fields of AuthorshipDisambiguation and Alias Resolution Specifically string metrics will beexplained in section 21 authorship attribution systems in section 22 andlink analysis techniques in section 23 Several ways of combining thesetechniques as well as different measures for performance evaluation willbe discussed in sections 24 and 25

bull Chapter 3 outlines the methodology that has been used in order to conductthe experiments The email corpus that has been used will be describedas well as the preprocessing that has been applied to it Furthermore thetechniques that have been chosen for evaluation in the experiments willbe explained

bull Chapter 4 will present in detail the results of the experiments that havebeen conducted

bull Finally Chapter 5 provides a summary and discussion of the obtainedresults as well as recommendations for the future

4

Chapter 2

Literature Review

In this chapter a review of relevant literature from the fields of AuthorshipDisambiguation and Alias Resolution will be given The first section will explaindifferent string metrics that have successfully been applied to resolve superficialaliases and authorship problems In the second section authorship attributiontechniques that can be used to resolve the question of authorship in generalwill be discussed Moreover the various design choices that have to be madewhen creating an authorship attribution system will be explained The thirdsection will deal with techniques from Link Analysis that use the network inwhich emails reside to discover aliases In the fourth section several ways ofcombining these techniques will be discussed The last section will introduceseveral measures that can be used for evaluating the performance of differenttechniques

21 String metrics

String similarity metrics are a class of functions that map two strings to a realnumber where the higher the value of this number the greater the similaritybetween the two strings Many string metrics use the number of operationsthat are required to transform one string into another in order to calculatethe similarity between the two Possible operations include insertion deletionsubstitution and transposition A different class of string metrics is the phoneticencodings in which strings are converted into codes according to how they arepronounced However these encodings are language dependent and are notavailable for many languages

String metrics do not take into account information regarding the contextin which the strings occur As such they can be considered rather simple ap-proaches to resolving aliases or settling authorship disputes However stringmetrics can be very useful for detecting misspellings of email aliases result-ing from the using different email domains or naming conventions For exam-ple they can easily detect the similarity between rdquojohndoedomaincomrdquo and

5

rdquojhondoedomaincomrdquo They are less useful when people deliberately try tohide their identity by using completely different email addresses

211 Techniques

In this section the most commonly used string metrics will be discussedThe Levenshtein distance [52] often referred to as edit distance is of one the

earliest and most used string distances It is defined as the minimum requiredamount of operations between string s and t to transform one string into theother Each operation has a cost of 1 and the allowed operations are inser-tion deletion and substitution of a character The Levenshtein distance can betransformed into a similarity metric by using

similarity(s t) =1

Levenshtein(s t) + 1(21)

The Jaro similarity [32] algorithm uses the number of transpositions T andthe number of matching characters m in order to determine the similarity be-tween two strings Two characters are matching only if they are no fartherapart than half the length of the longest string The number of transpositionsis defined as the number of matching characters in different sequence ordersdivided by two The similarity is then calculated as follows

Jaro(s t) =1

3

(m

|s|+m

|t|+mminus Tm

)(22)

where |s| denotes the length of string sThe Jaro-Winkler similarity [67] is an extension of the Jaro-algorithm using

the empirical finding by Winkler that less errors tend to occur at the start ofstrings The similarity is calculated as follows where p is the length of the prefixthat the two strings share

Jaro-Winkler(s t) = Jaro(s t) +p

10(10minus Jaro(s t)) (23)

The Soundex algorithm [53] is the most well known and at the same timethe oldest phonetic encoding that is used for string matching Strings are firstconverted into phonetic codes after which strings with similar codes are assumedto be highly similar In order to convert a string into a Soundex-code the firstletter of the string is retained after which the following letters are convertedto numbers according to the set of rules shown in table 21 In the resultingcode all zeros are removed as well as multiple sequential occurrences of thesame digit The code is then cut-off or extended with zeros such that is hasexactly 3 digits The first letter of the string together with the 3 digits formsthe Soundex-code The Soundex algorithm makes use of the fact that stringsthat are pronounced in a similar fashion tend to have the same Soundex codeFor example rdquoMaidrdquo and rdquoMaderdquo both results in the Soundex code rdquoM300rdquo

The longest common substring [23] method iteratively finds and removes thelongest substring of minimum length l that two strings have in common until

6

Letter Digit

A E I O U H W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6

Table 21 The rules for converting letters into digits as they are used in theSoundex algorithm

no more substrings can be found The final similarity can then be calculated bytaking the length of all the common substrings divided by either the maximumminimum or average length of the original strings

A slightly different approach by Monge and Elkan [50] uses a string metricsuch as any of the ones discussed above in a recursive matching scheme in orderto determine similarity between strings String s and t are first broken into sub-strings s = s1 sK and t = t1 tK after which the similarity is definedas

Monge-Elkan(s t) =1

K

Ksumi=1

max

Ksumj=1

simprime(si tj) (24)

where simprime(si tj) denotes the similarity score between sub-strings si and tj asassigned by a secondary string metric

Christen [11] provides an extensive comparison of these and other stringmetrics on 4 different test sets of given- sur- and full names He found thatit is important to know beforehand the structure of the names to be matchedand whether they have been parsed and standardized He also found that Jaro-Winkler similarity performed best in a comparison of 27 different string met-rics Furthermore he reached the following conclusions (1) Phonetic encodingshould not be used since they are outperformed by all other techniques (2) Jaroand Jaro-Winkler similarity performs well for given- and surnames if the namesare parsed into separate fields (3) longest common substring is useful when thenames might contain swapped words (4) the Winkler modification can be usedwith every technique to improve the quality of the matching (5) the selection ofa proper threshold is the biggest problem for most matching techniques and (6)the fastest techniques are the ones that that have a time complexity linear to thelength of the strings Cohen and Fienberg [13] evaluated several strings metricson 13 different test sets concluding that the Monge-Elkan distance achieved thebest performance of all the string metrics The Jaro-Winkler metric proved to bea fast heuristic scheme achieving almost the same performance as Monge-Elkanwhilst being considerably less complex in nature

7

22 Authorship Attribution

Other approaches to resolving aliases and disambiguating authors can be foundin the field of Authorship Attribution The authorship attribution task can bedescribed as follows given a set of candidate authors and a set of documentswritten by each of these authors try to determine which of these candidateswrote a given anonymous document In the traditional authorship attributionproblem the number of candidate authors is typically small (2 - 10) the numberof documents per author is large and the length of these documents is largeMoreover it is assumed that the author of the anonymous document is actuallyin the candidate set ie there is a closed candidate set A good example ofa traditional authorship attribution problem is to determine the author of adisputed literary work such as some of Shakespearersquos plays

Authorship attribution techniques can be very useful in resolving aliases anddetermining authorship An authorship attribution system can be trained todistinguish between different authors in an email data set For a given authorit is possible to determine if an alias is being used by letting the authorshipattribution system predict which authorrsquos writing style most closely resemblesthe given authorrsquos writing style

In the remainder of this section the different techniques that have been em-ployed in authorship attribution problems will be explained as well as importantdesign choices that have to be made These include the choice of a feature seta feature selection technique the actual attribution technique and whether totreat the problem from an instance-based perspective or a profile-based perspec-tive

221 Instance vs profile-based

A general distinction can be made between techniques that treat each emailindividually (instance-based) and techniques that accumulate all the emails perauthor (profile-based) The first approach treats each email from a given authoras a single training instance and thereby retains differences in texts from thesame author The second approach accumulates all the texts from a givenauthor into one big training file creating a profile of one author and disregardingdifferences in each individual text The choice is mostly philosophical whetherto model the general style of each author or the individual style of each document[63]

222 Features

An important design choice in authorship attribution systems is the choice offeature set Features are the specific writing-style attributes predefined by theresearcher that are extracted from a piece of text in order to capture stylisticinformation that is characteristic for a particular author Since the choice offeature set can affect the performance of the authorship attribution in variousways it is important to consider which features to include or exclude In general

8

a distinction can be made between lexical syntactic structural semantic andcontent-specific features These features will be discussed in that order in thefollowing sections

Lexical features

Lexical features are the features that are derived at the character and word-levelof the text and are the most commonly used features These features are consid-ered language-independent since they do not need any prior language-dependentprocessing before they can be applied to a text Character-frequencies word-length distributions frequency of digits and non-alphanumeric characters andtotal number of words are all examples of lexical features that provide usefulinformation

An easy-to-use lexical feature that is also computationally simple is charactern-grams For example the character 4-grams that can be extracted from thephrase rdquoThe dogrdquo are rdquothe rdquo rdquohe drdquo rdquoe dordquo and rdquo dogrdquo Character n-gramscan capture various writing style markers from a text such as capitalization orUKUS-variants of certain words Even if a word is incorrectly spelled mostof the n-grams extracted from the correct and incorrect variant will matchalthough a misspelling can also be considered as a style marker for a particularauthor An advantage of character n-grams is that they do not need tokenizationbefore they can be applied to a text which is very useful in Asian or Arabiclanguages where tokenization is difficult

More complicated lexical features require the detection of word and sen-tence boundaries in the text By using common Natural Language Processing(NLP) tools a text can be broken up into its constituent parts during a pro-cess called tokenization A token can be a single word phrase symbol or othermeaningful element After counting the occurrence of each distinct token the nmost frequently occurring tokens can be used as features since the tokens thatoccur most frequent are considered to contain the most useful discriminatoryinformation

Another set of features that can be derived from tokenization is the fre-quency of different word lengths These features provide information on howoften a particular author uses words of different lengths Vocabulary richnessmeasures are a subset of lexical features that are derived from these word lengthfrequencies Hapax Legomena is the number of words that occur once in a textwhereas Hapax Dislegomena is the number of words that occur twice Thenumber of hapax legomena and hapax dislegomena gives an indication of howrich the vocabulary is that is used by a certain author Authors that have alarger vocabulary will have a higher count of once- or twice-occurring wordsthan authors with a small vocabulary The type-token ratio VN is the numberof unique tokens V divided by the total number of tokens in a text N andgives another indication of vocabulary richness Numerous vocabulary richnessmeasures have been created based on word frequencies of which the most wellknown are

9

bull Yulersquos K [69]

K = 104 middot

[minus1

N+sumi

V (iN)

(i

N

)2]

(25)

where V (iN) is the number of words occurring i times in the text oflength N

bull Sichelrsquos S [59]

S =V (2 N)

V (N)(26)

where V (N) is the vocabulary size and V (2 N) the number of twice-occurring words

bull Brunetrsquos W [7]W = NV (N)a (27)

where N is the number of words and a is usually set to 0172

bull Honorersquos R [26]

H = 100 middot

logN

1minus N(1N)V (N)

(28)

Furthermore smileys [64] abbreviations [62] slang words [36] and evenspelling errors can be used as stylometric features to distinguish between au-thors For example Koppel [37] tries to simulate the way a human expert dis-criminates between authors by using stylistic idiosyncrasies such as misspellingsto fingerprint a particular author A disadvantage of that approach is that dis-criminating between authors is hard when there are little idiosyncrasies presentin the text

Syntactic features

Syntactic features capture authorial style at the sentence level by analyzing thesyntactic constructions that an author uses The underlying idea is that everyauthor unconsciously uses more or less the same syntactical patterns in eachtext Syntactic features rely on the accuracy and availability of NLP-tools indifferent languages in order to tokenize and apply Part-of-speech (POS)-tags

A common method is to analyze the short sequences of POS-tags that occurmost frequently throughout an authorrsquos work For example the sentence rdquoTheman walked in the parkrdquo can be represented as

| DETERMINER | NOUN | VERB PAST TENSE| PREPOSITION | DETERMINER | NOUN |

10

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 2: Thesis Freek Maes - Final Version

Abstract

This thesis deals with authorship disambiguation and alias resolution in emaildata sets Given a set of emails it is investigated how to resolve aliases anddisambiguate authors even if their names are misspelled if they use completelydifferent email addresses or if they deliberately use aliases A review of stringmetrics as well as the relevant literature from authorship attribution and linkanalysis is given It is proposed that a combination of techniques from differentdomains can perform better than each of these techniques individually A subsetof the ENRON email data set is selected and artificial aliases are created inorder to test this hypothesis Four different individual approaches are evaluated(1) Jaro-Winkler similarity on email addresses (2) Support Vector Machines onemail content (3) Jaccard similarity of the link network and (4) Connected Pathsimilarity of the link network Moreover two combinations of these techniquesare created and evaluated The results show that a combination of Jaro-Winkleremail address similarity Support Vector Machine on writing style attributes andJaccard similarity of the link network performs best on two different test sets

Acknowledgments

I would like to thank several people that have guided me in the process ofwriting this thesis without whom I would not have been able to finish it FirstI would like to thank Dr Johannes Scholtes for his supervision of my thesisand internship and for his time to help me with questions and challenges alongthe way Without his continuing enthusiasm and optimism it would have been amuch harder if not impossible task to write this thesis I would also like to thankDr Nico Roos for evaluating this thesis as a second assessor Finally I wouldlike to thank Eva van den Hurk for her constant support and encouragementduring the past year

Contents

List of Figures 1

List of Tables 2

1 Introduction 311 Structure of the thesis 4

2 Literature Review 521 String metrics 5

211 Techniques 622 Authorship Attribution 8

221 Instance vs profile-based 8222 Features 8223 Feature Selection 12224 Techniques 13

23 Link analysis 21231 Techniques 21

24 Combining Approaches 2425 Evaluation measures 2626 Conclusion 27

3 Methods 2931 ENRON Corpus 2932 Individual Techniques 3533 Combinations of Techniques 38

4 Results 40

5 Discussion 4651 Conclusion 4752 Future Recommendations 48

6 Bibliography 50

Appendix 56

List of Figures

21 The structure of a supervised authorship attribution system 1522 Example of a decision tree 1723 Linear Separation using Support Vector Machines 1824 Mapping of feature space for SVM using RBF-kernel 1925 Example of an Artificial Neural Network 2026 Connected Triples in a link network 2227 Connected Path similarity in a link network 24

31 Information extracted from an email in the ENRON data set 3032 Evaluation of different kernels and training sizes for SVM 3233 Distribution of email messages per author 3334 Distribution of total number of words per author 3335 Network graph of the authors in the ENRON subset 3436 Structure of the combined approach 39

41 Performance of individual techniques on the mixed test set 4242 Performance of combined techniques on the mixed test set 4243 Performance of individual techniques on the hard test set 4344 Performance of combined techniques on the hard test set 4345 Best performance of different techniques on the mixed test set 4446 Best performance of different techniques on the hard test set 45

1

List of Tables

21 Soundex algorithm rules 722 Contingency table for evaluation 26

31 Preprocessing steps applied to the ENRON corpus 3132 Artificial Aliases in the ENRON data set by type 3533 Distribution of alias-types in two different test sets 3534 Feature set for the authorship SVM 37

2

Chapter 1

Introduction

Authorship disambiguation and alias resolution are increasingly important con-cepts in domains such as intelligence and law where email collections may con-tain authors that use one or more aliases Aliases occur when a person usesmultiple email addresses for either intentional or unintentional reasons For ex-ample people can try to hide their identity by intentionally adopting severaldifferent email addresses something that is common in intelligence data setssuch as terrorist networks On the other hand the use of different email ad-dresses (home office etc) is becoming common nowadays Hence there alsoexist many unintentional aliases where only the domain of the email address isdifferent or where a simple misspelling of a name has occurred

Various approaches have been applied successfully to resolve aliases in emaildata sets although each has its own shortcomings Unintentional aliases canbe resolved by employing metrics that indicate how much two email addresseslook alike However these metrics are easily fooled by persons using completelydifferent email addresses Another approach focuses on the content of the emailby creating a profile of an authorrsquos writing style By comparing the writingstyle of different authors and finding those that employ similar writing stylesaliases that are more complex can be detected This approach has been appliedsuccessfully to attribute authorship of disputed literary works However itencounters scalability issues when the number of authors grows large or thelength of the texts grows small as is the case in email data sets A thirdapproach makes use of the fact that even if an author use a completely differentemail address and writing style the people with whom he corresponds via emailmight remain stable The similarity between different authorsrsquo email contactscan be determined using link analysis techniques These techniques achievereasonable results and sometimes manage to find aliases that other techniquesdo not find

3

The three approaches mentioned above operate on different domains namelythe email address the content of the email and the email network Finding away to combine these approaches and utilize their combined strengths mightenable us to overcome their individual weaknesses In order to guide the re-search that has been conducted for this thesis three research questions havebeen formulated

1 Which authorship disambiguation and alias resolution techniques existthat can be used on email data

2 How can techniques from different domains be combined

3 Can a combination of techniques from different domains increase perfor-mance over individual techniques

11 Structure of the thesis

The structure of the remaining parts of this thesis is as follows

bull Chapter 2 introduces multiple techniques from the fields of AuthorshipDisambiguation and Alias Resolution Specifically string metrics will beexplained in section 21 authorship attribution systems in section 22 andlink analysis techniques in section 23 Several ways of combining thesetechniques as well as different measures for performance evaluation willbe discussed in sections 24 and 25

bull Chapter 3 outlines the methodology that has been used in order to conductthe experiments The email corpus that has been used will be describedas well as the preprocessing that has been applied to it Furthermore thetechniques that have been chosen for evaluation in the experiments willbe explained

bull Chapter 4 will present in detail the results of the experiments that havebeen conducted

bull Finally Chapter 5 provides a summary and discussion of the obtainedresults as well as recommendations for the future

4

Chapter 2

Literature Review

In this chapter a review of relevant literature from the fields of AuthorshipDisambiguation and Alias Resolution will be given The first section will explaindifferent string metrics that have successfully been applied to resolve superficialaliases and authorship problems In the second section authorship attributiontechniques that can be used to resolve the question of authorship in generalwill be discussed Moreover the various design choices that have to be madewhen creating an authorship attribution system will be explained The thirdsection will deal with techniques from Link Analysis that use the network inwhich emails reside to discover aliases In the fourth section several ways ofcombining these techniques will be discussed The last section will introduceseveral measures that can be used for evaluating the performance of differenttechniques

21 String metrics

String similarity metrics are a class of functions that map two strings to a realnumber where the higher the value of this number the greater the similaritybetween the two strings Many string metrics use the number of operationsthat are required to transform one string into another in order to calculatethe similarity between the two Possible operations include insertion deletionsubstitution and transposition A different class of string metrics is the phoneticencodings in which strings are converted into codes according to how they arepronounced However these encodings are language dependent and are notavailable for many languages

String metrics do not take into account information regarding the contextin which the strings occur As such they can be considered rather simple ap-proaches to resolving aliases or settling authorship disputes However stringmetrics can be very useful for detecting misspellings of email aliases result-ing from the using different email domains or naming conventions For exam-ple they can easily detect the similarity between rdquojohndoedomaincomrdquo and

5

rdquojhondoedomaincomrdquo They are less useful when people deliberately try tohide their identity by using completely different email addresses

211 Techniques

In this section the most commonly used string metrics will be discussedThe Levenshtein distance [52] often referred to as edit distance is of one the

earliest and most used string distances It is defined as the minimum requiredamount of operations between string s and t to transform one string into theother Each operation has a cost of 1 and the allowed operations are inser-tion deletion and substitution of a character The Levenshtein distance can betransformed into a similarity metric by using

similarity(s t) =1

Levenshtein(s t) + 1(21)

The Jaro similarity [32] algorithm uses the number of transpositions T andthe number of matching characters m in order to determine the similarity be-tween two strings Two characters are matching only if they are no fartherapart than half the length of the longest string The number of transpositionsis defined as the number of matching characters in different sequence ordersdivided by two The similarity is then calculated as follows

Jaro(s t) =1

3

(m

|s|+m

|t|+mminus Tm

)(22)

where |s| denotes the length of string sThe Jaro-Winkler similarity [67] is an extension of the Jaro-algorithm using

the empirical finding by Winkler that less errors tend to occur at the start ofstrings The similarity is calculated as follows where p is the length of the prefixthat the two strings share

Jaro-Winkler(s t) = Jaro(s t) +p

10(10minus Jaro(s t)) (23)

The Soundex algorithm [53] is the most well known and at the same timethe oldest phonetic encoding that is used for string matching Strings are firstconverted into phonetic codes after which strings with similar codes are assumedto be highly similar In order to convert a string into a Soundex-code the firstletter of the string is retained after which the following letters are convertedto numbers according to the set of rules shown in table 21 In the resultingcode all zeros are removed as well as multiple sequential occurrences of thesame digit The code is then cut-off or extended with zeros such that is hasexactly 3 digits The first letter of the string together with the 3 digits formsthe Soundex-code The Soundex algorithm makes use of the fact that stringsthat are pronounced in a similar fashion tend to have the same Soundex codeFor example rdquoMaidrdquo and rdquoMaderdquo both results in the Soundex code rdquoM300rdquo

The longest common substring [23] method iteratively finds and removes thelongest substring of minimum length l that two strings have in common until

6

Letter Digit

A E I O U H W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6

Table 21 The rules for converting letters into digits as they are used in theSoundex algorithm

no more substrings can be found The final similarity can then be calculated bytaking the length of all the common substrings divided by either the maximumminimum or average length of the original strings

A slightly different approach by Monge and Elkan [50] uses a string metricsuch as any of the ones discussed above in a recursive matching scheme in orderto determine similarity between strings String s and t are first broken into sub-strings s = s1 sK and t = t1 tK after which the similarity is definedas

Monge-Elkan(s t) =1

K

Ksumi=1

max

Ksumj=1

simprime(si tj) (24)

where simprime(si tj) denotes the similarity score between sub-strings si and tj asassigned by a secondary string metric

Christen [11] provides an extensive comparison of these and other stringmetrics on 4 different test sets of given- sur- and full names He found thatit is important to know beforehand the structure of the names to be matchedand whether they have been parsed and standardized He also found that Jaro-Winkler similarity performed best in a comparison of 27 different string met-rics Furthermore he reached the following conclusions (1) Phonetic encodingshould not be used since they are outperformed by all other techniques (2) Jaroand Jaro-Winkler similarity performs well for given- and surnames if the namesare parsed into separate fields (3) longest common substring is useful when thenames might contain swapped words (4) the Winkler modification can be usedwith every technique to improve the quality of the matching (5) the selection ofa proper threshold is the biggest problem for most matching techniques and (6)the fastest techniques are the ones that that have a time complexity linear to thelength of the strings Cohen and Fienberg [13] evaluated several strings metricson 13 different test sets concluding that the Monge-Elkan distance achieved thebest performance of all the string metrics The Jaro-Winkler metric proved to bea fast heuristic scheme achieving almost the same performance as Monge-Elkanwhilst being considerably less complex in nature

7

22 Authorship Attribution

Other approaches to resolving aliases and disambiguating authors can be foundin the field of Authorship Attribution The authorship attribution task can bedescribed as follows given a set of candidate authors and a set of documentswritten by each of these authors try to determine which of these candidateswrote a given anonymous document In the traditional authorship attributionproblem the number of candidate authors is typically small (2 - 10) the numberof documents per author is large and the length of these documents is largeMoreover it is assumed that the author of the anonymous document is actuallyin the candidate set ie there is a closed candidate set A good example ofa traditional authorship attribution problem is to determine the author of adisputed literary work such as some of Shakespearersquos plays

Authorship attribution techniques can be very useful in resolving aliases anddetermining authorship An authorship attribution system can be trained todistinguish between different authors in an email data set For a given authorit is possible to determine if an alias is being used by letting the authorshipattribution system predict which authorrsquos writing style most closely resemblesthe given authorrsquos writing style

In the remainder of this section the different techniques that have been em-ployed in authorship attribution problems will be explained as well as importantdesign choices that have to be made These include the choice of a feature seta feature selection technique the actual attribution technique and whether totreat the problem from an instance-based perspective or a profile-based perspec-tive

221 Instance vs profile-based

A general distinction can be made between techniques that treat each emailindividually (instance-based) and techniques that accumulate all the emails perauthor (profile-based) The first approach treats each email from a given authoras a single training instance and thereby retains differences in texts from thesame author The second approach accumulates all the texts from a givenauthor into one big training file creating a profile of one author and disregardingdifferences in each individual text The choice is mostly philosophical whetherto model the general style of each author or the individual style of each document[63]

222 Features

An important design choice in authorship attribution systems is the choice offeature set Features are the specific writing-style attributes predefined by theresearcher that are extracted from a piece of text in order to capture stylisticinformation that is characteristic for a particular author Since the choice offeature set can affect the performance of the authorship attribution in variousways it is important to consider which features to include or exclude In general

8

a distinction can be made between lexical syntactic structural semantic andcontent-specific features These features will be discussed in that order in thefollowing sections

Lexical features

Lexical features are the features that are derived at the character and word-levelof the text and are the most commonly used features These features are consid-ered language-independent since they do not need any prior language-dependentprocessing before they can be applied to a text Character-frequencies word-length distributions frequency of digits and non-alphanumeric characters andtotal number of words are all examples of lexical features that provide usefulinformation

An easy-to-use lexical feature that is also computationally simple is charactern-grams For example the character 4-grams that can be extracted from thephrase rdquoThe dogrdquo are rdquothe rdquo rdquohe drdquo rdquoe dordquo and rdquo dogrdquo Character n-gramscan capture various writing style markers from a text such as capitalization orUKUS-variants of certain words Even if a word is incorrectly spelled mostof the n-grams extracted from the correct and incorrect variant will matchalthough a misspelling can also be considered as a style marker for a particularauthor An advantage of character n-grams is that they do not need tokenizationbefore they can be applied to a text which is very useful in Asian or Arabiclanguages where tokenization is difficult

More complicated lexical features require the detection of word and sen-tence boundaries in the text By using common Natural Language Processing(NLP) tools a text can be broken up into its constituent parts during a pro-cess called tokenization A token can be a single word phrase symbol or othermeaningful element After counting the occurrence of each distinct token the nmost frequently occurring tokens can be used as features since the tokens thatoccur most frequent are considered to contain the most useful discriminatoryinformation

Another set of features that can be derived from tokenization is the fre-quency of different word lengths These features provide information on howoften a particular author uses words of different lengths Vocabulary richnessmeasures are a subset of lexical features that are derived from these word lengthfrequencies Hapax Legomena is the number of words that occur once in a textwhereas Hapax Dislegomena is the number of words that occur twice Thenumber of hapax legomena and hapax dislegomena gives an indication of howrich the vocabulary is that is used by a certain author Authors that have alarger vocabulary will have a higher count of once- or twice-occurring wordsthan authors with a small vocabulary The type-token ratio VN is the numberof unique tokens V divided by the total number of tokens in a text N andgives another indication of vocabulary richness Numerous vocabulary richnessmeasures have been created based on word frequencies of which the most wellknown are

9

bull Yulersquos K [69]

K = 104 middot

[minus1

N+sumi

V (iN)

(i

N

)2]

(25)

where V (iN) is the number of words occurring i times in the text oflength N

bull Sichelrsquos S [59]

S =V (2 N)

V (N)(26)

where V (N) is the vocabulary size and V (2 N) the number of twice-occurring words

bull Brunetrsquos W [7]W = NV (N)a (27)

where N is the number of words and a is usually set to 0172

bull Honorersquos R [26]

H = 100 middot

logN

1minus N(1N)V (N)

(28)

Furthermore smileys [64] abbreviations [62] slang words [36] and evenspelling errors can be used as stylometric features to distinguish between au-thors For example Koppel [37] tries to simulate the way a human expert dis-criminates between authors by using stylistic idiosyncrasies such as misspellingsto fingerprint a particular author A disadvantage of that approach is that dis-criminating between authors is hard when there are little idiosyncrasies presentin the text

Syntactic features

Syntactic features capture authorial style at the sentence level by analyzing thesyntactic constructions that an author uses The underlying idea is that everyauthor unconsciously uses more or less the same syntactical patterns in eachtext Syntactic features rely on the accuracy and availability of NLP-tools indifferent languages in order to tokenize and apply Part-of-speech (POS)-tags

A common method is to analyze the short sequences of POS-tags that occurmost frequently throughout an authorrsquos work For example the sentence rdquoTheman walked in the parkrdquo can be represented as

| DETERMINER | NOUN | VERB PAST TENSE| PREPOSITION | DETERMINER | NOUN |

10

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 3: Thesis Freek Maes - Final Version

Acknowledgments

I would like to thank several people that have guided me in the process ofwriting this thesis without whom I would not have been able to finish it FirstI would like to thank Dr Johannes Scholtes for his supervision of my thesisand internship and for his time to help me with questions and challenges alongthe way Without his continuing enthusiasm and optimism it would have been amuch harder if not impossible task to write this thesis I would also like to thankDr Nico Roos for evaluating this thesis as a second assessor Finally I wouldlike to thank Eva van den Hurk for her constant support and encouragementduring the past year

Contents

List of Figures 1

List of Tables 2

1 Introduction 311 Structure of the thesis 4

2 Literature Review 521 String metrics 5

211 Techniques 622 Authorship Attribution 8

221 Instance vs profile-based 8222 Features 8223 Feature Selection 12224 Techniques 13

23 Link analysis 21231 Techniques 21

24 Combining Approaches 2425 Evaluation measures 2626 Conclusion 27

3 Methods 2931 ENRON Corpus 2932 Individual Techniques 3533 Combinations of Techniques 38

4 Results 40

5 Discussion 4651 Conclusion 4752 Future Recommendations 48

6 Bibliography 50

Appendix 56

List of Figures

21 The structure of a supervised authorship attribution system 1522 Example of a decision tree 1723 Linear Separation using Support Vector Machines 1824 Mapping of feature space for SVM using RBF-kernel 1925 Example of an Artificial Neural Network 2026 Connected Triples in a link network 2227 Connected Path similarity in a link network 24

31 Information extracted from an email in the ENRON data set 3032 Evaluation of different kernels and training sizes for SVM 3233 Distribution of email messages per author 3334 Distribution of total number of words per author 3335 Network graph of the authors in the ENRON subset 3436 Structure of the combined approach 39

41 Performance of individual techniques on the mixed test set 4242 Performance of combined techniques on the mixed test set 4243 Performance of individual techniques on the hard test set 4344 Performance of combined techniques on the hard test set 4345 Best performance of different techniques on the mixed test set 4446 Best performance of different techniques on the hard test set 45

1

List of Tables

21 Soundex algorithm rules 722 Contingency table for evaluation 26

31 Preprocessing steps applied to the ENRON corpus 3132 Artificial Aliases in the ENRON data set by type 3533 Distribution of alias-types in two different test sets 3534 Feature set for the authorship SVM 37

2

Chapter 1

Introduction

Authorship disambiguation and alias resolution are increasingly important con-cepts in domains such as intelligence and law where email collections may con-tain authors that use one or more aliases Aliases occur when a person usesmultiple email addresses for either intentional or unintentional reasons For ex-ample people can try to hide their identity by intentionally adopting severaldifferent email addresses something that is common in intelligence data setssuch as terrorist networks On the other hand the use of different email ad-dresses (home office etc) is becoming common nowadays Hence there alsoexist many unintentional aliases where only the domain of the email address isdifferent or where a simple misspelling of a name has occurred

Various approaches have been applied successfully to resolve aliases in emaildata sets although each has its own shortcomings Unintentional aliases canbe resolved by employing metrics that indicate how much two email addresseslook alike However these metrics are easily fooled by persons using completelydifferent email addresses Another approach focuses on the content of the emailby creating a profile of an authorrsquos writing style By comparing the writingstyle of different authors and finding those that employ similar writing stylesaliases that are more complex can be detected This approach has been appliedsuccessfully to attribute authorship of disputed literary works However itencounters scalability issues when the number of authors grows large or thelength of the texts grows small as is the case in email data sets A thirdapproach makes use of the fact that even if an author use a completely differentemail address and writing style the people with whom he corresponds via emailmight remain stable The similarity between different authorsrsquo email contactscan be determined using link analysis techniques These techniques achievereasonable results and sometimes manage to find aliases that other techniquesdo not find

3

The three approaches mentioned above operate on different domains namelythe email address the content of the email and the email network Finding away to combine these approaches and utilize their combined strengths mightenable us to overcome their individual weaknesses In order to guide the re-search that has been conducted for this thesis three research questions havebeen formulated

1 Which authorship disambiguation and alias resolution techniques existthat can be used on email data

2 How can techniques from different domains be combined

3 Can a combination of techniques from different domains increase perfor-mance over individual techniques

11 Structure of the thesis

The structure of the remaining parts of this thesis is as follows

bull Chapter 2 introduces multiple techniques from the fields of AuthorshipDisambiguation and Alias Resolution Specifically string metrics will beexplained in section 21 authorship attribution systems in section 22 andlink analysis techniques in section 23 Several ways of combining thesetechniques as well as different measures for performance evaluation willbe discussed in sections 24 and 25

bull Chapter 3 outlines the methodology that has been used in order to conductthe experiments The email corpus that has been used will be describedas well as the preprocessing that has been applied to it Furthermore thetechniques that have been chosen for evaluation in the experiments willbe explained

bull Chapter 4 will present in detail the results of the experiments that havebeen conducted

bull Finally Chapter 5 provides a summary and discussion of the obtainedresults as well as recommendations for the future

4

Chapter 2

Literature Review

In this chapter a review of relevant literature from the fields of AuthorshipDisambiguation and Alias Resolution will be given The first section will explaindifferent string metrics that have successfully been applied to resolve superficialaliases and authorship problems In the second section authorship attributiontechniques that can be used to resolve the question of authorship in generalwill be discussed Moreover the various design choices that have to be madewhen creating an authorship attribution system will be explained The thirdsection will deal with techniques from Link Analysis that use the network inwhich emails reside to discover aliases In the fourth section several ways ofcombining these techniques will be discussed The last section will introduceseveral measures that can be used for evaluating the performance of differenttechniques

21 String metrics

String similarity metrics are a class of functions that map two strings to a realnumber where the higher the value of this number the greater the similaritybetween the two strings Many string metrics use the number of operationsthat are required to transform one string into another in order to calculatethe similarity between the two Possible operations include insertion deletionsubstitution and transposition A different class of string metrics is the phoneticencodings in which strings are converted into codes according to how they arepronounced However these encodings are language dependent and are notavailable for many languages

String metrics do not take into account information regarding the contextin which the strings occur As such they can be considered rather simple ap-proaches to resolving aliases or settling authorship disputes However stringmetrics can be very useful for detecting misspellings of email aliases result-ing from the using different email domains or naming conventions For exam-ple they can easily detect the similarity between rdquojohndoedomaincomrdquo and

5

rdquojhondoedomaincomrdquo They are less useful when people deliberately try tohide their identity by using completely different email addresses

211 Techniques

In this section the most commonly used string metrics will be discussedThe Levenshtein distance [52] often referred to as edit distance is of one the

earliest and most used string distances It is defined as the minimum requiredamount of operations between string s and t to transform one string into theother Each operation has a cost of 1 and the allowed operations are inser-tion deletion and substitution of a character The Levenshtein distance can betransformed into a similarity metric by using

similarity(s t) =1

Levenshtein(s t) + 1(21)

The Jaro similarity [32] algorithm uses the number of transpositions T andthe number of matching characters m in order to determine the similarity be-tween two strings Two characters are matching only if they are no fartherapart than half the length of the longest string The number of transpositionsis defined as the number of matching characters in different sequence ordersdivided by two The similarity is then calculated as follows

Jaro(s t) =1

3

(m

|s|+m

|t|+mminus Tm

)(22)

where |s| denotes the length of string sThe Jaro-Winkler similarity [67] is an extension of the Jaro-algorithm using

the empirical finding by Winkler that less errors tend to occur at the start ofstrings The similarity is calculated as follows where p is the length of the prefixthat the two strings share

Jaro-Winkler(s t) = Jaro(s t) +p

10(10minus Jaro(s t)) (23)

The Soundex algorithm [53] is the most well known and at the same timethe oldest phonetic encoding that is used for string matching Strings are firstconverted into phonetic codes after which strings with similar codes are assumedto be highly similar In order to convert a string into a Soundex-code the firstletter of the string is retained after which the following letters are convertedto numbers according to the set of rules shown in table 21 In the resultingcode all zeros are removed as well as multiple sequential occurrences of thesame digit The code is then cut-off or extended with zeros such that is hasexactly 3 digits The first letter of the string together with the 3 digits formsthe Soundex-code The Soundex algorithm makes use of the fact that stringsthat are pronounced in a similar fashion tend to have the same Soundex codeFor example rdquoMaidrdquo and rdquoMaderdquo both results in the Soundex code rdquoM300rdquo

The longest common substring [23] method iteratively finds and removes thelongest substring of minimum length l that two strings have in common until

6

Letter Digit

A E I O U H W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6

Table 21 The rules for converting letters into digits as they are used in theSoundex algorithm

no more substrings can be found The final similarity can then be calculated bytaking the length of all the common substrings divided by either the maximumminimum or average length of the original strings

A slightly different approach by Monge and Elkan [50] uses a string metricsuch as any of the ones discussed above in a recursive matching scheme in orderto determine similarity between strings String s and t are first broken into sub-strings s = s1 sK and t = t1 tK after which the similarity is definedas

Monge-Elkan(s t) =1

K

Ksumi=1

max

Ksumj=1

simprime(si tj) (24)

where simprime(si tj) denotes the similarity score between sub-strings si and tj asassigned by a secondary string metric

Christen [11] provides an extensive comparison of these and other stringmetrics on 4 different test sets of given- sur- and full names He found thatit is important to know beforehand the structure of the names to be matchedand whether they have been parsed and standardized He also found that Jaro-Winkler similarity performed best in a comparison of 27 different string met-rics Furthermore he reached the following conclusions (1) Phonetic encodingshould not be used since they are outperformed by all other techniques (2) Jaroand Jaro-Winkler similarity performs well for given- and surnames if the namesare parsed into separate fields (3) longest common substring is useful when thenames might contain swapped words (4) the Winkler modification can be usedwith every technique to improve the quality of the matching (5) the selection ofa proper threshold is the biggest problem for most matching techniques and (6)the fastest techniques are the ones that that have a time complexity linear to thelength of the strings Cohen and Fienberg [13] evaluated several strings metricson 13 different test sets concluding that the Monge-Elkan distance achieved thebest performance of all the string metrics The Jaro-Winkler metric proved to bea fast heuristic scheme achieving almost the same performance as Monge-Elkanwhilst being considerably less complex in nature

7

22 Authorship Attribution

Other approaches to resolving aliases and disambiguating authors can be foundin the field of Authorship Attribution The authorship attribution task can bedescribed as follows given a set of candidate authors and a set of documentswritten by each of these authors try to determine which of these candidateswrote a given anonymous document In the traditional authorship attributionproblem the number of candidate authors is typically small (2 - 10) the numberof documents per author is large and the length of these documents is largeMoreover it is assumed that the author of the anonymous document is actuallyin the candidate set ie there is a closed candidate set A good example ofa traditional authorship attribution problem is to determine the author of adisputed literary work such as some of Shakespearersquos plays

Authorship attribution techniques can be very useful in resolving aliases anddetermining authorship An authorship attribution system can be trained todistinguish between different authors in an email data set For a given authorit is possible to determine if an alias is being used by letting the authorshipattribution system predict which authorrsquos writing style most closely resemblesthe given authorrsquos writing style

In the remainder of this section the different techniques that have been em-ployed in authorship attribution problems will be explained as well as importantdesign choices that have to be made These include the choice of a feature seta feature selection technique the actual attribution technique and whether totreat the problem from an instance-based perspective or a profile-based perspec-tive

221 Instance vs profile-based

A general distinction can be made between techniques that treat each emailindividually (instance-based) and techniques that accumulate all the emails perauthor (profile-based) The first approach treats each email from a given authoras a single training instance and thereby retains differences in texts from thesame author The second approach accumulates all the texts from a givenauthor into one big training file creating a profile of one author and disregardingdifferences in each individual text The choice is mostly philosophical whetherto model the general style of each author or the individual style of each document[63]

222 Features

An important design choice in authorship attribution systems is the choice offeature set Features are the specific writing-style attributes predefined by theresearcher that are extracted from a piece of text in order to capture stylisticinformation that is characteristic for a particular author Since the choice offeature set can affect the performance of the authorship attribution in variousways it is important to consider which features to include or exclude In general

8

a distinction can be made between lexical syntactic structural semantic andcontent-specific features These features will be discussed in that order in thefollowing sections

Lexical features

Lexical features are the features that are derived at the character and word-levelof the text and are the most commonly used features These features are consid-ered language-independent since they do not need any prior language-dependentprocessing before they can be applied to a text Character-frequencies word-length distributions frequency of digits and non-alphanumeric characters andtotal number of words are all examples of lexical features that provide usefulinformation

An easy-to-use lexical feature that is also computationally simple is charactern-grams For example the character 4-grams that can be extracted from thephrase rdquoThe dogrdquo are rdquothe rdquo rdquohe drdquo rdquoe dordquo and rdquo dogrdquo Character n-gramscan capture various writing style markers from a text such as capitalization orUKUS-variants of certain words Even if a word is incorrectly spelled mostof the n-grams extracted from the correct and incorrect variant will matchalthough a misspelling can also be considered as a style marker for a particularauthor An advantage of character n-grams is that they do not need tokenizationbefore they can be applied to a text which is very useful in Asian or Arabiclanguages where tokenization is difficult

More complicated lexical features require the detection of word and sen-tence boundaries in the text By using common Natural Language Processing(NLP) tools a text can be broken up into its constituent parts during a pro-cess called tokenization A token can be a single word phrase symbol or othermeaningful element After counting the occurrence of each distinct token the nmost frequently occurring tokens can be used as features since the tokens thatoccur most frequent are considered to contain the most useful discriminatoryinformation

Another set of features that can be derived from tokenization is the fre-quency of different word lengths These features provide information on howoften a particular author uses words of different lengths Vocabulary richnessmeasures are a subset of lexical features that are derived from these word lengthfrequencies Hapax Legomena is the number of words that occur once in a textwhereas Hapax Dislegomena is the number of words that occur twice Thenumber of hapax legomena and hapax dislegomena gives an indication of howrich the vocabulary is that is used by a certain author Authors that have alarger vocabulary will have a higher count of once- or twice-occurring wordsthan authors with a small vocabulary The type-token ratio VN is the numberof unique tokens V divided by the total number of tokens in a text N andgives another indication of vocabulary richness Numerous vocabulary richnessmeasures have been created based on word frequencies of which the most wellknown are

9

bull Yulersquos K [69]

K = 104 middot

[minus1

N+sumi

V (iN)

(i

N

)2]

(25)

where V (iN) is the number of words occurring i times in the text oflength N

bull Sichelrsquos S [59]

S =V (2 N)

V (N)(26)

where V (N) is the vocabulary size and V (2 N) the number of twice-occurring words

bull Brunetrsquos W [7]W = NV (N)a (27)

where N is the number of words and a is usually set to 0172

bull Honorersquos R [26]

H = 100 middot

logN

1minus N(1N)V (N)

(28)

Furthermore smileys [64] abbreviations [62] slang words [36] and evenspelling errors can be used as stylometric features to distinguish between au-thors For example Koppel [37] tries to simulate the way a human expert dis-criminates between authors by using stylistic idiosyncrasies such as misspellingsto fingerprint a particular author A disadvantage of that approach is that dis-criminating between authors is hard when there are little idiosyncrasies presentin the text

Syntactic features

Syntactic features capture authorial style at the sentence level by analyzing thesyntactic constructions that an author uses The underlying idea is that everyauthor unconsciously uses more or less the same syntactical patterns in eachtext Syntactic features rely on the accuracy and availability of NLP-tools indifferent languages in order to tokenize and apply Part-of-speech (POS)-tags

A common method is to analyze the short sequences of POS-tags that occurmost frequently throughout an authorrsquos work For example the sentence rdquoTheman walked in the parkrdquo can be represented as

| DETERMINER | NOUN | VERB PAST TENSE| PREPOSITION | DETERMINER | NOUN |

10

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 4: Thesis Freek Maes - Final Version

Contents

List of Figures 1

List of Tables 2

1 Introduction 311 Structure of the thesis 4

2 Literature Review 521 String metrics 5

211 Techniques 622 Authorship Attribution 8

221 Instance vs profile-based 8222 Features 8223 Feature Selection 12224 Techniques 13

23 Link analysis 21231 Techniques 21

24 Combining Approaches 2425 Evaluation measures 2626 Conclusion 27

3 Methods 2931 ENRON Corpus 2932 Individual Techniques 3533 Combinations of Techniques 38

4 Results 40

5 Discussion 4651 Conclusion 4752 Future Recommendations 48

6 Bibliography 50

Appendix 56

List of Figures

21 The structure of a supervised authorship attribution system 1522 Example of a decision tree 1723 Linear Separation using Support Vector Machines 1824 Mapping of feature space for SVM using RBF-kernel 1925 Example of an Artificial Neural Network 2026 Connected Triples in a link network 2227 Connected Path similarity in a link network 24

31 Information extracted from an email in the ENRON data set 3032 Evaluation of different kernels and training sizes for SVM 3233 Distribution of email messages per author 3334 Distribution of total number of words per author 3335 Network graph of the authors in the ENRON subset 3436 Structure of the combined approach 39

41 Performance of individual techniques on the mixed test set 4242 Performance of combined techniques on the mixed test set 4243 Performance of individual techniques on the hard test set 4344 Performance of combined techniques on the hard test set 4345 Best performance of different techniques on the mixed test set 4446 Best performance of different techniques on the hard test set 45

1

List of Tables

21 Soundex algorithm rules 722 Contingency table for evaluation 26

31 Preprocessing steps applied to the ENRON corpus 3132 Artificial Aliases in the ENRON data set by type 3533 Distribution of alias-types in two different test sets 3534 Feature set for the authorship SVM 37

2

Chapter 1

Introduction

Authorship disambiguation and alias resolution are increasingly important con-cepts in domains such as intelligence and law where email collections may con-tain authors that use one or more aliases Aliases occur when a person usesmultiple email addresses for either intentional or unintentional reasons For ex-ample people can try to hide their identity by intentionally adopting severaldifferent email addresses something that is common in intelligence data setssuch as terrorist networks On the other hand the use of different email ad-dresses (home office etc) is becoming common nowadays Hence there alsoexist many unintentional aliases where only the domain of the email address isdifferent or where a simple misspelling of a name has occurred

Various approaches have been applied successfully to resolve aliases in emaildata sets although each has its own shortcomings Unintentional aliases canbe resolved by employing metrics that indicate how much two email addresseslook alike However these metrics are easily fooled by persons using completelydifferent email addresses Another approach focuses on the content of the emailby creating a profile of an authorrsquos writing style By comparing the writingstyle of different authors and finding those that employ similar writing stylesaliases that are more complex can be detected This approach has been appliedsuccessfully to attribute authorship of disputed literary works However itencounters scalability issues when the number of authors grows large or thelength of the texts grows small as is the case in email data sets A thirdapproach makes use of the fact that even if an author use a completely differentemail address and writing style the people with whom he corresponds via emailmight remain stable The similarity between different authorsrsquo email contactscan be determined using link analysis techniques These techniques achievereasonable results and sometimes manage to find aliases that other techniquesdo not find

3

The three approaches mentioned above operate on different domains namelythe email address the content of the email and the email network Finding away to combine these approaches and utilize their combined strengths mightenable us to overcome their individual weaknesses In order to guide the re-search that has been conducted for this thesis three research questions havebeen formulated

1 Which authorship disambiguation and alias resolution techniques existthat can be used on email data

2 How can techniques from different domains be combined

3 Can a combination of techniques from different domains increase perfor-mance over individual techniques

11 Structure of the thesis

The structure of the remaining parts of this thesis is as follows

bull Chapter 2 introduces multiple techniques from the fields of AuthorshipDisambiguation and Alias Resolution Specifically string metrics will beexplained in section 21 authorship attribution systems in section 22 andlink analysis techniques in section 23 Several ways of combining thesetechniques as well as different measures for performance evaluation willbe discussed in sections 24 and 25

bull Chapter 3 outlines the methodology that has been used in order to conductthe experiments The email corpus that has been used will be describedas well as the preprocessing that has been applied to it Furthermore thetechniques that have been chosen for evaluation in the experiments willbe explained

bull Chapter 4 will present in detail the results of the experiments that havebeen conducted

bull Finally Chapter 5 provides a summary and discussion of the obtainedresults as well as recommendations for the future

4

Chapter 2

Literature Review

In this chapter a review of relevant literature from the fields of AuthorshipDisambiguation and Alias Resolution will be given The first section will explaindifferent string metrics that have successfully been applied to resolve superficialaliases and authorship problems In the second section authorship attributiontechniques that can be used to resolve the question of authorship in generalwill be discussed Moreover the various design choices that have to be madewhen creating an authorship attribution system will be explained The thirdsection will deal with techniques from Link Analysis that use the network inwhich emails reside to discover aliases In the fourth section several ways ofcombining these techniques will be discussed The last section will introduceseveral measures that can be used for evaluating the performance of differenttechniques

21 String metrics

String similarity metrics are a class of functions that map two strings to a realnumber where the higher the value of this number the greater the similaritybetween the two strings Many string metrics use the number of operationsthat are required to transform one string into another in order to calculatethe similarity between the two Possible operations include insertion deletionsubstitution and transposition A different class of string metrics is the phoneticencodings in which strings are converted into codes according to how they arepronounced However these encodings are language dependent and are notavailable for many languages

String metrics do not take into account information regarding the contextin which the strings occur As such they can be considered rather simple ap-proaches to resolving aliases or settling authorship disputes However stringmetrics can be very useful for detecting misspellings of email aliases result-ing from the using different email domains or naming conventions For exam-ple they can easily detect the similarity between rdquojohndoedomaincomrdquo and

5

rdquojhondoedomaincomrdquo They are less useful when people deliberately try tohide their identity by using completely different email addresses

211 Techniques

In this section the most commonly used string metrics will be discussedThe Levenshtein distance [52] often referred to as edit distance is of one the

earliest and most used string distances It is defined as the minimum requiredamount of operations between string s and t to transform one string into theother Each operation has a cost of 1 and the allowed operations are inser-tion deletion and substitution of a character The Levenshtein distance can betransformed into a similarity metric by using

similarity(s t) =1

Levenshtein(s t) + 1(21)

The Jaro similarity [32] algorithm uses the number of transpositions T andthe number of matching characters m in order to determine the similarity be-tween two strings Two characters are matching only if they are no fartherapart than half the length of the longest string The number of transpositionsis defined as the number of matching characters in different sequence ordersdivided by two The similarity is then calculated as follows

Jaro(s t) =1

3

(m

|s|+m

|t|+mminus Tm

)(22)

where |s| denotes the length of string sThe Jaro-Winkler similarity [67] is an extension of the Jaro-algorithm using

the empirical finding by Winkler that less errors tend to occur at the start ofstrings The similarity is calculated as follows where p is the length of the prefixthat the two strings share

Jaro-Winkler(s t) = Jaro(s t) +p

10(10minus Jaro(s t)) (23)

The Soundex algorithm [53] is the most well known and at the same timethe oldest phonetic encoding that is used for string matching Strings are firstconverted into phonetic codes after which strings with similar codes are assumedto be highly similar In order to convert a string into a Soundex-code the firstletter of the string is retained after which the following letters are convertedto numbers according to the set of rules shown in table 21 In the resultingcode all zeros are removed as well as multiple sequential occurrences of thesame digit The code is then cut-off or extended with zeros such that is hasexactly 3 digits The first letter of the string together with the 3 digits formsthe Soundex-code The Soundex algorithm makes use of the fact that stringsthat are pronounced in a similar fashion tend to have the same Soundex codeFor example rdquoMaidrdquo and rdquoMaderdquo both results in the Soundex code rdquoM300rdquo

The longest common substring [23] method iteratively finds and removes thelongest substring of minimum length l that two strings have in common until

6

Letter Digit

A E I O U H W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6

Table 21 The rules for converting letters into digits as they are used in theSoundex algorithm

no more substrings can be found The final similarity can then be calculated bytaking the length of all the common substrings divided by either the maximumminimum or average length of the original strings

A slightly different approach by Monge and Elkan [50] uses a string metricsuch as any of the ones discussed above in a recursive matching scheme in orderto determine similarity between strings String s and t are first broken into sub-strings s = s1 sK and t = t1 tK after which the similarity is definedas

Monge-Elkan(s t) =1

K

Ksumi=1

max

Ksumj=1

simprime(si tj) (24)

where simprime(si tj) denotes the similarity score between sub-strings si and tj asassigned by a secondary string metric

Christen [11] provides an extensive comparison of these and other stringmetrics on 4 different test sets of given- sur- and full names He found thatit is important to know beforehand the structure of the names to be matchedand whether they have been parsed and standardized He also found that Jaro-Winkler similarity performed best in a comparison of 27 different string met-rics Furthermore he reached the following conclusions (1) Phonetic encodingshould not be used since they are outperformed by all other techniques (2) Jaroand Jaro-Winkler similarity performs well for given- and surnames if the namesare parsed into separate fields (3) longest common substring is useful when thenames might contain swapped words (4) the Winkler modification can be usedwith every technique to improve the quality of the matching (5) the selection ofa proper threshold is the biggest problem for most matching techniques and (6)the fastest techniques are the ones that that have a time complexity linear to thelength of the strings Cohen and Fienberg [13] evaluated several strings metricson 13 different test sets concluding that the Monge-Elkan distance achieved thebest performance of all the string metrics The Jaro-Winkler metric proved to bea fast heuristic scheme achieving almost the same performance as Monge-Elkanwhilst being considerably less complex in nature

7

22 Authorship Attribution

Other approaches to resolving aliases and disambiguating authors can be foundin the field of Authorship Attribution The authorship attribution task can bedescribed as follows given a set of candidate authors and a set of documentswritten by each of these authors try to determine which of these candidateswrote a given anonymous document In the traditional authorship attributionproblem the number of candidate authors is typically small (2 - 10) the numberof documents per author is large and the length of these documents is largeMoreover it is assumed that the author of the anonymous document is actuallyin the candidate set ie there is a closed candidate set A good example ofa traditional authorship attribution problem is to determine the author of adisputed literary work such as some of Shakespearersquos plays

Authorship attribution techniques can be very useful in resolving aliases anddetermining authorship An authorship attribution system can be trained todistinguish between different authors in an email data set For a given authorit is possible to determine if an alias is being used by letting the authorshipattribution system predict which authorrsquos writing style most closely resemblesthe given authorrsquos writing style

In the remainder of this section the different techniques that have been em-ployed in authorship attribution problems will be explained as well as importantdesign choices that have to be made These include the choice of a feature seta feature selection technique the actual attribution technique and whether totreat the problem from an instance-based perspective or a profile-based perspec-tive

221 Instance vs profile-based

A general distinction can be made between techniques that treat each emailindividually (instance-based) and techniques that accumulate all the emails perauthor (profile-based) The first approach treats each email from a given authoras a single training instance and thereby retains differences in texts from thesame author The second approach accumulates all the texts from a givenauthor into one big training file creating a profile of one author and disregardingdifferences in each individual text The choice is mostly philosophical whetherto model the general style of each author or the individual style of each document[63]

222 Features

An important design choice in authorship attribution systems is the choice offeature set Features are the specific writing-style attributes predefined by theresearcher that are extracted from a piece of text in order to capture stylisticinformation that is characteristic for a particular author Since the choice offeature set can affect the performance of the authorship attribution in variousways it is important to consider which features to include or exclude In general

8

a distinction can be made between lexical syntactic structural semantic andcontent-specific features These features will be discussed in that order in thefollowing sections

Lexical features

Lexical features are the features that are derived at the character and word-levelof the text and are the most commonly used features These features are consid-ered language-independent since they do not need any prior language-dependentprocessing before they can be applied to a text Character-frequencies word-length distributions frequency of digits and non-alphanumeric characters andtotal number of words are all examples of lexical features that provide usefulinformation

An easy-to-use lexical feature that is also computationally simple is charactern-grams For example the character 4-grams that can be extracted from thephrase rdquoThe dogrdquo are rdquothe rdquo rdquohe drdquo rdquoe dordquo and rdquo dogrdquo Character n-gramscan capture various writing style markers from a text such as capitalization orUKUS-variants of certain words Even if a word is incorrectly spelled mostof the n-grams extracted from the correct and incorrect variant will matchalthough a misspelling can also be considered as a style marker for a particularauthor An advantage of character n-grams is that they do not need tokenizationbefore they can be applied to a text which is very useful in Asian or Arabiclanguages where tokenization is difficult

More complicated lexical features require the detection of word and sen-tence boundaries in the text By using common Natural Language Processing(NLP) tools a text can be broken up into its constituent parts during a pro-cess called tokenization A token can be a single word phrase symbol or othermeaningful element After counting the occurrence of each distinct token the nmost frequently occurring tokens can be used as features since the tokens thatoccur most frequent are considered to contain the most useful discriminatoryinformation

Another set of features that can be derived from tokenization is the fre-quency of different word lengths These features provide information on howoften a particular author uses words of different lengths Vocabulary richnessmeasures are a subset of lexical features that are derived from these word lengthfrequencies Hapax Legomena is the number of words that occur once in a textwhereas Hapax Dislegomena is the number of words that occur twice Thenumber of hapax legomena and hapax dislegomena gives an indication of howrich the vocabulary is that is used by a certain author Authors that have alarger vocabulary will have a higher count of once- or twice-occurring wordsthan authors with a small vocabulary The type-token ratio VN is the numberof unique tokens V divided by the total number of tokens in a text N andgives another indication of vocabulary richness Numerous vocabulary richnessmeasures have been created based on word frequencies of which the most wellknown are

9

bull Yulersquos K [69]

K = 104 middot

[minus1

N+sumi

V (iN)

(i

N

)2]

(25)

where V (iN) is the number of words occurring i times in the text oflength N

bull Sichelrsquos S [59]

S =V (2 N)

V (N)(26)

where V (N) is the vocabulary size and V (2 N) the number of twice-occurring words

bull Brunetrsquos W [7]W = NV (N)a (27)

where N is the number of words and a is usually set to 0172

bull Honorersquos R [26]

H = 100 middot

logN

1minus N(1N)V (N)

(28)

Furthermore smileys [64] abbreviations [62] slang words [36] and evenspelling errors can be used as stylometric features to distinguish between au-thors For example Koppel [37] tries to simulate the way a human expert dis-criminates between authors by using stylistic idiosyncrasies such as misspellingsto fingerprint a particular author A disadvantage of that approach is that dis-criminating between authors is hard when there are little idiosyncrasies presentin the text

Syntactic features

Syntactic features capture authorial style at the sentence level by analyzing thesyntactic constructions that an author uses The underlying idea is that everyauthor unconsciously uses more or less the same syntactical patterns in eachtext Syntactic features rely on the accuracy and availability of NLP-tools indifferent languages in order to tokenize and apply Part-of-speech (POS)-tags

A common method is to analyze the short sequences of POS-tags that occurmost frequently throughout an authorrsquos work For example the sentence rdquoTheman walked in the parkrdquo can be represented as

| DETERMINER | NOUN | VERB PAST TENSE| PREPOSITION | DETERMINER | NOUN |

10

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 5: Thesis Freek Maes - Final Version

List of Figures

21 The structure of a supervised authorship attribution system 1522 Example of a decision tree 1723 Linear Separation using Support Vector Machines 1824 Mapping of feature space for SVM using RBF-kernel 1925 Example of an Artificial Neural Network 2026 Connected Triples in a link network 2227 Connected Path similarity in a link network 24

31 Information extracted from an email in the ENRON data set 3032 Evaluation of different kernels and training sizes for SVM 3233 Distribution of email messages per author 3334 Distribution of total number of words per author 3335 Network graph of the authors in the ENRON subset 3436 Structure of the combined approach 39

41 Performance of individual techniques on the mixed test set 4242 Performance of combined techniques on the mixed test set 4243 Performance of individual techniques on the hard test set 4344 Performance of combined techniques on the hard test set 4345 Best performance of different techniques on the mixed test set 4446 Best performance of different techniques on the hard test set 45

1

List of Tables

21 Soundex algorithm rules 722 Contingency table for evaluation 26

31 Preprocessing steps applied to the ENRON corpus 3132 Artificial Aliases in the ENRON data set by type 3533 Distribution of alias-types in two different test sets 3534 Feature set for the authorship SVM 37

2

Chapter 1

Introduction

Authorship disambiguation and alias resolution are increasingly important con-cepts in domains such as intelligence and law where email collections may con-tain authors that use one or more aliases Aliases occur when a person usesmultiple email addresses for either intentional or unintentional reasons For ex-ample people can try to hide their identity by intentionally adopting severaldifferent email addresses something that is common in intelligence data setssuch as terrorist networks On the other hand the use of different email ad-dresses (home office etc) is becoming common nowadays Hence there alsoexist many unintentional aliases where only the domain of the email address isdifferent or where a simple misspelling of a name has occurred

Various approaches have been applied successfully to resolve aliases in emaildata sets although each has its own shortcomings Unintentional aliases canbe resolved by employing metrics that indicate how much two email addresseslook alike However these metrics are easily fooled by persons using completelydifferent email addresses Another approach focuses on the content of the emailby creating a profile of an authorrsquos writing style By comparing the writingstyle of different authors and finding those that employ similar writing stylesaliases that are more complex can be detected This approach has been appliedsuccessfully to attribute authorship of disputed literary works However itencounters scalability issues when the number of authors grows large or thelength of the texts grows small as is the case in email data sets A thirdapproach makes use of the fact that even if an author use a completely differentemail address and writing style the people with whom he corresponds via emailmight remain stable The similarity between different authorsrsquo email contactscan be determined using link analysis techniques These techniques achievereasonable results and sometimes manage to find aliases that other techniquesdo not find

3

The three approaches mentioned above operate on different domains namelythe email address the content of the email and the email network Finding away to combine these approaches and utilize their combined strengths mightenable us to overcome their individual weaknesses In order to guide the re-search that has been conducted for this thesis three research questions havebeen formulated

1 Which authorship disambiguation and alias resolution techniques existthat can be used on email data

2 How can techniques from different domains be combined

3 Can a combination of techniques from different domains increase perfor-mance over individual techniques

11 Structure of the thesis

The structure of the remaining parts of this thesis is as follows

bull Chapter 2 introduces multiple techniques from the fields of AuthorshipDisambiguation and Alias Resolution Specifically string metrics will beexplained in section 21 authorship attribution systems in section 22 andlink analysis techniques in section 23 Several ways of combining thesetechniques as well as different measures for performance evaluation willbe discussed in sections 24 and 25

bull Chapter 3 outlines the methodology that has been used in order to conductthe experiments The email corpus that has been used will be describedas well as the preprocessing that has been applied to it Furthermore thetechniques that have been chosen for evaluation in the experiments willbe explained

bull Chapter 4 will present in detail the results of the experiments that havebeen conducted

bull Finally Chapter 5 provides a summary and discussion of the obtainedresults as well as recommendations for the future

4

Chapter 2

Literature Review

In this chapter a review of relevant literature from the fields of AuthorshipDisambiguation and Alias Resolution will be given The first section will explaindifferent string metrics that have successfully been applied to resolve superficialaliases and authorship problems In the second section authorship attributiontechniques that can be used to resolve the question of authorship in generalwill be discussed Moreover the various design choices that have to be madewhen creating an authorship attribution system will be explained The thirdsection will deal with techniques from Link Analysis that use the network inwhich emails reside to discover aliases In the fourth section several ways ofcombining these techniques will be discussed The last section will introduceseveral measures that can be used for evaluating the performance of differenttechniques

21 String metrics

String similarity metrics are a class of functions that map two strings to a realnumber where the higher the value of this number the greater the similaritybetween the two strings Many string metrics use the number of operationsthat are required to transform one string into another in order to calculatethe similarity between the two Possible operations include insertion deletionsubstitution and transposition A different class of string metrics is the phoneticencodings in which strings are converted into codes according to how they arepronounced However these encodings are language dependent and are notavailable for many languages

String metrics do not take into account information regarding the contextin which the strings occur As such they can be considered rather simple ap-proaches to resolving aliases or settling authorship disputes However stringmetrics can be very useful for detecting misspellings of email aliases result-ing from the using different email domains or naming conventions For exam-ple they can easily detect the similarity between rdquojohndoedomaincomrdquo and

5

rdquojhondoedomaincomrdquo They are less useful when people deliberately try tohide their identity by using completely different email addresses

211 Techniques

In this section the most commonly used string metrics will be discussedThe Levenshtein distance [52] often referred to as edit distance is of one the

earliest and most used string distances It is defined as the minimum requiredamount of operations between string s and t to transform one string into theother Each operation has a cost of 1 and the allowed operations are inser-tion deletion and substitution of a character The Levenshtein distance can betransformed into a similarity metric by using

similarity(s t) =1

Levenshtein(s t) + 1(21)

The Jaro similarity [32] algorithm uses the number of transpositions T andthe number of matching characters m in order to determine the similarity be-tween two strings Two characters are matching only if they are no fartherapart than half the length of the longest string The number of transpositionsis defined as the number of matching characters in different sequence ordersdivided by two The similarity is then calculated as follows

Jaro(s t) =1

3

(m

|s|+m

|t|+mminus Tm

)(22)

where |s| denotes the length of string sThe Jaro-Winkler similarity [67] is an extension of the Jaro-algorithm using

the empirical finding by Winkler that less errors tend to occur at the start ofstrings The similarity is calculated as follows where p is the length of the prefixthat the two strings share

Jaro-Winkler(s t) = Jaro(s t) +p

10(10minus Jaro(s t)) (23)

The Soundex algorithm [53] is the most well known and at the same timethe oldest phonetic encoding that is used for string matching Strings are firstconverted into phonetic codes after which strings with similar codes are assumedto be highly similar In order to convert a string into a Soundex-code the firstletter of the string is retained after which the following letters are convertedto numbers according to the set of rules shown in table 21 In the resultingcode all zeros are removed as well as multiple sequential occurrences of thesame digit The code is then cut-off or extended with zeros such that is hasexactly 3 digits The first letter of the string together with the 3 digits formsthe Soundex-code The Soundex algorithm makes use of the fact that stringsthat are pronounced in a similar fashion tend to have the same Soundex codeFor example rdquoMaidrdquo and rdquoMaderdquo both results in the Soundex code rdquoM300rdquo

The longest common substring [23] method iteratively finds and removes thelongest substring of minimum length l that two strings have in common until

6

Letter Digit

A E I O U H W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6

Table 21 The rules for converting letters into digits as they are used in theSoundex algorithm

no more substrings can be found The final similarity can then be calculated bytaking the length of all the common substrings divided by either the maximumminimum or average length of the original strings

A slightly different approach by Monge and Elkan [50] uses a string metricsuch as any of the ones discussed above in a recursive matching scheme in orderto determine similarity between strings String s and t are first broken into sub-strings s = s1 sK and t = t1 tK after which the similarity is definedas

Monge-Elkan(s t) =1

K

Ksumi=1

max

Ksumj=1

simprime(si tj) (24)

where simprime(si tj) denotes the similarity score between sub-strings si and tj asassigned by a secondary string metric

Christen [11] provides an extensive comparison of these and other stringmetrics on 4 different test sets of given- sur- and full names He found thatit is important to know beforehand the structure of the names to be matchedand whether they have been parsed and standardized He also found that Jaro-Winkler similarity performed best in a comparison of 27 different string met-rics Furthermore he reached the following conclusions (1) Phonetic encodingshould not be used since they are outperformed by all other techniques (2) Jaroand Jaro-Winkler similarity performs well for given- and surnames if the namesare parsed into separate fields (3) longest common substring is useful when thenames might contain swapped words (4) the Winkler modification can be usedwith every technique to improve the quality of the matching (5) the selection ofa proper threshold is the biggest problem for most matching techniques and (6)the fastest techniques are the ones that that have a time complexity linear to thelength of the strings Cohen and Fienberg [13] evaluated several strings metricson 13 different test sets concluding that the Monge-Elkan distance achieved thebest performance of all the string metrics The Jaro-Winkler metric proved to bea fast heuristic scheme achieving almost the same performance as Monge-Elkanwhilst being considerably less complex in nature

7

22 Authorship Attribution

Other approaches to resolving aliases and disambiguating authors can be foundin the field of Authorship Attribution The authorship attribution task can bedescribed as follows given a set of candidate authors and a set of documentswritten by each of these authors try to determine which of these candidateswrote a given anonymous document In the traditional authorship attributionproblem the number of candidate authors is typically small (2 - 10) the numberof documents per author is large and the length of these documents is largeMoreover it is assumed that the author of the anonymous document is actuallyin the candidate set ie there is a closed candidate set A good example ofa traditional authorship attribution problem is to determine the author of adisputed literary work such as some of Shakespearersquos plays

Authorship attribution techniques can be very useful in resolving aliases anddetermining authorship An authorship attribution system can be trained todistinguish between different authors in an email data set For a given authorit is possible to determine if an alias is being used by letting the authorshipattribution system predict which authorrsquos writing style most closely resemblesthe given authorrsquos writing style

In the remainder of this section the different techniques that have been em-ployed in authorship attribution problems will be explained as well as importantdesign choices that have to be made These include the choice of a feature seta feature selection technique the actual attribution technique and whether totreat the problem from an instance-based perspective or a profile-based perspec-tive

221 Instance vs profile-based

A general distinction can be made between techniques that treat each emailindividually (instance-based) and techniques that accumulate all the emails perauthor (profile-based) The first approach treats each email from a given authoras a single training instance and thereby retains differences in texts from thesame author The second approach accumulates all the texts from a givenauthor into one big training file creating a profile of one author and disregardingdifferences in each individual text The choice is mostly philosophical whetherto model the general style of each author or the individual style of each document[63]

222 Features

An important design choice in authorship attribution systems is the choice offeature set Features are the specific writing-style attributes predefined by theresearcher that are extracted from a piece of text in order to capture stylisticinformation that is characteristic for a particular author Since the choice offeature set can affect the performance of the authorship attribution in variousways it is important to consider which features to include or exclude In general

8

a distinction can be made between lexical syntactic structural semantic andcontent-specific features These features will be discussed in that order in thefollowing sections

Lexical features

Lexical features are the features that are derived at the character and word-levelof the text and are the most commonly used features These features are consid-ered language-independent since they do not need any prior language-dependentprocessing before they can be applied to a text Character-frequencies word-length distributions frequency of digits and non-alphanumeric characters andtotal number of words are all examples of lexical features that provide usefulinformation

An easy-to-use lexical feature that is also computationally simple is charactern-grams For example the character 4-grams that can be extracted from thephrase rdquoThe dogrdquo are rdquothe rdquo rdquohe drdquo rdquoe dordquo and rdquo dogrdquo Character n-gramscan capture various writing style markers from a text such as capitalization orUKUS-variants of certain words Even if a word is incorrectly spelled mostof the n-grams extracted from the correct and incorrect variant will matchalthough a misspelling can also be considered as a style marker for a particularauthor An advantage of character n-grams is that they do not need tokenizationbefore they can be applied to a text which is very useful in Asian or Arabiclanguages where tokenization is difficult

More complicated lexical features require the detection of word and sen-tence boundaries in the text By using common Natural Language Processing(NLP) tools a text can be broken up into its constituent parts during a pro-cess called tokenization A token can be a single word phrase symbol or othermeaningful element After counting the occurrence of each distinct token the nmost frequently occurring tokens can be used as features since the tokens thatoccur most frequent are considered to contain the most useful discriminatoryinformation

Another set of features that can be derived from tokenization is the fre-quency of different word lengths These features provide information on howoften a particular author uses words of different lengths Vocabulary richnessmeasures are a subset of lexical features that are derived from these word lengthfrequencies Hapax Legomena is the number of words that occur once in a textwhereas Hapax Dislegomena is the number of words that occur twice Thenumber of hapax legomena and hapax dislegomena gives an indication of howrich the vocabulary is that is used by a certain author Authors that have alarger vocabulary will have a higher count of once- or twice-occurring wordsthan authors with a small vocabulary The type-token ratio VN is the numberof unique tokens V divided by the total number of tokens in a text N andgives another indication of vocabulary richness Numerous vocabulary richnessmeasures have been created based on word frequencies of which the most wellknown are

9

bull Yulersquos K [69]

K = 104 middot

[minus1

N+sumi

V (iN)

(i

N

)2]

(25)

where V (iN) is the number of words occurring i times in the text oflength N

bull Sichelrsquos S [59]

S =V (2 N)

V (N)(26)

where V (N) is the vocabulary size and V (2 N) the number of twice-occurring words

bull Brunetrsquos W [7]W = NV (N)a (27)

where N is the number of words and a is usually set to 0172

bull Honorersquos R [26]

H = 100 middot

logN

1minus N(1N)V (N)

(28)

Furthermore smileys [64] abbreviations [62] slang words [36] and evenspelling errors can be used as stylometric features to distinguish between au-thors For example Koppel [37] tries to simulate the way a human expert dis-criminates between authors by using stylistic idiosyncrasies such as misspellingsto fingerprint a particular author A disadvantage of that approach is that dis-criminating between authors is hard when there are little idiosyncrasies presentin the text

Syntactic features

Syntactic features capture authorial style at the sentence level by analyzing thesyntactic constructions that an author uses The underlying idea is that everyauthor unconsciously uses more or less the same syntactical patterns in eachtext Syntactic features rely on the accuracy and availability of NLP-tools indifferent languages in order to tokenize and apply Part-of-speech (POS)-tags

A common method is to analyze the short sequences of POS-tags that occurmost frequently throughout an authorrsquos work For example the sentence rdquoTheman walked in the parkrdquo can be represented as

| DETERMINER | NOUN | VERB PAST TENSE| PREPOSITION | DETERMINER | NOUN |

10

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 6: Thesis Freek Maes - Final Version

List of Tables

21 Soundex algorithm rules 722 Contingency table for evaluation 26

31 Preprocessing steps applied to the ENRON corpus 3132 Artificial Aliases in the ENRON data set by type 3533 Distribution of alias-types in two different test sets 3534 Feature set for the authorship SVM 37

2

Chapter 1

Introduction

Authorship disambiguation and alias resolution are increasingly important con-cepts in domains such as intelligence and law where email collections may con-tain authors that use one or more aliases Aliases occur when a person usesmultiple email addresses for either intentional or unintentional reasons For ex-ample people can try to hide their identity by intentionally adopting severaldifferent email addresses something that is common in intelligence data setssuch as terrorist networks On the other hand the use of different email ad-dresses (home office etc) is becoming common nowadays Hence there alsoexist many unintentional aliases where only the domain of the email address isdifferent or where a simple misspelling of a name has occurred

Various approaches have been applied successfully to resolve aliases in emaildata sets although each has its own shortcomings Unintentional aliases canbe resolved by employing metrics that indicate how much two email addresseslook alike However these metrics are easily fooled by persons using completelydifferent email addresses Another approach focuses on the content of the emailby creating a profile of an authorrsquos writing style By comparing the writingstyle of different authors and finding those that employ similar writing stylesaliases that are more complex can be detected This approach has been appliedsuccessfully to attribute authorship of disputed literary works However itencounters scalability issues when the number of authors grows large or thelength of the texts grows small as is the case in email data sets A thirdapproach makes use of the fact that even if an author use a completely differentemail address and writing style the people with whom he corresponds via emailmight remain stable The similarity between different authorsrsquo email contactscan be determined using link analysis techniques These techniques achievereasonable results and sometimes manage to find aliases that other techniquesdo not find

3

The three approaches mentioned above operate on different domains namelythe email address the content of the email and the email network Finding away to combine these approaches and utilize their combined strengths mightenable us to overcome their individual weaknesses In order to guide the re-search that has been conducted for this thesis three research questions havebeen formulated

1 Which authorship disambiguation and alias resolution techniques existthat can be used on email data

2 How can techniques from different domains be combined

3 Can a combination of techniques from different domains increase perfor-mance over individual techniques

11 Structure of the thesis

The structure of the remaining parts of this thesis is as follows

bull Chapter 2 introduces multiple techniques from the fields of AuthorshipDisambiguation and Alias Resolution Specifically string metrics will beexplained in section 21 authorship attribution systems in section 22 andlink analysis techniques in section 23 Several ways of combining thesetechniques as well as different measures for performance evaluation willbe discussed in sections 24 and 25

bull Chapter 3 outlines the methodology that has been used in order to conductthe experiments The email corpus that has been used will be describedas well as the preprocessing that has been applied to it Furthermore thetechniques that have been chosen for evaluation in the experiments willbe explained

bull Chapter 4 will present in detail the results of the experiments that havebeen conducted

bull Finally Chapter 5 provides a summary and discussion of the obtainedresults as well as recommendations for the future

4

Chapter 2

Literature Review

In this chapter a review of relevant literature from the fields of AuthorshipDisambiguation and Alias Resolution will be given The first section will explaindifferent string metrics that have successfully been applied to resolve superficialaliases and authorship problems In the second section authorship attributiontechniques that can be used to resolve the question of authorship in generalwill be discussed Moreover the various design choices that have to be madewhen creating an authorship attribution system will be explained The thirdsection will deal with techniques from Link Analysis that use the network inwhich emails reside to discover aliases In the fourth section several ways ofcombining these techniques will be discussed The last section will introduceseveral measures that can be used for evaluating the performance of differenttechniques

21 String metrics

String similarity metrics are a class of functions that map two strings to a realnumber where the higher the value of this number the greater the similaritybetween the two strings Many string metrics use the number of operationsthat are required to transform one string into another in order to calculatethe similarity between the two Possible operations include insertion deletionsubstitution and transposition A different class of string metrics is the phoneticencodings in which strings are converted into codes according to how they arepronounced However these encodings are language dependent and are notavailable for many languages

String metrics do not take into account information regarding the contextin which the strings occur As such they can be considered rather simple ap-proaches to resolving aliases or settling authorship disputes However stringmetrics can be very useful for detecting misspellings of email aliases result-ing from the using different email domains or naming conventions For exam-ple they can easily detect the similarity between rdquojohndoedomaincomrdquo and

5

rdquojhondoedomaincomrdquo They are less useful when people deliberately try tohide their identity by using completely different email addresses

211 Techniques

In this section the most commonly used string metrics will be discussedThe Levenshtein distance [52] often referred to as edit distance is of one the

earliest and most used string distances It is defined as the minimum requiredamount of operations between string s and t to transform one string into theother Each operation has a cost of 1 and the allowed operations are inser-tion deletion and substitution of a character The Levenshtein distance can betransformed into a similarity metric by using

similarity(s t) =1

Levenshtein(s t) + 1(21)

The Jaro similarity [32] algorithm uses the number of transpositions T andthe number of matching characters m in order to determine the similarity be-tween two strings Two characters are matching only if they are no fartherapart than half the length of the longest string The number of transpositionsis defined as the number of matching characters in different sequence ordersdivided by two The similarity is then calculated as follows

Jaro(s t) =1

3

(m

|s|+m

|t|+mminus Tm

)(22)

where |s| denotes the length of string sThe Jaro-Winkler similarity [67] is an extension of the Jaro-algorithm using

the empirical finding by Winkler that less errors tend to occur at the start ofstrings The similarity is calculated as follows where p is the length of the prefixthat the two strings share

Jaro-Winkler(s t) = Jaro(s t) +p

10(10minus Jaro(s t)) (23)

The Soundex algorithm [53] is the most well known and at the same timethe oldest phonetic encoding that is used for string matching Strings are firstconverted into phonetic codes after which strings with similar codes are assumedto be highly similar In order to convert a string into a Soundex-code the firstletter of the string is retained after which the following letters are convertedto numbers according to the set of rules shown in table 21 In the resultingcode all zeros are removed as well as multiple sequential occurrences of thesame digit The code is then cut-off or extended with zeros such that is hasexactly 3 digits The first letter of the string together with the 3 digits formsthe Soundex-code The Soundex algorithm makes use of the fact that stringsthat are pronounced in a similar fashion tend to have the same Soundex codeFor example rdquoMaidrdquo and rdquoMaderdquo both results in the Soundex code rdquoM300rdquo

The longest common substring [23] method iteratively finds and removes thelongest substring of minimum length l that two strings have in common until

6

Letter Digit

A E I O U H W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6

Table 21 The rules for converting letters into digits as they are used in theSoundex algorithm

no more substrings can be found The final similarity can then be calculated bytaking the length of all the common substrings divided by either the maximumminimum or average length of the original strings

A slightly different approach by Monge and Elkan [50] uses a string metricsuch as any of the ones discussed above in a recursive matching scheme in orderto determine similarity between strings String s and t are first broken into sub-strings s = s1 sK and t = t1 tK after which the similarity is definedas

Monge-Elkan(s t) =1

K

Ksumi=1

max

Ksumj=1

simprime(si tj) (24)

where simprime(si tj) denotes the similarity score between sub-strings si and tj asassigned by a secondary string metric

Christen [11] provides an extensive comparison of these and other stringmetrics on 4 different test sets of given- sur- and full names He found thatit is important to know beforehand the structure of the names to be matchedand whether they have been parsed and standardized He also found that Jaro-Winkler similarity performed best in a comparison of 27 different string met-rics Furthermore he reached the following conclusions (1) Phonetic encodingshould not be used since they are outperformed by all other techniques (2) Jaroand Jaro-Winkler similarity performs well for given- and surnames if the namesare parsed into separate fields (3) longest common substring is useful when thenames might contain swapped words (4) the Winkler modification can be usedwith every technique to improve the quality of the matching (5) the selection ofa proper threshold is the biggest problem for most matching techniques and (6)the fastest techniques are the ones that that have a time complexity linear to thelength of the strings Cohen and Fienberg [13] evaluated several strings metricson 13 different test sets concluding that the Monge-Elkan distance achieved thebest performance of all the string metrics The Jaro-Winkler metric proved to bea fast heuristic scheme achieving almost the same performance as Monge-Elkanwhilst being considerably less complex in nature

7

22 Authorship Attribution

Other approaches to resolving aliases and disambiguating authors can be foundin the field of Authorship Attribution The authorship attribution task can bedescribed as follows given a set of candidate authors and a set of documentswritten by each of these authors try to determine which of these candidateswrote a given anonymous document In the traditional authorship attributionproblem the number of candidate authors is typically small (2 - 10) the numberof documents per author is large and the length of these documents is largeMoreover it is assumed that the author of the anonymous document is actuallyin the candidate set ie there is a closed candidate set A good example ofa traditional authorship attribution problem is to determine the author of adisputed literary work such as some of Shakespearersquos plays

Authorship attribution techniques can be very useful in resolving aliases anddetermining authorship An authorship attribution system can be trained todistinguish between different authors in an email data set For a given authorit is possible to determine if an alias is being used by letting the authorshipattribution system predict which authorrsquos writing style most closely resemblesthe given authorrsquos writing style

In the remainder of this section the different techniques that have been em-ployed in authorship attribution problems will be explained as well as importantdesign choices that have to be made These include the choice of a feature seta feature selection technique the actual attribution technique and whether totreat the problem from an instance-based perspective or a profile-based perspec-tive

221 Instance vs profile-based

A general distinction can be made between techniques that treat each emailindividually (instance-based) and techniques that accumulate all the emails perauthor (profile-based) The first approach treats each email from a given authoras a single training instance and thereby retains differences in texts from thesame author The second approach accumulates all the texts from a givenauthor into one big training file creating a profile of one author and disregardingdifferences in each individual text The choice is mostly philosophical whetherto model the general style of each author or the individual style of each document[63]

222 Features

An important design choice in authorship attribution systems is the choice offeature set Features are the specific writing-style attributes predefined by theresearcher that are extracted from a piece of text in order to capture stylisticinformation that is characteristic for a particular author Since the choice offeature set can affect the performance of the authorship attribution in variousways it is important to consider which features to include or exclude In general

8

a distinction can be made between lexical syntactic structural semantic andcontent-specific features These features will be discussed in that order in thefollowing sections

Lexical features

Lexical features are the features that are derived at the character and word-levelof the text and are the most commonly used features These features are consid-ered language-independent since they do not need any prior language-dependentprocessing before they can be applied to a text Character-frequencies word-length distributions frequency of digits and non-alphanumeric characters andtotal number of words are all examples of lexical features that provide usefulinformation

An easy-to-use lexical feature that is also computationally simple is charactern-grams For example the character 4-grams that can be extracted from thephrase rdquoThe dogrdquo are rdquothe rdquo rdquohe drdquo rdquoe dordquo and rdquo dogrdquo Character n-gramscan capture various writing style markers from a text such as capitalization orUKUS-variants of certain words Even if a word is incorrectly spelled mostof the n-grams extracted from the correct and incorrect variant will matchalthough a misspelling can also be considered as a style marker for a particularauthor An advantage of character n-grams is that they do not need tokenizationbefore they can be applied to a text which is very useful in Asian or Arabiclanguages where tokenization is difficult

More complicated lexical features require the detection of word and sen-tence boundaries in the text By using common Natural Language Processing(NLP) tools a text can be broken up into its constituent parts during a pro-cess called tokenization A token can be a single word phrase symbol or othermeaningful element After counting the occurrence of each distinct token the nmost frequently occurring tokens can be used as features since the tokens thatoccur most frequent are considered to contain the most useful discriminatoryinformation

Another set of features that can be derived from tokenization is the fre-quency of different word lengths These features provide information on howoften a particular author uses words of different lengths Vocabulary richnessmeasures are a subset of lexical features that are derived from these word lengthfrequencies Hapax Legomena is the number of words that occur once in a textwhereas Hapax Dislegomena is the number of words that occur twice Thenumber of hapax legomena and hapax dislegomena gives an indication of howrich the vocabulary is that is used by a certain author Authors that have alarger vocabulary will have a higher count of once- or twice-occurring wordsthan authors with a small vocabulary The type-token ratio VN is the numberof unique tokens V divided by the total number of tokens in a text N andgives another indication of vocabulary richness Numerous vocabulary richnessmeasures have been created based on word frequencies of which the most wellknown are

9

bull Yulersquos K [69]

K = 104 middot

[minus1

N+sumi

V (iN)

(i

N

)2]

(25)

where V (iN) is the number of words occurring i times in the text oflength N

bull Sichelrsquos S [59]

S =V (2 N)

V (N)(26)

where V (N) is the vocabulary size and V (2 N) the number of twice-occurring words

bull Brunetrsquos W [7]W = NV (N)a (27)

where N is the number of words and a is usually set to 0172

bull Honorersquos R [26]

H = 100 middot

logN

1minus N(1N)V (N)

(28)

Furthermore smileys [64] abbreviations [62] slang words [36] and evenspelling errors can be used as stylometric features to distinguish between au-thors For example Koppel [37] tries to simulate the way a human expert dis-criminates between authors by using stylistic idiosyncrasies such as misspellingsto fingerprint a particular author A disadvantage of that approach is that dis-criminating between authors is hard when there are little idiosyncrasies presentin the text

Syntactic features

Syntactic features capture authorial style at the sentence level by analyzing thesyntactic constructions that an author uses The underlying idea is that everyauthor unconsciously uses more or less the same syntactical patterns in eachtext Syntactic features rely on the accuracy and availability of NLP-tools indifferent languages in order to tokenize and apply Part-of-speech (POS)-tags

A common method is to analyze the short sequences of POS-tags that occurmost frequently throughout an authorrsquos work For example the sentence rdquoTheman walked in the parkrdquo can be represented as

| DETERMINER | NOUN | VERB PAST TENSE| PREPOSITION | DETERMINER | NOUN |

10

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 7: Thesis Freek Maes - Final Version

Chapter 1

Introduction

Authorship disambiguation and alias resolution are increasingly important con-cepts in domains such as intelligence and law where email collections may con-tain authors that use one or more aliases Aliases occur when a person usesmultiple email addresses for either intentional or unintentional reasons For ex-ample people can try to hide their identity by intentionally adopting severaldifferent email addresses something that is common in intelligence data setssuch as terrorist networks On the other hand the use of different email ad-dresses (home office etc) is becoming common nowadays Hence there alsoexist many unintentional aliases where only the domain of the email address isdifferent or where a simple misspelling of a name has occurred

Various approaches have been applied successfully to resolve aliases in emaildata sets although each has its own shortcomings Unintentional aliases canbe resolved by employing metrics that indicate how much two email addresseslook alike However these metrics are easily fooled by persons using completelydifferent email addresses Another approach focuses on the content of the emailby creating a profile of an authorrsquos writing style By comparing the writingstyle of different authors and finding those that employ similar writing stylesaliases that are more complex can be detected This approach has been appliedsuccessfully to attribute authorship of disputed literary works However itencounters scalability issues when the number of authors grows large or thelength of the texts grows small as is the case in email data sets A thirdapproach makes use of the fact that even if an author use a completely differentemail address and writing style the people with whom he corresponds via emailmight remain stable The similarity between different authorsrsquo email contactscan be determined using link analysis techniques These techniques achievereasonable results and sometimes manage to find aliases that other techniquesdo not find

3

The three approaches mentioned above operate on different domains namelythe email address the content of the email and the email network Finding away to combine these approaches and utilize their combined strengths mightenable us to overcome their individual weaknesses In order to guide the re-search that has been conducted for this thesis three research questions havebeen formulated

1 Which authorship disambiguation and alias resolution techniques existthat can be used on email data

2 How can techniques from different domains be combined

3 Can a combination of techniques from different domains increase perfor-mance over individual techniques

11 Structure of the thesis

The structure of the remaining parts of this thesis is as follows

bull Chapter 2 introduces multiple techniques from the fields of AuthorshipDisambiguation and Alias Resolution Specifically string metrics will beexplained in section 21 authorship attribution systems in section 22 andlink analysis techniques in section 23 Several ways of combining thesetechniques as well as different measures for performance evaluation willbe discussed in sections 24 and 25

bull Chapter 3 outlines the methodology that has been used in order to conductthe experiments The email corpus that has been used will be describedas well as the preprocessing that has been applied to it Furthermore thetechniques that have been chosen for evaluation in the experiments willbe explained

bull Chapter 4 will present in detail the results of the experiments that havebeen conducted

bull Finally Chapter 5 provides a summary and discussion of the obtainedresults as well as recommendations for the future

4

Chapter 2

Literature Review

In this chapter a review of relevant literature from the fields of AuthorshipDisambiguation and Alias Resolution will be given The first section will explaindifferent string metrics that have successfully been applied to resolve superficialaliases and authorship problems In the second section authorship attributiontechniques that can be used to resolve the question of authorship in generalwill be discussed Moreover the various design choices that have to be madewhen creating an authorship attribution system will be explained The thirdsection will deal with techniques from Link Analysis that use the network inwhich emails reside to discover aliases In the fourth section several ways ofcombining these techniques will be discussed The last section will introduceseveral measures that can be used for evaluating the performance of differenttechniques

21 String metrics

String similarity metrics are a class of functions that map two strings to a realnumber where the higher the value of this number the greater the similaritybetween the two strings Many string metrics use the number of operationsthat are required to transform one string into another in order to calculatethe similarity between the two Possible operations include insertion deletionsubstitution and transposition A different class of string metrics is the phoneticencodings in which strings are converted into codes according to how they arepronounced However these encodings are language dependent and are notavailable for many languages

String metrics do not take into account information regarding the contextin which the strings occur As such they can be considered rather simple ap-proaches to resolving aliases or settling authorship disputes However stringmetrics can be very useful for detecting misspellings of email aliases result-ing from the using different email domains or naming conventions For exam-ple they can easily detect the similarity between rdquojohndoedomaincomrdquo and

5

rdquojhondoedomaincomrdquo They are less useful when people deliberately try tohide their identity by using completely different email addresses

211 Techniques

In this section the most commonly used string metrics will be discussedThe Levenshtein distance [52] often referred to as edit distance is of one the

earliest and most used string distances It is defined as the minimum requiredamount of operations between string s and t to transform one string into theother Each operation has a cost of 1 and the allowed operations are inser-tion deletion and substitution of a character The Levenshtein distance can betransformed into a similarity metric by using

similarity(s t) =1

Levenshtein(s t) + 1(21)

The Jaro similarity [32] algorithm uses the number of transpositions T andthe number of matching characters m in order to determine the similarity be-tween two strings Two characters are matching only if they are no fartherapart than half the length of the longest string The number of transpositionsis defined as the number of matching characters in different sequence ordersdivided by two The similarity is then calculated as follows

Jaro(s t) =1

3

(m

|s|+m

|t|+mminus Tm

)(22)

where |s| denotes the length of string sThe Jaro-Winkler similarity [67] is an extension of the Jaro-algorithm using

the empirical finding by Winkler that less errors tend to occur at the start ofstrings The similarity is calculated as follows where p is the length of the prefixthat the two strings share

Jaro-Winkler(s t) = Jaro(s t) +p

10(10minus Jaro(s t)) (23)

The Soundex algorithm [53] is the most well known and at the same timethe oldest phonetic encoding that is used for string matching Strings are firstconverted into phonetic codes after which strings with similar codes are assumedto be highly similar In order to convert a string into a Soundex-code the firstletter of the string is retained after which the following letters are convertedto numbers according to the set of rules shown in table 21 In the resultingcode all zeros are removed as well as multiple sequential occurrences of thesame digit The code is then cut-off or extended with zeros such that is hasexactly 3 digits The first letter of the string together with the 3 digits formsthe Soundex-code The Soundex algorithm makes use of the fact that stringsthat are pronounced in a similar fashion tend to have the same Soundex codeFor example rdquoMaidrdquo and rdquoMaderdquo both results in the Soundex code rdquoM300rdquo

The longest common substring [23] method iteratively finds and removes thelongest substring of minimum length l that two strings have in common until

6

Letter Digit

A E I O U H W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6

Table 21 The rules for converting letters into digits as they are used in theSoundex algorithm

no more substrings can be found The final similarity can then be calculated bytaking the length of all the common substrings divided by either the maximumminimum or average length of the original strings

A slightly different approach by Monge and Elkan [50] uses a string metricsuch as any of the ones discussed above in a recursive matching scheme in orderto determine similarity between strings String s and t are first broken into sub-strings s = s1 sK and t = t1 tK after which the similarity is definedas

Monge-Elkan(s t) =1

K

Ksumi=1

max

Ksumj=1

simprime(si tj) (24)

where simprime(si tj) denotes the similarity score between sub-strings si and tj asassigned by a secondary string metric

Christen [11] provides an extensive comparison of these and other stringmetrics on 4 different test sets of given- sur- and full names He found thatit is important to know beforehand the structure of the names to be matchedand whether they have been parsed and standardized He also found that Jaro-Winkler similarity performed best in a comparison of 27 different string met-rics Furthermore he reached the following conclusions (1) Phonetic encodingshould not be used since they are outperformed by all other techniques (2) Jaroand Jaro-Winkler similarity performs well for given- and surnames if the namesare parsed into separate fields (3) longest common substring is useful when thenames might contain swapped words (4) the Winkler modification can be usedwith every technique to improve the quality of the matching (5) the selection ofa proper threshold is the biggest problem for most matching techniques and (6)the fastest techniques are the ones that that have a time complexity linear to thelength of the strings Cohen and Fienberg [13] evaluated several strings metricson 13 different test sets concluding that the Monge-Elkan distance achieved thebest performance of all the string metrics The Jaro-Winkler metric proved to bea fast heuristic scheme achieving almost the same performance as Monge-Elkanwhilst being considerably less complex in nature

7

22 Authorship Attribution

Other approaches to resolving aliases and disambiguating authors can be foundin the field of Authorship Attribution The authorship attribution task can bedescribed as follows given a set of candidate authors and a set of documentswritten by each of these authors try to determine which of these candidateswrote a given anonymous document In the traditional authorship attributionproblem the number of candidate authors is typically small (2 - 10) the numberof documents per author is large and the length of these documents is largeMoreover it is assumed that the author of the anonymous document is actuallyin the candidate set ie there is a closed candidate set A good example ofa traditional authorship attribution problem is to determine the author of adisputed literary work such as some of Shakespearersquos plays

Authorship attribution techniques can be very useful in resolving aliases anddetermining authorship An authorship attribution system can be trained todistinguish between different authors in an email data set For a given authorit is possible to determine if an alias is being used by letting the authorshipattribution system predict which authorrsquos writing style most closely resemblesthe given authorrsquos writing style

In the remainder of this section the different techniques that have been em-ployed in authorship attribution problems will be explained as well as importantdesign choices that have to be made These include the choice of a feature seta feature selection technique the actual attribution technique and whether totreat the problem from an instance-based perspective or a profile-based perspec-tive

221 Instance vs profile-based

A general distinction can be made between techniques that treat each emailindividually (instance-based) and techniques that accumulate all the emails perauthor (profile-based) The first approach treats each email from a given authoras a single training instance and thereby retains differences in texts from thesame author The second approach accumulates all the texts from a givenauthor into one big training file creating a profile of one author and disregardingdifferences in each individual text The choice is mostly philosophical whetherto model the general style of each author or the individual style of each document[63]

222 Features

An important design choice in authorship attribution systems is the choice offeature set Features are the specific writing-style attributes predefined by theresearcher that are extracted from a piece of text in order to capture stylisticinformation that is characteristic for a particular author Since the choice offeature set can affect the performance of the authorship attribution in variousways it is important to consider which features to include or exclude In general

8

a distinction can be made between lexical syntactic structural semantic andcontent-specific features These features will be discussed in that order in thefollowing sections

Lexical features

Lexical features are the features that are derived at the character and word-levelof the text and are the most commonly used features These features are consid-ered language-independent since they do not need any prior language-dependentprocessing before they can be applied to a text Character-frequencies word-length distributions frequency of digits and non-alphanumeric characters andtotal number of words are all examples of lexical features that provide usefulinformation

An easy-to-use lexical feature that is also computationally simple is charactern-grams For example the character 4-grams that can be extracted from thephrase rdquoThe dogrdquo are rdquothe rdquo rdquohe drdquo rdquoe dordquo and rdquo dogrdquo Character n-gramscan capture various writing style markers from a text such as capitalization orUKUS-variants of certain words Even if a word is incorrectly spelled mostof the n-grams extracted from the correct and incorrect variant will matchalthough a misspelling can also be considered as a style marker for a particularauthor An advantage of character n-grams is that they do not need tokenizationbefore they can be applied to a text which is very useful in Asian or Arabiclanguages where tokenization is difficult

More complicated lexical features require the detection of word and sen-tence boundaries in the text By using common Natural Language Processing(NLP) tools a text can be broken up into its constituent parts during a pro-cess called tokenization A token can be a single word phrase symbol or othermeaningful element After counting the occurrence of each distinct token the nmost frequently occurring tokens can be used as features since the tokens thatoccur most frequent are considered to contain the most useful discriminatoryinformation

Another set of features that can be derived from tokenization is the fre-quency of different word lengths These features provide information on howoften a particular author uses words of different lengths Vocabulary richnessmeasures are a subset of lexical features that are derived from these word lengthfrequencies Hapax Legomena is the number of words that occur once in a textwhereas Hapax Dislegomena is the number of words that occur twice Thenumber of hapax legomena and hapax dislegomena gives an indication of howrich the vocabulary is that is used by a certain author Authors that have alarger vocabulary will have a higher count of once- or twice-occurring wordsthan authors with a small vocabulary The type-token ratio VN is the numberof unique tokens V divided by the total number of tokens in a text N andgives another indication of vocabulary richness Numerous vocabulary richnessmeasures have been created based on word frequencies of which the most wellknown are

9

bull Yulersquos K [69]

K = 104 middot

[minus1

N+sumi

V (iN)

(i

N

)2]

(25)

where V (iN) is the number of words occurring i times in the text oflength N

bull Sichelrsquos S [59]

S =V (2 N)

V (N)(26)

where V (N) is the vocabulary size and V (2 N) the number of twice-occurring words

bull Brunetrsquos W [7]W = NV (N)a (27)

where N is the number of words and a is usually set to 0172

bull Honorersquos R [26]

H = 100 middot

logN

1minus N(1N)V (N)

(28)

Furthermore smileys [64] abbreviations [62] slang words [36] and evenspelling errors can be used as stylometric features to distinguish between au-thors For example Koppel [37] tries to simulate the way a human expert dis-criminates between authors by using stylistic idiosyncrasies such as misspellingsto fingerprint a particular author A disadvantage of that approach is that dis-criminating between authors is hard when there are little idiosyncrasies presentin the text

Syntactic features

Syntactic features capture authorial style at the sentence level by analyzing thesyntactic constructions that an author uses The underlying idea is that everyauthor unconsciously uses more or less the same syntactical patterns in eachtext Syntactic features rely on the accuracy and availability of NLP-tools indifferent languages in order to tokenize and apply Part-of-speech (POS)-tags

A common method is to analyze the short sequences of POS-tags that occurmost frequently throughout an authorrsquos work For example the sentence rdquoTheman walked in the parkrdquo can be represented as

| DETERMINER | NOUN | VERB PAST TENSE| PREPOSITION | DETERMINER | NOUN |

10

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 8: Thesis Freek Maes - Final Version

The three approaches mentioned above operate on different domains namelythe email address the content of the email and the email network Finding away to combine these approaches and utilize their combined strengths mightenable us to overcome their individual weaknesses In order to guide the re-search that has been conducted for this thesis three research questions havebeen formulated

1 Which authorship disambiguation and alias resolution techniques existthat can be used on email data

2 How can techniques from different domains be combined

3 Can a combination of techniques from different domains increase perfor-mance over individual techniques

11 Structure of the thesis

The structure of the remaining parts of this thesis is as follows

bull Chapter 2 introduces multiple techniques from the fields of AuthorshipDisambiguation and Alias Resolution Specifically string metrics will beexplained in section 21 authorship attribution systems in section 22 andlink analysis techniques in section 23 Several ways of combining thesetechniques as well as different measures for performance evaluation willbe discussed in sections 24 and 25

bull Chapter 3 outlines the methodology that has been used in order to conductthe experiments The email corpus that has been used will be describedas well as the preprocessing that has been applied to it Furthermore thetechniques that have been chosen for evaluation in the experiments willbe explained

bull Chapter 4 will present in detail the results of the experiments that havebeen conducted

bull Finally Chapter 5 provides a summary and discussion of the obtainedresults as well as recommendations for the future

4

Chapter 2

Literature Review

In this chapter a review of relevant literature from the fields of AuthorshipDisambiguation and Alias Resolution will be given The first section will explaindifferent string metrics that have successfully been applied to resolve superficialaliases and authorship problems In the second section authorship attributiontechniques that can be used to resolve the question of authorship in generalwill be discussed Moreover the various design choices that have to be madewhen creating an authorship attribution system will be explained The thirdsection will deal with techniques from Link Analysis that use the network inwhich emails reside to discover aliases In the fourth section several ways ofcombining these techniques will be discussed The last section will introduceseveral measures that can be used for evaluating the performance of differenttechniques

21 String metrics

String similarity metrics are a class of functions that map two strings to a realnumber where the higher the value of this number the greater the similaritybetween the two strings Many string metrics use the number of operationsthat are required to transform one string into another in order to calculatethe similarity between the two Possible operations include insertion deletionsubstitution and transposition A different class of string metrics is the phoneticencodings in which strings are converted into codes according to how they arepronounced However these encodings are language dependent and are notavailable for many languages

String metrics do not take into account information regarding the contextin which the strings occur As such they can be considered rather simple ap-proaches to resolving aliases or settling authorship disputes However stringmetrics can be very useful for detecting misspellings of email aliases result-ing from the using different email domains or naming conventions For exam-ple they can easily detect the similarity between rdquojohndoedomaincomrdquo and

5

rdquojhondoedomaincomrdquo They are less useful when people deliberately try tohide their identity by using completely different email addresses

211 Techniques

In this section the most commonly used string metrics will be discussedThe Levenshtein distance [52] often referred to as edit distance is of one the

earliest and most used string distances It is defined as the minimum requiredamount of operations between string s and t to transform one string into theother Each operation has a cost of 1 and the allowed operations are inser-tion deletion and substitution of a character The Levenshtein distance can betransformed into a similarity metric by using

similarity(s t) =1

Levenshtein(s t) + 1(21)

The Jaro similarity [32] algorithm uses the number of transpositions T andthe number of matching characters m in order to determine the similarity be-tween two strings Two characters are matching only if they are no fartherapart than half the length of the longest string The number of transpositionsis defined as the number of matching characters in different sequence ordersdivided by two The similarity is then calculated as follows

Jaro(s t) =1

3

(m

|s|+m

|t|+mminus Tm

)(22)

where |s| denotes the length of string sThe Jaro-Winkler similarity [67] is an extension of the Jaro-algorithm using

the empirical finding by Winkler that less errors tend to occur at the start ofstrings The similarity is calculated as follows where p is the length of the prefixthat the two strings share

Jaro-Winkler(s t) = Jaro(s t) +p

10(10minus Jaro(s t)) (23)

The Soundex algorithm [53] is the most well known and at the same timethe oldest phonetic encoding that is used for string matching Strings are firstconverted into phonetic codes after which strings with similar codes are assumedto be highly similar In order to convert a string into a Soundex-code the firstletter of the string is retained after which the following letters are convertedto numbers according to the set of rules shown in table 21 In the resultingcode all zeros are removed as well as multiple sequential occurrences of thesame digit The code is then cut-off or extended with zeros such that is hasexactly 3 digits The first letter of the string together with the 3 digits formsthe Soundex-code The Soundex algorithm makes use of the fact that stringsthat are pronounced in a similar fashion tend to have the same Soundex codeFor example rdquoMaidrdquo and rdquoMaderdquo both results in the Soundex code rdquoM300rdquo

The longest common substring [23] method iteratively finds and removes thelongest substring of minimum length l that two strings have in common until

6

Letter Digit

A E I O U H W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6

Table 21 The rules for converting letters into digits as they are used in theSoundex algorithm

no more substrings can be found The final similarity can then be calculated bytaking the length of all the common substrings divided by either the maximumminimum or average length of the original strings

A slightly different approach by Monge and Elkan [50] uses a string metricsuch as any of the ones discussed above in a recursive matching scheme in orderto determine similarity between strings String s and t are first broken into sub-strings s = s1 sK and t = t1 tK after which the similarity is definedas

Monge-Elkan(s t) =1

K

Ksumi=1

max

Ksumj=1

simprime(si tj) (24)

where simprime(si tj) denotes the similarity score between sub-strings si and tj asassigned by a secondary string metric

Christen [11] provides an extensive comparison of these and other stringmetrics on 4 different test sets of given- sur- and full names He found thatit is important to know beforehand the structure of the names to be matchedand whether they have been parsed and standardized He also found that Jaro-Winkler similarity performed best in a comparison of 27 different string met-rics Furthermore he reached the following conclusions (1) Phonetic encodingshould not be used since they are outperformed by all other techniques (2) Jaroand Jaro-Winkler similarity performs well for given- and surnames if the namesare parsed into separate fields (3) longest common substring is useful when thenames might contain swapped words (4) the Winkler modification can be usedwith every technique to improve the quality of the matching (5) the selection ofa proper threshold is the biggest problem for most matching techniques and (6)the fastest techniques are the ones that that have a time complexity linear to thelength of the strings Cohen and Fienberg [13] evaluated several strings metricson 13 different test sets concluding that the Monge-Elkan distance achieved thebest performance of all the string metrics The Jaro-Winkler metric proved to bea fast heuristic scheme achieving almost the same performance as Monge-Elkanwhilst being considerably less complex in nature

7

22 Authorship Attribution

Other approaches to resolving aliases and disambiguating authors can be foundin the field of Authorship Attribution The authorship attribution task can bedescribed as follows given a set of candidate authors and a set of documentswritten by each of these authors try to determine which of these candidateswrote a given anonymous document In the traditional authorship attributionproblem the number of candidate authors is typically small (2 - 10) the numberof documents per author is large and the length of these documents is largeMoreover it is assumed that the author of the anonymous document is actuallyin the candidate set ie there is a closed candidate set A good example ofa traditional authorship attribution problem is to determine the author of adisputed literary work such as some of Shakespearersquos plays

Authorship attribution techniques can be very useful in resolving aliases anddetermining authorship An authorship attribution system can be trained todistinguish between different authors in an email data set For a given authorit is possible to determine if an alias is being used by letting the authorshipattribution system predict which authorrsquos writing style most closely resemblesthe given authorrsquos writing style

In the remainder of this section the different techniques that have been em-ployed in authorship attribution problems will be explained as well as importantdesign choices that have to be made These include the choice of a feature seta feature selection technique the actual attribution technique and whether totreat the problem from an instance-based perspective or a profile-based perspec-tive

221 Instance vs profile-based

A general distinction can be made between techniques that treat each emailindividually (instance-based) and techniques that accumulate all the emails perauthor (profile-based) The first approach treats each email from a given authoras a single training instance and thereby retains differences in texts from thesame author The second approach accumulates all the texts from a givenauthor into one big training file creating a profile of one author and disregardingdifferences in each individual text The choice is mostly philosophical whetherto model the general style of each author or the individual style of each document[63]

222 Features

An important design choice in authorship attribution systems is the choice offeature set Features are the specific writing-style attributes predefined by theresearcher that are extracted from a piece of text in order to capture stylisticinformation that is characteristic for a particular author Since the choice offeature set can affect the performance of the authorship attribution in variousways it is important to consider which features to include or exclude In general

8

a distinction can be made between lexical syntactic structural semantic andcontent-specific features These features will be discussed in that order in thefollowing sections

Lexical features

Lexical features are the features that are derived at the character and word-levelof the text and are the most commonly used features These features are consid-ered language-independent since they do not need any prior language-dependentprocessing before they can be applied to a text Character-frequencies word-length distributions frequency of digits and non-alphanumeric characters andtotal number of words are all examples of lexical features that provide usefulinformation

An easy-to-use lexical feature that is also computationally simple is charactern-grams For example the character 4-grams that can be extracted from thephrase rdquoThe dogrdquo are rdquothe rdquo rdquohe drdquo rdquoe dordquo and rdquo dogrdquo Character n-gramscan capture various writing style markers from a text such as capitalization orUKUS-variants of certain words Even if a word is incorrectly spelled mostof the n-grams extracted from the correct and incorrect variant will matchalthough a misspelling can also be considered as a style marker for a particularauthor An advantage of character n-grams is that they do not need tokenizationbefore they can be applied to a text which is very useful in Asian or Arabiclanguages where tokenization is difficult

More complicated lexical features require the detection of word and sen-tence boundaries in the text By using common Natural Language Processing(NLP) tools a text can be broken up into its constituent parts during a pro-cess called tokenization A token can be a single word phrase symbol or othermeaningful element After counting the occurrence of each distinct token the nmost frequently occurring tokens can be used as features since the tokens thatoccur most frequent are considered to contain the most useful discriminatoryinformation

Another set of features that can be derived from tokenization is the fre-quency of different word lengths These features provide information on howoften a particular author uses words of different lengths Vocabulary richnessmeasures are a subset of lexical features that are derived from these word lengthfrequencies Hapax Legomena is the number of words that occur once in a textwhereas Hapax Dislegomena is the number of words that occur twice Thenumber of hapax legomena and hapax dislegomena gives an indication of howrich the vocabulary is that is used by a certain author Authors that have alarger vocabulary will have a higher count of once- or twice-occurring wordsthan authors with a small vocabulary The type-token ratio VN is the numberof unique tokens V divided by the total number of tokens in a text N andgives another indication of vocabulary richness Numerous vocabulary richnessmeasures have been created based on word frequencies of which the most wellknown are

9

bull Yulersquos K [69]

K = 104 middot

[minus1

N+sumi

V (iN)

(i

N

)2]

(25)

where V (iN) is the number of words occurring i times in the text oflength N

bull Sichelrsquos S [59]

S =V (2 N)

V (N)(26)

where V (N) is the vocabulary size and V (2 N) the number of twice-occurring words

bull Brunetrsquos W [7]W = NV (N)a (27)

where N is the number of words and a is usually set to 0172

bull Honorersquos R [26]

H = 100 middot

logN

1minus N(1N)V (N)

(28)

Furthermore smileys [64] abbreviations [62] slang words [36] and evenspelling errors can be used as stylometric features to distinguish between au-thors For example Koppel [37] tries to simulate the way a human expert dis-criminates between authors by using stylistic idiosyncrasies such as misspellingsto fingerprint a particular author A disadvantage of that approach is that dis-criminating between authors is hard when there are little idiosyncrasies presentin the text

Syntactic features

Syntactic features capture authorial style at the sentence level by analyzing thesyntactic constructions that an author uses The underlying idea is that everyauthor unconsciously uses more or less the same syntactical patterns in eachtext Syntactic features rely on the accuracy and availability of NLP-tools indifferent languages in order to tokenize and apply Part-of-speech (POS)-tags

A common method is to analyze the short sequences of POS-tags that occurmost frequently throughout an authorrsquos work For example the sentence rdquoTheman walked in the parkrdquo can be represented as

| DETERMINER | NOUN | VERB PAST TENSE| PREPOSITION | DETERMINER | NOUN |

10

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 9: Thesis Freek Maes - Final Version

Chapter 2

Literature Review

In this chapter a review of relevant literature from the fields of AuthorshipDisambiguation and Alias Resolution will be given The first section will explaindifferent string metrics that have successfully been applied to resolve superficialaliases and authorship problems In the second section authorship attributiontechniques that can be used to resolve the question of authorship in generalwill be discussed Moreover the various design choices that have to be madewhen creating an authorship attribution system will be explained The thirdsection will deal with techniques from Link Analysis that use the network inwhich emails reside to discover aliases In the fourth section several ways ofcombining these techniques will be discussed The last section will introduceseveral measures that can be used for evaluating the performance of differenttechniques

21 String metrics

String similarity metrics are a class of functions that map two strings to a realnumber where the higher the value of this number the greater the similaritybetween the two strings Many string metrics use the number of operationsthat are required to transform one string into another in order to calculatethe similarity between the two Possible operations include insertion deletionsubstitution and transposition A different class of string metrics is the phoneticencodings in which strings are converted into codes according to how they arepronounced However these encodings are language dependent and are notavailable for many languages

String metrics do not take into account information regarding the contextin which the strings occur As such they can be considered rather simple ap-proaches to resolving aliases or settling authorship disputes However stringmetrics can be very useful for detecting misspellings of email aliases result-ing from the using different email domains or naming conventions For exam-ple they can easily detect the similarity between rdquojohndoedomaincomrdquo and

5

rdquojhondoedomaincomrdquo They are less useful when people deliberately try tohide their identity by using completely different email addresses

211 Techniques

In this section the most commonly used string metrics will be discussedThe Levenshtein distance [52] often referred to as edit distance is of one the

earliest and most used string distances It is defined as the minimum requiredamount of operations between string s and t to transform one string into theother Each operation has a cost of 1 and the allowed operations are inser-tion deletion and substitution of a character The Levenshtein distance can betransformed into a similarity metric by using

similarity(s t) =1

Levenshtein(s t) + 1(21)

The Jaro similarity [32] algorithm uses the number of transpositions T andthe number of matching characters m in order to determine the similarity be-tween two strings Two characters are matching only if they are no fartherapart than half the length of the longest string The number of transpositionsis defined as the number of matching characters in different sequence ordersdivided by two The similarity is then calculated as follows

Jaro(s t) =1

3

(m

|s|+m

|t|+mminus Tm

)(22)

where |s| denotes the length of string sThe Jaro-Winkler similarity [67] is an extension of the Jaro-algorithm using

the empirical finding by Winkler that less errors tend to occur at the start ofstrings The similarity is calculated as follows where p is the length of the prefixthat the two strings share

Jaro-Winkler(s t) = Jaro(s t) +p

10(10minus Jaro(s t)) (23)

The Soundex algorithm [53] is the most well known and at the same timethe oldest phonetic encoding that is used for string matching Strings are firstconverted into phonetic codes after which strings with similar codes are assumedto be highly similar In order to convert a string into a Soundex-code the firstletter of the string is retained after which the following letters are convertedto numbers according to the set of rules shown in table 21 In the resultingcode all zeros are removed as well as multiple sequential occurrences of thesame digit The code is then cut-off or extended with zeros such that is hasexactly 3 digits The first letter of the string together with the 3 digits formsthe Soundex-code The Soundex algorithm makes use of the fact that stringsthat are pronounced in a similar fashion tend to have the same Soundex codeFor example rdquoMaidrdquo and rdquoMaderdquo both results in the Soundex code rdquoM300rdquo

The longest common substring [23] method iteratively finds and removes thelongest substring of minimum length l that two strings have in common until

6

Letter Digit

A E I O U H W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6

Table 21 The rules for converting letters into digits as they are used in theSoundex algorithm

no more substrings can be found The final similarity can then be calculated bytaking the length of all the common substrings divided by either the maximumminimum or average length of the original strings

A slightly different approach by Monge and Elkan [50] uses a string metricsuch as any of the ones discussed above in a recursive matching scheme in orderto determine similarity between strings String s and t are first broken into sub-strings s = s1 sK and t = t1 tK after which the similarity is definedas

Monge-Elkan(s t) =1

K

Ksumi=1

max

Ksumj=1

simprime(si tj) (24)

where simprime(si tj) denotes the similarity score between sub-strings si and tj asassigned by a secondary string metric

Christen [11] provides an extensive comparison of these and other stringmetrics on 4 different test sets of given- sur- and full names He found thatit is important to know beforehand the structure of the names to be matchedand whether they have been parsed and standardized He also found that Jaro-Winkler similarity performed best in a comparison of 27 different string met-rics Furthermore he reached the following conclusions (1) Phonetic encodingshould not be used since they are outperformed by all other techniques (2) Jaroand Jaro-Winkler similarity performs well for given- and surnames if the namesare parsed into separate fields (3) longest common substring is useful when thenames might contain swapped words (4) the Winkler modification can be usedwith every technique to improve the quality of the matching (5) the selection ofa proper threshold is the biggest problem for most matching techniques and (6)the fastest techniques are the ones that that have a time complexity linear to thelength of the strings Cohen and Fienberg [13] evaluated several strings metricson 13 different test sets concluding that the Monge-Elkan distance achieved thebest performance of all the string metrics The Jaro-Winkler metric proved to bea fast heuristic scheme achieving almost the same performance as Monge-Elkanwhilst being considerably less complex in nature

7

22 Authorship Attribution

Other approaches to resolving aliases and disambiguating authors can be foundin the field of Authorship Attribution The authorship attribution task can bedescribed as follows given a set of candidate authors and a set of documentswritten by each of these authors try to determine which of these candidateswrote a given anonymous document In the traditional authorship attributionproblem the number of candidate authors is typically small (2 - 10) the numberof documents per author is large and the length of these documents is largeMoreover it is assumed that the author of the anonymous document is actuallyin the candidate set ie there is a closed candidate set A good example ofa traditional authorship attribution problem is to determine the author of adisputed literary work such as some of Shakespearersquos plays

Authorship attribution techniques can be very useful in resolving aliases anddetermining authorship An authorship attribution system can be trained todistinguish between different authors in an email data set For a given authorit is possible to determine if an alias is being used by letting the authorshipattribution system predict which authorrsquos writing style most closely resemblesthe given authorrsquos writing style

In the remainder of this section the different techniques that have been em-ployed in authorship attribution problems will be explained as well as importantdesign choices that have to be made These include the choice of a feature seta feature selection technique the actual attribution technique and whether totreat the problem from an instance-based perspective or a profile-based perspec-tive

221 Instance vs profile-based

A general distinction can be made between techniques that treat each emailindividually (instance-based) and techniques that accumulate all the emails perauthor (profile-based) The first approach treats each email from a given authoras a single training instance and thereby retains differences in texts from thesame author The second approach accumulates all the texts from a givenauthor into one big training file creating a profile of one author and disregardingdifferences in each individual text The choice is mostly philosophical whetherto model the general style of each author or the individual style of each document[63]

222 Features

An important design choice in authorship attribution systems is the choice offeature set Features are the specific writing-style attributes predefined by theresearcher that are extracted from a piece of text in order to capture stylisticinformation that is characteristic for a particular author Since the choice offeature set can affect the performance of the authorship attribution in variousways it is important to consider which features to include or exclude In general

8

a distinction can be made between lexical syntactic structural semantic andcontent-specific features These features will be discussed in that order in thefollowing sections

Lexical features

Lexical features are the features that are derived at the character and word-levelof the text and are the most commonly used features These features are consid-ered language-independent since they do not need any prior language-dependentprocessing before they can be applied to a text Character-frequencies word-length distributions frequency of digits and non-alphanumeric characters andtotal number of words are all examples of lexical features that provide usefulinformation

An easy-to-use lexical feature that is also computationally simple is charactern-grams For example the character 4-grams that can be extracted from thephrase rdquoThe dogrdquo are rdquothe rdquo rdquohe drdquo rdquoe dordquo and rdquo dogrdquo Character n-gramscan capture various writing style markers from a text such as capitalization orUKUS-variants of certain words Even if a word is incorrectly spelled mostof the n-grams extracted from the correct and incorrect variant will matchalthough a misspelling can also be considered as a style marker for a particularauthor An advantage of character n-grams is that they do not need tokenizationbefore they can be applied to a text which is very useful in Asian or Arabiclanguages where tokenization is difficult

More complicated lexical features require the detection of word and sen-tence boundaries in the text By using common Natural Language Processing(NLP) tools a text can be broken up into its constituent parts during a pro-cess called tokenization A token can be a single word phrase symbol or othermeaningful element After counting the occurrence of each distinct token the nmost frequently occurring tokens can be used as features since the tokens thatoccur most frequent are considered to contain the most useful discriminatoryinformation

Another set of features that can be derived from tokenization is the fre-quency of different word lengths These features provide information on howoften a particular author uses words of different lengths Vocabulary richnessmeasures are a subset of lexical features that are derived from these word lengthfrequencies Hapax Legomena is the number of words that occur once in a textwhereas Hapax Dislegomena is the number of words that occur twice Thenumber of hapax legomena and hapax dislegomena gives an indication of howrich the vocabulary is that is used by a certain author Authors that have alarger vocabulary will have a higher count of once- or twice-occurring wordsthan authors with a small vocabulary The type-token ratio VN is the numberof unique tokens V divided by the total number of tokens in a text N andgives another indication of vocabulary richness Numerous vocabulary richnessmeasures have been created based on word frequencies of which the most wellknown are

9

bull Yulersquos K [69]

K = 104 middot

[minus1

N+sumi

V (iN)

(i

N

)2]

(25)

where V (iN) is the number of words occurring i times in the text oflength N

bull Sichelrsquos S [59]

S =V (2 N)

V (N)(26)

where V (N) is the vocabulary size and V (2 N) the number of twice-occurring words

bull Brunetrsquos W [7]W = NV (N)a (27)

where N is the number of words and a is usually set to 0172

bull Honorersquos R [26]

H = 100 middot

logN

1minus N(1N)V (N)

(28)

Furthermore smileys [64] abbreviations [62] slang words [36] and evenspelling errors can be used as stylometric features to distinguish between au-thors For example Koppel [37] tries to simulate the way a human expert dis-criminates between authors by using stylistic idiosyncrasies such as misspellingsto fingerprint a particular author A disadvantage of that approach is that dis-criminating between authors is hard when there are little idiosyncrasies presentin the text

Syntactic features

Syntactic features capture authorial style at the sentence level by analyzing thesyntactic constructions that an author uses The underlying idea is that everyauthor unconsciously uses more or less the same syntactical patterns in eachtext Syntactic features rely on the accuracy and availability of NLP-tools indifferent languages in order to tokenize and apply Part-of-speech (POS)-tags

A common method is to analyze the short sequences of POS-tags that occurmost frequently throughout an authorrsquos work For example the sentence rdquoTheman walked in the parkrdquo can be represented as

| DETERMINER | NOUN | VERB PAST TENSE| PREPOSITION | DETERMINER | NOUN |

10

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 10: Thesis Freek Maes - Final Version

rdquojhondoedomaincomrdquo They are less useful when people deliberately try tohide their identity by using completely different email addresses

211 Techniques

In this section the most commonly used string metrics will be discussedThe Levenshtein distance [52] often referred to as edit distance is of one the

earliest and most used string distances It is defined as the minimum requiredamount of operations between string s and t to transform one string into theother Each operation has a cost of 1 and the allowed operations are inser-tion deletion and substitution of a character The Levenshtein distance can betransformed into a similarity metric by using

similarity(s t) =1

Levenshtein(s t) + 1(21)

The Jaro similarity [32] algorithm uses the number of transpositions T andthe number of matching characters m in order to determine the similarity be-tween two strings Two characters are matching only if they are no fartherapart than half the length of the longest string The number of transpositionsis defined as the number of matching characters in different sequence ordersdivided by two The similarity is then calculated as follows

Jaro(s t) =1

3

(m

|s|+m

|t|+mminus Tm

)(22)

where |s| denotes the length of string sThe Jaro-Winkler similarity [67] is an extension of the Jaro-algorithm using

the empirical finding by Winkler that less errors tend to occur at the start ofstrings The similarity is calculated as follows where p is the length of the prefixthat the two strings share

Jaro-Winkler(s t) = Jaro(s t) +p

10(10minus Jaro(s t)) (23)

The Soundex algorithm [53] is the most well known and at the same timethe oldest phonetic encoding that is used for string matching Strings are firstconverted into phonetic codes after which strings with similar codes are assumedto be highly similar In order to convert a string into a Soundex-code the firstletter of the string is retained after which the following letters are convertedto numbers according to the set of rules shown in table 21 In the resultingcode all zeros are removed as well as multiple sequential occurrences of thesame digit The code is then cut-off or extended with zeros such that is hasexactly 3 digits The first letter of the string together with the 3 digits formsthe Soundex-code The Soundex algorithm makes use of the fact that stringsthat are pronounced in a similar fashion tend to have the same Soundex codeFor example rdquoMaidrdquo and rdquoMaderdquo both results in the Soundex code rdquoM300rdquo

The longest common substring [23] method iteratively finds and removes thelongest substring of minimum length l that two strings have in common until

6

Letter Digit

A E I O U H W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6

Table 21 The rules for converting letters into digits as they are used in theSoundex algorithm

no more substrings can be found The final similarity can then be calculated bytaking the length of all the common substrings divided by either the maximumminimum or average length of the original strings

A slightly different approach by Monge and Elkan [50] uses a string metricsuch as any of the ones discussed above in a recursive matching scheme in orderto determine similarity between strings String s and t are first broken into sub-strings s = s1 sK and t = t1 tK after which the similarity is definedas

Monge-Elkan(s t) =1

K

Ksumi=1

max

Ksumj=1

simprime(si tj) (24)

where simprime(si tj) denotes the similarity score between sub-strings si and tj asassigned by a secondary string metric

Christen [11] provides an extensive comparison of these and other stringmetrics on 4 different test sets of given- sur- and full names He found thatit is important to know beforehand the structure of the names to be matchedand whether they have been parsed and standardized He also found that Jaro-Winkler similarity performed best in a comparison of 27 different string met-rics Furthermore he reached the following conclusions (1) Phonetic encodingshould not be used since they are outperformed by all other techniques (2) Jaroand Jaro-Winkler similarity performs well for given- and surnames if the namesare parsed into separate fields (3) longest common substring is useful when thenames might contain swapped words (4) the Winkler modification can be usedwith every technique to improve the quality of the matching (5) the selection ofa proper threshold is the biggest problem for most matching techniques and (6)the fastest techniques are the ones that that have a time complexity linear to thelength of the strings Cohen and Fienberg [13] evaluated several strings metricson 13 different test sets concluding that the Monge-Elkan distance achieved thebest performance of all the string metrics The Jaro-Winkler metric proved to bea fast heuristic scheme achieving almost the same performance as Monge-Elkanwhilst being considerably less complex in nature

7

22 Authorship Attribution

Other approaches to resolving aliases and disambiguating authors can be foundin the field of Authorship Attribution The authorship attribution task can bedescribed as follows given a set of candidate authors and a set of documentswritten by each of these authors try to determine which of these candidateswrote a given anonymous document In the traditional authorship attributionproblem the number of candidate authors is typically small (2 - 10) the numberof documents per author is large and the length of these documents is largeMoreover it is assumed that the author of the anonymous document is actuallyin the candidate set ie there is a closed candidate set A good example ofa traditional authorship attribution problem is to determine the author of adisputed literary work such as some of Shakespearersquos plays

Authorship attribution techniques can be very useful in resolving aliases anddetermining authorship An authorship attribution system can be trained todistinguish between different authors in an email data set For a given authorit is possible to determine if an alias is being used by letting the authorshipattribution system predict which authorrsquos writing style most closely resemblesthe given authorrsquos writing style

In the remainder of this section the different techniques that have been em-ployed in authorship attribution problems will be explained as well as importantdesign choices that have to be made These include the choice of a feature seta feature selection technique the actual attribution technique and whether totreat the problem from an instance-based perspective or a profile-based perspec-tive

221 Instance vs profile-based

A general distinction can be made between techniques that treat each emailindividually (instance-based) and techniques that accumulate all the emails perauthor (profile-based) The first approach treats each email from a given authoras a single training instance and thereby retains differences in texts from thesame author The second approach accumulates all the texts from a givenauthor into one big training file creating a profile of one author and disregardingdifferences in each individual text The choice is mostly philosophical whetherto model the general style of each author or the individual style of each document[63]

222 Features

An important design choice in authorship attribution systems is the choice offeature set Features are the specific writing-style attributes predefined by theresearcher that are extracted from a piece of text in order to capture stylisticinformation that is characteristic for a particular author Since the choice offeature set can affect the performance of the authorship attribution in variousways it is important to consider which features to include or exclude In general

8

a distinction can be made between lexical syntactic structural semantic andcontent-specific features These features will be discussed in that order in thefollowing sections

Lexical features

Lexical features are the features that are derived at the character and word-levelof the text and are the most commonly used features These features are consid-ered language-independent since they do not need any prior language-dependentprocessing before they can be applied to a text Character-frequencies word-length distributions frequency of digits and non-alphanumeric characters andtotal number of words are all examples of lexical features that provide usefulinformation

An easy-to-use lexical feature that is also computationally simple is charactern-grams For example the character 4-grams that can be extracted from thephrase rdquoThe dogrdquo are rdquothe rdquo rdquohe drdquo rdquoe dordquo and rdquo dogrdquo Character n-gramscan capture various writing style markers from a text such as capitalization orUKUS-variants of certain words Even if a word is incorrectly spelled mostof the n-grams extracted from the correct and incorrect variant will matchalthough a misspelling can also be considered as a style marker for a particularauthor An advantage of character n-grams is that they do not need tokenizationbefore they can be applied to a text which is very useful in Asian or Arabiclanguages where tokenization is difficult

More complicated lexical features require the detection of word and sen-tence boundaries in the text By using common Natural Language Processing(NLP) tools a text can be broken up into its constituent parts during a pro-cess called tokenization A token can be a single word phrase symbol or othermeaningful element After counting the occurrence of each distinct token the nmost frequently occurring tokens can be used as features since the tokens thatoccur most frequent are considered to contain the most useful discriminatoryinformation

Another set of features that can be derived from tokenization is the fre-quency of different word lengths These features provide information on howoften a particular author uses words of different lengths Vocabulary richnessmeasures are a subset of lexical features that are derived from these word lengthfrequencies Hapax Legomena is the number of words that occur once in a textwhereas Hapax Dislegomena is the number of words that occur twice Thenumber of hapax legomena and hapax dislegomena gives an indication of howrich the vocabulary is that is used by a certain author Authors that have alarger vocabulary will have a higher count of once- or twice-occurring wordsthan authors with a small vocabulary The type-token ratio VN is the numberof unique tokens V divided by the total number of tokens in a text N andgives another indication of vocabulary richness Numerous vocabulary richnessmeasures have been created based on word frequencies of which the most wellknown are

9

bull Yulersquos K [69]

K = 104 middot

[minus1

N+sumi

V (iN)

(i

N

)2]

(25)

where V (iN) is the number of words occurring i times in the text oflength N

bull Sichelrsquos S [59]

S =V (2 N)

V (N)(26)

where V (N) is the vocabulary size and V (2 N) the number of twice-occurring words

bull Brunetrsquos W [7]W = NV (N)a (27)

where N is the number of words and a is usually set to 0172

bull Honorersquos R [26]

H = 100 middot

logN

1minus N(1N)V (N)

(28)

Furthermore smileys [64] abbreviations [62] slang words [36] and evenspelling errors can be used as stylometric features to distinguish between au-thors For example Koppel [37] tries to simulate the way a human expert dis-criminates between authors by using stylistic idiosyncrasies such as misspellingsto fingerprint a particular author A disadvantage of that approach is that dis-criminating between authors is hard when there are little idiosyncrasies presentin the text

Syntactic features

Syntactic features capture authorial style at the sentence level by analyzing thesyntactic constructions that an author uses The underlying idea is that everyauthor unconsciously uses more or less the same syntactical patterns in eachtext Syntactic features rely on the accuracy and availability of NLP-tools indifferent languages in order to tokenize and apply Part-of-speech (POS)-tags

A common method is to analyze the short sequences of POS-tags that occurmost frequently throughout an authorrsquos work For example the sentence rdquoTheman walked in the parkrdquo can be represented as

| DETERMINER | NOUN | VERB PAST TENSE| PREPOSITION | DETERMINER | NOUN |

10

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 11: Thesis Freek Maes - Final Version

Letter Digit

A E I O U H W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6

Table 21 The rules for converting letters into digits as they are used in theSoundex algorithm

no more substrings can be found The final similarity can then be calculated bytaking the length of all the common substrings divided by either the maximumminimum or average length of the original strings

A slightly different approach by Monge and Elkan [50] uses a string metricsuch as any of the ones discussed above in a recursive matching scheme in orderto determine similarity between strings String s and t are first broken into sub-strings s = s1 sK and t = t1 tK after which the similarity is definedas

Monge-Elkan(s t) =1

K

Ksumi=1

max

Ksumj=1

simprime(si tj) (24)

where simprime(si tj) denotes the similarity score between sub-strings si and tj asassigned by a secondary string metric

Christen [11] provides an extensive comparison of these and other stringmetrics on 4 different test sets of given- sur- and full names He found thatit is important to know beforehand the structure of the names to be matchedand whether they have been parsed and standardized He also found that Jaro-Winkler similarity performed best in a comparison of 27 different string met-rics Furthermore he reached the following conclusions (1) Phonetic encodingshould not be used since they are outperformed by all other techniques (2) Jaroand Jaro-Winkler similarity performs well for given- and surnames if the namesare parsed into separate fields (3) longest common substring is useful when thenames might contain swapped words (4) the Winkler modification can be usedwith every technique to improve the quality of the matching (5) the selection ofa proper threshold is the biggest problem for most matching techniques and (6)the fastest techniques are the ones that that have a time complexity linear to thelength of the strings Cohen and Fienberg [13] evaluated several strings metricson 13 different test sets concluding that the Monge-Elkan distance achieved thebest performance of all the string metrics The Jaro-Winkler metric proved to bea fast heuristic scheme achieving almost the same performance as Monge-Elkanwhilst being considerably less complex in nature

7

22 Authorship Attribution

Other approaches to resolving aliases and disambiguating authors can be foundin the field of Authorship Attribution The authorship attribution task can bedescribed as follows given a set of candidate authors and a set of documentswritten by each of these authors try to determine which of these candidateswrote a given anonymous document In the traditional authorship attributionproblem the number of candidate authors is typically small (2 - 10) the numberof documents per author is large and the length of these documents is largeMoreover it is assumed that the author of the anonymous document is actuallyin the candidate set ie there is a closed candidate set A good example ofa traditional authorship attribution problem is to determine the author of adisputed literary work such as some of Shakespearersquos plays

Authorship attribution techniques can be very useful in resolving aliases anddetermining authorship An authorship attribution system can be trained todistinguish between different authors in an email data set For a given authorit is possible to determine if an alias is being used by letting the authorshipattribution system predict which authorrsquos writing style most closely resemblesthe given authorrsquos writing style

In the remainder of this section the different techniques that have been em-ployed in authorship attribution problems will be explained as well as importantdesign choices that have to be made These include the choice of a feature seta feature selection technique the actual attribution technique and whether totreat the problem from an instance-based perspective or a profile-based perspec-tive

221 Instance vs profile-based

A general distinction can be made between techniques that treat each emailindividually (instance-based) and techniques that accumulate all the emails perauthor (profile-based) The first approach treats each email from a given authoras a single training instance and thereby retains differences in texts from thesame author The second approach accumulates all the texts from a givenauthor into one big training file creating a profile of one author and disregardingdifferences in each individual text The choice is mostly philosophical whetherto model the general style of each author or the individual style of each document[63]

222 Features

An important design choice in authorship attribution systems is the choice offeature set Features are the specific writing-style attributes predefined by theresearcher that are extracted from a piece of text in order to capture stylisticinformation that is characteristic for a particular author Since the choice offeature set can affect the performance of the authorship attribution in variousways it is important to consider which features to include or exclude In general

8

a distinction can be made between lexical syntactic structural semantic andcontent-specific features These features will be discussed in that order in thefollowing sections

Lexical features

Lexical features are the features that are derived at the character and word-levelof the text and are the most commonly used features These features are consid-ered language-independent since they do not need any prior language-dependentprocessing before they can be applied to a text Character-frequencies word-length distributions frequency of digits and non-alphanumeric characters andtotal number of words are all examples of lexical features that provide usefulinformation

An easy-to-use lexical feature that is also computationally simple is charactern-grams For example the character 4-grams that can be extracted from thephrase rdquoThe dogrdquo are rdquothe rdquo rdquohe drdquo rdquoe dordquo and rdquo dogrdquo Character n-gramscan capture various writing style markers from a text such as capitalization orUKUS-variants of certain words Even if a word is incorrectly spelled mostof the n-grams extracted from the correct and incorrect variant will matchalthough a misspelling can also be considered as a style marker for a particularauthor An advantage of character n-grams is that they do not need tokenizationbefore they can be applied to a text which is very useful in Asian or Arabiclanguages where tokenization is difficult

More complicated lexical features require the detection of word and sen-tence boundaries in the text By using common Natural Language Processing(NLP) tools a text can be broken up into its constituent parts during a pro-cess called tokenization A token can be a single word phrase symbol or othermeaningful element After counting the occurrence of each distinct token the nmost frequently occurring tokens can be used as features since the tokens thatoccur most frequent are considered to contain the most useful discriminatoryinformation

Another set of features that can be derived from tokenization is the fre-quency of different word lengths These features provide information on howoften a particular author uses words of different lengths Vocabulary richnessmeasures are a subset of lexical features that are derived from these word lengthfrequencies Hapax Legomena is the number of words that occur once in a textwhereas Hapax Dislegomena is the number of words that occur twice Thenumber of hapax legomena and hapax dislegomena gives an indication of howrich the vocabulary is that is used by a certain author Authors that have alarger vocabulary will have a higher count of once- or twice-occurring wordsthan authors with a small vocabulary The type-token ratio VN is the numberof unique tokens V divided by the total number of tokens in a text N andgives another indication of vocabulary richness Numerous vocabulary richnessmeasures have been created based on word frequencies of which the most wellknown are

9

bull Yulersquos K [69]

K = 104 middot

[minus1

N+sumi

V (iN)

(i

N

)2]

(25)

where V (iN) is the number of words occurring i times in the text oflength N

bull Sichelrsquos S [59]

S =V (2 N)

V (N)(26)

where V (N) is the vocabulary size and V (2 N) the number of twice-occurring words

bull Brunetrsquos W [7]W = NV (N)a (27)

where N is the number of words and a is usually set to 0172

bull Honorersquos R [26]

H = 100 middot

logN

1minus N(1N)V (N)

(28)

Furthermore smileys [64] abbreviations [62] slang words [36] and evenspelling errors can be used as stylometric features to distinguish between au-thors For example Koppel [37] tries to simulate the way a human expert dis-criminates between authors by using stylistic idiosyncrasies such as misspellingsto fingerprint a particular author A disadvantage of that approach is that dis-criminating between authors is hard when there are little idiosyncrasies presentin the text

Syntactic features

Syntactic features capture authorial style at the sentence level by analyzing thesyntactic constructions that an author uses The underlying idea is that everyauthor unconsciously uses more or less the same syntactical patterns in eachtext Syntactic features rely on the accuracy and availability of NLP-tools indifferent languages in order to tokenize and apply Part-of-speech (POS)-tags

A common method is to analyze the short sequences of POS-tags that occurmost frequently throughout an authorrsquos work For example the sentence rdquoTheman walked in the parkrdquo can be represented as

| DETERMINER | NOUN | VERB PAST TENSE| PREPOSITION | DETERMINER | NOUN |

10

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 12: Thesis Freek Maes - Final Version

22 Authorship Attribution

Other approaches to resolving aliases and disambiguating authors can be foundin the field of Authorship Attribution The authorship attribution task can bedescribed as follows given a set of candidate authors and a set of documentswritten by each of these authors try to determine which of these candidateswrote a given anonymous document In the traditional authorship attributionproblem the number of candidate authors is typically small (2 - 10) the numberof documents per author is large and the length of these documents is largeMoreover it is assumed that the author of the anonymous document is actuallyin the candidate set ie there is a closed candidate set A good example ofa traditional authorship attribution problem is to determine the author of adisputed literary work such as some of Shakespearersquos plays

Authorship attribution techniques can be very useful in resolving aliases anddetermining authorship An authorship attribution system can be trained todistinguish between different authors in an email data set For a given authorit is possible to determine if an alias is being used by letting the authorshipattribution system predict which authorrsquos writing style most closely resemblesthe given authorrsquos writing style

In the remainder of this section the different techniques that have been em-ployed in authorship attribution problems will be explained as well as importantdesign choices that have to be made These include the choice of a feature seta feature selection technique the actual attribution technique and whether totreat the problem from an instance-based perspective or a profile-based perspec-tive

221 Instance vs profile-based

A general distinction can be made between techniques that treat each emailindividually (instance-based) and techniques that accumulate all the emails perauthor (profile-based) The first approach treats each email from a given authoras a single training instance and thereby retains differences in texts from thesame author The second approach accumulates all the texts from a givenauthor into one big training file creating a profile of one author and disregardingdifferences in each individual text The choice is mostly philosophical whetherto model the general style of each author or the individual style of each document[63]

222 Features

An important design choice in authorship attribution systems is the choice offeature set Features are the specific writing-style attributes predefined by theresearcher that are extracted from a piece of text in order to capture stylisticinformation that is characteristic for a particular author Since the choice offeature set can affect the performance of the authorship attribution in variousways it is important to consider which features to include or exclude In general

8

a distinction can be made between lexical syntactic structural semantic andcontent-specific features These features will be discussed in that order in thefollowing sections

Lexical features

Lexical features are the features that are derived at the character and word-levelof the text and are the most commonly used features These features are consid-ered language-independent since they do not need any prior language-dependentprocessing before they can be applied to a text Character-frequencies word-length distributions frequency of digits and non-alphanumeric characters andtotal number of words are all examples of lexical features that provide usefulinformation

An easy-to-use lexical feature that is also computationally simple is charactern-grams For example the character 4-grams that can be extracted from thephrase rdquoThe dogrdquo are rdquothe rdquo rdquohe drdquo rdquoe dordquo and rdquo dogrdquo Character n-gramscan capture various writing style markers from a text such as capitalization orUKUS-variants of certain words Even if a word is incorrectly spelled mostof the n-grams extracted from the correct and incorrect variant will matchalthough a misspelling can also be considered as a style marker for a particularauthor An advantage of character n-grams is that they do not need tokenizationbefore they can be applied to a text which is very useful in Asian or Arabiclanguages where tokenization is difficult

More complicated lexical features require the detection of word and sen-tence boundaries in the text By using common Natural Language Processing(NLP) tools a text can be broken up into its constituent parts during a pro-cess called tokenization A token can be a single word phrase symbol or othermeaningful element After counting the occurrence of each distinct token the nmost frequently occurring tokens can be used as features since the tokens thatoccur most frequent are considered to contain the most useful discriminatoryinformation

Another set of features that can be derived from tokenization is the fre-quency of different word lengths These features provide information on howoften a particular author uses words of different lengths Vocabulary richnessmeasures are a subset of lexical features that are derived from these word lengthfrequencies Hapax Legomena is the number of words that occur once in a textwhereas Hapax Dislegomena is the number of words that occur twice Thenumber of hapax legomena and hapax dislegomena gives an indication of howrich the vocabulary is that is used by a certain author Authors that have alarger vocabulary will have a higher count of once- or twice-occurring wordsthan authors with a small vocabulary The type-token ratio VN is the numberof unique tokens V divided by the total number of tokens in a text N andgives another indication of vocabulary richness Numerous vocabulary richnessmeasures have been created based on word frequencies of which the most wellknown are

9

bull Yulersquos K [69]

K = 104 middot

[minus1

N+sumi

V (iN)

(i

N

)2]

(25)

where V (iN) is the number of words occurring i times in the text oflength N

bull Sichelrsquos S [59]

S =V (2 N)

V (N)(26)

where V (N) is the vocabulary size and V (2 N) the number of twice-occurring words

bull Brunetrsquos W [7]W = NV (N)a (27)

where N is the number of words and a is usually set to 0172

bull Honorersquos R [26]

H = 100 middot

logN

1minus N(1N)V (N)

(28)

Furthermore smileys [64] abbreviations [62] slang words [36] and evenspelling errors can be used as stylometric features to distinguish between au-thors For example Koppel [37] tries to simulate the way a human expert dis-criminates between authors by using stylistic idiosyncrasies such as misspellingsto fingerprint a particular author A disadvantage of that approach is that dis-criminating between authors is hard when there are little idiosyncrasies presentin the text

Syntactic features

Syntactic features capture authorial style at the sentence level by analyzing thesyntactic constructions that an author uses The underlying idea is that everyauthor unconsciously uses more or less the same syntactical patterns in eachtext Syntactic features rely on the accuracy and availability of NLP-tools indifferent languages in order to tokenize and apply Part-of-speech (POS)-tags

A common method is to analyze the short sequences of POS-tags that occurmost frequently throughout an authorrsquos work For example the sentence rdquoTheman walked in the parkrdquo can be represented as

| DETERMINER | NOUN | VERB PAST TENSE| PREPOSITION | DETERMINER | NOUN |

10

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 13: Thesis Freek Maes - Final Version

a distinction can be made between lexical syntactic structural semantic andcontent-specific features These features will be discussed in that order in thefollowing sections

Lexical features

Lexical features are the features that are derived at the character and word-levelof the text and are the most commonly used features These features are consid-ered language-independent since they do not need any prior language-dependentprocessing before they can be applied to a text Character-frequencies word-length distributions frequency of digits and non-alphanumeric characters andtotal number of words are all examples of lexical features that provide usefulinformation

An easy-to-use lexical feature that is also computationally simple is charactern-grams For example the character 4-grams that can be extracted from thephrase rdquoThe dogrdquo are rdquothe rdquo rdquohe drdquo rdquoe dordquo and rdquo dogrdquo Character n-gramscan capture various writing style markers from a text such as capitalization orUKUS-variants of certain words Even if a word is incorrectly spelled mostof the n-grams extracted from the correct and incorrect variant will matchalthough a misspelling can also be considered as a style marker for a particularauthor An advantage of character n-grams is that they do not need tokenizationbefore they can be applied to a text which is very useful in Asian or Arabiclanguages where tokenization is difficult

More complicated lexical features require the detection of word and sen-tence boundaries in the text By using common Natural Language Processing(NLP) tools a text can be broken up into its constituent parts during a pro-cess called tokenization A token can be a single word phrase symbol or othermeaningful element After counting the occurrence of each distinct token the nmost frequently occurring tokens can be used as features since the tokens thatoccur most frequent are considered to contain the most useful discriminatoryinformation

Another set of features that can be derived from tokenization is the fre-quency of different word lengths These features provide information on howoften a particular author uses words of different lengths Vocabulary richnessmeasures are a subset of lexical features that are derived from these word lengthfrequencies Hapax Legomena is the number of words that occur once in a textwhereas Hapax Dislegomena is the number of words that occur twice Thenumber of hapax legomena and hapax dislegomena gives an indication of howrich the vocabulary is that is used by a certain author Authors that have alarger vocabulary will have a higher count of once- or twice-occurring wordsthan authors with a small vocabulary The type-token ratio VN is the numberof unique tokens V divided by the total number of tokens in a text N andgives another indication of vocabulary richness Numerous vocabulary richnessmeasures have been created based on word frequencies of which the most wellknown are

9

bull Yulersquos K [69]

K = 104 middot

[minus1

N+sumi

V (iN)

(i

N

)2]

(25)

where V (iN) is the number of words occurring i times in the text oflength N

bull Sichelrsquos S [59]

S =V (2 N)

V (N)(26)

where V (N) is the vocabulary size and V (2 N) the number of twice-occurring words

bull Brunetrsquos W [7]W = NV (N)a (27)

where N is the number of words and a is usually set to 0172

bull Honorersquos R [26]

H = 100 middot

logN

1minus N(1N)V (N)

(28)

Furthermore smileys [64] abbreviations [62] slang words [36] and evenspelling errors can be used as stylometric features to distinguish between au-thors For example Koppel [37] tries to simulate the way a human expert dis-criminates between authors by using stylistic idiosyncrasies such as misspellingsto fingerprint a particular author A disadvantage of that approach is that dis-criminating between authors is hard when there are little idiosyncrasies presentin the text

Syntactic features

Syntactic features capture authorial style at the sentence level by analyzing thesyntactic constructions that an author uses The underlying idea is that everyauthor unconsciously uses more or less the same syntactical patterns in eachtext Syntactic features rely on the accuracy and availability of NLP-tools indifferent languages in order to tokenize and apply Part-of-speech (POS)-tags

A common method is to analyze the short sequences of POS-tags that occurmost frequently throughout an authorrsquos work For example the sentence rdquoTheman walked in the parkrdquo can be represented as

| DETERMINER | NOUN | VERB PAST TENSE| PREPOSITION | DETERMINER | NOUN |

10

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 14: Thesis Freek Maes - Final Version

bull Yulersquos K [69]

K = 104 middot

[minus1

N+sumi

V (iN)

(i

N

)2]

(25)

where V (iN) is the number of words occurring i times in the text oflength N

bull Sichelrsquos S [59]

S =V (2 N)

V (N)(26)

where V (N) is the vocabulary size and V (2 N) the number of twice-occurring words

bull Brunetrsquos W [7]W = NV (N)a (27)

where N is the number of words and a is usually set to 0172

bull Honorersquos R [26]

H = 100 middot

logN

1minus N(1N)V (N)

(28)

Furthermore smileys [64] abbreviations [62] slang words [36] and evenspelling errors can be used as stylometric features to distinguish between au-thors For example Koppel [37] tries to simulate the way a human expert dis-criminates between authors by using stylistic idiosyncrasies such as misspellingsto fingerprint a particular author A disadvantage of that approach is that dis-criminating between authors is hard when there are little idiosyncrasies presentin the text

Syntactic features

Syntactic features capture authorial style at the sentence level by analyzing thesyntactic constructions that an author uses The underlying idea is that everyauthor unconsciously uses more or less the same syntactical patterns in eachtext Syntactic features rely on the accuracy and availability of NLP-tools indifferent languages in order to tokenize and apply Part-of-speech (POS)-tags

A common method is to analyze the short sequences of POS-tags that occurmost frequently throughout an authorrsquos work For example the sentence rdquoTheman walked in the parkrdquo can be represented as

| DETERMINER | NOUN | VERB PAST TENSE| PREPOSITION | DETERMINER | NOUN |

10

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 15: Thesis Freek Maes - Final Version

where token start and ends are delimited by a | The complete sentence can beseen as one syntactic pattern but a subset of the sentence can also be takenas a syntactic pattern An example of the use of POS-tags can be found inSolorio and Pillay [62] which use POS-tag uni-grams bi-grams and tri-grams incombination with common lexical features to successfully attribute authorshipOther features that can be derived from syntactical parsing are the depth ofthe resulting parse tree (a measure of sentence complexity) [64] or the distancecovered by the dependency links in the parse tree

Function words are introduced by Mosteller and Wallace [51] to distinguishbetween different authors Function words are words that have almost no lexicalmeaning but signify grammatical relationship between other words in a sentenceThe set of functions words in a language usually does not change in short periodsof time as opposed to content words For example in the sentence rdquoThe leaderof the team was very strongrdquo the function words are rdquoTherdquo rdquoofrdquo and rdquotherdquoThey signify the grammatical structure of the sentence without containing anymeaning they are topic-independent Lists of function words are available formany languages and the frequency counts of the different function words can beused as a features

Structural features

Structural features represent the way an authorrsquos writing is organized includinglayout and paragraph structures Examples of structural features are the num-ber of sentences the number of lines the use of indentation or the number ofwords per paragraph Structural features also include content-specific featuresthat can only be used in particular domains de Vel [16] proposed a set of struc-tural features specifically designed for email such as the presence or absence ofgreetings and salutation blocks and the reply status (whether the body of theemail contains reply or forwarded text) Zheng et al [71] extended these emailspecific features with features such as the presence of telephone numbers andURLs in the salutation blocks However they experienced difficulties with ex-tracting these features Another example of content specific structural featuresis the use of HTML-tags by de Vel et al [17] They found that some emailprograms used HTML formatting for their emails and included the frequency ofdifferent HTML-tags in their feature set

Semantic features

The most complex set of features are the semantic features These featuresrequire sophisticated NLP-techniques such as full parsing or semantic analysisin order to capture the meaning of words andor sentences in a text Sincethese NLP-techniques are often noisy and inaccurate there is a chance thatthese features will only harm the attribution process because they might notreflect the actual writing style of the author very well

An example of the use of semantic features comes from Tanguy et al [64] whouses UKUS-spelling variants and 12 features based on WordNet [49] Word-

11

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 16: Thesis Freek Maes - Final Version

Net is a large semantic database of English nouns verbs adjectives and ad-verbs grouped into sets of cognitive synonyms The features that have beenderived from WordNet represent polysemy availability in the database and thefrequency of different Named Entity-types such as date location money num-ber ordinal organization percent person and time The results obtained byTanguy are among the best in a recent International Authorship IdentificationCompetition at the CLEF 2011-conference [2]

An interesting semantic feature that has been used by Koppel and Akiva[38] is called meaning-preserving stability of linguistic elements It measures thedegree to which a word or another type of linguistic construct such as a part-of-speech tag can be replaced by a synonym in a given sentence If a word can bereplaced by many different synonyms in a sentence it is considered an unstableword Function words are therefore excellent examples of unstable featuresKoppel and Akiva found that frequent and unstable words are good stylometricindicators and can therefore be used for authorship attribution These resultsare intuitive since the particular choice of a synonym is likely to reflect anauthors writing style

223 Feature Selection

Most studies use a combination of lexical syntactic structural andor semanticfeatures Combining these features can create a very high-dimensional featureset especially when character and word frequencies are used To combat theincrease in computational complexity a feature selection technique can be usedto reduce the dimensionality of the feature set The general idea is to retain onlythose features that have high discriminatory power Note that some featuresthat are not very informative on their own might work better in combinationwith other features In such cases feature selection can harm the attributionprocess by eliminating useful combinations of features Feature selection has todeal with an additional problem in the sense that the final feature set might beover-fitting the training data Therefore the use of feature selection methodsis ambiguous and can potentially decrease the performance of the authorshipattribution For example in the aforementioned International Authorship Iden-tification Competition at the CLEF 2011-conference [2] the best performingtechnique was the one that used the largest and most diverse feature set

The simplest feature selection method uses the frequency count of each fea-ture and dismisses the features that have a low frequency The idea is thatfrequently occurring features contain more information about the author andare therefore more useful in discriminating between different authors Addi-tionally more complex measures can be used to assess the discriminatory powerof individual features Forman [22] provides a comparison of feature selectionmetrics for binary text classification using Support Vector Machines tested ona large number of data sets The two feature selection methods that performedbest are Information Gain and Bi-Normal Separation a result that is confirmedby Simeon and Hilderman [60] and Gabrilovich and Markovitch [24]

12

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 17: Thesis Freek Maes - Final Version

Information Gain is a traditional feature selection method that is definedas the difference in entropy before and after a particular feature is removedEntropy is defined as

Entropy = minussumxisinX

P (x) logP (x) (29)

where X represents the set of classes that are being predicted The subset offeatures with the highest information gain explains the data best and can beused instead of the full feature set

Bi-Normal Separation is a method that measures the difference in z-scoreof the true positive and false positive rates where the z-score indicates howmany standard deviations a particular value deviates from the mean Forman[22] states that rdquoThe metric measures the horizontal separation between twostandard Normal curves where their relative position is uniquely prescribed bytpr [true positive rate] and fpr [false positive rate]rdquo

Instead of removing features features can also be combined in order to re-duce the dimensionality of the feature set A good example of this is PrincipalComponent Analysis (PCA) which transforms the feature set into a set of lin-early uncorrelated components called principal components The number ofprincipal components is always less than the original number of features UsingPCA a combination of the original feature set is created that accounts for amaximum amount of variance in the feature set whilst reducing the dimension-ality For example Tearle et al [65] use PCA to create linear combinations offeatures that explain 95 of the variation in the data Hence they reduce thecomputational complexity of the training stage of their authorship attributionsystem In contrast Binongo [4] uses a fixed number of principal componentsto represent the feature set He uses PCA to reduce a 50-dimensional featureset to a mere 2-dimensional feature set With this 2-dimensional feature set hestill manages to assert with confidence that rdquoThe Royal Book of Ozrdquo has beenwritten by Ruth Plumly Thompson Lyman Frank Baum was usually creditedas the author since he had written the previous 14 books in the series

224 Techniques

A major design choice in every authorship attribution system is the authorshipattribution technique that will be used A very common distinction that ismade in Machine Learning is the one between supervised and unsupervisedtechniques The same distinction will also be used here to discuss the differentMachine Learning techniques that are used for authorship determination

Unsupervised techniques

Several approaches to authorship attribution use a predefined similarity measurebetween two documents or between two author-profiles For a given anonymousdocument its similarity to the documents of other authors is calculated and theauthor with the largest similarity to the anonymous document is considered the

13

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 18: Thesis Freek Maes - Final Version

real author For example Koppel et al [39] construct a tf-idf and a stylisticvector representation of 10000 short blog extracts and compare them to eachother using the well-known cosine similarity For two documents s and t and

their vector representationsminusrarrV (s) and

minusrarrV (t) the cosine similarity is defined as

Cosine(s t) =

minusrarrV (s) middot

minusrarrV (t)

|minusrarrV (s)||

minusrarrV (t)|

(210)

Over 20 of the extracts was found to be most similar to its actual author whichis quite promising considering the fact that the number of candidate authors is10000 In a later research by Koppel et al [40] they report that 46 of 1000blog extracts are classified correctly using only the cosine similarity

Instead of comparing one document against all other documents it is alsopossible to use these measures to create clusters of similar documents Usingsimple k-nearest-neighbor clustering each document is assigned to the majorityvote of its k metrically nearest neighbors This results in several clusters ofdocuments that have high inter-cluster similarity and low intra-cluster simi-larity Based on these clusters it can be concluded that the documents in onecluster have been written by a single author possibly under different aliasesIqbal et al [29] use a similar approach to cluster emails by their writing styleusing k-means clustering in which emails are recursively assigned to the clusterwith the nearest mean based on a given set of k initial means The similaritymeasure that is used to compare two documents is called Writeprint as will beexplained below They achieve an F-score of 090 for a data set with 5 authorsand 40 messages per author However the performance decreases significantlywhen the number of authors or the number of messages per author increasesindicating that the technique has scalability problems

Writeprint Mining is a recent technique developed by Iqbal et al [30 31]They extract a combination of lexical syntactic structural and domain-specificfeatures from the texts of candidate authors and discretize each feature into aset of intervals For example a single feature having a range of [0 1] can bediscretized into four intervals A1 [0 025] A2 [025 050] A3 [050 075] andA4 [075 1] An email with a value of 06 for feature A will then be representedas (0 0 1 0) After converting a text into a feature vector they mine the featureset for frequent patterns A pattern is defined by a subset of features and afrequent pattern is one that exceeds a certain minimum support threshold Byrecursively combining features into patterns and by deleting the ones that donot meet minimum support they generate a set of frequent patterns for eachauthor Next they remove all the frequent patterns that occur for more thanone author in order to derive a so-called rdquowriteprintrdquo for each author containingcombinations of features that are unique for that particular author By countingthe number of frequent patterns that occur in both the unknown writeprint andthe authorrsquos writeprint they can attribute the anonymous document to theauthor with the highest count of common frequent patterns Moreover it ispossible to assign a weight to each frequent pattern in the writeprint to makesome patterns more important than others The method achieves an accuracy

14

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 19: Thesis Freek Maes - Final Version

Figure 21 The structure of a supervised authorship attribution system

of 88 on a data set with 4 candidate authors each represented using 40 emailmessages When the number of authors is increase to 20 the accuracy drops to70 Similarly when the number of emails per author is decreased from 40 to10 the accuracy with 4 candidate authors drops from 88 to 33

Supervised techniques

The second set of authorship attribution techniques is the set of supervised tech-niques The main structure of every supervised authorship attribution systemcan be seen in figure 21 First a set of training texts with known authorship isconverted to a set of feature vectors Based on these feature vectors a machinelearning algorithm creates a predictive model that can predict the authorshipof new texts with unknown authorship

One of the earliest supervised methods used to discriminate between au-thors analyzes the frequency distribution of a particular feature and comparesthe distribution derived from an anonymous piece of text to those that havebeen derived from different authors The author whose distribution matchesmost closely to that of the anonymous text is considered the real author Afamous study by Mendenhall [47] that uses such frequency distributions is of-ten considered as the study that originated the field of Authorship AttributionMendenhall examined how often authors such as Bacon Marlowe and Shake-speare use word of different lengths By plotting the frequency of each wordlength on a graph he created a so-called characteristic curve for each author

15

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 20: Thesis Freek Maes - Final Version

His most important finding was that word length distributions tend to remainthe same across different works for a particular author In a similar fashionMendenhall [48] discovered after a tedious analysis of Shakespeare and Baconrsquoswork that Shakespeare tends to use four-letter words most often whereas Baconuses three-letter words most often He thereby rejects the claim of some literaryscholars that Shakespeare and Bacon is the same person

One of the most influential studies on authorship attribution by Mostellerand Wallace [51] concerned the authorship of the Federalist Papers a set of12 political essays with disputed authorship between Alexander Hamilton andJames Madison Being one of the first to use a small set of function wordsas an indicator of authorial style Mosteller amp Wallace applied a Naıve Bayesprobabilistic model to the frequency of these function words and found that alldocuments were written by Madison The Naıve Bayes model quantifies the ideaby Mendenhall by using the probability density functions of a set of features tocharacterize each author For a given set of features x1 xn and a set ofauthors A where Ai denotes an individual author the probability that a givenauthor Ai is the real author of the original document can be expressed by

P (Ai|x1 xn) = P (x1 xn|Ai)P (Ai) (211)

The real author is then calculated using

Alowast = arg maxP (A|x1 xn) (212)

Whereas many approaches only deal with closed candidate sets Burrows [8]addressed the issue of open candidate sets sets where the actual author mightnot be in the candidate set He introduced a new measure called Delta whichhe defined as

rdquothe mean of the absolute differences between the z-scores for aset of word-variables in a given text-group and the z-scores for thesame set of word-variables in a target textrdquo

Burrows computes the z-scores of the 30 most-frequent words from a text againsta reference corpus (a large contemporary corpus of which mean and standarddeviation of these 30 words is computed) and calculates the difference in z-scoresbetween a known text and an unknown text for several authors The authorwith the lowest delta-score is considered the author of the unknown text Ithas proven to be a good authorship attribution method for texts of at least1500 words but it can also be used to generate candidate authors for text aslittle as 100 words long Hoover [27] confirms the effectiveness of this method inauthorship attribution of large problems with open candidate sets Furthermorehe shows that using larger sets of frequent words improves the accuracy of themethod significantly as does removing personal pronouns and words with a highinverse document frequency from the feature set

Another learning method that has been employed to discriminate betweendifferent authors is the use of decision trees Given a set of labeled training

16

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 21: Thesis Freek Maes - Final Version

Figure 22 An example of a decision three with three internal nodes thattests different feature values The classification decision is made in the bottomnodes

instances these instances are recursively split into different subsets using testson feature values The decisions at the leaves of the tree assign the actual class(ie the most likely author) The features that are used at each internal nodeare usually chosen based on their Information Gain which has been explainedin section 223 At each step of creating the decision tree the attribute withthe highest information gain is chosen to partition the data set An example ofa trivial decision tree is given in figure 22 By testing the value of two differentfeatures the classification can be made Decision trees are a simple and effectivemethod but do not perform well on the authorship attribution task accordingto Zhao and Zobel [70] In their evaluation of different authorship attributionmethods on a data set of Associated Press-articles they used only function wordsas style features Two Bayesian classifiers two distance based methods (k-nearest-neighbor) and the C45 Decision Tree classifier have been tested using2-5 authors and 20-800 documents per author Across different experimentsC45 consistently performed worst whereas Bayesian networks gave the bestresults

A commonly used method for supervised classification is Support VectorMachines (SVM) SVM is a supervised classification method that can deal withhigh-dimensional data very well By providing it with a set of positively andnegatively labeled training instances the SVM builds a model that can classifynew instances into one of the two categories It does so by building a hyper-plane in high-dimensional space that separates the two classes with the greatestpotential margin of error This means that the generalization error of the model

17

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 22: Thesis Freek Maes - Final Version

Figure 23 An example of linear separation using a maximum margin oferror The 5 points on the boundaries are the support vectors Image courtesyof Manning et al [46]

is minimal since the hyperplane can be moved by a maximum distance beforea training error occurs A benefit of SVM is that it only needs the instances onthe decision boundary the support vectors in order to classify new instancesresulting in a very fast classification An example of how SVM maximizes themargin of error of the hyperplane that separates two linearly separable classescan be seen in figure 23

SVMrsquos are defined by a number of parameters that greatly influence the ac-curacy of the resulting classification model First and foremost is the choice ofkernel function since it depends on the problem at hand and what underlyingtrend is being modeled The kernel function is a mapping function that trans-forms the original data to a higher dimensional representation since separationin higher dimensions is usually easier than in lower dimensions It enables theSVM to perform separations even when the boundaries are very complex Thesimplest kernel is the Linear Kernel which assumes that the training data islinearly separable and can separate the training data using hyper planes Apolynomial kernel as its name implies can separate the data using combina-tions of features up to the polynomial order A Radial Basis Function kernel(RBF) is the most common choice of kernel and can separate data using com-plex circles or hyper spheres Figure 24 provides an example how mapping thedata from low to high dimensional space using an RBF-kernel can simplify theseparation For a detailed exposition of the internal workings of Support VectorMachines see Cortes and Vapnik [14]

18

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 23: Thesis Freek Maes - Final Version

Figure 24 Original vector space is shown on the left where circles andsquares represent positive and negative classes On the right is the mapping bythe RBF-kernel into higher dimensional space making the separation easierThe black squared and circles represent the support vectors Image courtesy ofYang [68]

Variations to binary SVM exist that make it possible to perform direct multi-class classification [15] However due to various complexities in implementingthis direct multi-class solution preference is usually given to a combinationof binary classifiers Popular methods are one-vs-all using winner-takes-allstrategy one-vs-all using voting and error-correcting codes [18] Numerousauthorship attribution researches exist that utilize the power of Support VectorMachines For example Tsuboi and Matsumoto [66] Gamon [25] Abbasi andChen [1] Zheng et al [71] and Luyckx et al [45] report good results when usingSVM for various authorship attribution tasks such as determining the author-ship of Japanese web forum messages English and Chinese newsgroup messagesor student essays

Another machine learning technique that can be used for authorship attri-bution is Artificial Neural Networks (ANN) ANN mimics the workings of theneural networks in the human brain and can be used to predict authorship ofdocuments based on a set of training examples The ANN consists of a numberof neurons in the input layer equal to the number of features in each trainingvector and a single output neuron that predicts the final class In between isa hidden layer with a number of neurons that each use input weights to trans-form a subset of the input neurons into an output The output is then eitherrouted to another neuron or set of neurons in the hidden layer or to the outputlayer An example of a simple neural network with 5 input nodes (features)one hidden layer with 6 nodes and 1 output node can be seen in figure 25 Thenetwork is trained by adjusting the weights of each node such that it results in a

19

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 24: Thesis Freek Maes - Final Version

Figure 25 An example of an Artificial Neural Network with 1 hidden layerThe output node predicts the class of the instance

correct prediction Tearle et al [65] use an automated algorithm to select inputmetrics and training amp validation instances in order to train an artificial neuralnetwork Although their method is computationally intensive it achieves per-fect accuracy when applied to the aforementioned ShakespeareMarlowe-debateand the Federalist Papers

A fairly recent and novel technique for attributing authorship is Unmaskingby Koppel et al [40] Unmasking is based on the idea that only a small subsetof the feature set is responsible for identifying a particular author Whichsubset of the feature that is might very among different authors Using a linearSVM trained to distinguish between one author and an anonymous text theyiteratively remove the k strongest weighted positive and negative features fromthe feature set and retrain the SVM If the anonymous text is written bythe same author as the other training texts the cross-validation accuracy willdrop significantly after a certain number of iterations In other words when aparticular subset of distinguishing features is removed it becomes increasinglyhard to distinguish between the two authors The actual attribution of the textis performed by analyzing the speed with which the cross-validation degradesafter each iteration Since the speed of degradation can be quantified it ispossible to deal with open candidate sets (ie to verify that a text has beenwritten by a particular author) using the Unmasking technique On the samedata set of 10000 blogs that Koppel et al used in section 224 the Unmasking

20

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 25: Thesis Freek Maes - Final Version

technique was able to attribute a 500-word snippet to one of 1000 authors withcoverage of 42 and precision of 93 It is interesting to note that the methodis especially keen to prevent false positive results when the author is not in thecandidate set Reasonable results can be achieved for as little as 100 words pertext

23 Link analysis

The third approach to authorship disambiguation and alias resolution is one thatoriginates from a different but related field namely Link Analysis In essenceevery problem that features a set of entities with pairwise relationships betweenthem can be modeled as a link network Since this thesis focuses specifically onalias resolution applied to email collections which can easily be modeled as agraph problem it is useful to consider the techniques that have been developedin this field

A basic link network consists of two sets a set of vertices V and a set ofedges W Each author in the data set is mapped to a single vertex in V andeach message is mapped to a single edge in W Let vi vj isin V then an edgeevivj isin W if a message has been sent from author vi to author vj If thereexists an edge evivj isin W then vi and vj are considered to be neighbors Theneighborhood N(vi) is the set of all neighbors of the vertice vi Numerousmeasures have been developed to calculate the similarity between two verticesbased on their connections within a link network The most important measureswill be discussed in the following section

231 Techniques

Co-citation or bibliographic coupling is when two scientific documents shareone or more bibliographical references If paper A and B are both cited by athird paper C it is possible that paper A and B are somehow related Co-citation frequency [61] is simply the frequency with which other papers cite Aand B together in their papers and can be expressed as follows

Co-citation(vi vj) = |N(vi) cupN(vj)| (213)

In Graph Theory this is often known as the shared neighbor frequency Theidea of co-citation can be extended to a network of email addresses when manyauthors have written emails to both author A and B this indicates that there isprobably a strong relationship between A and B Reuther and Walter [55] defineco-citation in terms of link analysis and apply it to detect duplicate authors ina publication database They reformulate co-citation as a Connected Triple asubgraph of V with three vertices vi vj vk and two edges wik wjk such that viand vj are connected via the third vertice vk Figure 26 provides an exampleof a trivial network where email addresses vi and vj are considered to be aliasesbecause they are connected by three different Connected Triples (red yellowand blue)

21

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 26: Thesis Freek Maes - Final Version

Figure 26 An example of two vertices vi and vj being connected by threeConnected Triples (each in a different color)

Jaccard similarity is another link analysis metric that can be used to deter-mine the similarity between the neighborhoods of two vertices The similaritybetween the neighborhoods of vi and vj is defined as follows

Jaccard(vi vj) =N(vi) capN(vj)

N(vi) cupN(vj)(214)

where N(vi) again designates the set of neighbors of vi Liben-Nowell andKleinberg [41] use Jaccard similarity to predict between which authors in a co-authorship network new links will form The prediction of new links is based onthe assumption that similar authors who have not yet worked together mightdo so in the future On five different data sets Jaccard performs between 16 and42 times as good as random in predicting where new links will form Howeverit sometimes performs on par with other rather simple simple methods suchas the number of common neighbors or the negated length of the shortest pathbetween vertices

SimRank [33] is an iterative extension to co-citation frequency that can beapplied to determine the similarity between any two pairs of objects in a graphSimRank uses the notion of in-going and out-going links in a directed graphto calculate similarity Let I(v) be the set of in-going neighbors of vertice vand O(v) the set of out-going neighbors of v The in-degree and out-degreerepresent the number of in-going and out-going neighbors and are denoted by|I(v)| and |O(v)| respectively An individual neighbor is denoted as Ii(v) orOi(v) The similarity between vertice vi and vj can be calculated using thefollowing recursive equation

SimRank(vi vj) =C

|I(vi)||I(vj)|

|I(vi)|sumx=1

|I(vj)|sumy=1

SimRank(Ix(vi) Iy(vj)) (215)

where C is a constant between 0 and 1 In practice the equation can be solved byiteration to a fixed-point letting SimRank(vi vj) = 1 if vi = vj and 0 otherwiseSimRank performs significantly better than the Co-citation algorithm in a range

22

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 27: Thesis Freek Maes - Final Version

of experiments conducted by Jeh [33] although Lin et al [43] report that Jaccardoutperforms SimRank in their experiments on two different data sets

PageSim [42] is another extension to the co-citation algorithm that assignsa weight to each link depending on its relative importance in the graph Theimportance of a node is calculated using the well-known PageRank-algorithmdeveloped by Page et al [54] By propagation of the PageRank score of avertice to its neighbors the similarity between two nodes can be calculated Ina comparison by Boongoen and Shen [5] on three different data sets PageSimperforms better than Connected Triples and SimRank on two out of three datasets On one data set Connected Triples finds more aliases than PageSimwhereas SimRank performs poorly in general

The Connected Path algorithm is a technique developed by Boongoen et al[6] that takes into account not only information from direct neighbors but alsofrom neighbors in the 2nd 3rd nth degree The two important notions arethat (1) the more unique a vertice on a path between two nodes is the strongerit indicates a possible similarity between these two nodes and (2) the longerthe path between two nodes the less informative the connection The similaritybetween two vertices vi and vj is calculated using

ConnectedPath(vi vj) =sum

pisinPATH(vivj r)

U(p)

length(p)(216)

where PATH(vi vj r) is the collection of all paths between vi and vj of lengthr U(p) is the uniqueness of a particular path p isin PATH which is calculatedas follows

U(p) =sum

vxisinpath(vivj)vx isinvivj

UQ(vx) (217)

UQ(vx) denotes the uniqueness of a single vertice vx in the path p It is anindication of how informative that vertice is in a particular path and it can becalculated as follows

UQ(vx) =|wxxminus1|+ |wxx+1|sumforallvgisinV

|wxg|(218)

where wxg denotes an edge between vx isin path(vi vj) and any other vertexvg isin V and wxx+1 and wxxminus1 denoted edges from vx to is adjacent vertices inthe path Figure 27 provides an example of vertices vj and vj having a strongsimilarity since they are connected via different paths The figure also showsthat longer paths are less informative than shorter paths Boongoen et al [6]use three different data sets two of which have also been used in Boongoen andShen [5] to compare Connected Path to Jaccard similarity Connected TriplesSimRank PageSim and Jaro-Winkler On all three data sets Connected Pathis able to find the most aliases

23

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 28: Thesis Freek Maes - Final Version

Figure 27 An example of three different paths between the vertices vi andvj The most direct path (px) is the most informative path Image courtesy ofBoongoen et al [6]

24 Combining Approaches

There exist several ways in which the results from different techniques can becombined to improve the final result

One of the simplest methods is to create a linear combination of the formf(x) = αsi + βsj + γsk where si sj and sk denote the scores assigned bytechniques i j k respectively each normalized such that they fall in the range[0 1] The weights α β γ determine the relative importance of each of thetechniques For example Baroni et al [3] use this approach to successfullycombine a string edit distance and semantic information obtained by using theMutual Information measure Boongoen et al [6] use a related approach incombining link analysis results and string metrics They use the ConnectedPath-algorithm that is described in section 23 to generate normalized pair-wisesimilarity scores for all combinations of authors After selecting the k highest-scoring pairs they compute string-matching scores for each of the pairs Theycompare the effectiveness of using four different aggregation methods namelyignoring the link analysis score averaging the two scores taking the maximumand taking the minimum of the two scores Each aggregation of string metricsand link analysis achieved higher recall than Connected Path by itself butno higher than using string metrics alone Note that the link analysis wasperformed to a shallow depth of 2 and that the aggregation methods wererather simple

Another approach is to create a feature vector consisting of the scores as-signed by the three techniques A weighted voting mechanism such as a Support

24

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 29: Thesis Freek Maes - Final Version

Vector Machine can then be used to distinguish between combinations of resultsthat indicate real or false aliases For example a very high similarity in termsof email addresses might only indicate a real alias if there is also a high similar-ity in terms of content A voting SVM can recognize these cases and improvethe results of the individual techniques Koppel et al [39] have successfullyapplied this approach to the results of cosine similarity ranking on a data setwith thousands of authors For a given anonymous text they calculated the co-sine similarity between that text and every candidate author after which theyranked the candidate authors by score Using 18 features such as the absolutesimilarity of the text to the top-ranked author and the difference in similaritybetween the top-ranked author and the k-ranked author they trained a separateSVM to decide for which combinations of results they could confidently assigna real author Using this approach they could assign a class in 313 of allcases and this classification was correct for 882 of such cases Provided thatit is acceptable that the technique returns no classification for many instancesthe results are very promising considering that there are thousands of candidateauthors

Hsiung et al [28] use a similar approach in order to combine the results ofcombine four string metrics and four link analysis metrics It is one of the onlyapproaches that utilizes information from more than on domain as has been donein this thesis Logistic Regression is used to combine the results and determinewhether two names in a link data set are aliases The approach has been testedon three different data sets One data set is manually extracted and labeledfrom public web pages whereas the other two sets consists of hand labeledspam emails The combined approach is compared with an approach using onlystring metrics and an approach using only link analysis On all three data setsthe approach combining string metrics and link analysis metrics achieves betterresults than the approaches using either string metrics or link analysis

Another approach to combining the strengths of different techniques thatcan be advantageous both in terms of time and complexity is to use a siftingmethod In such an approach the least complex method is used first to sift outthe most obvious aliases ie the ones that can be classified as aliases with highconfidence The ambiguous aliases are then passed to a second technique andso on The advantage is that the more complex and time-consuming approachesare only used when no other method can give a definite results For examplea string distance metric can be used to eliminate the most obvious aliases fromthe data set with high precision The remaining aliases can then be detected byusing a more complicated technique eg a neural network Unfortunately noprevious research could be found that employs this approach

25

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 30: Thesis Freek Maes - Final Version

correct alias false alias

retrieved true positives (tp) false positives (fp)not retrieved false negatives (fn) true negatives (tn)

Table 22 The contingency table that is used for performance measurements

25 Evaluation measures

In order to correctly evaluate and compare different techniques it is essentialto use good evaluation measures In authorship attribution and alias resolutionit is common to construct a contingency table such as the one that can be seenin table 22 Based on this contingency table several evaluation measures canbe derived

A commonly used measure for evaluating the performance of machine learn-ing systems is accuracy Accuracy is defined as the percentage of classificationsthat are correct

Accuracy =correct classification

total number of classifications=

tp+ tn

tp+ fp+ fn+ tn(219)

Although it looks like a good measure of performance it is not that hard toobtain high accuracy in an authorship attribution or alias resolution systemSince the classes are highly skewed towards the negative class a classifier canattain high accuracy simply by classifying all the examples as negatives Inmost of the cases this will be correct since most candidate authors are in factnot aliases of the author under investigation

Therefore three other commonly used measures have been adopted in thisthesis Precision (P) measures the proportion of retrieved aliases that are ac-tually correct This can also be defined as

P =| retrieved aliases cap correct aliases |

| retrieved aliases |=

tp

tp+ fn(220)

Recall (R) measures the proportion of correct aliases that have been retrievedThis can be defined as

R =| retrieved aliases cap correct aliases |

| total correct aliases |=

tp

tp+ fn(221)

These two measures are not as dependent on the class distributions as theaccuracy measure Therefore they are a more sensible choice to use in thissituation Moreover by having these two measures of performance it is possibleto trade off one for the other since precision and recall are highly interdependentFor example a user might prefer to retrieve more potential aliases with a lowerprecision if he is going to manually evaluated them anyway On the other handa user that wants to automate the complete process and be able to rely greatlyon the classification given by the system will favor precision over recall Since the

26

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 31: Thesis Freek Maes - Final Version

preference for precision or recall is highly dependent on the userrsquos preferencesa single measure is often used for evaluating systems The F-measure is theweighted harmonic mean of precision and recall defined as

F =1

α 1P + (1minus α) 1

R

(222)

Often the important of precision and recall is balanced by choosing α = 05This results in the so-called F1-measure which can now simply be written as

F1 = 2 middot precision middot recall

precision + recall(223)

The reason why a harmonic mean is taken instead of an arithmetic mean (aver-age) is that it is always possible to achieve an arithmetic mean of at least 50simply by classifying all instances as positive Since all the correct aliases areretrieved the recall will be 100 and the arithmetic mean will be at least 50The harmonic mean is more suitable because it will be closer to the minimumthan to the arithmetic mean of precision and recall when the two values differgreatly [46]

Averaging the precision and recall scores for different test runs can be donein two different ways one is micro-averaging where a contingency table for allthe problems together is constructed and global precision and recall is calculatedbased on this table The other is macro-averaging where precision and recall isfirst calculated for each problem after which a simple arithmetic mean is takento determine the global precision and recall Macro-averaging gives equal weightto each class whereas micro-averaging gives equal weight to each documentSince micro-averaging tends to favor large classes over small classes [46] andgive more importance to accuracy on authors with many test documents macro-averaging is used in this thesis in order to get a good view of effectiveness onthe smaller classes

26 Conclusion

In this chapter several approaches to authorship disambiguation and alias reso-lution have been discussed In particular a distinction has been made betweenstring metrics authorship attribution techniques and techniques from link anal-ysis Finally several approaches to combining multiple techniques as well asdifferent evaluation measures have been discussed in the last sections

The techniques that have been chosen for evaluation in the experiments areas follows

bull Christen2006a found that when dealing with surnames the Jaro similaritymetric performed best out of 27 techniques Cohen and Fienberg [13]evaluated different string metrics on different data sets and found thatthe Monge-Elkan distance performed best However they conclude thatthe Jaro-Winkler metric performed almost as well as the Monge-Elkan

27

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 32: Thesis Freek Maes - Final Version

distance but is an order of magnitude faster Therefore the Jaro-Winklermetric has been used in the experiments to follow

bull Considering the authorship attribution techniques SVMrsquos are the onesmost used and most reliable for authorship attribution (see eg [34 161 71 45 2] Moreover they are not sensitive to the problem of over-fitting and perform automated feature selection Therefore SVM hasbeen chosen as a classifier for the authorship attribution approach

bull Concerning the graph analysis techniques the Connected Path approachhas been adopted since it outperformed other link analysis techniqueswhen applied to several data sets [6] Moreover the Jaccard techniquehas been adopted since it performed well in the same research but isconsiderably less complex in nature

bull The measures precision and recall are used to evaluate the techniquesMoreover precision and recall values are combined into a single F1-measurein order to aid the comparison of different techniques to each other

28

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 33: Thesis Freek Maes - Final Version

Chapter 3

Methods

This chapter explains the different approaches that have been taken to tacklethe problems defined in the research questions as well as the corpus that hasbeen used for evaluation Design choices that have been made as well as impor-tant implementation details will be elaborated upon This chapter will startwith an introduction of the corpus that has been used and an explanation ofthe preprocessing that has been applied to it The individual techniques thathave been implemented will be discussed in section 32 whereas the differentcombinations of techniques are dealt with in section 33

31 ENRON Corpus

The ENRON corpus is a large data set of emails collected from the personalcomputers of employees of the American energy company ENRON On Decem-ber 2nd the company went bankrupt on account of one of the largest and mostcomplex cases of accounting fraud in US history The emails generated by158 employees were seized by the Federal Energy Regulatory Commission [21]during its investigation which started in 2002 The ENRON data set is the onlylarge real-life email data set that is available for research The informationthat is recorded for a single email message can be seen in figure 31 Since thesender receiver and body of each email message is recorded the data set isespecially well suited for alias resolution techniques that use information fromdifferent domains Although the original data set contains emails as well asattachments only the emails have been used in this thesis Many attachmentsdo not contain written text and for the ones that do it cannot be verified whothe actual author of the text is For the body of the email messages it can beassumed that the sender of the email has written it except for the forward andreply-parts Concerning the text that is stored in attachments this assumptioncannot be made Therefore the inclusion of these attachment would create toomuch noise in the data set especially for the authorship attribution techniquesthat use the content of the email

29

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 34: Thesis Freek Maes - Final Version

SSN requirement

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Sent 12122000 at 1608

Thank you very much We will give it a try

Message id 293

From monikacaushollienroncom

To flemingenroncom rhondaflemingtxdpsstatetxus

Subject SSN requirement

Sent date 12122000

Body Thank you very much We will give it a try

Attachment false

Figure 31 An example of a single email message in the ENRON data setand the information that is extracted from it

The most well known version of the data set has been made public by Co-hen [12] in 2004 after some integrity issues had been dealt with The Cohen-version of the data set contains roughly 500000 email messages from 151 Enronemployees Shetty and Adibi [58] analyzed the corpusrsquo appropriateness for re-search and applied several preprocessing steps to the data Messages stored inone userrsquos in-box and in another userrsquos out-box were considered duplicates andhave been removed The same goes for messages that were created by the com-puter by organizing and storing messages into folders such as rdquoall documentsrdquoEmpty messages system messages and messages that contained only forwardsor junk data were removed Invalid email addresses were changed to the formatrdquonoaddressenroncomrdquo and undisclosed recipients were changed to the formatrdquoundisclosed-recipientsenroncomrdquo Finally the folder-based representation ofthe data set was converted to several tables in a MySQL database The cleanedversion of the data set contains 252758 emails by 151 different employees

The corpus that was made available by Shetty amp Adibi still contained somenoise as well as information not useful for this thesis Therefore after convertingit to a Microsoft SQL-database the following preprocessing steps have beenapplied to it

1 A number of system messages and junk messages that were still present inthe data were removed Among these messages where calendar remindersand messages that only contained attachments

30

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 35: Thesis Freek Maes - Final Version

Step Records affected Percentage removed (cum)

1 17052 6703 13681 12004 26223 22505 4001 24006 25990 34007 3700 35808 52163 5650

Table 31 Preprocessing steps applied to the ENRON corpus

2 Forward and reply-parts of messages have been removed

3 Empty messages resulting from the removal of forward or reply-parts instep 2 were removed

4 Messages that contained lt= 10 words were removed since they containedtoo little useful information

5 Authors that had written a total numbers of words lt= 100 were removedfor the same reason as in step 4

6 Messages that had the same sender receiver body send date and subjectwere considered duplicates and only one copy was retained

Table 31 provides an overview of the number as well as the cumulative per-centage of records that have been removed per step

In addition to the above preprocessing steps an experiment was conductedto determine the number of training instances that were needed for classificationusing Support Vector Machines These results which can be found in Figure32 showed that the highest cross-validation accuracy was achieved by using 80emails per author Therefore authors that had send a total number of emailslt= 80 were removed from the data set In order to preserve balance in the dataset authors with more than 600 emails in total were also removed Accordingto Burrows [9] 10000 words per author is a reliable minimum for authorshipattribution whereas Sanderson and Guenter [57] mention a minimum of 5000words per author In the final data set the average number of words per emailequals 209 and with at least 80 emails per author it is ensured that each authorhas a reliable number of words to train on After preprocessing the data setconsisted of 44912 emails by 246 different authors After preprocessing thedatabase contains 44912 messages by 246 different senders For each messagethe sender receiver subject body and send-date has been stored

In order to get a better view of the data set that is being used several statis-tics have been calculated Figure 33 shows the distribution of email messagesper author The x-axis represents the number of email messages that have beensent whereas the y-axis represents the number of authors Figure 34 provides

31

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 36: Thesis Freek Maes - Final Version

20 40 60 80 100 120 140 160 180 20005

055

06

065

07

075

08

085

09

095

1

Number of training instances per class

10minus

fold

Cro

ssminus

valid

atio

n ac

cura

cy

Crossminusvalidation accuracy for different training set sizes

LinearRBF

Figure 32 Averages of 10 times 10-fold cross-validation using different train-ing set sizes and kernels for the Authorship SVM

an overview of the total number of words per author The x-axis represent thetotal number of words that one author has written whereas the y-axis repre-sents the number of authors It can be seen that by far the largest number ofauthors has written a total number of words between 10000 and 100000

Next to these statistical measures a network graph has been created offall the authors in the final data set Figure 35 shows this graph Only thesenders in the network are shown since the number of receivers runs into thethousands The color of a node represent the degree of that node ie thenumber of in-going and out-going links It represents the number of messagesthat this author has sent and received It can be concluded from this graph thatthe authors in the data set are highly connected The average degree is as highas 216 Moreover nodes that appear close to each other in the graph have moreconnections between them meaning that they have had regular email contact

Since there was no data to verify whether the ENRON-data set actuallycontained any real aliases and therefore no means to measure precision andrecall several artificial aliases have been created In order to create these aliasesauthors with a total number of emails gt= 200 were selected from the data setand their emails were split up into several aliases To be more precise messagesfrom and to the original author were randomly assigned to one out of a numberof aliases for that author three different categories of authors have thus beencreated namely

32

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 37: Thesis Freek Maes - Final Version

90 100 110 120 130 140 150 160 170 180 190 200 210 220 2300

5

10

15

20

25

30

35

Num

ber

of a

utho

rs

Number of emails

Figure 33 The distribution of email messages per author

10000 100000 1000000 100000000

20

40

60

80

100

120

140

160

180

Num

ber

of a

utho

rs

Total number of words

Figure 34 The distribution of the total number of words per author

33

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 38: Thesis Freek Maes - Final Version

40enronen

rickbuye

fazul_adba

ehaedick

markpalme

sheikh_ahm

abdul_rahm

fcalger

hasan_izz

paulkaufm

abu_islam

faggoud_ya

mscottetheresast

al-hourie

keeganfar

janetteel

mpresto

timbelden

paulybarboenro

rodhaysle

kathleenc

thodgee

taffymill

hollykeis

marcusnet

shonnieda

ahmad_ibra

travismcc

ali_atwa

kayyoung

tracygeac

40enronen

bakr_ahmad

jasonwill

michellel

amartin

brianredm

jimschwie

rodhayslerodhaysle

shonawils

pattithom

40enronen mikegrigs

stephanie

ahoward

lindydono

nabith_hen

ahmed_garbdannymcca

ericgadd

kevinhyat

hasan_izz-

kevinhyatkevinhyat

ashraf_ref

samir_salw

ipayitenr

jfarmer

kimberlyh

karenbuck

ashankma

sheilaglo alanaronoandyzippe

jkeanen

davidoxle

gregwhallrickbuye

jkaminski

lindarobe

abdul_rahm

susanmara

alancomne

daviddela

nu_manenr

markschro

lisayohorayalvare

yasein_tah

janelguer

harryking

leslielaw

susanmara

maureenmc

lnicolay

suenorde

sarahnovo

johnshelk

jennifert

daviddela

susanmara

karendenn

staceybol

mschmidt

kevinpres

daviddelachristini

maryhain

josephala

mohamed_at

mdaygmssr

rosaleefl

christian

karendenn

sgovenarg

christophe

jbennettg

michaeltr

chrisfost

kristinwa

mikegrigs

phillipal

mikegrigs

phillipal

miyungbus

abu_fatima

benjacoby

huntershi

gfergusbr

lisamelle

abd_al_wak

mscotte

mscottemtholte

stephanie

vweldone

abu_abdall

melissamu

bin_laden

davidport

beckyspen

sharistac

taylorenr

taylorenr

brenthend

hassan_ros

rhondaden

gregpiper

kimberlyh

mforney

carolst

justinboy

davidminn

bobshults stephanie

jeanmrha

susanpere

sandrabra

brittdavitorikuyke

jonathanm

fletchers

randallga

brantreve

the_teache

stacydick

ahmed_khal

janetholt

kevinrusc

kimwarde

mikecarso

martincui

patricemi

ustaz

kimwarde

reaganror

samuelsch

danadavis

tammiesch

mprestokallene

peterkeoh

sharencas

ahmed_shie

stephanie

dthomas

cgironemloveen

anas_al-sa

lmimsenabu_yussrr

wwhitee

abu_khadij

abu_omran

tanyaroha

stuartzis

robertbrususanbail

suzannead

samanthabstephanie

muhammad_a

ali_saed_b

russelldi

abu_seif_a

kennethth

ashraf_ref

markwhitt

outlookte

philliplo

muhamad_ib

the_emirglenhass

jefferyfa

lorrainel

johnbucha

jerrygrav

dennisleedarrellsc

juliearms

cindystar

veronicae

markmccon

bsanders

joanniewi

rosaleefl

billiiie

twandaswe

carasempe

ammar_mans

fouad_moha

sherriser

annschmid

audreyrob

philliplo

michellel

billrapp

tluccie

kimwarde

kaychapma

stevehoos

ericgilla

ccampbell

monikacau

juanherna

lavoratoe

lavoratoe

jkaminski

meganpark

cameronpenancysell

cheryljoh

robertcot

continenta

mpresto

dkinneyco

douglassa

enron_upda

enron_upda

enron_upda

sheikh_bah

ruthconca

exchangein

exchangein

saif_al-adil

foolmotle

gelliottimujahid_sh

infopmaco

sheikh_swe

marypoorm

jkaminskishirleycr

joeparksjoeparks

kaminskie

mforney

robinrodr kerrithom

khalid_mou

lgoldseth

liztaylor

robinrodr

marketing

philliplo

masterama

memberserv

messenger

mjones7tx

mjones7tx

mustafa_mu

navigator

newsreal-newsletter

noreplycc

noreplycc

nytdirect

perfmgmte

perfmgmte

robgayen

robinrodrsafa_tabah

sylviahu

truorange

webmaster

Figure 35 A network graph of the authors in the subset of the ENRON dataset

34

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 39: Thesis Freek Maes - Final Version

Type of Alias Number of authors

High Jaro-Winkler with 1 alias 26High Jaro-Winkler with 2 aliases 15Low Jaro-Winkler with 1 alias 11

Low Jaro-Winkler with 2 aliases 1No Alias 193

Table 32 Artificial Aliases in the ENRON data set by type

Test set Mixed Hard

High Jaro-Winkler 6 2Low Jaro-Winkler 8 16

No alias 6 2

Table 33 Distribution of alias-types in two different test sets

bull Authors with 1 or more artificial aliases with a high Jaro-Winkler simi-larity (eg johndoeenroncomA amp johndoeenroncomB)

bull Aliases with 1 or more artificial aliases with a low Jaro-Winkler similarity(eg bin laden amp abu abdallah)

bull Authors without an alias

The distribution of authors and aliases in the final data set can be seen in moredetail in table 32The number of authors including aliases in the final data setequaled 315

Test sets

In order to evaluate the results of the different techniques two different test setshave been created The first test set called the mixed test set has a fairly equaldivision of alias types as can be seen in table 33 The second test set calledthe hard test set is substantially more difficult since the majority of the aliasesare not easy to recognize by their email addresses The authors in each test setwere chosen at random from their respective alias categories

32 Individual Techniques

The first technique whose performance has been evaluated on the ENRON sub-set is the Jaro-Winkler similarity For each author in the test set the Jaro-Winkler similarity of that authorrsquos email address to that of each other authorhas been calculated If the Jaro-Winkler score of a particular author-authorpair is above a certain threshold the two authors are considered to be aliases

35

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 40: Thesis Freek Maes - Final Version

The precision and recall for different decision thresholds has been measured us-ing the test sets of table 33 This technique will hereafter be referred to asrdquoJaro-Winklerrdquo or rdquoJWrdquo

The second technique that has been evaluated is the Connected Path methodhereafter referred to as rdquoConnected Pathrdquo or rdquoCPrdquo For each author in the testset a Connected Path search to depth 3 has been performed In order to ensurethat the scores were in the range of [0 1] the score for a particular author-authorpair was calculated as follows

ConnectedPath(vi vj) =ConnectedPath(vi vj)

ConnectedPathmax(31)

where ConnectedPathmax is the maximum similarity scores found for any twoauthors in the data set

The third technique that has been tested is Jaccard similarity hereafterreferred to as rdquoJaccardrdquo For each author in the test set the Jaccard similaritybetween that authors neighbors and each other authors neighbors has beencalculated Note that authors that have been removed from the data set duringthe preprocessing steps described in section 31 do not occur in the neighborhoodof their correspondents anymore and do not contribute towards the Jaccardsimilarity score The same principle applies to the Connected Path score

The last individual technique that has been evaluated is the use of SVMon email content hereafter referred to as rdquoauthorship SVMrdquo or simply rdquoSVMrdquoThe first decision that had to be made was whether to treat the problem froman instance-based or profile-based perspective Since it is possible that authorsemploy different writing styles when writing to different contacts it is importantto retain the differences in each email message For example an author can usedifferent words when writing to friends instead of colleagues or he can uselonger sentences when writing to superiors instead of writing to subordinatesMoreover Hirst and Feiguina [20] conclude that using multiple short texts forauthorship attribution overcomes the problem of not having sufficiently longtraining texts available Therefore an instance-based approach to authorshipattribution has been adopted A combination of lexical syntactic and structuralfeatures has been adapted from [71] and extended with a number of additionalfeatures to create a larger overall feature variance The complete feature setthat has been used in the authorship SVMrsquos can be found in table 34 Thelist of function words that has been used in the feature set can be found in theappendix

Based on the experiment shown in figure 32 it was determined that theauthorship SVMrsquos should be trained using a Radial Basis Function-kernel sinceits overall performance was better than that of the linear kernel The parameterC influences the penalty associated with classification errors whereas γ controlsthe shape of the separating hyper plane In order to find optimal values of Cand γ a straightforward grid-search has been performed using exponentiallygrowing sequences of C and γ Specifically the accuracy all combinations ofC = 2minus5 2minus3 2minus1 215 and γ = 2minus15 2minus13 2minus11 23 is calculated using

36

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 41: Thesis Freek Maes - Final Version

Features Description

Lexical1 Total number of characters (C)2 Total number of alphabetic characters C3 Total number of upper-case characters C4 Total number of digit characters C5 Total number of white-space characters C6 Total number of tab spaces C

7-32 Frequency of letters A-Z33-53 Frequency of special characters ~$^amp-_=+gtlt[]|

54 Total number of words (M)55 Total number of short words M less than four characters56 Total number of characters in words C57 Average word length58 Average sentence length (in characters)59 Average sentence length (in words)60 Total different words M61 Hapax legomena Frequency of once-occurring words62 Hapax dislegomena Frequency of twice-occurring words

63-82 Word length frequency distribution M83-333 TFIDF of 250 most frequent 3-grams

Syntactic334-341 Frequency of punctuation rsquo rdquo342-491 Frequency of function words

Structural492 Total number of sentences

Table 34 Feature set for the authorship SVM

37

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 42: Thesis Freek Maes - Final Version

5 times 5-fold cross validation for each authorship SVM The highest scoringcombination of parameters is then chosen to train the actual SVM model

The authorship attribution problem is a multi-class problem because a giventext has to be attributed to one out of multiple candidate authors Since ordi-nary SVMrsquos can only solve binary classification problems a multi-class approachusing one-versus-all classification has been adopted In the one-versus-all ap-proach a single authorship SVM is trained on positive training instances fromone author and negative instances from all the other authors That is theauthorship SVM makes a classification whether a given text has been writtenby one particular author or not Once a separate SVM has been trained forevery author a given text can be classified by letting each authorship SVMassign a probability to the text being written by that author Rifkin and Klau-tau [56] show that as long as a good binary classifier is used it makes littledifference which multi-class scheme is used Therefore a simple scheme such asone-versus-all is preferable over more complex schemes such as error-correctingcodes

Since SVM is sensitive to class imbalances the authorship SVMrsquos are trainedusing an equal amount of positive and negative training instances In order tomake sure that the negative class is a fairly accurate representation of all theother authors emails have been selected at random from other authors For eachauthor all the authorrsquos emails are selected as positive examples and an equalamount of negative emails are used for the negative class The software thathas been used for the authorship SVMrsquos is called SVMNET [35] It is a cleanC-conversion by Matthew A Johnson of the popular LibSVM software suite[10] SVMNET uses the Sequential Minimal Optimization-algorithm describedin Fan et al [19] and is able to handle classification regression and distribu-tion estimation for single and multi-class problems using different kernels andparameters

33 Combinations of Techniques

In order to test whether a combination of techniques that operate on differentdomains will perform better than these techniques individually two differentcombinations of techniques have been tested

bull JW-CP-SVM Jaro-Winkler similarity of email addresses Connected Pathsimilarity of the link network (depth = 3) amp authorship SVM on emailcontent

bull JW-Jaccard-SVM Jaro-Winkler similarity of email addresses Jaccardsimilarity of direct neighbors in the link network amp authorship SVM onemail content

Each combination of techniques is realized by using a voting algorithm thatis based on a Support Vector Machine which will specifically be referred toas the rdquovoting SVMrdquo The voting SVM takes as input a vector containing the

38

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 43: Thesis Freek Maes - Final Version

Figure 36 The structure of the combined approach

results of the three techniques for a single candidate author and gives as outputa prediction whether the input author is an alias or not An overview of thecombined approach can be found in figure 36

In order to create the voting SVM model a set of training instances isrequired The training set for the voting SVM has the same distribution of aliastypes as the test set depending on which test set is being used Obviouslythe authors that are used to test on are not used for training the voting SVMFor each of the 20 authors in the voting SVM training set candidate aliasesare manually labeled as positive or negative All the positively labeled aliasestogether are used as positive training instances for the voting SVM and 5 timesthe number of positive instances are randomly selected from the negativelylabeled aliases as negative training examples The reason for this class imbalanceis that the number of positive instances in the training sets is rather low 14 and18 for the mixed and hard sets respectively In order to have enough traininginstances available for the voting svm more negative examples are chosen fromthe training set

After two voting SVMrsquos have been trained one using Jaccard similarity andone using Connected Path similarity the test sets from figure 33 are used todetermine the precision and recall for various decision thresholds

39

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 44: Thesis Freek Maes - Final Version

Chapter 4

Results

The results that have been obtained from the various experiments will be dis-cussed in this section First the results of the individual and combined tech-niques on the mixed test set will be given Second the results of the individualand combined techniques on the hard test set will be given Finally an overviewof the best results achieved by each individual and combined techniques will begiven

Figures 41a to 41d show the precision and recall scores achieved on themixed test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively The results are based on the following decisionthresholds 00 005 010 10 Jaro-Winkler achieves the best F1-scoreof 080 at a decision threshold of 094 and 096 Connected Path achieves amaximum F1-score of 048 on a decision threshold ranging from 08 minus 096Jaccardrsquos best F1-score of 069 occurs at a decision threshold of 058 Finallyauthorship SVM achieves a maximum F1-score of 079 for a decision thresholdof 060

Figures 42a and 42b show the results achieved by the two combinations oftechniques JW-CP-SVM and JW-Jaccard-SVM on the mixed test set JW-CP-SVM achieves its best F1-score of 079 at a threshold of 074 JW-Jaccard-SVMachieves the best result of all the techniques on this test set namely an F1-scoreof 088 using a threshold of 078

Figures 43a to 43d show the precision and recall scores achieved on thehard test set for the techniques Jaro-Winkler Connected Path Jaccard andauthorship SVM respectively Again the results are based on the followingdecision thresholds 00 005 010 10 Jaro-Winkler achieves its highestF1-score of 028 at a decision threshold of 088 The best F1-score for ConnectedPath is 053 using a decision threshold of 012 Jaccard achieves a maximum F1-score of 067 at a decision threshold of 038 Finally authorship SVM achievesa maximum F1-score of 076 at a decision threshold of 068

Figures 44a and 44b show the results achieved by the two combinationsof techniques JW-CP-SVM and JW-Jaccard-SVM on the hard test set JW-CP-SVM achieves its best F1-score of 065 at a threshold of 078 whereas JW-

40

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 45: Thesis Freek Maes - Final Version

Jaccard-SVM achieves the best results of all the techniques on this test setnamely an F1-score of 089 using a threshold of 092

In addition to these graphs the best F1-scores for all the techniques on eachtest set are summarized in figs 45 and 46 The precision and recall valuesthat are shown correspond to the best F1-scores achieved It can be concludedthat the best results on both the mixed and the hard test set are achieved byJW-Jaccard-SVM

41

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 46: Thesis Freek Maes - Final Version

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 41 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the mixed test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 42 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the mixed test set

42

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 47: Thesis Freek Maes - Final Version

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) Jaro-Winkler

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) Connected Path

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(c) Jaccard

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(d) Authorship SVM

Figure 43 Precision recall and F1 calculated using various decision thresh-olds for individual techniques on the hard test set

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(a) JW-CP-SVM

0 01 02 03 04 05 06 07 08 09 10

02

04

06

08

1

12

Decision threshold

PrecisionRecallF1

(b) JW-Jaccard-SVM

Figure 44 Precision recall and F1 calculated using various decision thresh-olds for combined techniques on the hard test set

43

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 48: Thesis Freek Maes - Final Version

Figure 45 Best results on the mixed test set for different techniques Preci-sion and recall values correspond to the given F1-scores

44

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 49: Thesis Freek Maes - Final Version

Figure 46 Best results on the hard test set for different techniques Precisionand recall values correspond to the given F1-scores

45

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 50: Thesis Freek Maes - Final Version

Chapter 5

Discussion

The purpose of this thesis was to investigate which techniques could be used forresolving aliases and disambiguate authors in email data Specifically it wasinvestigated whether a combination of techniques could perform better thanindividual techniques on these tasks The main results of this thesis can besummarized as follows

Jaro Winkler The Jaro-Winkler approach gave good results on the mixedtest set but failed on the hard test set The high F1-score on the mixed test setcan be explained by the fact that many of the artificial aliases had an extremelyhigh Jaro-Winkler similarity The hard test set more closely mimics a real-worldscenario where aliases do not look as much alike For example Boongoen et al[6] showed that in their data set of real terrorist names derived from web pages70 of the true aliases had a Jaro-Winkler similarity of less than 06 Sincethe hard test set also features more aliases with low Jaro-Winkler similaritythe performance on this test set is significantly lower However the results stillshow that using a simple string metric can detect many aliases resulting fromspelling errors or the use of different email addresses for work home etc

Connected Path It can be concluded that the Connected Path algorithmfailed to achieve good results on both test sets because of three reasons Firstsince authors have been split up into aliases and some have been removed all to-gether the link networkrsquos structure might have been corrupted This especiallyaffects link analysis that goes beyond the analysis of direct neighbors since ittakes into account more complicated link connections Second because of timeconstraints the link network search has been performed to depth 3 which meansthat only the information contained in paths of length 2 and 3 have been usedin the calculation of the similarity score Boongoen et al [6] achieved betteraccuracy by searching to depth 4 compared with a search to depth 2 It isexpected that the same behavior of Connected Path can be observed on thisdata set if the search would have been performed to a greater depth Third theConnected Path method can only return similarity scores for authors that areconnected to the original author If there was no Connected Path score returnedfor a particular author-alias pair the alias had to be counted as a false negative

46

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 51: Thesis Freek Maes - Final Version

thereby decreasing the overall recallJaccard Using Jaccard similarity yielded better results than the Connected

Path algorithm Since Jaccard similarity only takes into account direct neigh-bors it is less affected by changes in the link network Moreover the Jaccardsimilarity can be calculated between any two authors in the data set which iswhy it scored better than the Connected Path method

Authorship SVM The use of authorship SVMrsquos gave good results overallespecially considering the fact that there are 314 candidate aliases for eachauthor and that the training texts are short Considering that Luyckx [44]reported scalability issues when using a multi-class SVM approach the one-versus-all approach that has been adopted in this thesis is very promising

Combined techniques The combination of JW-CP-SVM did not performvery well On the mixed test set it performed as good as authorship SVMor even Jaro-Winkler alone and for the hard test set it performed even worseBecause of aforementioned reasons the Connected Path method failed to achievegood results in general In combination with the low Jaro-Winkler performanceon the hard data set this resulted in the combination JW-CP-SVM failing toachieve reasonable results The best results for both test sets are achieved bythe combination of Jaro-Winkler Jaccard and authorship SVM On the hardtest set the increase in F1-score of this combination compared with the secondbest technique (SVM) is as high as 16 whereas on the mixed test set theincrease to the second-best technique (Jaro-Winkler) is 9

51 Conclusion

The results of the experiments confirm the hypothesis that a combination oftechniques can yield better results than using these techniques individuallyThe research questions that have been formulated to guide this research areanswered below

Which authorship disambiguation and alias resolution techniques ex-ist that can be used on email data

The literature review of section 2 has provided an extensive overview of thedifferent techniques that can be used to attribute authorship and resolve aliasesTechniques that operate on the domain of email addresses are able to resolvesuperficial aliases resulting from unintentional misspellings or simple variationsin naming conventions Authorship attribution techniques can predict the realauthor of a given set of email very well provided that there is enough trainingtext available for each author If this is the case dealing with a large authorset has also proven to be possible Link analysis techniques have low precisionand recall when used individually but can still manage to find aliases that othertechniques do not

47

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 52: Thesis Freek Maes - Final Version

How can techniques from different domains be combined

Section 24 has given an overview of approaches to combining techniques fromdifferent domains Previous literature on the subject is rare especially on com-bining techniques from more than two domains A common method is to usea linear weighted combination of the results of different techniques Since theweights are often manually constructed the results are not that good and theability of these techniques to generalize is low Better results are achieved whena classifier is trained to distinguish between good and bad combinations of re-sults

Can a combination of techniques from different domains increase per-formance when compared with individual techniques

The results of Chapter 4 show that a combination of techniques can indeedincrease precision and recall when compared with individual techniques Specif-ically a combination of Jaro-Winkler similarity on email addresses authorshipSVM on email content and Jaccard similarity of the link network using an SVMvoting algorithm achieves the best results when tested on a subset of the EN-RON data set The results of this combination of techniques also tend to bemore robust across different decision thresholds something that is useful whendetermining a proper threshold might be difficult It is important to note thatthe relative improvement in F1-score of the combined techniques over the in-dividual techniques is dependent on the number of low Jaro-Winkler aliases inthe test set Especially on the hard data set where aliases are more difficult torecognize the combination of techniques performs very well and achieves sig-nificantly higher F1-scores than the individual techniques This indicates thatthe different techniques are indeed complementary and can work together toachieve better results

To summarize it can be concluded that a combination of techniques thatoperate on different domains is more effective in disambiguating authors andresolving aliases than each of these techniques individually

52 Future Recommendations

The results and conclusion that have been put forward in the previous sectionsprovide good ground for future research It will be interesting to see how wellthe techniques that have been used in this thesis perform on a full data set withreal aliases which could not be found to use in this research Should such acollection not exist it is worthwhile to create one

The link analysis techniques that have been used in this paper only useinformation from the direct neighborhood of the authors Boongoen et al[6]have already shown that searching to a greater depth yields better results soit is useful to look at how the algorithm can be optimized to be less computa-tionally intensive in order to search to greater depths Moreover since the less

48

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 53: Thesis Freek Maes - Final Version

sophisticated Jaccard similarity surprisingly outperformed Connected Path itis worthwhile to experiment with different link analysis techniques

Finally the assumption has been made that the results from various tech-niques are independent of each other These assumptions have not been testedand it is not clear if and in what way various techniques affect each other Thereare a myriad of decisions to be made when implementing an alias resolution sys-tem that combines different techniques Therefore it is imperative that moreresearch will be done to examine the best choice of feature sets techniques andaggregation methods

49

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 54: Thesis Freek Maes - Final Version

Chapter 6

Bibliography

[1] Abbasi A and Chen H (2005) Applying authorship analysis to extremist-group Web forum messages Intelligent Systems IEEE 20(5)67ndash75

[2] Argamon S and Juola P (2011) Overview of the International Au-thorship Identification Competition at PAN-2011 In CLEF (Notebook Pa-persLabsWorkshop)

[3] Baroni M Matiasek J and Trost H (2002) Unsupervised discovery ofmorphologically related words based on orthographic and semantic similarityIn Proceedings of the ACL-02 workshop on Morphological and phonologicallearning-Volume 6 volume 6 pages 48ndash57 Morristown NJ USA Associationfor Computational Linguistics

[4] Binongo J N G (2003) Who Wrote the 15th Book of Oz An Applicationof Multivariate Analysis to Authorship Attribution Chance 16(2)9ndash17

[5] Boongoen T and Shen Q (2009) Semi-supervised OWA aggregation forlink-based similarity evaluation and alias detection 2009 IEEE InternationalConference on Fuzzy Systems pages 288ndash293

[6] Boongoen T Shen Q and Price C (2010) Disclosing false identitythrough hybrid link analysis Artificial Intelligence and Law 18(1)77ndash102

[7] Brunet E (1978) Le Vocabulaire de Jean Giraudoux structure et evolutionStatistique et informatique appliquees a letude des textes a partir des donneesdu Tresor de la langue francaise Le Vocabulaire des grands ecrivains francais1 Slatkine

[8] Burrows J (2002) rsquoDeltarsquo A measure of stylistic difference and a guide tolikely authorship Literary and Linguistic Computing 17(3)267ndash287

[9] Burrows J (2007) All the Way Through Testing for Authorship in Differ-ent Frequency Strata Literary and Linguistic Computing 22(1)27ndash47

50

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 55: Thesis Freek Maes - Final Version

[10] Chang C-c and Lin C-j (2011) LIBSVM A Library for Support VectorMachines ACM Transactions on Intelligent Systems and Technology 2(3)1ndash39

[11] Christen P (2006) A Comparison of Personal Name Matching Techniquesand Practical Issues In Sixth IEEE International Conference on Data Mining- Workshops (ICDMWrsquo06) number September pages 290ndash294 IEEE

[12] Cohen W W (2009) Enron email dataset Retrieved from httpwwwcscmuedu~enron

[13] Cohen W W and Fienberg S E (2003) A comparison of three stringmatching algorithms Methods 20(1)73ndash78

[14] Cortes C and Vapnik V (1995) Support-vector networks MachineLearning 20(3)273ndash297

[15] Crammer K and Singer Y (2002) On the Algorithmic Implementationof Multiclass Kernel-based Vector Machines Journal of Machine LearningResearch 2(2)265ndash292

[16] de Vel O (2000) Mining e-mail authorship Proc Workshop on TextMining ACM International

[17] de Vel O Anderson A Corney M and Mohay G (2001) Mining e-mailcontent for author identification forensics ACM SIGMOD Record 30(4)55

[18] Duan K-b and Keerthi S S (2005) Which Is the Best Multiclass SVMMethod An Empirical Study Multiple Classifier Systems 3541278ndash285

[19] Fan R-e Chen P-h and Lin C-j (2005) Working Set Selection UsingSecond Order Information for Training Support Vector Machines Journal ofMachine Learning Research 61889ndash1918

[20] Feiguina O and Hirst G (2007) Authorship attribution for small textsLiterary and forensic experiments In Proceedings of the 30th InternationalConference of the Special Interest Group on Information Retrieval Work-shop on Plagiarism Analysis Authorship Identification and Near-DuplicateDetection (SIGIR) pages 3ndash6

[21] FERC (2012) Information released in enron investigation Re-trieved from httpwwwfercgovindustrieselectricindus-act

wecenroninfo-releaseasp

[22] Forman G (2003) An Extensive Empirical Study of Feature SelectionMetrics for Text Classification Journal of Machine Learning Research 3(7-8)1289ndash1305

[23] Friedman C and Sideli R (1992) Tolerating spelling errors during pa-tient validation Computers and biomedical research an international journal25(5)486ndash509

51

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 56: Thesis Freek Maes - Final Version

[24] Gabrilovich E and Markovitch S (2004) Text categorization with manyredundant features Using aggressive feature selection to make SVMs compet-itive with C45 In Proceedings of the twenty-first international conference onMachine learning ACM International Conference Proceeding Series page 41Banff Canada ACM

[25] Gamon M (2004) Linguistic correlates of style In Proceedings of the 20thinternational conference on Computational Linguistics - COLING rsquo04 pages611ndashes Morristown NJ USA Association for Computational Linguistics

[26] Honore A (1979) Some simple measures of richness of vocabulary Asso-ciation for Literary and Linguistic Computing Bulletin 7(2)172ndash177

[27] Hoover D L (2004) Testing Burrowsrsquos Delta Literary and LinguisticComputing 19(4)453ndash476

[28] Hsiung P Moore A Neill D and Schneider J (2005) Alias Detec-tion in Link Data Sets In Proceedings of the International Conference onIntelligence Analysis (2005) volume 4

[29] Iqbal F Binsalleeh H Fung B C and Debbabi M (2010) Miningwriteprints from anonymous e-mails for forensic investigation Digital Inves-tigation 7(1-2)56ndash64

[30] Iqbal F Binsalleeh H Fung B C and Debbabi M (2011) A unifieddata mining solution for authorship analysis in anonymous textual commu-nications Information Sciences

[31] Iqbal F Hadjidj R Fung B C and Debbabi M (2008) A novelapproach of mining write-prints for authorship attribution in e-mail forensicsDigital Investigation 5S42ndashS51

[32] Jaro M A (1995) Probabilistic linkage of large public health data filesStatistics in Medicine 14(5-7)491ndash498

[33] Jeh G (2002) SimRank a measure of structural-context similarity Pro-ceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining pages 1ndash11

[34] Joachims T (1998) Text categorization with support vector machinesLearning with many relevant features Machine Learning 1398(23)137ndash142

[35] Johnson M A (2009) Svmnet 163 Retrieved from httpwww

matthewajohnsonorgsoftwaresvmhtml

[36] Kern R Seifert C Zechner M and Granitzer M (2011) Vote VetoMeta-Classifier for Authorship Identification In Notebook for PAN at CLEF2011

52

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 57: Thesis Freek Maes - Final Version

[37] Koppel M (2003) Exploiting stylistic idiosyncrasies for authorship attri-bution In Proceedings of the 2003 International Joint Conferences on Artifi-cial Intelligence Workshop on Computational Approaches to Style Analysisand Synthesis number 2000 pages 69ndash72 Acapulco Mexico

[38] Koppel M and Akiva N (2003) A corpus-independent feature set forstyle-based text categorization In Proceedings of IJCAIrsquo03 Workshop onComputational Approaches to Style Analysis and Synthesis Acapulco Mex-ico

[39] Koppel M Gan R and Messeri E (2006) Authorship Attribution withThousands of Candidate Authors In Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and Development in InformationRetrieval pages 659ndash660 Seattle Washington USA ACM

[40] Koppel M Schler J and Argamon S (2010) Authorship attribution inthe wild Language Resources and Evaluation 45(1)83ndash94

[41] Liben-Nowell D and Kleinberg J (2007) The Link-Prediction Problemfor Social Networks Journal of the American Society for Information Scienceand Technology 58(7)1019ndash1031

[42] Lin Z and Lyu M (2006) PageSim a novel link-based measure of webpage aimilarity In Proceedings of the 15th International Conference on WorldWide Web pages 1019ndash1020 Edinburgh

[43] Lin Z Lyu M R and King I (2009) MatchSim In Proceeding of the18th ACM conference on Information and knowledge management - CIKMrsquo09 page 1613 New York New York USA ACM Press

[44] Luyckx K (2010) Scalability Issues in Authorship Attribution Schaal-baarheid bij Auteursherkenning PhD thesis Universiteit Antwerpen

[45] Luyckx K Daelemans W Hamilton A and Madison J (2008) Au-thorship Attribution and Verification with Many Authors and Limited DataIn Proceedings of the 22nd International Conference on Computational Lin-guistics COLING 08 (2008) number August pages 513ndash520 Association forComputational Linguistics

[46] Manning C D Raghavan P and Schutze H (2009) An Introduction toInformation Retrieval Number c Cambridge University Press CambridgeEngland 2009 edition

[47] Mendenhall T C (1887) The Characteristic Curves of Composition Sci-ence (New York NY) 9(214S)237ndash46

[48] Mendenhall T C (1901) A Mechanical Solution of a Literary problemPopular Science Monthly 60(2)97ndash105

53

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 58: Thesis Freek Maes - Final Version

[49] Miller (1995) WordNet A Lexical Database Communications of theACM 38(11)39ndash41

[50] Monge A E and Elkan C P (1996) The field matching problem Algo-rithms and applications In Evangelos S Jiawei H and Usama F editorsProceedings of the Second International Conference on Knowledge Discoveryand Data Mining number Slaven 1992 pages 267ndash270 Menlo Park Califor-nia The AAAI Press Menlo Park California

[51] Mosteller F and Wallace D L (1964) Inference and Disputed AuthorshipThe Federalist The David Hume Series of Philosophy and Cognitive ScienceReissues Addison-Wesley

[52] Navarro G (2001) A guided tour to approximate string matching ACMComputing Surveys 33(1)31ndash88

[53] Odell M and Russell R (1918) The soundex coding system US Patents1261167

[54] Page L Brin S Motwani R and Winograd T (1998) The PageRankCitation Ranking Bringing Order to the Web Technical report StanfordInfoLab

[55] Reuther P and Walter B (2006) Survey on test collections and techniquesfor personal name matching Int J Metadata Semant Ontologies 1(2)89ndash99

[56] Rifkin R and Klautau A (2004) In Defense of One-Vs-All ClassificationJournal of Machine Learning Research 5101ndash141

[57] Sanderson C and Guenter S (2006) Short Text Authorship Attributionvia Sequence Kernels Markov Chains and Author Unmasking An Inves-tigation In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing number July pages 482ndash491 Association forComputational Linguistics

[58] Shetty J and Adibi J (2004) The Enron Email Dataset DatabaseSchema and Brief Statistical Report Technical report Information SciencesInstitute

[59] Sichel H S (1986) Word Frequency Distributions and Type-Token Char-acteristics The Mathematical Scientist 1145ndash72

[60] Simeon M and Hilderman R (2009) An Empirical Study of CategorySkew on Feature Selection for Text Categorization In Proceedings of the22nd Canadian Conference on Artificial Intelligence Advances in ArtificialIntelligence pages 249ndash252 Springer-Verlag Berlin

[61] Small H (1973) Co-citation in the Scientific Literature A New Measure ofthe Relationship Between Two Documents Journal of the American Societyfor Information Science pages 265ndash269

54

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 59: Thesis Freek Maes - Final Version

[62] Solorio T and Pillay S (2011) Authorship Identification with ModalitySpecific Meta Features In Notebook for PAN at CLEF 2011

[63] Stamatatos E (2009) A survey of modern authorship attribution methodsJournal of the American Society for Information Science and Technology60(3)538ndash556

[64] Tanguy L Urieli A Calderone B Hathout N and Sajous F (2011)A multitude of linguistically-rich features for authorship attribution In Note-book for PAN at CLEF 2011

[65] Tearle M Taylor K and Demuth H (2008) An algorithm for auto-mated authorship attribution using neural networks Literary and LinguisticComputing 23(4)425ndash442

[66] Tsuboi Y and Matsumoto Y (2002) Authorship identification for Het-eregeneous Documents IPSJ SIG Notes pages 17ndash24

[67] Winkler W E (1999) The State of Record Linkage and Current ResearchProblems Statistical Research Division US Census Bureau pages 1ndash15

[68] Yang H (2003) Margin Variations in Support Vector Regression for theStock Market Prediction PhD thesis The Chinese University of Hong Kong

[69] Yule G U (1944) The statistical study of literary vocabulary CambridgeUniversity Press

[70] Zhao Y and Zobel J (2005) Effective and Scalable Authorship Attribu-tion Using Function Words Information Retrieval Technology Proceedings3689174ndash189

[71] Zheng R Li J Chen H and Huang Z (2005) A Framework forAuthorship Identification of Online Messages Writing-Style Features andJournal of the American Society for Information Science 57(3)378ndash393

55

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix
Page 60: Thesis Freek Maes - Final Version

Appendix

List of function words used in authorship SVM

aaboutaboveafterallalthoughamamonganandanotheranyanybodyanyoneanythingarearoundasatbebecausebeforebehindbelowbesidebetween

bothbutbycancosdodowneacheitherenougheveryeverybodyeveryoneeverythingfewfollowingforfromhaveheherhimiifinincluding

insideintoisititslatterlesslikelittlelotsmanymemoremostmuchmustmynearneedneithernonobodynonenornothingof

offononceoneontooppositeorouroutsideoverownpastperplentyplusregardingsameseveralsheshouldsincesosomesomebodysomeonesomething

suchthanthatthetheirthemthesetheythisthosethoughthroughtilltotowardtowardsunderunlessunlikeuntilupuponususedviawe

whatwhateverwhenwherewhetherwhichwhilewhowhoeverwhomwhosewillwithwithinwithoutworthwouldyesyouyour

56

  • List of Figures
  • List of Tables
  • Introduction
    • Structure of the thesis
      • Literature Review
        • String metrics
          • Techniques
            • Authorship Attribution
              • Instance vs profile-based
              • Features
              • Feature Selection
              • Techniques
                • Link analysis
                  • Techniques
                    • Combining Approaches
                    • Evaluation measures
                    • Conclusion
                      • Methods
                        • ENRON Corpus
                        • Individual Techniques
                        • Combinations of Techniques
                          • Results
                          • Discussion
                            • Conclusion
                            • Future Recommendations
                              • Bibliography
                              • Appendix