issa international conference 2015 - c.ymcdn.com · paul herrmann, cissp, cisa, ence, cpp 2015 issa...
TRANSCRIPT
08/24/15
1
2015 ISSA INTERNATIONAL CONFERENCE
N-Gram AnalysisIn Suspect Author Identification
of Anonymous Email
Copyright 2015, eVestigations Inc., All Rights Reserved
Paul Herrmann, CISSP, CISA, ENCE, CPP
2015 ISSA INTERNATIONAL CONFERENCE
• Problem Description
• Computational Linguistics as a Possible Solution
• N-Grams
• Solution Development
• Testing and Verification
• Problem Resolution
• Appropriate Usage
• Future Work
• Summary
N-Gram Analysis In Suspect Author Identification of Anonymous Email
Outline
Page 2
2015 ISSA INTERNATIONAL CONFERENCE
I HATE XXXXX with a passion. I dislike everything about the company. The work…my pay, my hours, my co workers and my STUPID supervisor. I am exhausted EVERY single day. And on top of this I have to work a SECOND job at Emory. I do not have a fucking life and this is pissing me off. I have been doing this tired shit for YEARS! I did not go to college in Mississippi to move to Georgia and settle for this bullshit.
Anyway the reason for this message is concerning the work ethic at Xxxxx. I work with the most disrespectful people I have ever met in my life. They are very loud, very rude, and ignorant. In addition, they are very unprofessional on ALL levels. They do things like xxxx xxxxxxx everyday and hide or throw them away to cover up themselves.
But what gets to me is the constant nagging by my co workers. It has gotten to the point that I dread the WORD Xxxxx. I have been thinking about confronting several employees telling them to be respectful to me when they speak but they are so rude, it wouldn’t seem to work.
So I have decided that I will just bring my Gun to work next week. Look like it’s gonna be another Columbine very SOON if these motherfuckers dont behave.
Large Company - Anonymous Compliance Portal - Threat
Problem Description
Page 3
08/24/15
2
2015 ISSA INTERNATIONAL CONFERENCE
• Five communications of similar length
• Over two week period
• Specific threat
• Various sources including anonymous portal and anonymous email
• Communications contained misinformation
• IP tracing led to dead ends
Threat Communications
Problem Description
Page 4
2015 ISSA INTERNATIONAL CONFERENCE
• Suspect familiar with organization
• Aware of anonymous compliance employee portal
• Suspect knowledgeable of IP tracing/obscuring techniques
Knowns
Problem Description
Page 5
2015 ISSA INTERNATIONAL CONFERENCE
Problem Description
Page 6
08/24/15
3
2015 ISSA INTERNATIONAL CONFERENCE
• Threat Emails
• Reasonable Suspicion that Author is an Employee
• MS Exchange Server of Employee Emails
Have
Problem Description
Page 7
Given an unknown writing, identify the author from a fixed population of writing.
Problem Description
2015 ISSA INTERNATIONAL CONFERENCE
• Masters Thesis - All information can be represented as 1’s and 0’s, Alfred Noble Prize from the American Society of Civil Engineers in 1940.
• “A Mathematical Theory of Communication”, 1948, Bell Laboratories, Claude Shannon (coined the term “bit”)
• “Prediction and Entropy of the English Language”, 1951, The Bell System Technical Journal
• Information entropy & N-Grams, core components of computation linguistics emerge in 1950’s after failed attempts at computer language translation
• Computational linguistics became a sub-division of artificial intelligence in the 1960’s with emphasis on speech recognition and machine comprehension and soon disappeared from the forefront of research …
Computational Linguistics
Computational Linguistics as a Possible Solution
Page 8
Claude Shannon (1916‐2001)The Father of Information Theory
2015 ISSA INTERNATIONAL CONFERENCE
• N stands for the number of consecutive lexical units (words, letters etc.) for which the frequency are calculated from a known body of texts (the “training texts”)
• 3-gram example (“Now is the time that…”)
Now is the 92is the time 13the time that 32
What is an N-Gram Model?
Computational Linguistics as a Possible Solution
Page 9
3‐gram data – Google Web N‐Gram Corpus
ceramics collected by 52ceramics collectible pottery 50ceramics collectibles cooking 45ceramics collection , 144ceramics collection . 247ceramics collection </S> 120ceramics collection and 43ceramics collection at 52ceramics collection is 68ceramics collection of 76ceramics collection | 59ceramics collections , 66ceramics collections . 60ceramics combined with 46
08/24/15
4
2015 ISSA INTERNATIONAL CONFERENCE
• In late 1950’s N-Gram models (as well as other statistical models) were heavily criticized by Noam Chomsky, considered by many the father of modern linguistics.
• Formal grammar: finite sets of production rules, nonterminal symbols, terminal symbols and a starting symbol.
• Chompsky Hierarchy, 4-types of grammars.
• By 1960’s N-Grams all but disappear from linguistic research
The Battle
Computational Linguistics as a Possible Solution
Page 10
Noam ChompskyThe Father of Modern Linguistics
2015 ISSA INTERNATIONAL CONFERENCE
• An N-Gram model predicts the likelihood of the next letter(or word) based on only the sequence of the preceding N-1 letters (or Words) without regard to the “rules of grammar”
• N-Grams returned in the mid-1980’s
• By 1990’s well entrenched - Google - Search Engines
• N-gram models have been extremely effective in modeling language applications
The Issue
Computational Linguistics as a Possible Solution
Page 11
© 2005 Ryan North
2015 ISSA INTERNATIONAL CONFERENCE
• Search optimization
• Find likely candidates for the correct spelling of a misspelled word
• Improve compression in compression algorithms
• Assess the probability of a given word sequence in speech recognition and optical character recognition software
• Identifying similar documents
• Improve hashing and information retrieval performance
• Identify text language or creating random language-like text
• Identify species from DNA
• Associate text of unknown authorship to an author
Applications
N-Grams
Page 12
08/24/15
5
2015 ISSA INTERNATIONAL CONFERENCE
Number of tokens: 1,024,908,267,229Number of sentences: 95,119,665,584Number of unigrams: 13,588,391Number of bigrams: 314,843,401Number of trigrams: 977,069,902Number of fourgrams: 1,313,818,354Number of fivegrams: 1,176,470,663
Google Web N-Gram Corpus ”All Our N-gram are Belong to You”
N-Grams
Page 13
2015 ISSA INTERNATIONAL CONFERENCE
The Lineup
Solution Development
Page 14
Can we develop a system to use N‐grams to pick out writing similar to sample texts from a known population of text?
In other words, can we use N‐grams to match the writing style of the threat letters to a known author contained in the MS Exchange database?
2015 ISSA INTERNATIONAL CONFERENCE
“Author Identification Using Imbalanced and Limited Training Texts”, Efstathios Stamatatos, University of the Aegean
• Calculate the N-Grams for the unattributed texts
• Calculate the N-Grams for known author texts
• Calculate the “distance” of the N-Grams from one another
• Suggest N = 3 to 5 yield best results
• Issue: “distance” measure is sensitive to the text sizes to be relatively similar
Research
Solution Development
Page 15
08/24/15
6
2015 ISSA INTERNATIONAL CONFERENCE
“Effective Identification of Source Code Authors Using Byte-Level Information”, Frantzeskou, Stamatatos, Gritzalis, Katsikas, University of the Aegean
Deals with limited text sizes
Replaces “distance” calculation with a count of matched N-grams
Accuracy reported between 94% and 100%
Research
Solution Development
Page 16
2015 ISSA INTERNATIONAL CONFERENCE
• Normally would compare N-grams of unknown author text against a composite of all author writing. (Who writes like this?”)
• Instead compare N-grams of threat text against individual texts of authors (“Who wrote an email like this?”)
Design Issue #1 – Writing is Intentionally Stylized
Solution Development
Page 17
2015 ISSA INTERNATIONAL CONFERENCE
• Retrieve email from “Sent” Folder
• Parse off previous non-author email text (Forwarded chains)
• Parse off signatures and boilerplate
• If remaining text is below minimum length discard
Design Issue #2 – Email Contains Non-Author Text
Solution Development
Page 18
08/24/15
7
2015 ISSA INTERNATIONAL CONFERENCE
• Develop function to load and parse email archives
• Develop function to eliminate non-author text
• Develop (compiler-like) Lexical Analysis routine generating tokens
• Develop N-Gram calculation routine
• Develop N-Gram comparison routine
• Develop testing routine
Development Steps
Solution Development
Page 19
2015 ISSA INTERNATIONAL CONFERENCE
Solution Development
Page 20
2015 ISSA INTERNATIONAL CONFERENCE
Solution Development
Page 21
08/24/15
8
2015 ISSA INTERNATIONAL CONFERENCE
Solution Development
Page 22
2015 ISSA INTERNATIONAL CONFERENCE
Solution Development
Page 23
2015 ISSA INTERNATIONAL CONFERENCE
Solution Development
Page 24
08/24/15
9
2015 ISSA INTERNATIONAL CONFERENCE
• Load all Enron PST
• Dedupe and cleanse non-author text
• Calculated 5-Grams for all
• Perform 20 sets of:
• Perform 100 trials:
• Randomly pick an one email author (mailbox)
• Randomly pick an email from the mailbox
• Compare all email 5-grams to the selected email
• If email having the most 5-grams in common is from the same author, then record “success” else “fail”
Testing Protocol
Solution Development
Page 25
2015 ISSA INTERNATIONAL CONFERENCE
Solution Development
Page 26
2015 ISSA INTERNATIONAL CONFERENCE
79.2% Accuracy with Standard Deviation of 4.2%
Testing and Verification
Page 27
08/24/15
10
2015 ISSA INTERNATIONAL CONFERENCE
• Investigated suspect
• Poor performance report
• Former employer reported similar threats prior to employee’s separation
• Employee confessed and resigned
Identified Suspect
Problem Resolution
Page 28
2015 ISSA INTERNATIONAL CONFERENCE
1. Suspect Pool Identification
2. Trial Evidence
A. Current testing results clearly place the methodology in the “more likely than not” category at best.
B. Have been used in British courts. In British and Australian court systems it is the Expert rather than the method that is recognized (See [1])
C. USA Courts - Daubert barrier
N-Gram Anonymous Author Usage Cases
Appropriate Usage
Page 29
2015 ISSA INTERNATIONAL CONFERENCE
1. Empirical testing: whether the theory or technique is falsifiable, refutable, and/or testable.
2. Whether it has been subjected to peer review and publication.
3. The known or potential error rate.
4. The existence and maintenance of standards and controls concerning its operation.
5. The degree to which the theory and technique is generally accepted by a relevant scientific community.
Daubert Criteria
Appropriate Usage
Page 30
08/24/15
11
2015 ISSA INTERNATIONAL CONFERENCE
• Coulthard, [1] argues that the N-gram model as well as other statistical methods meet the Daubert criteria.
• Tiersma and Solan, [5] argue that in most cases expert testimony is excluded, but the documents in question are admitted leaving the jury to discern without expert guidance their meaning and authorship
Linguists Argue Daubert Criteria is Adequately Met
Appropriate Usage
Page 31
2015 ISSA INTERNATIONAL CONFERENCE
• United States v. Van Wyk (83 F. Supp. 2d 515 (D. N.J. 2000))
“Although Fitzgerald [the FBI agent offered as a stylistics expert] employed a particular methodology that may be subject to testing, neither Fitzgerald nor the Government has been able to identify a known rate of error, establish what amount of samples is necessary for an expert to be able to reach a conclusion as to probability of authorship, or pinpoint any meaningful peer review. Additionally, as Defendant argues, there is no universally recognized standard for certifying an individual as an expert in forensic stylistics. (83 F.Supp.2d at 522)” [5]
Daubert Criteria Challenge
Appropriate Usage
Page 32
2015 ISSA INTERNATIONAL CONFERENCE
• JonBenét Ramsey, (1996) Ransom Note (See [5])
• Professor Donald Foster, using stylistic analysis first attributed the note to someone who did not write it.
• Later, Professor Foster changed his position and determined that Mrs. Ramsey had written the note.
• “Such incidents help to justify the law's concern about methodology.” [5]
Daubert Criteria Challenge
Appropriate Usage
Page 33
08/24/15
12
2015 ISSA INTERNATIONAL CONFERENCE
• More case studies
• Better testing to isolate best statistical approach and parameters
• Implement Linguistic Analysis techniques to determine:
• Nationality and region affiliations of author
• Sex of author
• Ethnicity of author
• Education level of author
• Hostility level of author
• Utilize MIT Simile project http://simile.mit.edu/wiki/NGram
Future Work
Page 34
2015 ISSA INTERNATIONAL CONFERENCE
• N-gram analysis of anonymous email authorship has been successfully applied
• The methodology is testable
• The current error rate creates a “reasonable doubt” as to authorship
• The methodology can be an aid in identifying suspects
Summary
Page 35
2015 ISSA INTERNATIONAL CONFERENCE
[1] M. Coulthard, “Author Identification, Idiolect and Linguistic Uniqueness”, Applied Linguistics, 2004, 13-14
[2] E. Stamatatos, “Author Identification Using Imbalanced and Limited Training Texts”, University of the Aegean, pp. 1-7
[3] C. Chaski, “Empirical Evaluations of Language-based Author Identification Techniques”, Forensic Linguistics, 2001, 1-65
[4] S. Banerjee, T. Pederson, “The Design, Implementation and Use of the NgramStatistics Package”, Carnegie Mellon University, 1-12
[5] P. Tiersma, L. Solan, “The Linguist on the Witness Stand: Forensic Linguistics in American Courts”. Language, Vol. 78, No. 2 (Jun 2002), pp. 221-239
[6] C. Shannon, “A Mathematical Theory of Communication”, The Bell System Technical Journal, Vol. 27, pp. 379-423, 633-656, July, October 1948
[7] C. Shannon, “Prediction and Entropy of Printed English”, The Bell System Technical Journal, Vol. 30, pp 50-64, January 1951
References
Page 36
08/24/15
13
2015 ISSA INTERNATIONAL CONFERENCE
Paul Herrmann, EnCE, CISSP, CISA, CPP
Contact
Page 37