determining common authorship among documents

21
Determining Common Authorship Among Documents Paul Bonamy Mentor: Dr. Paul Kantor

Upload: king

Post on 28-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Determining Common Authorship Among Documents. Paul Bonamy Mentor: Dr. Paul Kantor. Author Identification & Common Authorship. Author Identification: “Who wrote this?” Mosteller/Wallace, 1964 – The Federalist 12 disputed papers attributed to Madison Generally utilizes statistical analysis - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Determining Common Authorship Among Documents

Determining Common Authorship Among

Documents

Paul Bonamy

Mentor: Dr. Paul Kantor

Page 2: Determining Common Authorship Among Documents

Author Identification & Common Authorship

Author Identification: “Who wrote this?” Mosteller/Wallace, 1964 – The Federalist 12 disputed papers attributed to Madison Generally utilizes statistical analysis

Common Authorship: “Do these share an author?” Does not (necessarily) require statistics/training Useful for detecting forgeries, etc

Page 3: Determining Common Authorship Among Documents

BMR/BXR

Implements Bayesian Multinomial Regression

Used to perform 1-of-k classification BMRtrain accepts feature vectors, outputs

assignment model BMRclassify accepts model & vectors,

outputs assignments Can output author probability vectors

Page 4: Determining Common Authorship Among Documents

Bayesian Analysis

Consider two match boxes Probability of Box 1, given black marble?

H0 = We have Box 1, E = We see a black marble

)(

)()|()|(

Theorem Bayes'

000 EP

HPHEPEHP

3

1

75.

25.

75.

)5(.5.)|(

75.)(,5.)(,5.)|(

0

00

EHP

EPHPHEP

Page 5: Determining Common Authorship Among Documents

Bayesian Analysis in BMR

Bayes’ Theorem Extendable to P(C|F1…FN) C is a class F1…FN are features

Effectively applies Bayes’ Theorem to itself

Page 6: Determining Common Authorship Among Documents

Author

Probabilities

BMR/BXR WorkflowData

( Doc Corpus)

Feature Extractor

BMRtrain

AuthorIdentification

Feature Vectors

AuthorProbabilities

Test/TrainSplitter

Training Set Testing Set

BMRclassifyModel

Feature Vectors

Page 7: Determining Common Authorship Among Documents

Corpus Construction Articles from 2006-07 issues of The Compass Newspaper 16 Authors 130 Documents

300 - 500 Words: 69 500+ Words: 61

Varied Topics

On Friday, November 3, LSSU experienced its first closing of the semester due to inclement weather. The Soo Evening News reported a “number of minor mishaps,” and “slippery-road induced mishaps,” including two crashes near the campus of LSSU. All classes before 10 AM were canceled because of the snow and ice that had accumulated overnight, but many students arrived for classes as usual, unaware of the cancellation. …

Page 8: Determining Common Authorship Among Documents

Feature Extraction

Perl script using Lingua::EN::Tagger Selects words, part-of-speech (POS), or both

(wordPOS) address/VB address/NN

Used wordPOS in common authorship study Returns vector of feature frequencies

4:9.0 16:5.0 22:4.0 23:2.0 28:5.0 29:1.0 33:4.0 36:9.0 38:1.0 41:3.0 46:13.0 56:2.0 …

Page 9: Determining Common Authorship Among Documents

Author Probability Vectors

Produced by BMR/BXR upon request Probability doc belongs to each author in the

training set Not normalized (sum not necessarily 1)

0.17% 0.68% 9.13% 8.90% 2.42% 0.94% 10.55% 0.32% 0.72% 36.95% 0.31% 0.50% 0.48% 22.08% 1.34% 4.52%

Page 10: Determining Common Authorship Among Documents

Computed With Features

Start with feature vectors Select all distinct pairs of vectors Compute dot product and Euclidean distance Sort data

Descending by dot product Ascending by Euclidean distance

Page 11: Determining Common Authorship Among Documents

Computed With Authors

Start with author probability vectors Select all distinct pairs of vectors Compute dot product and Euclidean distance Sort data

Descending by dot product Ascending by Euclidean distance

Page 12: Determining Common Authorship Among Documents

What Are We Looking For?

DP and Euclidean distance measure distance Computed distances between vectors Sorted from closest to furthest

Docs by same author are close together Docs by different authors far apart

Same Auth? Doc # Auth # Doc # Auth # DP Euclid

1 5 2 6 2 0.756 28.302

0 2 0 27 9 0.702 30.116

0 5 2 32 13 0.711 30.133

1 32 13 33 13 0.771 30.381

0 6 2 32 13 0.729 30.708

Page 13: Determining Common Authorship Among Documents

ROC Curve

Shows fractions of not-pairs versus fraction of pairs

Area under curve indicates model accuracy Higher is better

Euclidean distance of feature vector

This curve: 64.7% of area under curve

Page 14: Determining Common Authorship Among Documents

Can We Improve This?

Euclid Dot

Features 64.7%

Authors

Page 15: Determining Common Authorship Among Documents

Can We Improve This?

Euclid Dot

Features 64.7% 65.2%

Authors

Page 16: Determining Common Authorship Among Documents

Can We Improve This?

Euclid Dot

Features 64.7% 65.2%

Authors 78.6%

Page 17: Determining Common Authorship Among Documents

Can We Improve This?

Euclid Dot

Features 64.7% 65.2%

Authors 78.6% 83.3%

Page 18: Determining Common Authorship Among Documents

Can We Improve This?

Euclid Dot

Features 64.7% 65.2%

Authors 78.6% 83.3%

Page 19: Determining Common Authorship Among Documents

Results for Other Data Splits

Analysis vs. Area Under ROC Curve

60.0%

65.0%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

100.0%

Analysis Type

Are

a U

nd

er

RO

C C

urv

e

33.33% Accurate 73.5% 69.9% 95.1% 95.3%

38.10% Accurate 77.8% 65.7% 69.9% 75.2%

56.40% Accurate 64.7% 65.2% 78.6% 83.3%

80.00% Accurate 65.0% 77.0% 88.3% 92.0%

Features Euclid Features DP Author Euclid Author DP

Page 20: Determining Common Authorship Among Documents

Analyzing Other Corpora

Obtained second corpus 9377 Documents 24 Authors

Results similar to those on Compass dataset

Euclid Dot

Features 55.2% 59.5%

Authors 79.7% 84.5%

Page 21: Determining Common Authorship Among Documents

Open Questions

Are Area Under Curve variations significant? How does Author ID model accuracy affect

same-author accuracy? A low Author-ID accuracy model did very well

Can we reduce memory/processing requirements?