ranking similarity between political speeches using naive
TRANSCRIPT
Ranking Similarity between Political Speeches using Naive Bayes Text Classification
James Ryder and Sen Zhang
Dept. of Mathematics, Computer Science, & Statistics
State University of New York at Oneonta
[email protected] --- [email protected]
Text Classification Training
• Comprised of two subcomponents - Concordancer
- Metaconcordancer
• Al l training is done offline and prior to any cI a ss ifi catio n
• Training generates f inished metaconcordance files that are used during classification
• A set of metaconcordance files is a set of categories containing N f iles
•
Training - Metaconcordancer
Combine all concordances (C1 - CM) for a single author (Ai) into a single metaconcordance fi le (MTA)
• • _~'I M eta concordance r
• A, MTA
Fil e
Create a metaconcordance file for author Ai' This is a complete description of this author's texts and is used at Classification time.
Inaugural Speech Experiment Design
• Training Phase
- The Inaugural speeches of the recent ten u.s. presidents: Barack Obama, George W. Bush, Bill Clinton, George Bush, Ronald Reagan, Jimmy Carter, Richard Nixon, Lyndon Johnson, John Kennedy, and Dwight Eisenhower.
- For those who served two term s, only his second inaugural speech w as collected.
• Classification Phase
- Obama's Inaugural speech
- Bush's Second Inaugural speech
- Bush's Farewell speech
Top Frequently used Words of the George W. Bush's Farewell Speech
«\it! took cronomy JUStICe """ 01 cause ,hru .. hort 1 ''''''ghlS gave "'''''' ..."
,~~ hu~~.,~neon e~T"'ihtGod~bctler hard • h 'Ie ars ~h .t'ailroad ~;Iung two jomg hI',,,,,,, /xU'll
show rug t men ouse never . attack ~"I ike
AdemocraCy Laura · whose Americans hers Amencan best liround da YSStates merIt;apast COnfide~:N!f~SSOtffm:' opereme~beJ,,,, \; musibock a Ion President (alk may home Id wl1ether l:: d Jus,,,,
r<omf.: war hdpcOountnr ""' liberty(""A son · , keepanlger Every United fi mmd...l ,h", . . ~./ ",o:"nr, menca Ssaf,..tvar£aq d1y
"'m~" re~~9m wrught C1tl~enS tough,:~always'l:' character future 1· r tIme new d .. .."" gIven "1'1" Y young ;n,dl'li= see lIe """1 eClslons eight honor. times suffmng !CCUli1)' ce
terrorists good pro5J'<f1l}' ahead thank pafilh r,ng "tohl}' ~ meet pea id<dogy i"uruJe .~nJorful f~'h.lJoscd';" h~;gher seven (,I/o" nght
""'mcdtcine'~ ~!I' ~~ t<1T1If
TC Authors Training Example
• Set of N authors - We are given a snippet of text said to be written by one of the authors in the set of authors (categories)
• This TC system should attempt to predict which author is most likely to have written the snippet of text
• In the training phase, we need to obtain samples of the writing for each author in the category (author) set
Prepare for Classification
• Col lect all MTA (category) fi les into one folder.
• Edit the category list f ile by inserting the names of all category files to compare against.
• To be ready to classify some unknown snippet of text, one needs - All category files prepared (MTA)
- The category list file
CategoriesA = {MTAA" MTAA2, ... , MTAAN } is the set of all category files
Prelimin ary Experi ment Results {1.1}: Ranking via Comparing
Their Speech es with Obam a's Inaugural Speech
(iI ~. ~
~ .~ . {o ~ ~ <) ~ ~ ~
/ ~ '?, oJ G ~. ~ ~ ~ ~ ~ ~ ~
•
"% ~ t. ~ a ? ~ ? "" ~ ?
-< -< ~
~
'" t1> r-I a
em :;: ~ t1> t1> ~
~ ~
~ ;;: ;;: Ql Ql ~
~ Obama's Inaugural Speech " " ~ ~
Top Frequently used Words of the Barack Obama's Inaugural Speech
•
Joint and Conditional Probability Using variables X andY. P(X, Y) means th e probability that X will take on a spec ific value x and Y will take on a speci fic value y. Th at is. X will occur and Y will occur too .
P(X. Y) = P(Y. X). This idea is known as Joint Probability.
P(Y = Y I X = x) is read the "probability d, at Y will take on the specificva lue y GIV EN THAT X has already taken on the speci fic valu e x". Conditional Probability P(Y I X) .
P(X. Y) = P(Y. X) and P(Y, X) = P(X. Y) P(X. Y) = P(X) P(Y I X) P(y. X) = pry ) P(X I Y)
Th e formula above is read 1he probability tha t X occurs and Y occurs is the same as th e probability th at X occurs and given th at X has occurred, Y occurs" andvice versa on th e right side.
P(X. Y) = P(Y. X) P(X) P(Y I X) = P(Y) P(X I Y)
P(Y I X) = P(Y) P(X I Y) P(X)
Th e formula above is readM(th e chan ce of Y given th at X occurred) is «th e chan ce of Y occurrin g) times (th e chance ol X occurring given thatY occurred)) divided by (th e chance 01 X occurring)
This is the standard Bayes Theorem
TC Author Example Training
• Find many works of literature that each author has written
• For each author, create a single concordance for each of this author's M texts (T)
A = {A" A2, ... ,AN} The set of all N authors TAi = {T" T2, ... , T M} The set of all texts for author Ai
C,
-~II Con corda ncer 11--.. , : CM
Create a concordance file for each of author's texts (TAi), C, through CM'
Mapping Classic Bayes Theorem into Na·ive Bayes Text Classification System
We will use a modi fied (naN e) version 01 Bayes Th eorem to create an ordered list 01 categories l or which a given in puttext may belong. Th e ordering is based upon th e relative like lihoodthatthe text is similar to a category in stance for all categories in th e ca tegory set.
a P(Y I X)
b c = PlY) PIX I Y)
P(X)
d
a) X is th e in puttext ($ource) th at we attemptto classify . Y is a single instance of a category from among a set ol categories bein gconsidered. a is read 'the probability th altext X belongs to category Y". b) is th e probability thatthis is instance ' j' from the category set. If th e category set contain s 1 a categories then P(Y) is 0.10.
c ) is th e probability of all words in th e in putt ext bein g foun d in category in stance YI from th e set o f ca tegories .
d) is th e probability that inpultext X is in laclthe inputtext X. Clearly, this is P(1) and there lore will be discarded in the fin al formula without affectin g relative scoresbehYeen the categories .
Political Spectrum Experiment Design
• Training Phase - Ten prom inent world-wide polit ica l f igures
• George W. Bush, Winston Churchill, Bill Clinton, Adolf Hilter, John Kennedy, Tse-tung Mao, Karl Marx, Barack Obama, Joseph Sta lin, Margaret Thatcher.
- For each of them, we randomly select five speeches or written works. By random, here we mean, we just collected these speeches fi-om the Internet without prior knowledge about them and without reading them.
• Classi f icat ion Phase - Obama's Inaugura l speech - Bush's Second In augural speech - Bu sh's Farewe ll speech
References and Acknowledgments
• Beautiful Word Clouds: http://www.wordle.net
• Inaugu ral Speeches of Presidents of the Un ited States: http://www.ba rtleby.com/124
• Thanksto Dr. Wil liam R. Wilkerson for his help in directing us to online political speech repositories.
• Thanks to the TLTC for printing out the Poster.
•
Na·lve Bayes Text Classifier
• Our text classification (TC) system is broken into two ma in components - Tra ining
- Classification
• Train ing must be done first
• We need to map a standard Bayes Theorem into a formula for quantifying the likel ihood that a given text (X) fa lls into a certain category instance (Y).
Training - Concordancer
• For each text (Tj ) in TAj' the concordancer
- counts the number of occurrences of each unique
word in the text (frequency)
- counts the total number of words
- calculates the relative frequency of each unique
wo rd in the text (frequency / total_words)
- creates an output file concordance (C) containing
the above information and the list of unique
words
c) For a giv en category Yi what is the probability that the words in X appear in Yi ?
X = {W, .W2,""W, } Th e set of all words in the snippet of text
P(X I Yi ) = P«w, . w, .... w,) I Vi)
, P(X I Yi) = TT P(WJ I Yi ) The probability of wJ is th e relativ e Irequency of th e word
J"' contained in th e metaconcordance for category Y,
II wJ from X is not present in the Yi th en we use a very sm all number for the probability because the probability of a word not found is zero. M ultip lying by zero destroys the prod uct.
This product wi ll result in an extremely small number that may be small er than a computer ca n properly rep resent precisely. So, we use a trick . Instead we add th e logarithms.
Trick: log (A • B) = log (A) + log (B)
d
10g(P(X I Vi)) = r 10g(P('-"'i I Yi)) j= 1
Prelim inary Experim ent Results {2.2}: Ranki ng via Comparing
Th eir speech es with G. W. Bush's Second In augural Speec h
(iI
~ {o 1 <:; ~ <) ~ '? '?, ~ ~.
<::.. ~ % 'f,
~ 0j % S: ~. '0 ?
~ ~ ~ oJ ~
~ ~ "1- a '?
-< -< ~
~ t1> <1> r-I a
em :;: ~ "' "' ~
~ ~
~ ;;: ;;: '" Ql ~
~ G.W. Bush's Second Inaugural Speech " " ~ ~
Future Work
• To improve ranking accuracy, we plan to - use variants of Na'ive Bayes and address the poor
independent assumption;
- explore more linguist ic, rhetorical and stylistical features such as metaphors, analogies, similes, opposition, alliteration, antithesis and parallelism etc.;
- select more representative training datasets;
- conduct more intensive experiments .
•