Transcript
Page 1: Ranking Similarity between Political Speeches using Naive

Ranking Similarity between Political Speeches using Naive Bayes Text Classification

James Ryder and Sen Zhang

Dept. of Mathematics, Computer Science, & Statistics

State University of New York at Oneonta

[email protected] --- [email protected]

Text Classification Training

• Comprised of two subcomponents - Concordancer

- Metaconcordancer

• Al l training is done offline and prior to any cI a ss ifi catio n

• Training generates f inished metaconcordance files that are used during classification

• A set of metaconcordance files is a set of categories containing N f iles

Training - Metaconcordancer

Combine all concordances (C1 - CM) for a single author (Ai) into a single metaconcordance fi le (MTA)

• • _~'I M eta concordance r

• A, MTA

Fil e

Create a metaconcordance file for author Ai' This is a complete description of this author's texts and is used at Classification time.

Inaugural Speech Experiment Design

• Training Phase

- The Inaugural speeches of the recent ten u.s. presidents: Barack Obama, George W. Bush, Bill Clinton, George Bush, Ronald Reagan, Jimmy Carter, Richard Nixon, Lyndon Johnson, John Kennedy, and Dwight Eisenhower.

- For those who served two term s, only his second inaugural speech w as collected.

• Classification Phase

- Obama's Inaugural speech

- Bush's Second Inaugural speech

- Bush's Farewell speech

Top Frequently used Words of the George W. Bush's Farewell Speech

«\it! took cronomy JUStICe """ 01 cause ,hru .. hort 1 ''''''ghlS gave "'''''' ..."

,~~ hu~~.,~neon e~T"'ihtGod~bctler hard • h 'Ie ars ~h .t'ailroad ~;Iung two jomg hI',,,,,,, /xU'll

show rug t men ouse never . attack ~"I ike

AdemocraCy Laura · whose Americans hers Amencan best liround da YSStates merIt;apast COnfide~:N!f~SSOtffm:' opereme~beJ,,,, \; musibock a Ion President (alk may home Id wl1ether l:: d Jus,,,,

r<omf.: war hdpcOountnr ""' liberty(""A son · , keepanlger Every United fi mmd...l ,h", . . ~./ ",o:"nr, menca Ssaf,..tvar£aq d1y

"'m~" re~~9m wrught C1tl~enS tough,:~always'l:' character future 1· r tIme new d .. .."" gIven "1'1" Y young ;n,dl'li= see lIe """1 eClslons eight honor. times suffmng !CCUli1)' ce

terrorists good pro5J'<f1l}' ahead thank pafilh r,ng "tohl}' ~ meet pea id<dogy i"uruJe .~nJorful f~'h.lJoscd';" h~;gher seven (,I/o" nght

""'mcdtcine'~ ~!I' ~~ t<1T1If

TC Authors Training Example

• Set of N authors - We are given a snippet of text said to be written by one of the authors in the set of authors (categories)

• This TC system should attempt to predict which author is most likely to have written the snippet of text

• In the training phase, we need to obtain samples of the writing for each author in the category (author) set

Prepare for Classification

• Col lect all MTA (category) fi les into one folder.

• Edit the category list f ile by inserting the names of all category files to compare against.

• To be ready to classify some unknown snippet of text, one needs - All category files prepared (MTA)

- The category list file

CategoriesA = {MTAA" MTAA2, ... , MTAAN } is the set of all category files

Prelimin ary Experi ment Results {1.1}: Ranking via Comparing

Their Speech es with Obam a's Inaugural Speech

(iI ~. ~

~ .~ . {o ~ ~ <) ~ ~ ~

/ ~ '?, oJ G ~. ~ ~ ~ ~ ~ ~ ~

"% ~ t. ~ a ? ~ ? "" ~ ?

-< -< ~

~

'" t1> r-I a

em :;: ~ t1> t1> ~

~ ~

~ ;;: ;;: Ql Ql ~

~ Obama's Inaugural Speech " " ~ ~

Top Frequently used Words of the Barack Obama's Inaugural Speech

Joint and Conditional Probability Using variables X andY. P(X, Y) means th e probability that X will take on a spec ific value x and Y will take on a speci fic value y. Th at is. X will occur and Y will occur too .

P(X. Y) = P(Y. X). This idea is known as Joint Probability.

P(Y = Y I X = x) is read the "probability d, at Y will take on the specificva lue y GIV EN THAT X has already taken on the speci fic valu e x". Conditional Probability P(Y I X) .

P(X. Y) = P(Y. X) and P(Y, X) = P(X. Y) P(X. Y) = P(X) P(Y I X) P(y. X) = pry ) P(X I Y)

Th e formula above is read 1he probability tha t X occurs and Y occurs is the same as th e probability th at X occurs and given th at X has occurred, Y occurs" andvice versa on th e right side.

P(X. Y) = P(Y. X) P(X) P(Y I X) = P(Y) P(X I Y)

P(Y I X) = P(Y) P(X I Y) P(X)

Th e formula above is readM(th e chan ce of Y given th at X occurred) is «th e chan ce of Y occurrin g) times (th e chance ol X occurring given thatY occurred)) divided by (th e chance 01 X occurring)

This is the standard Bayes Theorem

TC Author Example Training

• Find many works of literature that each author has written

• For each author, create a single concordance for each of this author's M texts (T)

A = {A" A2, ... ,AN} The set of all N authors TAi = {T" T2, ... , T M} The set of all texts for author Ai

C,

-~II Con corda ncer 11--.. , : CM

Create a concordance file for each of author's texts (TAi), C, through CM'

Mapping Classic Bayes Theorem into Na·ive Bayes Text Classification System

We will use a modi fied (naN e) version 01 Bayes Th eorem to create an ordered list 01 categories l or which a given in puttext may belong. Th e ordering is based upon th e relative like lihoodthatthe text is similar to a category in stance for all categories in th e ca tegory set.

a P(Y I X)

b c = PlY) PIX I Y)

P(X)

d

a) X is th e in puttext ($ource) th at we attemptto classify . Y is a single instance of a category from among a set ol categories bein gconsidered. a is read 'the probability th altext X belongs to category Y". b) is th e probability thatthis is instance ' j' from the category set. If th e category set contain s 1 a categories then P(Y) is 0.10.

c ) is th e probability of all words in th e in putt ext bein g foun d in category in stance YI from th e set o f ca tegories .

d) is th e probability that inpultext X is in laclthe inputtext X. Clearly, this is P(1) and there lore will be discarded in the fin al formula without affectin g relative scoresbehYeen the categories .

Political Spectrum Experiment Design

• Training Phase - Ten prom inent world-wide polit ica l f igures

• George W. Bush, Winston Churchill, Bill Clinton, Adolf Hilter, John Kennedy, Tse-tung Mao, Karl Marx, Barack Obama, Joseph Sta lin, Margaret Thatcher.

- For each of them, we randomly select five speeches or written works. By random, here we mean, we just collected these speeches fi-om the Internet without prior knowledge about them and without reading them.

• Classi f icat ion Phase - Obama's Inaugura l speech - Bush's Second In augural speech - Bu sh's Farewe ll speech

References and Acknowledgments

• Beautiful Word Clouds: http://www.wordle.net

• Inaugu ral Speeches of Presidents of the Un ited States: http://www.ba rtleby.com/124

• Thanksto Dr. Wil liam R. Wilkerson for his help in directing us to online political speech repositories.

• Thanks to the TLTC for printing out the Poster.

Na·lve Bayes Text Classifier

• Our text classification (TC) system is broken into two ma in components - Tra ining

- Classification

• Train ing must be done first

• We need to map a standard Bayes Theorem into a formula for quantifying the likel ihood that a given text (X) fa lls into a certain category instance (Y).

Training - Concordancer

• For each text (Tj ) in TAj' the concordancer

- counts the number of occurrences of each unique

word in the text (frequency)

- counts the total number of words

- calculates the relative frequency of each unique

wo rd in the text (frequency / total_words)

- creates an output file concordance (C) containing

the above information and the list of unique

words

c) For a giv en category Yi what is the probability that the words in X appear in Yi ?

X = {W, .W2,""W, } Th e set of all words in the snippet of text

P(X I Yi ) = P«w, . w, .... w,) I Vi)

, P(X I Yi) = TT P(WJ I Yi ) The probability of wJ is th e relativ e Irequency of th e word

J"' contained in th e metaconcordance for category Y,

II wJ from X is not present in the Yi th en we use a very sm all number for the probability because the probability of a word not found is zero. M ultip lying by zero destroys the prod uct.

This product wi ll result in an extremely small number that may be small er than a computer ca n properly rep resent precisely. So, we use a trick . Instead we add th e logarithms.

Trick: log (A • B) = log (A) + log (B)

d

10g(P(X I Vi)) = r 10g(P('-"'i I Yi)) j= 1

Prelim inary Experim ent Results {2.2}: Ranki ng via Comparing

Th eir speech es with G. W. Bush's Second In augural Speec h

(iI

~ {o 1 <:; ~ <) ~ '? '?, ~ ~.

<::.. ~ % 'f,

~ 0j % S: ~. '0 ?

~ ~ ~ oJ ~

~ ~ "1- a '?

-< -< ~

~ t1> <1> r-I a

em :;: ~ "' "' ~

~ ~

~ ;;: ;;: '" Ql ~

~ G.W. Bush's Second Inaugural Speech " " ~ ~

Future Work

• To improve ranking accuracy, we plan to - use variants of Na'ive Bayes and address the poor

independent assumption;

- explore more linguist ic, rhetorical and stylistical features such as metaphors, analogies, similes, opposition, alliteration, antithesis and parallelism etc.;

- select more representative training datasets;

- conduct more intensive experiments .

Top Related