advanced quantitative research methodology, lecture notes: =1=text

137
Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis I: How to Read 100 Million Blogs (& Classify Deaths Without Physicians) 1 Gary King http://GKing.Harvard.Edu April 25, 2010 1 c Copyright 2010 Gary King, All Rights Reserved. Gary King http://GKing.Harvard.Edu () Advanced Quantitative Research Methodology, Lecture Notes: April 25, 2010 1/1

Upload: others

Post on 12-Sep-2021

24 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Advanced Quantitative Research Methodology, LectureNotes: Text Analysis I: How to Read 100 Million Blogs

(& Classify Deaths Without Physicians)1

Gary Kinghttp://GKing.Harvard.Edu

April 25, 2010

1 c©Copyright 2010 Gary King, All Rights Reserved.Gary King http://GKing.Harvard.Edu () Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis I: How to Read 100 Million Blogs (& Classify Deaths Without Physicians)April 25, 2010 1 / 1

Page 2: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

References

Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24

commercializedvia:

Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91

In use by (among others):

Copies at http://gking.harvard.edu

http://ow.ly/14hDU (play after ad)

http://ow.ly/14h36 (play 10:00-12:06)

Gary King (Harvard, IQSS) Text Analysis 2 / 1

Page 3: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

References

Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24 commercializedvia:

Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91

In use by (among others):

Copies at http://gking.harvard.edu

http://ow.ly/14hDU (play after ad)

http://ow.ly/14h36 (play 10:00-12:06)

Gary King (Harvard, IQSS) Text Analysis 2 / 1

Page 4: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

References

Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24 commercializedvia:

Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91

In use by (among others):

Copies at http://gking.harvard.edu

http://ow.ly/14hDU (play after ad)

http://ow.ly/14h36 (play 10:00-12:06)

Gary King (Harvard, IQSS) Text Analysis 2 / 1

Page 5: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

References

Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24 commercializedvia:

Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91 In use by (among others):

Copies at http://gking.harvard.edu

http://ow.ly/14hDU (play after ad)

http://ow.ly/14h36 (play 10:00-12:06)

Gary King (Harvard, IQSS) Text Analysis 2 / 1

Page 6: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

References

Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24 commercializedvia:

Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91 In use by (among others):

Copies at http://gking.harvard.edu

http://ow.ly/14hDU (play after ad)

http://ow.ly/14h36 (play 10:00-12:06)

Gary King (Harvard, IQSS) Text Analysis 2 / 1

Page 7: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

References

Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24 commercializedvia:

Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91 In use by (among others):

Copies at http://gking.harvard.edu

http://ow.ly/14hDU (play after ad)

http://ow.ly/14h36 (play 10:00-12:06)

Gary King (Harvard, IQSS) Text Analysis 2 / 1

Page 8: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

References

Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24 commercializedvia:

Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91 In use by (among others):

Copies at http://gking.harvard.edu

http://ow.ly/14hDU (play after ad)

http://ow.ly/14h36 (play 10:00-12:06)

Gary King (Harvard, IQSS) Text Analysis 2 / 1

Page 9: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Inputs and Target Quantities of Interest

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories

Quantities of interest

individual document classifications (spam filters)proportion in each category (proportion email which is spam)

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities

Gary King (Harvard, IQSS) Text Analysis 3 / 1

Page 10: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Inputs and Target Quantities of Interest

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories

Quantities of interest

individual document classifications (spam filters)proportion in each category (proportion email which is spam)

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities

Gary King (Harvard, IQSS) Text Analysis 3 / 1

Page 11: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Inputs and Target Quantities of Interest

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)

A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories

Quantities of interest

individual document classifications (spam filters)proportion in each category (proportion email which is spam)

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities

Gary King (Harvard, IQSS) Text Analysis 3 / 1

Page 12: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Inputs and Target Quantities of Interest

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categories

A small set of documents hand-coded into the categories

Quantities of interest

individual document classifications (spam filters)proportion in each category (proportion email which is spam)

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities

Gary King (Harvard, IQSS) Text Analysis 3 / 1

Page 13: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Inputs and Target Quantities of Interest

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories

Quantities of interest

individual document classifications (spam filters)proportion in each category (proportion email which is spam)

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities

Gary King (Harvard, IQSS) Text Analysis 3 / 1

Page 14: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Inputs and Target Quantities of Interest

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories

Quantities of interest

individual document classifications (spam filters)proportion in each category (proportion email which is spam)

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities

Gary King (Harvard, IQSS) Text Analysis 3 / 1

Page 15: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Inputs and Target Quantities of Interest

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories

Quantities of interest

individual document classifications (spam filters)

proportion in each category (proportion email which is spam)

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities

Gary King (Harvard, IQSS) Text Analysis 3 / 1

Page 16: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Inputs and Target Quantities of Interest

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories

Quantities of interest

individual document classifications (spam filters)proportion in each category (proportion email which is spam)

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities

Gary King (Harvard, IQSS) Text Analysis 3 / 1

Page 17: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Inputs and Target Quantities of Interest

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories

Quantities of interest

individual document classifications (spam filters)proportion in each category (proportion email which is spam)

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities

Gary King (Harvard, IQSS) Text Analysis 3 / 1

Page 18: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Inputs and Target Quantities of Interest

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories

Quantities of interest

individual document classifications (spam filters)proportion in each category (proportion email which is spam)

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)

High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities

Gary King (Harvard, IQSS) Text Analysis 3 / 1

Page 19: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Inputs and Target Quantities of Interest

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories

Quantities of interest

individual document classifications (spam filters)proportion in each category (proportion email which is spam)

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions

⇒ Different methods optimize estimation of the different quantities

Gary King (Harvard, IQSS) Text Analysis 3 / 1

Page 20: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Inputs and Target Quantities of Interest

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories

Quantities of interest

individual document classifications (spam filters)proportion in each category (proportion email which is spam)

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities

Gary King (Harvard, IQSS) Text Analysis 3 / 1

Page 21: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Blogs as a Running Example

Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.

“We are living through the largest expansion of expressive capabilityin the history of the human race”

Measures classical notion of public opinion: active public expressionsdesigned to influence policy and politics (previously: strikes, boycotts,demonstrations, editorials)

(Public opinion ; surveys)

Gary King (Harvard, IQSS) Text Analysis 4 / 1

Page 22: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Blogs as a Running Example

Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.

“We are living through the largest expansion of expressive capabilityin the history of the human race”

Measures classical notion of public opinion: active public expressionsdesigned to influence policy and politics (previously: strikes, boycotts,demonstrations, editorials)

(Public opinion ; surveys)

Gary King (Harvard, IQSS) Text Analysis 4 / 1

Page 23: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Blogs as a Running Example

Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.

“We are living through the largest expansion of expressive capabilityin the history of the human race”

Measures classical notion of public opinion: active public expressionsdesigned to influence policy and politics (previously: strikes, boycotts,demonstrations, editorials)

(Public opinion ; surveys)

Gary King (Harvard, IQSS) Text Analysis 4 / 1

Page 24: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Blogs as a Running Example

Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.

“We are living through the largest expansion of expressive capabilityin the history of the human race”

Measures classical notion of public opinion: active public expressionsdesigned to influence policy and politics (previously: strikes, boycotts,demonstrations, editorials)

(Public opinion ; surveys)

Gary King (Harvard, IQSS) Text Analysis 4 / 1

Page 25: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Blogs as a Running Example

Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.

“We are living through the largest expansion of expressive capabilityin the history of the human race”

Measures classical notion of public opinion: active public expressionsdesigned to influence policy and politics (previously: strikes, boycotts,demonstrations, editorials)

(Public opinion ; surveys)

Gary King (Harvard, IQSS) Text Analysis 4 / 1

Page 26: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

One specific quantity of interest

Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)

Gary King (Harvard, IQSS) Text Analysis 5 / 1

Page 27: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

One specific quantity of interest

Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)

Gary King (Harvard, IQSS) Text Analysis 5 / 1

Page 28: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

One specific quantity of interest

Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)

Gary King (Harvard, IQSS) Text Analysis 5 / 1

Page 29: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

One specific quantity of interest

Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)

Gary King (Harvard, IQSS) Text Analysis 5 / 1

Page 30: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

One specific quantity of interest

Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization

“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)

Gary King (Harvard, IQSS) Text Analysis 5 / 1

Page 31: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

One specific quantity of interest

Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”

Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)

Gary King (Harvard, IQSS) Text Analysis 5 / 1

Page 32: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

One specific quantity of interest

Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”

Little common internal structure (no inverted pyramid)

Gary King (Harvard, IQSS) Text Analysis 5 / 1

Page 33: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

One specific quantity of interest

Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)

Gary King (Harvard, IQSS) Text Analysis 5 / 1

Page 34: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Example of output: John Kerry’s Botched Joke

You know, education — if you make the most of it . . . you cando well. If you don’t, you get stuck in Iraq.

●●

●● ●

●●

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

2

Gary King (Harvard, IQSS) Text Analysis 6 / 1

Page 35: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Example of output: John Kerry’s Botched Joke

You know, education — if you make the most of it . . . you cando well. If you don’t, you get stuck in Iraq.

●●

●● ●

●●

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

2

Gary King (Harvard, IQSS) Text Analysis 6 / 1

Page 36: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Example of output: John Kerry’s Botched Joke

You know, education — if you make the most of it . . . you cando well. If you don’t, you get stuck in Iraq.

●●

●● ●

●●

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

2

Gary King (Harvard, IQSS) Text Analysis 6 / 1

Page 37: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Representing Text as Numbers

Filter: choose English language blogs that mention Bush

Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)

Code variables: presence/absence of unique unigrams, bigrams,trigrams

Our Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

More sophisticated summaries: we’ve used, but they’re not necessary

(More systematic than than 1 dummy variable per document)

Gary King (Harvard, IQSS) Text Analysis 7 / 1

Page 38: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Representing Text as Numbers

Filter: choose English language blogs that mention Bush

Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)

Code variables: presence/absence of unique unigrams, bigrams,trigrams

Our Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

More sophisticated summaries: we’ve used, but they’re not necessary

(More systematic than than 1 dummy variable per document)

Gary King (Harvard, IQSS) Text Analysis 7 / 1

Page 39: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Representing Text as Numbers

Filter: choose English language blogs that mention Bush

Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)

Code variables: presence/absence of unique unigrams, bigrams,trigrams

Our Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

More sophisticated summaries: we’ve used, but they’re not necessary

(More systematic than than 1 dummy variable per document)

Gary King (Harvard, IQSS) Text Analysis 7 / 1

Page 40: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Representing Text as Numbers

Filter: choose English language blogs that mention Bush

Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)

Code variables: presence/absence of unique unigrams, bigrams,trigrams

Our Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

More sophisticated summaries: we’ve used, but they’re not necessary

(More systematic than than 1 dummy variable per document)

Gary King (Harvard, IQSS) Text Analysis 7 / 1

Page 41: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Representing Text as Numbers

Filter: choose English language blogs that mention Bush

Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)

Code variables: presence/absence of unique unigrams, bigrams,trigrams

Our Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

More sophisticated summaries: we’ve used, but they’re not necessary

(More systematic than than 1 dummy variable per document)

Gary King (Harvard, IQSS) Text Analysis 7 / 1

Page 42: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Representing Text as Numbers

Filter: choose English language blogs that mention Bush

Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)

Code variables: presence/absence of unique unigrams, bigrams,trigrams

Our Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.

keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

More sophisticated summaries: we’ve used, but they’re not necessary

(More systematic than than 1 dummy variable per document)

Gary King (Harvard, IQSS) Text Analysis 7 / 1

Page 43: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Representing Text as Numbers

Filter: choose English language blogs that mention Bush

Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)

Code variables: presence/absence of unique unigrams, bigrams,trigrams

Our Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variables

Groups infinite possible posts into “only” 23,672 distinct types

More sophisticated summaries: we’ve used, but they’re not necessary

(More systematic than than 1 dummy variable per document)

Gary King (Harvard, IQSS) Text Analysis 7 / 1

Page 44: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Representing Text as Numbers

Filter: choose English language blogs that mention Bush

Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)

Code variables: presence/absence of unique unigrams, bigrams,trigrams

Our Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

More sophisticated summaries: we’ve used, but they’re not necessary

(More systematic than than 1 dummy variable per document)

Gary King (Harvard, IQSS) Text Analysis 7 / 1

Page 45: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Representing Text as Numbers

Filter: choose English language blogs that mention Bush

Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)

Code variables: presence/absence of unique unigrams, bigrams,trigrams

Our Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

More sophisticated summaries: we’ve used, but they’re not necessary

(More systematic than than 1 dummy variable per document)

Gary King (Harvard, IQSS) Text Analysis 7 / 1

Page 46: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Representing Text as Numbers

Filter: choose English language blogs that mention Bush

Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)

Code variables: presence/absence of unique unigrams, bigrams,trigrams

Our Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

More sophisticated summaries: we’ve used, but they’re not necessary

(More systematic than than 1 dummy variable per document)

Gary King (Harvard, IQSS) Text Analysis 7 / 1

Page 47: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Notation

Document Category

Di =

-2 extremely negative

-1 negative

0 neutral

1 positive

2 extremely positive

NA no opinion expressed

NB not a blog

Word Stem Profile:

Si =

Si1 = 1 if “awful” is used, 0 if not

Si2 = 1 if “good” is used, 0 if not...

...

SiK = 1 if “except” is used, 0 if not

Gary King (Harvard, IQSS) Text Analysis 8 / 1

Page 48: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Notation

Document Category

Di =

-2 extremely negative

-1 negative

0 neutral

1 positive

2 extremely positive

NA no opinion expressed

NB not a blog

Word Stem Profile:

Si =

Si1 = 1 if “awful” is used, 0 if not

Si2 = 1 if “good” is used, 0 if not...

...

SiK = 1 if “except” is used, 0 if not

Gary King (Harvard, IQSS) Text Analysis 8 / 1

Page 49: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Notation

Document Category

Di =

-2 extremely negative

-1 negative

0 neutral

1 positive

2 extremely positive

NA no opinion expressed

NB not a blog

Word Stem Profile:

Si =

Si1 = 1 if “awful” is used, 0 if not

Si2 = 1 if “good” is used, 0 if not...

...

SiK = 1 if “except” is used, 0 if not

Gary King (Harvard, IQSS) Text Analysis 8 / 1

Page 50: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Quantities of Interest

Computer Science: individual document classifications

D1,D2 . . . , DL

Social Science: proportions in each category

P(D) =

P(D = −2)P(D = −1)P(D = 0)P(D = 1)P(D = 2)

P(D = NA)P(D = NB)

Gary King (Harvard, IQSS) Text Analysis 9 / 1

Page 51: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Quantities of Interest

Computer Science: individual document classifications

D1,D2 . . . , DL

Social Science: proportions in each category

P(D) =

P(D = −2)P(D = −1)P(D = 0)P(D = 1)P(D = 2)

P(D = NA)P(D = NB)

Gary King (Harvard, IQSS) Text Analysis 9 / 1

Page 52: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Quantities of Interest

Computer Science: individual document classifications

D1,D2 . . . , DL

Social Science: proportions in each category

P(D) =

P(D = −2)P(D = −1)P(D = 0)P(D = 1)P(D = 2)

P(D = NA)P(D = NB)

Gary King (Harvard, IQSS) Text Analysis 9 / 1

Page 53: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Issues with Existing Statistical Approaches

1 Direct Sampling

Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)

2 Aggregation of model-based individual classifications

Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard, IQSS) Text Analysis 10 / 1

Page 54: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Issues with Existing Statistical Approaches

1 Direct Sampling

Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)

2 Aggregation of model-based individual classifications

Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard, IQSS) Text Analysis 10 / 1

Page 55: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Issues with Existing Statistical Approaches

1 Direct Sampling

Biased without a random sample

nonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)

2 Aggregation of model-based individual classifications

Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard, IQSS) Text Analysis 10 / 1

Page 56: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Issues with Existing Statistical Approaches

1 Direct Sampling

Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.

(Classification of population documents not necessary)

2 Aggregation of model-based individual classifications

Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard, IQSS) Text Analysis 10 / 1

Page 57: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Issues with Existing Statistical Approaches

1 Direct Sampling

Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)

2 Aggregation of model-based individual classifications

Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard, IQSS) Text Analysis 10 / 1

Page 58: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Issues with Existing Statistical Approaches

1 Direct Sampling

Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)

2 Aggregation of model-based individual classifications

Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard, IQSS) Text Analysis 10 / 1

Page 59: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Issues with Existing Statistical Approaches

1 Direct Sampling

Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)

2 Aggregation of model-based individual classifications

Biased without a random sample

Models P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard, IQSS) Text Analysis 10 / 1

Page 60: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Issues with Existing Statistical Approaches

1 Direct Sampling

Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)

2 Aggregation of model-based individual classifications

Biased without a random sampleModels P(D|S), but the world works as P(S|D)

Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard, IQSS) Text Analysis 10 / 1

Page 61: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Issues with Existing Statistical Approaches

1 Direct Sampling

Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)

2 Aggregation of model-based individual classifications

Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard, IQSS) Text Analysis 10 / 1

Page 62: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Issues with Existing Statistical Approaches

1 Direct Sampling

Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)

2 Aggregation of model-based individual classifications

Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.

S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard, IQSS) Text Analysis 10 / 1

Page 63: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Issues with Existing Statistical Approaches

1 Direct Sampling

Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)

2 Aggregation of model-based individual classifications

Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard, IQSS) Text Analysis 10 / 1

Page 64: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Issues with Existing Statistical Approaches

1 Direct Sampling

Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)

2 Aggregation of model-based individual classifications

Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard, IQSS) Text Analysis 10 / 1

Page 65: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Aggregate classifications to category proportions

Use labeled set to estimate misclassification rates (by cross-validation)

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(No new assumptions beyond that of the classifier)

(still requires random samples, individual classification, etc)

Gary King (Harvard, IQSS) Text Analysis 11 / 1

Page 66: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Aggregate classifications to category proportions

Use labeled set to estimate misclassification rates (by cross-validation)

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(No new assumptions beyond that of the classifier)

(still requires random samples, individual classification, etc)

Gary King (Harvard, IQSS) Text Analysis 11 / 1

Page 67: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Aggregate classifications to category proportions

Use labeled set to estimate misclassification rates (by cross-validation)

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(No new assumptions beyond that of the classifier)

(still requires random samples, individual classification, etc)

Gary King (Harvard, IQSS) Text Analysis 11 / 1

Page 68: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Aggregate classifications to category proportions

Use labeled set to estimate misclassification rates (by cross-validation)

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(No new assumptions beyond that of the classifier)

(still requires random samples, individual classification, etc)

Gary King (Harvard, IQSS) Text Analysis 11 / 1

Page 69: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Aggregate classifications to category proportions

Use labeled set to estimate misclassification rates (by cross-validation)

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(No new assumptions beyond that of the classifier)

(still requires random samples, individual classification, etc)

Gary King (Harvard, IQSS) Text Analysis 11 / 1

Page 70: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Aggregate classifications to category proportions

Use labeled set to estimate misclassification rates (by cross-validation)

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(No new assumptions beyond that of the classifier)

(still requires random samples, individual classification, etc)

Gary King (Harvard, IQSS) Text Analysis 11 / 1

Page 71: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Aggregate classifications to category proportions

Use labeled set to estimate misclassification rates (by cross-validation)

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(No new assumptions beyond that of the classifier)

(still requires random samples, individual classification, etc)

Gary King (Harvard, IQSS) Text Analysis 11 / 1

Page 72: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Aggregate classifications to category proportions

Use labeled set to estimate misclassification rates (by cross-validation)

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(No new assumptions beyond that of the classifier)

(still requires random samples, individual classification, etc)

Gary King (Harvard, IQSS) Text Analysis 11 / 1

Page 73: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Formalization from Epidemiology(Levy and Kass, 1970)

Accounting identity for 2 categories:

P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)

Solve:

P(D = 1) =P(D̂ = 1)− (1− spec)

sens− (1− spec)

Use this equation to correct P(D̂ = 1)

Gary King (Harvard, IQSS) Text Analysis 12 / 1

Page 74: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Formalization from Epidemiology(Levy and Kass, 1970)

Accounting identity for 2 categories:

P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)

Solve:

P(D = 1) =P(D̂ = 1)− (1− spec)

sens− (1− spec)

Use this equation to correct P(D̂ = 1)

Gary King (Harvard, IQSS) Text Analysis 12 / 1

Page 75: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Formalization from Epidemiology(Levy and Kass, 1970)

Accounting identity for 2 categories:

P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)

Solve:

P(D = 1) =P(D̂ = 1)− (1− spec)

sens− (1− spec)

Use this equation to correct P(D̂ = 1)

Gary King (Harvard, IQSS) Text Analysis 12 / 1

Page 76: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Formalization from Epidemiology(Levy and Kass, 1970)

Accounting identity for 2 categories:

P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)

Solve:

P(D = 1) =P(D̂ = 1)− (1− spec)

sens− (1− spec)

Use this equation to correct P(D̂ = 1)

Gary King (Harvard, IQSS) Text Analysis 12 / 1

Page 77: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Generalizations: J Categories, No Individual Classification(King and Lu, 2008)

Accounting identity for J categories

P(D̂ = j) =J∑

j ′=1

P(D̂ = j |D = j ′)P(D = j ′)

Drop D̂ calculation, since D̂ = f (S):

P(S = s) =J∑

j ′=1

P(S = s|D = j ′)P(D = j ′)

Simplify to an equivalent matrix expression:

P(S) = P(S|D)P(D)

Gary King (Harvard, IQSS) Text Analysis 13 / 1

Page 78: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Generalizations: J Categories, No Individual Classification(King and Lu, 2008)

Accounting identity for J categories

P(D̂ = j) =J∑

j ′=1

P(D̂ = j |D = j ′)P(D = j ′)

Drop D̂ calculation, since D̂ = f (S):

P(S = s) =J∑

j ′=1

P(S = s|D = j ′)P(D = j ′)

Simplify to an equivalent matrix expression:

P(S) = P(S|D)P(D)

Gary King (Harvard, IQSS) Text Analysis 13 / 1

Page 79: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Generalizations: J Categories, No Individual Classification(King and Lu, 2008)

Accounting identity for J categories

P(D̂ = j) =J∑

j ′=1

P(D̂ = j |D = j ′)P(D = j ′)

Drop D̂ calculation, since D̂ = f (S):

P(S = s) =J∑

j ′=1

P(S = s|D = j ′)P(D = j ′)

Simplify to an equivalent matrix expression:

P(S) = P(S|D)P(D)

Gary King (Harvard, IQSS) Text Analysis 13 / 1

Page 80: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Generalizations: J Categories, No Individual Classification(King and Lu, 2008)

Accounting identity for J categories

P(D̂ = j) =J∑

j ′=1

P(D̂ = j |D = j ′)P(D = j ′)

Drop D̂ calculation, since D̂ = f (S):

P(S = s) =J∑

j ′=1

P(S = s|D = j ′)P(D = j ′)

Simplify to an equivalent matrix expression:

P(S) = P(S|D)P(D)

Gary King (Harvard, IQSS) Text Analysis 13 / 1

Page 81: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 82: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Document category proportions (quantity of interest)

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 83: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Word stem profile proportions (estimate in unlabeled set by tabulation)

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 84: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Word stem profiles, by category (estimate in labeled set by tabulation)

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 85: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ

=⇒ β = (X ′X )−1X ′y

Alternative symbols (to emphasize the linear equation)

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 86: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Solve for quantity of interest (with no error term)

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 87: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 88: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computer

P(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 89: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparse

Elements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 90: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 91: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 92: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average results

Equivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 93: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical data

Use constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 94: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 95: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Gary King (Harvard, IQSS) Text Analysis 14 / 1

Page 96: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

A Nonrandom Hand-coded Sample

0.0 0.1 0.2 0.3 0.4

0.0

0.1

0.2

0.3

0.4

Differences in Document Category Frequencies

Ph(D)

P(D

)

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

0.01 0.02 0.05 0.10

0.01

0.02

0.05

0.10

Differences in Word Profile Frequencies

Ph(S)

P(S

)

All existing methods would fail with these data.

Gary King (Harvard, IQSS) Text Analysis 15 / 1

Page 97: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Accurate Estimates

0.0 0.1 0.2 0.3 0.4

0.0

0.1

0.2

0.3

0.4

Actual P(D)

Est

imat

ed P

(D)

Gary King (Harvard, IQSS) Text Analysis 16 / 1

Page 98: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Out-of-sample Comparison: 60 Seconds vs. 8.7 Days

●● ●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

●●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

Gary King (Harvard, IQSS) Text Analysis 17 / 1

Page 99: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Out of Sample Validation: Other Examples

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

Congressional Speeches

Actual P(D)

Est

imat

ed P

(D)

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Immigration Editorials

Actual P(D)

Est

imat

ed P

(D)

●●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

Enron Emails

Actual P(D)

Est

imat

ed P

(D)

Gary King (Harvard, IQSS) Text Analysis 18 / 1

Page 100: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard, IQSS) Text Analysis 19 / 1

Page 101: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard, IQSS) Text Analysis 19 / 1

Page 102: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policies

High quality death registration: only 23/192 countries

Existing Approaches

Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard, IQSS) Text Analysis 19 / 1

Page 103: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard, IQSS) Text Analysis 19 / 1

Page 104: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard, IQSS) Text Analysis 19 / 1

Page 105: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questions

Ask physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard, IQSS) Text Analysis 19 / 1

Page 106: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)

Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard, IQSS) Text Analysis 19 / 1

Page 107: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)

Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard, IQSS) Text Analysis 19 / 1

Page 108: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard, IQSS) Text Analysis 19 / 1

Page 109: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

An Alternative Approach

Document Category, Cause of Death,

Di =

1 if bladder cancer

2 if cardiovascular disease

3 if transportation accident...

...

J if infectious respiratory

Word Stem Profile, Symptoms:

Si =

Si1 = 1 if “breathing difficulties”, 0 if not

Si2 = 1 if “stomach ache”, 0 if not...

...

SiK = 1 if “diarrhea”, 0 if not

Apply the same methods

Gary King (Harvard, IQSS) Text Analysis 20 / 1

Page 110: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

An Alternative Approach

Document Category, Cause of Death,

Di =

1 if bladder cancer

2 if cardiovascular disease

3 if transportation accident...

...

J if infectious respiratory

Word Stem Profile, Symptoms:

Si =

Si1 = 1 if “breathing difficulties”, 0 if not

Si2 = 1 if “stomach ache”, 0 if not...

...

SiK = 1 if “diarrhea”, 0 if not

Apply the same methods

Gary King (Harvard, IQSS) Text Analysis 20 / 1

Page 111: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

An Alternative Approach

Document Category, Cause of Death,

Di =

1 if bladder cancer

2 if cardiovascular disease

3 if transportation accident...

...

J if infectious respiratory

Word Stem Profile, Symptoms:

Si =

Si1 = 1 if “breathing difficulties”, 0 if not

Si2 = 1 if “stomach ache”, 0 if not...

...

SiK = 1 if “diarrhea”, 0 if not

Apply the same methods

Gary King (Harvard, IQSS) Text Analysis 20 / 1

Page 112: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

An Alternative Approach

Document Category, Cause of Death,

Di =

1 if bladder cancer

2 if cardiovascular disease

3 if transportation accident...

...

J if infectious respiratory

Word Stem Profile, Symptoms:

Si =

Si1 = 1 if “breathing difficulties”, 0 if not

Si2 = 1 if “stomach ache”, 0 if not...

...

SiK = 1 if “diarrhea”, 0 if not

Apply the same methods

Gary King (Harvard, IQSS) Text Analysis 20 / 1

Page 113: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Validation in Tanzania

●●

●●

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Random Split Sample

res.split.boot$true.CSMF

Est

imat

e

0.00 0.05 0.10 0.15 0.20 0.25 0.30

−0.

2−

0.1

0.0

0.1

0.2

TRUE

Err

or

●●

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Community Sample

res.split.boot$true.CSMFE

stim

ate

0.00 0.05 0.10 0.15 0.20 0.25 0.30

−0.

2−

0.1

0.0

0.1

0.2

TRUE

Err

or

● ●

Gary King (Harvard, IQSS) Text Analysis 21 / 1

Page 114: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Validation in China

●●

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Random Split Sample

res.split.boot$true.CSMF

Est

imat

e

0.00 0.05 0.10 0.15 0.20 0.25 0.30

−0.

100.

000.

050.

10

TRUE

Err

or

●●

●●

0.00

0.05

0.10

0.15

0.20

0.25

0.30

City Sample I

res.split.boot$true.CSMF

Est

imat

e

0.00 0.05 0.10 0.15 0.20 0.25 0.30

−0.

100.

000.

050.

10

TRUE

Err

or

●●

● ●

●●

0.00

0.05

0.10

0.15

0.20

0.25

0.30

City Sample II

res.big.boot$true.CSMF

Est

imat

e

0.00 0.05 0.10 0.15 0.20 0.25 0.30

−0.

100.

000.

050.

10

TRUE

Err

or

Gary King (Harvard, IQSS) Text Analysis 22 / 1

Page 115: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Implications for an Individual Classifier

All existing classifiers assume: Ph(S ,D) = P(S ,D)

For a different quantity we assume: Ph(S |D) = P(S |D)

How to use this (less restrictive) assumption for classification (BayesTheorem):

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Gary King (Harvard, IQSS) Text Analysis 23 / 1

Page 116: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Implications for an Individual Classifier

All existing classifiers assume: Ph(S ,D) = P(S ,D)

For a different quantity we assume: Ph(S |D) = P(S |D)

How to use this (less restrictive) assumption for classification (BayesTheorem):

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Gary King (Harvard, IQSS) Text Analysis 23 / 1

Page 117: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Implications for an Individual Classifier

All existing classifiers assume: Ph(S ,D) = P(S ,D)

For a different quantity we assume: Ph(S |D) = P(S |D)

How to use this (less restrictive) assumption for classification (BayesTheorem):

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Gary King (Harvard, IQSS) Text Analysis 23 / 1

Page 118: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Implications for an Individual Classifier

All existing classifiers assume: Ph(S ,D) = P(S ,D)

For a different quantity we assume: Ph(S |D) = P(S |D)

How to use this (less restrictive) assumption for classification (BayesTheorem):

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Gary King (Harvard, IQSS) Text Analysis 23 / 1

Page 119: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Implications for an Individual Classifier

All existing classifiers assume: Ph(S ,D) = P(S ,D)

For a different quantity we assume: Ph(S |D) = P(S |D)

How to use this (less restrictive) assumption for classification (BayesTheorem):

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Gary King (Harvard, IQSS) Text Analysis 23 / 1

Page 120: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Implications for an Individual Classifier

All existing classifiers assume: Ph(S ,D) = P(S ,D)

For a different quantity we assume: Ph(S |D) = P(S |D)

How to use this (less restrictive) assumption for classification (BayesTheorem):

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

The goal: individual classification

Gary King (Harvard, IQSS) Text Analysis 23 / 1

Page 121: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Implications for an Individual Classifier

All existing classifiers assume: Ph(S ,D) = P(S ,D)

For a different quantity we assume: Ph(S |D) = P(S |D)

How to use this (less restrictive) assumption for classification (BayesTheorem):

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Output from our estimator (described above)

Gary King (Harvard, IQSS) Text Analysis 23 / 1

Page 122: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Implications for an Individual Classifier

All existing classifiers assume: Ph(S ,D) = P(S ,D)

For a different quantity we assume: Ph(S |D) = P(S |D)

How to use this (less restrictive) assumption for classification (BayesTheorem):

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Nonparametric estimate from labeled set (an assumption)

Gary King (Harvard, IQSS) Text Analysis 23 / 1

Page 123: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Implications for an Individual Classifier

All existing classifiers assume: Ph(S ,D) = P(S ,D)

For a different quantity we assume: Ph(S |D) = P(S |D)

How to use this (less restrictive) assumption for classification (BayesTheorem):

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Nonparametric estimate from unlabeled set (no assumption)

Gary King (Harvard, IQSS) Text Analysis 23 / 1

Page 124: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Classification with Less Restrictive Assumptions

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

Ph (D

=j)

●●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.00

0.05

0.10

0.15

0.20

0.25

0.30

P(Sk=1)

Ph (S

k=1)

Gary King (Harvard, IQSS) Text Analysis 24 / 1

Page 125: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Classification with Less Restrictive Assumptions

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

Ph (D

=j)

●●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.00

0.05

0.10

0.15

0.20

0.25

0.30

P(Sk=1)

Ph (S

k=1)

Gary King (Harvard, IQSS) Text Analysis 24 / 1

Page 126: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Classification with Less Restrictive Assumptions

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

Ph (D

=j)

●●

0.00 0.05 0.10 0.15 0.20 0.25 0.300.

000.

050.

100.

150.

200.

250.

30

P(Sk=1)

Ph (S

k=1)

Gary King (Harvard, IQSS) Text Analysis 24 / 1

Page 127: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Classification with Less Restrictive Assumptions

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

P̂(D

=j)

SVMNonparametric

Percent correctly classified:

SVM (best existing classifier): 40.5%Our nonparametric approach: 59.8%

Gary King (Harvard, IQSS) Text Analysis 25 / 1

Page 128: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Classification with Less Restrictive Assumptions

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

P̂(D

=j)

SVMNonparametric

Percent correctly classified:

SVM (best existing classifier): 40.5%Our nonparametric approach: 59.8%

Gary King (Harvard, IQSS) Text Analysis 25 / 1

Page 129: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Classification with Less Restrictive Assumptions

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

P̂(D

=j)

SVMNonparametric

Percent correctly classified:

SVM (best existing classifier): 40.5%Our nonparametric approach: 59.8%

Gary King (Harvard, IQSS) Text Analysis 25 / 1

Page 130: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Classification with Less Restrictive Assumptions

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

P̂(D

=j)

SVMNonparametric

Percent correctly classified:

SVM (best existing classifier): 40.5%

Our nonparametric approach: 59.8%

Gary King (Harvard, IQSS) Text Analysis 25 / 1

Page 131: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Classification with Less Restrictive Assumptions

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

P̂(D

=j)

SVMNonparametric

Percent correctly classified:

SVM (best existing classifier): 40.5%Our nonparametric approach: 59.8%

Gary King (Harvard, IQSS) Text Analysis 25 / 1

Page 132: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Misclassification Matrix for Blog Posts

-2 -1 0 1 2 NA NB P(D1)

-2 .70 .10 .01 .01 .00 .02 .16 .28-1 .33 .25 .04 .02 .01 .01 .35 .080 .13 .17 .13 .11 .05 .02 .40 .021 .07 .06 .08 .20 .25 .01 .34 .032 .03 .03 .03 .22 .43 .01 .25 .03

NA .04 .01 .00 .00 .00 .81 .14 .12NB .10 .07 .02 .02 .02 .04 .75 .45

Gary King (Harvard, IQSS) Text Analysis 26 / 1

Page 133: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

SIMEX Analysis of “Not a Blog” Category

0.0

0.2

0.4

0.6

0.8

1.0

Category NB

αα−1 0

0.0

0.2

0.4

0.6

0.8

1.0

αα

Gary King (Harvard, IQSS) Text Analysis 27 / 1

Page 134: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

SIMEX Analysis of “Not a Blog” Category

0.0

0.2

0.4

0.6

0.8

1.0

Category NB

αα−1 0

0.0

0.2

0.4

0.6

0.8

1.0

αα

0.0

0.2

0.4

0.6

0.8

1.0

αα

Gary King (Harvard, IQSS) Text Analysis 28 / 1

Page 135: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

SIMEX Analysis of “Not a Blog” Category

−1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

Category NB

αα−1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

Category NB

αα

0.0

0.2

0.4

0.6

0.8

1.0

αα

0.0

0.2

0.4

0.6

0.8

1.0

αα

Gary King (Harvard, IQSS) Text Analysis 29 / 1

Page 136: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

SIMEX Analysis of Other Categories

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category −2

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category −2

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category −1

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category −1

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 0

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 0

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 1

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 1

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 2

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 2

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category NA

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category NA

αα

Gary King (Harvard, IQSS) Text Analysis 30 / 1

Page 137: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

For more information

http://GKing.Harvard.edu

Gary King (Harvard, IQSS) Text Analysis 31 / 1