advanced quantitative research methodology, lecture notes: =1=text
TRANSCRIPT
Advanced Quantitative Research Methodology, LectureNotes: Text Analysis I: How to Read 100 Million Blogs
(& Classify Deaths Without Physicians)1
Gary Kinghttp://GKing.Harvard.Edu
April 25, 2010
1 c©Copyright 2010 Gary King, All Rights Reserved.Gary King http://GKing.Harvard.Edu () Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis I: How to Read 100 Million Blogs (& Classify Deaths Without Physicians)April 25, 2010 1 / 1
References
Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24
commercializedvia:
Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91
In use by (among others):
Copies at http://gking.harvard.edu
http://ow.ly/14hDU (play after ad)
http://ow.ly/14h36 (play 10:00-12:06)
Gary King (Harvard, IQSS) Text Analysis 2 / 1
References
Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24 commercializedvia:
Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91
In use by (among others):
Copies at http://gking.harvard.edu
http://ow.ly/14hDU (play after ad)
http://ow.ly/14h36 (play 10:00-12:06)
Gary King (Harvard, IQSS) Text Analysis 2 / 1
References
Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24 commercializedvia:
Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91
In use by (among others):
Copies at http://gking.harvard.edu
http://ow.ly/14hDU (play after ad)
http://ow.ly/14h36 (play 10:00-12:06)
Gary King (Harvard, IQSS) Text Analysis 2 / 1
References
Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24 commercializedvia:
Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91 In use by (among others):
Copies at http://gking.harvard.edu
http://ow.ly/14hDU (play after ad)
http://ow.ly/14h36 (play 10:00-12:06)
Gary King (Harvard, IQSS) Text Analysis 2 / 1
References
Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24 commercializedvia:
Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91 In use by (among others):
Copies at http://gking.harvard.edu
http://ow.ly/14hDU (play after ad)
http://ow.ly/14h36 (play 10:00-12:06)
Gary King (Harvard, IQSS) Text Analysis 2 / 1
References
Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24 commercializedvia:
Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91 In use by (among others):
Copies at http://gking.harvard.edu
http://ow.ly/14hDU (play after ad)
http://ow.ly/14h36 (play 10:00-12:06)
Gary King (Harvard, IQSS) Text Analysis 2 / 1
References
Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24 commercializedvia:
Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91 In use by (among others):
Copies at http://gking.harvard.edu
http://ow.ly/14hDU (play after ad)
http://ow.ly/14h36 (play 10:00-12:06)
Gary King (Harvard, IQSS) Text Analysis 2 / 1
Inputs and Target Quantities of Interest
Input Data:
Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories
Quantities of interest
individual document classifications (spam filters)proportion in each category (proportion email which is spam)
Estimation
Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities
Gary King (Harvard, IQSS) Text Analysis 3 / 1
Inputs and Target Quantities of Interest
Input Data:
Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories
Quantities of interest
individual document classifications (spam filters)proportion in each category (proportion email which is spam)
Estimation
Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities
Gary King (Harvard, IQSS) Text Analysis 3 / 1
Inputs and Target Quantities of Interest
Input Data:
Large set of text documents (blogs, web pages, emails, etc.)
A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories
Quantities of interest
individual document classifications (spam filters)proportion in each category (proportion email which is spam)
Estimation
Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities
Gary King (Harvard, IQSS) Text Analysis 3 / 1
Inputs and Target Quantities of Interest
Input Data:
Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categories
A small set of documents hand-coded into the categories
Quantities of interest
individual document classifications (spam filters)proportion in each category (proportion email which is spam)
Estimation
Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities
Gary King (Harvard, IQSS) Text Analysis 3 / 1
Inputs and Target Quantities of Interest
Input Data:
Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories
Quantities of interest
individual document classifications (spam filters)proportion in each category (proportion email which is spam)
Estimation
Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities
Gary King (Harvard, IQSS) Text Analysis 3 / 1
Inputs and Target Quantities of Interest
Input Data:
Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories
Quantities of interest
individual document classifications (spam filters)proportion in each category (proportion email which is spam)
Estimation
Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities
Gary King (Harvard, IQSS) Text Analysis 3 / 1
Inputs and Target Quantities of Interest
Input Data:
Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories
Quantities of interest
individual document classifications (spam filters)
proportion in each category (proportion email which is spam)
Estimation
Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities
Gary King (Harvard, IQSS) Text Analysis 3 / 1
Inputs and Target Quantities of Interest
Input Data:
Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories
Quantities of interest
individual document classifications (spam filters)proportion in each category (proportion email which is spam)
Estimation
Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities
Gary King (Harvard, IQSS) Text Analysis 3 / 1
Inputs and Target Quantities of Interest
Input Data:
Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories
Quantities of interest
individual document classifications (spam filters)proportion in each category (proportion email which is spam)
Estimation
Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities
Gary King (Harvard, IQSS) Text Analysis 3 / 1
Inputs and Target Quantities of Interest
Input Data:
Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories
Quantities of interest
individual document classifications (spam filters)proportion in each category (proportion email which is spam)
Estimation
Can get the 2nd by counting the 1st (turns out not to be necessary!)
High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities
Gary King (Harvard, IQSS) Text Analysis 3 / 1
Inputs and Target Quantities of Interest
Input Data:
Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories
Quantities of interest
individual document classifications (spam filters)proportion in each category (proportion email which is spam)
Estimation
Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions
⇒ Different methods optimize estimation of the different quantities
Gary King (Harvard, IQSS) Text Analysis 3 / 1
Inputs and Target Quantities of Interest
Input Data:
Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories
Quantities of interest
individual document classifications (spam filters)proportion in each category (proportion email which is spam)
Estimation
Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities
Gary King (Harvard, IQSS) Text Analysis 3 / 1
Blogs as a Running Example
Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.
“We are living through the largest expansion of expressive capabilityin the history of the human race”
Measures classical notion of public opinion: active public expressionsdesigned to influence policy and politics (previously: strikes, boycotts,demonstrations, editorials)
(Public opinion ; surveys)
Gary King (Harvard, IQSS) Text Analysis 4 / 1
Blogs as a Running Example
Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.
“We are living through the largest expansion of expressive capabilityin the history of the human race”
Measures classical notion of public opinion: active public expressionsdesigned to influence policy and politics (previously: strikes, boycotts,demonstrations, editorials)
(Public opinion ; surveys)
Gary King (Harvard, IQSS) Text Analysis 4 / 1
Blogs as a Running Example
Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.
“We are living through the largest expansion of expressive capabilityin the history of the human race”
Measures classical notion of public opinion: active public expressionsdesigned to influence policy and politics (previously: strikes, boycotts,demonstrations, editorials)
(Public opinion ; surveys)
Gary King (Harvard, IQSS) Text Analysis 4 / 1
Blogs as a Running Example
Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.
“We are living through the largest expansion of expressive capabilityin the history of the human race”
Measures classical notion of public opinion: active public expressionsdesigned to influence policy and politics (previously: strikes, boycotts,demonstrations, editorials)
(Public opinion ; surveys)
Gary King (Harvard, IQSS) Text Analysis 4 / 1
Blogs as a Running Example
Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.
“We are living through the largest expansion of expressive capabilityin the history of the human race”
Measures classical notion of public opinion: active public expressionsdesigned to influence policy and politics (previously: strikes, boycotts,demonstrations, editorials)
(Public opinion ; surveys)
Gary King (Harvard, IQSS) Text Analysis 4 / 1
One specific quantity of interest
Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts
Specific categories: Label Category−2 extremely negative−1 negative
0 neutral1 positive2 extremely positive
NA no opinion expressedNB not a blog
Hard case:
Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)
Gary King (Harvard, IQSS) Text Analysis 5 / 1
One specific quantity of interest
Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts
Specific categories: Label Category−2 extremely negative−1 negative
0 neutral1 positive2 extremely positive
NA no opinion expressedNB not a blog
Hard case:
Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)
Gary King (Harvard, IQSS) Text Analysis 5 / 1
One specific quantity of interest
Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts
Specific categories: Label Category−2 extremely negative−1 negative
0 neutral1 positive2 extremely positive
NA no opinion expressedNB not a blog
Hard case:
Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)
Gary King (Harvard, IQSS) Text Analysis 5 / 1
One specific quantity of interest
Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts
Specific categories: Label Category−2 extremely negative−1 negative
0 neutral1 positive2 extremely positive
NA no opinion expressedNB not a blog
Hard case:
Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)
Gary King (Harvard, IQSS) Text Analysis 5 / 1
One specific quantity of interest
Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts
Specific categories: Label Category−2 extremely negative−1 negative
0 neutral1 positive2 extremely positive
NA no opinion expressedNB not a blog
Hard case:
Part ordinal, part nominal categorization
“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)
Gary King (Harvard, IQSS) Text Analysis 5 / 1
One specific quantity of interest
Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts
Specific categories: Label Category−2 extremely negative−1 negative
0 neutral1 positive2 extremely positive
NA no opinion expressedNB not a blog
Hard case:
Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”
Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)
Gary King (Harvard, IQSS) Text Analysis 5 / 1
One specific quantity of interest
Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts
Specific categories: Label Category−2 extremely negative−1 negative
0 neutral1 positive2 extremely positive
NA no opinion expressedNB not a blog
Hard case:
Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”
Little common internal structure (no inverted pyramid)
Gary King (Harvard, IQSS) Text Analysis 5 / 1
One specific quantity of interest
Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts
Specific categories: Label Category−2 extremely negative−1 negative
0 neutral1 positive2 extremely positive
NA no opinion expressedNB not a blog
Hard case:
Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)
Gary King (Harvard, IQSS) Text Analysis 5 / 1
Example of output: John Kerry’s Botched Joke
You know, education — if you make the most of it . . . you cando well. If you don’t, you get stuck in Iraq.
●
●
●
●
●
●
●
●
●
●●
●● ●
●
●
●
●
●●
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
−2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
−1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
2
Gary King (Harvard, IQSS) Text Analysis 6 / 1
Example of output: John Kerry’s Botched Joke
You know, education — if you make the most of it . . . you cando well. If you don’t, you get stuck in Iraq.
●
●
●
●
●
●
●
●
●
●●
●● ●
●
●
●
●
●●
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
−2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
−1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
2
Gary King (Harvard, IQSS) Text Analysis 6 / 1
Example of output: John Kerry’s Botched Joke
You know, education — if you make the most of it . . . you cando well. If you don’t, you get stuck in Iraq.
●
●
●
●
●
●
●
●
●
●●
●● ●
●
●
●
●
●●
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
−2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
−1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Affect Towards John Kerry
2006−2007
Pro
port
ion
Sept Oct Nov Dec Jan Feb Mar
2
Gary King (Harvard, IQSS) Text Analysis 6 / 1
Representing Text as Numbers
Filter: choose English language blogs that mention Bush
Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)
Code variables: presence/absence of unique unigrams, bigrams,trigrams
Our Example:
Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types
More sophisticated summaries: we’ve used, but they’re not necessary
(More systematic than than 1 dummy variable per document)
Gary King (Harvard, IQSS) Text Analysis 7 / 1
Representing Text as Numbers
Filter: choose English language blogs that mention Bush
Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)
Code variables: presence/absence of unique unigrams, bigrams,trigrams
Our Example:
Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types
More sophisticated summaries: we’ve used, but they’re not necessary
(More systematic than than 1 dummy variable per document)
Gary King (Harvard, IQSS) Text Analysis 7 / 1
Representing Text as Numbers
Filter: choose English language blogs that mention Bush
Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)
Code variables: presence/absence of unique unigrams, bigrams,trigrams
Our Example:
Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types
More sophisticated summaries: we’ve used, but they’re not necessary
(More systematic than than 1 dummy variable per document)
Gary King (Harvard, IQSS) Text Analysis 7 / 1
Representing Text as Numbers
Filter: choose English language blogs that mention Bush
Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)
Code variables: presence/absence of unique unigrams, bigrams,trigrams
Our Example:
Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types
More sophisticated summaries: we’ve used, but they’re not necessary
(More systematic than than 1 dummy variable per document)
Gary King (Harvard, IQSS) Text Analysis 7 / 1
Representing Text as Numbers
Filter: choose English language blogs that mention Bush
Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)
Code variables: presence/absence of unique unigrams, bigrams,trigrams
Our Example:
Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types
More sophisticated summaries: we’ve used, but they’re not necessary
(More systematic than than 1 dummy variable per document)
Gary King (Harvard, IQSS) Text Analysis 7 / 1
Representing Text as Numbers
Filter: choose English language blogs that mention Bush
Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)
Code variables: presence/absence of unique unigrams, bigrams,trigrams
Our Example:
Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.
keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types
More sophisticated summaries: we’ve used, but they’re not necessary
(More systematic than than 1 dummy variable per document)
Gary King (Harvard, IQSS) Text Analysis 7 / 1
Representing Text as Numbers
Filter: choose English language blogs that mention Bush
Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)
Code variables: presence/absence of unique unigrams, bigrams,trigrams
Our Example:
Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variables
Groups infinite possible posts into “only” 23,672 distinct types
More sophisticated summaries: we’ve used, but they’re not necessary
(More systematic than than 1 dummy variable per document)
Gary King (Harvard, IQSS) Text Analysis 7 / 1
Representing Text as Numbers
Filter: choose English language blogs that mention Bush
Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)
Code variables: presence/absence of unique unigrams, bigrams,trigrams
Our Example:
Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types
More sophisticated summaries: we’ve used, but they’re not necessary
(More systematic than than 1 dummy variable per document)
Gary King (Harvard, IQSS) Text Analysis 7 / 1
Representing Text as Numbers
Filter: choose English language blogs that mention Bush
Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)
Code variables: presence/absence of unique unigrams, bigrams,trigrams
Our Example:
Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types
More sophisticated summaries: we’ve used, but they’re not necessary
(More systematic than than 1 dummy variable per document)
Gary King (Harvard, IQSS) Text Analysis 7 / 1
Representing Text as Numbers
Filter: choose English language blogs that mention Bush
Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)
Code variables: presence/absence of unique unigrams, bigrams,trigrams
Our Example:
Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types
More sophisticated summaries: we’ve used, but they’re not necessary
(More systematic than than 1 dummy variable per document)
Gary King (Harvard, IQSS) Text Analysis 7 / 1
Notation
Document Category
Di =
-2 extremely negative
-1 negative
0 neutral
1 positive
2 extremely positive
NA no opinion expressed
NB not a blog
Word Stem Profile:
Si =
Si1 = 1 if “awful” is used, 0 if not
Si2 = 1 if “good” is used, 0 if not...
...
SiK = 1 if “except” is used, 0 if not
Gary King (Harvard, IQSS) Text Analysis 8 / 1
Notation
Document Category
Di =
-2 extremely negative
-1 negative
0 neutral
1 positive
2 extremely positive
NA no opinion expressed
NB not a blog
Word Stem Profile:
Si =
Si1 = 1 if “awful” is used, 0 if not
Si2 = 1 if “good” is used, 0 if not...
...
SiK = 1 if “except” is used, 0 if not
Gary King (Harvard, IQSS) Text Analysis 8 / 1
Notation
Document Category
Di =
-2 extremely negative
-1 negative
0 neutral
1 positive
2 extremely positive
NA no opinion expressed
NB not a blog
Word Stem Profile:
Si =
Si1 = 1 if “awful” is used, 0 if not
Si2 = 1 if “good” is used, 0 if not...
...
SiK = 1 if “except” is used, 0 if not
Gary King (Harvard, IQSS) Text Analysis 8 / 1
Quantities of Interest
Computer Science: individual document classifications
D1,D2 . . . , DL
Social Science: proportions in each category
P(D) =
P(D = −2)P(D = −1)P(D = 0)P(D = 1)P(D = 2)
P(D = NA)P(D = NB)
Gary King (Harvard, IQSS) Text Analysis 9 / 1
Quantities of Interest
Computer Science: individual document classifications
D1,D2 . . . , DL
Social Science: proportions in each category
P(D) =
P(D = −2)P(D = −1)P(D = 0)P(D = 1)P(D = 2)
P(D = NA)P(D = NB)
Gary King (Harvard, IQSS) Text Analysis 9 / 1
Quantities of Interest
Computer Science: individual document classifications
D1,D2 . . . , DL
Social Science: proportions in each category
P(D) =
P(D = −2)P(D = −1)P(D = 0)P(D = 1)P(D = 2)
P(D = NA)P(D = NB)
Gary King (Harvard, IQSS) Text Analysis 9 / 1
Issues with Existing Statistical Approaches
1 Direct Sampling
Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)
2 Aggregation of model-based individual classifications
Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless
P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)
Bias even with optimal classification and high % correctly classified
Gary King (Harvard, IQSS) Text Analysis 10 / 1
Issues with Existing Statistical Approaches
1 Direct Sampling
Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)
2 Aggregation of model-based individual classifications
Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless
P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)
Bias even with optimal classification and high % correctly classified
Gary King (Harvard, IQSS) Text Analysis 10 / 1
Issues with Existing Statistical Approaches
1 Direct Sampling
Biased without a random sample
nonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)
2 Aggregation of model-based individual classifications
Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless
P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)
Bias even with optimal classification and high % correctly classified
Gary King (Harvard, IQSS) Text Analysis 10 / 1
Issues with Existing Statistical Approaches
1 Direct Sampling
Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.
(Classification of population documents not necessary)
2 Aggregation of model-based individual classifications
Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless
P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)
Bias even with optimal classification and high % correctly classified
Gary King (Harvard, IQSS) Text Analysis 10 / 1
Issues with Existing Statistical Approaches
1 Direct Sampling
Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)
2 Aggregation of model-based individual classifications
Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless
P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)
Bias even with optimal classification and high % correctly classified
Gary King (Harvard, IQSS) Text Analysis 10 / 1
Issues with Existing Statistical Approaches
1 Direct Sampling
Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)
2 Aggregation of model-based individual classifications
Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless
P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)
Bias even with optimal classification and high % correctly classified
Gary King (Harvard, IQSS) Text Analysis 10 / 1
Issues with Existing Statistical Approaches
1 Direct Sampling
Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)
2 Aggregation of model-based individual classifications
Biased without a random sample
Models P(D|S), but the world works as P(S|D)Bias unless
P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)
Bias even with optimal classification and high % correctly classified
Gary King (Harvard, IQSS) Text Analysis 10 / 1
Issues with Existing Statistical Approaches
1 Direct Sampling
Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)
2 Aggregation of model-based individual classifications
Biased without a random sampleModels P(D|S), but the world works as P(S|D)
Bias unless
P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)
Bias even with optimal classification and high % correctly classified
Gary King (Harvard, IQSS) Text Analysis 10 / 1
Issues with Existing Statistical Approaches
1 Direct Sampling
Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)
2 Aggregation of model-based individual classifications
Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless
P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)
Bias even with optimal classification and high % correctly classified
Gary King (Harvard, IQSS) Text Analysis 10 / 1
Issues with Existing Statistical Approaches
1 Direct Sampling
Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)
2 Aggregation of model-based individual classifications
Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless
P(D|S) encompasses the “true” model.
S spans the space of all predictors of D (i.e., all information in thedocument)
Bias even with optimal classification and high % correctly classified
Gary King (Harvard, IQSS) Text Analysis 10 / 1
Issues with Existing Statistical Approaches
1 Direct Sampling
Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)
2 Aggregation of model-based individual classifications
Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless
P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)
Bias even with optimal classification and high % correctly classified
Gary King (Harvard, IQSS) Text Analysis 10 / 1
Issues with Existing Statistical Approaches
1 Direct Sampling
Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)
2 Aggregation of model-based individual classifications
Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless
P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)
Bias even with optimal classification and high % correctly classified
Gary King (Harvard, IQSS) Text Analysis 10 / 1
Using Misclassification Rates to Correct Proportions
Use some method to classify unlabeled documents
Aggregate classifications to category proportions
Use labeled set to estimate misclassification rates (by cross-validation)
Use misclassification rates to correct proportions
Result: vastly improved estimates of category proportions
(No new assumptions beyond that of the classifier)
(still requires random samples, individual classification, etc)
Gary King (Harvard, IQSS) Text Analysis 11 / 1
Using Misclassification Rates to Correct Proportions
Use some method to classify unlabeled documents
Aggregate classifications to category proportions
Use labeled set to estimate misclassification rates (by cross-validation)
Use misclassification rates to correct proportions
Result: vastly improved estimates of category proportions
(No new assumptions beyond that of the classifier)
(still requires random samples, individual classification, etc)
Gary King (Harvard, IQSS) Text Analysis 11 / 1
Using Misclassification Rates to Correct Proportions
Use some method to classify unlabeled documents
Aggregate classifications to category proportions
Use labeled set to estimate misclassification rates (by cross-validation)
Use misclassification rates to correct proportions
Result: vastly improved estimates of category proportions
(No new assumptions beyond that of the classifier)
(still requires random samples, individual classification, etc)
Gary King (Harvard, IQSS) Text Analysis 11 / 1
Using Misclassification Rates to Correct Proportions
Use some method to classify unlabeled documents
Aggregate classifications to category proportions
Use labeled set to estimate misclassification rates (by cross-validation)
Use misclassification rates to correct proportions
Result: vastly improved estimates of category proportions
(No new assumptions beyond that of the classifier)
(still requires random samples, individual classification, etc)
Gary King (Harvard, IQSS) Text Analysis 11 / 1
Using Misclassification Rates to Correct Proportions
Use some method to classify unlabeled documents
Aggregate classifications to category proportions
Use labeled set to estimate misclassification rates (by cross-validation)
Use misclassification rates to correct proportions
Result: vastly improved estimates of category proportions
(No new assumptions beyond that of the classifier)
(still requires random samples, individual classification, etc)
Gary King (Harvard, IQSS) Text Analysis 11 / 1
Using Misclassification Rates to Correct Proportions
Use some method to classify unlabeled documents
Aggregate classifications to category proportions
Use labeled set to estimate misclassification rates (by cross-validation)
Use misclassification rates to correct proportions
Result: vastly improved estimates of category proportions
(No new assumptions beyond that of the classifier)
(still requires random samples, individual classification, etc)
Gary King (Harvard, IQSS) Text Analysis 11 / 1
Using Misclassification Rates to Correct Proportions
Use some method to classify unlabeled documents
Aggregate classifications to category proportions
Use labeled set to estimate misclassification rates (by cross-validation)
Use misclassification rates to correct proportions
Result: vastly improved estimates of category proportions
(No new assumptions beyond that of the classifier)
(still requires random samples, individual classification, etc)
Gary King (Harvard, IQSS) Text Analysis 11 / 1
Using Misclassification Rates to Correct Proportions
Use some method to classify unlabeled documents
Aggregate classifications to category proportions
Use labeled set to estimate misclassification rates (by cross-validation)
Use misclassification rates to correct proportions
Result: vastly improved estimates of category proportions
(No new assumptions beyond that of the classifier)
(still requires random samples, individual classification, etc)
Gary King (Harvard, IQSS) Text Analysis 11 / 1
Formalization from Epidemiology(Levy and Kass, 1970)
Accounting identity for 2 categories:
P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)
Solve:
P(D = 1) =P(D̂ = 1)− (1− spec)
sens− (1− spec)
Use this equation to correct P(D̂ = 1)
Gary King (Harvard, IQSS) Text Analysis 12 / 1
Formalization from Epidemiology(Levy and Kass, 1970)
Accounting identity for 2 categories:
P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)
Solve:
P(D = 1) =P(D̂ = 1)− (1− spec)
sens− (1− spec)
Use this equation to correct P(D̂ = 1)
Gary King (Harvard, IQSS) Text Analysis 12 / 1
Formalization from Epidemiology(Levy and Kass, 1970)
Accounting identity for 2 categories:
P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)
Solve:
P(D = 1) =P(D̂ = 1)− (1− spec)
sens− (1− spec)
Use this equation to correct P(D̂ = 1)
Gary King (Harvard, IQSS) Text Analysis 12 / 1
Formalization from Epidemiology(Levy and Kass, 1970)
Accounting identity for 2 categories:
P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)
Solve:
P(D = 1) =P(D̂ = 1)− (1− spec)
sens− (1− spec)
Use this equation to correct P(D̂ = 1)
Gary King (Harvard, IQSS) Text Analysis 12 / 1
Generalizations: J Categories, No Individual Classification(King and Lu, 2008)
Accounting identity for J categories
P(D̂ = j) =J∑
j ′=1
P(D̂ = j |D = j ′)P(D = j ′)
Drop D̂ calculation, since D̂ = f (S):
P(S = s) =J∑
j ′=1
P(S = s|D = j ′)P(D = j ′)
Simplify to an equivalent matrix expression:
P(S) = P(S|D)P(D)
Gary King (Harvard, IQSS) Text Analysis 13 / 1
Generalizations: J Categories, No Individual Classification(King and Lu, 2008)
Accounting identity for J categories
P(D̂ = j) =J∑
j ′=1
P(D̂ = j |D = j ′)P(D = j ′)
Drop D̂ calculation, since D̂ = f (S):
P(S = s) =J∑
j ′=1
P(S = s|D = j ′)P(D = j ′)
Simplify to an equivalent matrix expression:
P(S) = P(S|D)P(D)
Gary King (Harvard, IQSS) Text Analysis 13 / 1
Generalizations: J Categories, No Individual Classification(King and Lu, 2008)
Accounting identity for J categories
P(D̂ = j) =J∑
j ′=1
P(D̂ = j |D = j ′)P(D = j ′)
Drop D̂ calculation, since D̂ = f (S):
P(S = s) =J∑
j ′=1
P(S = s|D = j ′)P(D = j ′)
Simplify to an equivalent matrix expression:
P(S) = P(S|D)P(D)
Gary King (Harvard, IQSS) Text Analysis 13 / 1
Generalizations: J Categories, No Individual Classification(King and Lu, 2008)
Accounting identity for J categories
P(D̂ = j) =J∑
j ′=1
P(D̂ = j |D = j ′)P(D = j ′)
Drop D̂ calculation, since D̂ = f (S):
P(S = s) =J∑
j ′=1
P(S = s|D = j ′)P(D = j ′)
Simplify to an equivalent matrix expression:
P(S) = P(S|D)P(D)
Gary King (Harvard, IQSS) Text Analysis 13 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y
Technical estimation issues:
2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y
Document category proportions (quantity of interest)
Technical estimation issues:
2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y
Word stem profile proportions (estimate in unlabeled set by tabulation)
Technical estimation issues:
2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y
Word stem profiles, by category (estimate in labeled set by tabulation)
Technical estimation issues:
2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ
=⇒ β = (X ′X )−1X ′y
Alternative symbols (to emphasize the linear equation)
Technical estimation issues:
2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y
Solve for quantity of interest (with no error term)
Technical estimation issues:
2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y
Technical estimation issues:
2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y
Technical estimation issues:
2K is enormous, far larger than any existing computer
P(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y
Technical estimation issues:
2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparse
Elements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y
Technical estimation issues:
2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y
Technical estimation issues:
2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y
Technical estimation issues:
2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average results
Equivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y
Technical estimation issues:
2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical data
Use constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y
Technical estimation issues:
2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
Estimation
The matrix expression again:
P(S)2K×1
= P(S|D)2K×J
P(D)J×1
=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y
Technical estimation issues:
2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1
Solutions
Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex
Result: fast, accurate, with very little (human) tuning required
Gary King (Harvard, IQSS) Text Analysis 14 / 1
A Nonrandom Hand-coded Sample
●
●
●
●
●
●
●
0.0 0.1 0.2 0.3 0.4
0.0
0.1
0.2
0.3
0.4
Differences in Document Category Frequencies
Ph(D)
P(D
)
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
● ●
● ●
●●
●●
●
●
●
●
0.01 0.02 0.05 0.10
0.01
0.02
0.05
0.10
Differences in Word Profile Frequencies
Ph(S)
P(S
)
All existing methods would fail with these data.
Gary King (Harvard, IQSS) Text Analysis 15 / 1
Accurate Estimates
●
●
●
●
●
●
●
0.0 0.1 0.2 0.3 0.4
0.0
0.1
0.2
0.3
0.4
Actual P(D)
Est
imat
ed P
(D)
Gary King (Harvard, IQSS) Text Analysis 16 / 1
Out-of-sample Comparison: 60 Seconds vs. 8.7 Days
●
●● ●●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Affect in Blogs
Actual P(D)
Est
imat
ed P
(D)
●
●
●●●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Affect in Blogs
Actual P(D)
Est
imat
ed P
(D)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Affect in Blogs
Actual P(D)
Est
imat
ed P
(D)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Affect in Blogs
Actual P(D)
Est
imat
ed P
(D)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Affect in Blogs
Actual P(D)
Est
imat
ed P
(D)
Gary King (Harvard, IQSS) Text Analysis 17 / 1
Out of Sample Validation: Other Examples
●
●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
Congressional Speeches
Actual P(D)
Est
imat
ed P
(D)
●
●
●
●
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Immigration Editorials
Actual P(D)
Est
imat
ed P
(D)
●
●
●
●●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
Enron Emails
Actual P(D)
Est
imat
ed P
(D)
Gary King (Harvard, IQSS) Text Analysis 18 / 1
Verbal Autopsy Methods
The Problem
Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries
Existing Approaches
Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)
Gary King (Harvard, IQSS) Text Analysis 19 / 1
Verbal Autopsy Methods
The Problem
Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries
Existing Approaches
Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)
Gary King (Harvard, IQSS) Text Analysis 19 / 1
Verbal Autopsy Methods
The Problem
Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policies
High quality death registration: only 23/192 countries
Existing Approaches
Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)
Gary King (Harvard, IQSS) Text Analysis 19 / 1
Verbal Autopsy Methods
The Problem
Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries
Existing Approaches
Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)
Gary King (Harvard, IQSS) Text Analysis 19 / 1
Verbal Autopsy Methods
The Problem
Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries
Existing Approaches
Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)
Gary King (Harvard, IQSS) Text Analysis 19 / 1
Verbal Autopsy Methods
The Problem
Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries
Existing Approaches
Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questions
Ask physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)
Gary King (Harvard, IQSS) Text Analysis 19 / 1
Verbal Autopsy Methods
The Problem
Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries
Existing Approaches
Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)
Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)
Gary King (Harvard, IQSS) Text Analysis 19 / 1
Verbal Autopsy Methods
The Problem
Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries
Existing Approaches
Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)
Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)
Gary King (Harvard, IQSS) Text Analysis 19 / 1
Verbal Autopsy Methods
The Problem
Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries
Existing Approaches
Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)
Gary King (Harvard, IQSS) Text Analysis 19 / 1
An Alternative Approach
Document Category, Cause of Death,
Di =
1 if bladder cancer
2 if cardiovascular disease
3 if transportation accident...
...
J if infectious respiratory
Word Stem Profile, Symptoms:
Si =
Si1 = 1 if “breathing difficulties”, 0 if not
Si2 = 1 if “stomach ache”, 0 if not...
...
SiK = 1 if “diarrhea”, 0 if not
Apply the same methods
Gary King (Harvard, IQSS) Text Analysis 20 / 1
An Alternative Approach
Document Category, Cause of Death,
Di =
1 if bladder cancer
2 if cardiovascular disease
3 if transportation accident...
...
J if infectious respiratory
Word Stem Profile, Symptoms:
Si =
Si1 = 1 if “breathing difficulties”, 0 if not
Si2 = 1 if “stomach ache”, 0 if not...
...
SiK = 1 if “diarrhea”, 0 if not
Apply the same methods
Gary King (Harvard, IQSS) Text Analysis 20 / 1
An Alternative Approach
Document Category, Cause of Death,
Di =
1 if bladder cancer
2 if cardiovascular disease
3 if transportation accident...
...
J if infectious respiratory
Word Stem Profile, Symptoms:
Si =
Si1 = 1 if “breathing difficulties”, 0 if not
Si2 = 1 if “stomach ache”, 0 if not...
...
SiK = 1 if “diarrhea”, 0 if not
Apply the same methods
Gary King (Harvard, IQSS) Text Analysis 20 / 1
An Alternative Approach
Document Category, Cause of Death,
Di =
1 if bladder cancer
2 if cardiovascular disease
3 if transportation accident...
...
J if infectious respiratory
Word Stem Profile, Symptoms:
Si =
Si1 = 1 if “breathing difficulties”, 0 if not
Si2 = 1 if “stomach ache”, 0 if not...
...
SiK = 1 if “diarrhea”, 0 if not
Apply the same methods
Gary King (Harvard, IQSS) Text Analysis 20 / 1
Validation in Tanzania
●
●
●
●
●●
●●
●
●
●
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Random Split Sample
res.split.boot$true.CSMF
Est
imat
e
●
●
0.00 0.05 0.10 0.15 0.20 0.25 0.30
−0.
2−
0.1
0.0
0.1
0.2
TRUE
Err
or
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Community Sample
res.split.boot$true.CSMFE
stim
ate
●
●
0.00 0.05 0.10 0.15 0.20 0.25 0.30
−0.
2−
0.1
0.0
0.1
0.2
TRUE
Err
or
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
Gary King (Harvard, IQSS) Text Analysis 21 / 1
Validation in China
●●
●
●
●
●
●
●
●
●
●
●
●
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Random Split Sample
res.split.boot$true.CSMF
Est
imat
e
●
●
0.00 0.05 0.10 0.15 0.20 0.25 0.30
−0.
100.
000.
050.
10
TRUE
Err
or
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
0.00
0.05
0.10
0.15
0.20
0.25
0.30
City Sample I
res.split.boot$true.CSMF
Est
imat
e
●
●
0.00 0.05 0.10 0.15 0.20 0.25 0.30
−0.
100.
000.
050.
10
TRUE
Err
or
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●●
●
0.00
0.05
0.10
0.15
0.20
0.25
0.30
City Sample II
res.big.boot$true.CSMF
Est
imat
e
●
●
0.00 0.05 0.10 0.15 0.20 0.25 0.30
−0.
100.
000.
050.
10
TRUE
Err
or
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Gary King (Harvard, IQSS) Text Analysis 22 / 1
Implications for an Individual Classifier
All existing classifiers assume: Ph(S ,D) = P(S ,D)
For a different quantity we assume: Ph(S |D) = P(S |D)
How to use this (less restrictive) assumption for classification (BayesTheorem):
P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)
P(S` = s`)
Gary King (Harvard, IQSS) Text Analysis 23 / 1
Implications for an Individual Classifier
All existing classifiers assume: Ph(S ,D) = P(S ,D)
For a different quantity we assume: Ph(S |D) = P(S |D)
How to use this (less restrictive) assumption for classification (BayesTheorem):
P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)
P(S` = s`)
Gary King (Harvard, IQSS) Text Analysis 23 / 1
Implications for an Individual Classifier
All existing classifiers assume: Ph(S ,D) = P(S ,D)
For a different quantity we assume: Ph(S |D) = P(S |D)
How to use this (less restrictive) assumption for classification (BayesTheorem):
P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)
P(S` = s`)
Gary King (Harvard, IQSS) Text Analysis 23 / 1
Implications for an Individual Classifier
All existing classifiers assume: Ph(S ,D) = P(S ,D)
For a different quantity we assume: Ph(S |D) = P(S |D)
How to use this (less restrictive) assumption for classification (BayesTheorem):
P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)
P(S` = s`)
Gary King (Harvard, IQSS) Text Analysis 23 / 1
Implications for an Individual Classifier
All existing classifiers assume: Ph(S ,D) = P(S ,D)
For a different quantity we assume: Ph(S |D) = P(S |D)
How to use this (less restrictive) assumption for classification (BayesTheorem):
P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)
P(S` = s`)
Gary King (Harvard, IQSS) Text Analysis 23 / 1
Implications for an Individual Classifier
All existing classifiers assume: Ph(S ,D) = P(S ,D)
For a different quantity we assume: Ph(S |D) = P(S |D)
How to use this (less restrictive) assumption for classification (BayesTheorem):
P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)
P(S` = s`)
The goal: individual classification
Gary King (Harvard, IQSS) Text Analysis 23 / 1
Implications for an Individual Classifier
All existing classifiers assume: Ph(S ,D) = P(S ,D)
For a different quantity we assume: Ph(S |D) = P(S |D)
How to use this (less restrictive) assumption for classification (BayesTheorem):
P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)
P(S` = s`)
Output from our estimator (described above)
Gary King (Harvard, IQSS) Text Analysis 23 / 1
Implications for an Individual Classifier
All existing classifiers assume: Ph(S ,D) = P(S ,D)
For a different quantity we assume: Ph(S |D) = P(S |D)
How to use this (less restrictive) assumption for classification (BayesTheorem):
P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)
P(S` = s`)
Nonparametric estimate from labeled set (an assumption)
Gary King (Harvard, IQSS) Text Analysis 23 / 1
Implications for an Individual Classifier
All existing classifiers assume: Ph(S ,D) = P(S ,D)
For a different quantity we assume: Ph(S |D) = P(S |D)
How to use this (less restrictive) assumption for classification (BayesTheorem):
P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)
P(S` = s`)
Nonparametric estimate from unlabeled set (no assumption)
Gary King (Harvard, IQSS) Text Analysis 23 / 1
Classification with Less Restrictive Assumptions
●
●
●
●
●
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
P(D=j)
Ph (D
=j)
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.00 0.05 0.10 0.15 0.20 0.25 0.30
0.00
0.05
0.10
0.15
0.20
0.25
0.30
P(Sk=1)
Ph (S
k=1)
Gary King (Harvard, IQSS) Text Analysis 24 / 1
Classification with Less Restrictive Assumptions
●
●
●
●
●
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
P(D=j)
Ph (D
=j)
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.00 0.05 0.10 0.15 0.20 0.25 0.30
0.00
0.05
0.10
0.15
0.20
0.25
0.30
P(Sk=1)
Ph (S
k=1)
Gary King (Harvard, IQSS) Text Analysis 24 / 1
Classification with Less Restrictive Assumptions
●
●
●
●
●
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
P(D=j)
Ph (D
=j)
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.00 0.05 0.10 0.15 0.20 0.25 0.300.
000.
050.
100.
150.
200.
250.
30
P(Sk=1)
Ph (S
k=1)
Gary King (Harvard, IQSS) Text Analysis 24 / 1
Classification with Less Restrictive Assumptions
●
●
●
●
●
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
P(D=j)
P̂(D
=j)
●
●
●
●
●
●
●
SVMNonparametric
Percent correctly classified:
SVM (best existing classifier): 40.5%Our nonparametric approach: 59.8%
Gary King (Harvard, IQSS) Text Analysis 25 / 1
Classification with Less Restrictive Assumptions
●
●
●
●
●
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
P(D=j)
P̂(D
=j)
●
●
●
●
●
●
●
SVMNonparametric
Percent correctly classified:
SVM (best existing classifier): 40.5%Our nonparametric approach: 59.8%
Gary King (Harvard, IQSS) Text Analysis 25 / 1
Classification with Less Restrictive Assumptions
●
●
●
●
●
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
P(D=j)
P̂(D
=j)
●
●
●
●
●
●
●
SVMNonparametric
Percent correctly classified:
SVM (best existing classifier): 40.5%Our nonparametric approach: 59.8%
Gary King (Harvard, IQSS) Text Analysis 25 / 1
Classification with Less Restrictive Assumptions
●
●
●
●
●
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
P(D=j)
P̂(D
=j)
●
●
●
●
●
●
●
SVMNonparametric
Percent correctly classified:
SVM (best existing classifier): 40.5%
Our nonparametric approach: 59.8%
Gary King (Harvard, IQSS) Text Analysis 25 / 1
Classification with Less Restrictive Assumptions
●
●
●
●
●
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
P(D=j)
P̂(D
=j)
●
●
●
●
●
●
●
SVMNonparametric
Percent correctly classified:
SVM (best existing classifier): 40.5%Our nonparametric approach: 59.8%
Gary King (Harvard, IQSS) Text Analysis 25 / 1
Misclassification Matrix for Blog Posts
-2 -1 0 1 2 NA NB P(D1)
-2 .70 .10 .01 .01 .00 .02 .16 .28-1 .33 .25 .04 .02 .01 .01 .35 .080 .13 .17 .13 .11 .05 .02 .40 .021 .07 .06 .08 .20 .25 .01 .34 .032 .03 .03 .03 .22 .43 .01 .25 .03
NA .04 .01 .00 .00 .00 .81 .14 .12NB .10 .07 .02 .02 .02 .04 .75 .45
Gary King (Harvard, IQSS) Text Analysis 26 / 1
SIMEX Analysis of “Not a Blog” Category
0.0
0.2
0.4
0.6
0.8
1.0
Category NB
αα−1 0
0.0
0.2
0.4
0.6
0.8
1.0
αα
Gary King (Harvard, IQSS) Text Analysis 27 / 1
SIMEX Analysis of “Not a Blog” Category
0.0
0.2
0.4
0.6
0.8
1.0
Category NB
αα−1 0
0.0
0.2
0.4
0.6
0.8
1.0
αα
0.0
0.2
0.4
0.6
0.8
1.0
αα
Gary King (Harvard, IQSS) Text Analysis 28 / 1
SIMEX Analysis of “Not a Blog” Category
−1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
Category NB
αα−1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
Category NB
αα
0.0
0.2
0.4
0.6
0.8
1.0
αα
0.0
0.2
0.4
0.6
0.8
1.0
αα
Gary King (Harvard, IQSS) Text Analysis 29 / 1
SIMEX Analysis of Other Categories
−1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Category −2
αα−1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Category −2
αα
−1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Category −1
αα−1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Category −1
αα
−1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Category 0
αα−1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Category 0
αα
−1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Category 1
αα−1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Category 1
αα
−1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Category 2
αα−1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Category 2
αα
−1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Category NA
αα−1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Category NA
αα
Gary King (Harvard, IQSS) Text Analysis 30 / 1
For more information
http://GKing.Harvard.edu
Gary King (Harvard, IQSS) Text Analysis 31 / 1