advanced quantitative research methodology, lecture notes: =1=text

Advanced Quantitative Research Methodology, LectureNotes: Text Analysis I: How to Read 100 Million Blogs

(& Classify Deaths Without Physicians)1

Gary Kinghttp://GKing.Harvard.Edu

April 25, 2010

1 c©Copyright 2010 Gary King, All Rights Reserved.Gary King http://GKing.Harvard.Edu () Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis I: How to Read 100 Million Blogs (& Classify Deaths Without Physicians)April 25, 2010 1 / 1

Page 2: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

References

Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24

commercializedvia:

Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91

In use by (among others):

Copies at http://gking.harvard.edu

http://ow.ly/14hDU (play after ad)

http://ow.ly/14h36 (play 10:00-12:06)

Gary King (Harvard, IQSS) Text Analysis 2 / 1

http://gking.harvard.edu

http://ow.ly/14hDU

http://ow.ly/14h36

Page 3: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

References

Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” 54, 1 (January 2010): 229-24 commercializedvia:

http://ow.ly/14h36 (play 10:00-12:06)

http://ow.ly/14hDU

http://ow.ly/14h36

Page 4: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

References

http://ow.ly/14h36 (play 10:00-12:06)

http://ow.ly/14hDU

http://ow.ly/14h36

Page 5: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

References

Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science 23, 1 (February, 2008): Pp.78–91 In use by (among others):

http://ow.ly/14h36 (play 10:00-12:06)

http://ow.ly/14hDU

http://ow.ly/14h36

Page 6: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

References

http://ow.ly/14h36 (play 10:00-12:06)

http://ow.ly/14hDU

http://ow.ly/14h36

Page 7: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

References

http://ow.ly/14h36 (play 10:00-12:06)

http://ow.ly/14hDU

http://ow.ly/14h36

Page 8: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

References

http://ow.ly/14h36 (play 10:00-12:06)

http://ow.ly/14hDU

http://ow.ly/14h36

Page 9: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Inputs and Target Quantities of Interest

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories

Quantities of interest

individual document classifications (spam filters)proportion in each category (proportion email which is spam)

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities

Page 10: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Input Data:

Estimation

Page 11: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)

A set of (mutually exclusive and exhaustive) categoriesA small set of documents hand-coded into the categories

Estimation

Page 12: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Input Data:

Large set of text documents (blogs, web pages, emails, etc.)A set of (mutually exclusive and exhaustive) categories

A small set of documents hand-coded into the categories

Estimation

Page 13: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Input Data:

Estimation

Page 14: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Input Data:

Estimation

Page 15: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Input Data:

individual document classifications (spam filters)

proportion in each category (proportion email which is spam)

Estimation

Page 16: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Input Data:

Estimation

Page 17: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Input Data:

Estimation

Page 18: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Input Data:

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)

High classification accuracy ; unbiased category proportions⇒ Different methods optimize estimation of the different quantities

Page 19: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Input Data:

Estimation

Can get the 2nd by counting the 1st (turns out not to be necessary!)High classification accuracy ; unbiased category proportions

⇒ Different methods optimize estimation of the different quantities

Page 20: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Input Data:

Estimation

Page 21: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Blogs as a Running Example

Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.

“We are living through the largest expansion of expressive capabilityin the history of the human race”

Measures classical notion of public opinion: active public expressionsdesigned to influence policy and politics (previously: strikes, boycotts,demonstrations, editorials)

(Public opinion ; surveys)

Page 22: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Page 23: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Page 24: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Page 25: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Page 26: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

One specific quantity of interest

Daily opinion about President Bush and 2008 candidates among allEnglish language blog posts

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)

Page 27: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Hard case:

Page 28: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Hard case:

Page 29: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Hard case:

Page 30: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Hard case:

Part ordinal, part nominal categorization

“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)

Page 31: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”

Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”Little common internal structure (no inverted pyramid)

Page 32: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Informal language: “my crunchy gf thinks dubya hid the wmd’s, :)!”

Little common internal structure (no inverted pyramid)

Page 33: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Hard case:

Page 34: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Example of output: John Kerry’s Botched Joke

You know, education — if you make the most of it . . . you cando well. If you don’t, you get stuck in Iraq.

●

●●

●● ●

●

●●

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2006−2007

Pro

port

ion

−1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2006−2007

Pro

port

ion

0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2006−2007

Pro

port

ion

1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2006−2007

Pro

port

ion

2

Page 35: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

●

●●

●● ●

●

●●

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2006−2007

Pro

port

ion

−2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2006−2007

Pro

port

ion

−1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2006−2007

Pro

port

ion

0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2006−2007

Pro

port

ion

1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2006−2007

Pro

port

ion

2

Page 36: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

●

●●

●● ●

●

●●

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2006−2007

Pro

port

ion

−2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2006−2007

Pro

port

ion

−1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2006−2007

Pro

port

ion

0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2006−2007

Pro

port

ion

1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2006−2007

Pro

port

ion

2

Page 37: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Representing Text as Numbers

Filter: choose English language blogs that mention Bush

Preprocess: convert to lower case, remove punctuation, keep onlyword stems (“consist”, “consisted”, “consistency” “consist”)

Code variables: presence/absence of unique unigrams, bigrams,trigrams

Our Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

More sophisticated summaries: we’ve used, but they’re not necessary

(More systematic than than 1 dummy variable per document)

Page 38: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Our Example:

Page 39: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Our Example:

Page 40: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Our Example:

Page 41: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Our Example:

Page 42: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Our Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.

keep only unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

Page 43: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Our Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.keep only unigrams in > 1% or < 99% of documents: 3,672 variables

Groups infinite possible posts into “only” 23,672 distinct types

Page 44: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Our Example:

Page 45: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Our Example:

Page 46: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Our Example:

Page 47: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Notation

Document Category

Di =

-2 extremely negative

-1 negative

0 neutral

1 positive

2 extremely positive

NA no opinion expressed

NB not a blog

Word Stem Profile:

Si =

Si1 = 1 if “awful” is used, 0 if not

Si2 = 1 if “good” is used, 0 if not...

...

SiK = 1 if “except” is used, 0 if not

Page 48: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Notation

Document Category

Di =

-1 negative

0 neutral

1 positive

NB not a blog

Word Stem Profile:

Si =

...

Page 49: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Notation

Document Category

Di =

-1 negative

0 neutral

1 positive

NB not a blog

Word Stem Profile:

Si =

...

Page 50: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Quantities of Interest

Computer Science: individual document classifications

D1,D2 . . . , DL

Social Science: proportions in each category

P(D) =

P(D = −2)P(D = −1)P(D = 0)P(D = 1)P(D = 2)

P(D = NA)P(D = NB)

Page 51: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

D1,D2 . . . , DL

P(D) =

P(D = −2)P(D = −1)P(D = 0)P(D = 1)P(D = 2)

P(D = NA)P(D = NB)

Page 52: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

D1,D2 . . . , DL

P(D) =

P(D = −2)P(D = −1)P(D = 0)P(D = 1)P(D = 2)

P(D = NA)P(D = NB)

Page 53: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Issues with Existing Statistical Approaches

1 Direct Sampling

Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)

2 Aggregation of model-based individual classifications

Biased without a random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Page 54: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

1 Direct Sampling

Page 55: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

1 Direct Sampling

Biased without a random sample

nonrandomness common due to population drift, data subdivisions, etc.(Classification of population documents not necessary)

Page 56: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

1 Direct Sampling

Biased without a random samplenonrandomness common due to population drift, data subdivisions, etc.

(Classification of population documents not necessary)

Page 57: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

1 Direct Sampling

Page 58: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

1 Direct Sampling

Page 59: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

1 Direct Sampling

Biased without a random sample

Models P(D|S), but the world works as P(S|D)Bias unless

Page 60: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

1 Direct Sampling

Biased without a random sampleModels P(D|S), but the world works as P(S|D)

Bias unless

Page 61: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

1 Direct Sampling

Page 62: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

1 Direct Sampling

P(D|S) encompasses the “true” model.

S spans the space of all predictors of D (i.e., all information in thedocument)

Page 63: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

1 Direct Sampling

Page 64: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

1 Direct Sampling

Page 65: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Aggregate classifications to category proportions

Use labeled set to estimate misclassification rates (by cross-validation)

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(No new assumptions beyond that of the classifier)

(still requires random samples, individual classification, etc)

Page 66: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Page 67: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Page 68: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Page 69: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Page 70: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Page 71: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Page 72: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Page 73: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Formalization from Epidemiology(Levy and Kass, 1970)

Accounting identity for 2 categories:

P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)

Solve:

P(D = 1) =P(D̂ = 1)− (1− spec)

sens− (1− spec)

Use this equation to correct P(D̂ = 1)

Page 74: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)

Solve:

P(D = 1) =P(D̂ = 1)− (1− spec)

sens− (1− spec)

Page 75: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)

Solve:

P(D = 1) =P(D̂ = 1)− (1− spec)

sens− (1− spec)

Page 76: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)

Solve:

P(D = 1) =P(D̂ = 1)− (1− spec)

sens− (1− spec)

Page 77: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Generalizations: J Categories, No Individual Classification(King and Lu, 2008)

Accounting identity for J categories

P(D̂ = j) =J∑

j ′=1

P(D̂ = j |D = j ′)P(D = j ′)

Drop D̂ calculation, since D̂ = f (S):

P(S = s) =J∑

j ′=1

P(S = s|D = j ′)P(D = j ′)

Simplify to an equivalent matrix expression:

P(S) = P(S|D)P(D)

Page 78: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

P(D̂ = j) =J∑

j ′=1

P(D̂ = j |D = j ′)P(D = j ′)

P(S = s) =J∑

j ′=1

P(S = s|D = j ′)P(D = j ′)

P(S) = P(S|D)P(D)

Page 79: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

P(D̂ = j) =J∑

j ′=1

P(D̂ = j |D = j ′)P(D = j ′)

P(S = s) =J∑

j ′=1

P(S = s|D = j ′)P(D = j ′)

P(S) = P(S|D)P(D)

Page 80: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

P(D̂ = j) =J∑

j ′=1

P(D̂ = j |D = j ′)P(D = j ′)

P(S = s) =J∑

j ′=1

P(S = s|D = j ′)P(D = j ′)

P(S) = P(S|D)P(D)

Page 81: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Result: fast, accurate, with very little (human) tuning required

Page 82: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Document category proportions (quantity of interest)

Solutions

Page 83: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Word stem profile proportions (estimate in unlabeled set by tabulation)

Solutions

Page 84: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Word stem profiles, by category (estimate in labeled set by tabulation)

Solutions

Page 85: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ

=⇒ β = (X ′X )−1X ′y

Alternative symbols (to emphasize the linear equation)

Solutions

Page 86: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Solve for quantity of interest (with no error term)

Solutions

Page 87: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Solutions

Page 88: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

2K is enormous, far larger than any existing computer

P(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Page 89: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparse

Elements of P(D) must be between 0 and 1 and sum to 1

Solutions

Page 90: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Solutions

Page 91: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Solutions

Page 92: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Solutions

Use subsets of S; average results

Equivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Page 93: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical data

Use constrained LS to constrain P(D) to simplex

Page 94: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Solutions

Page 95: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Estimation

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Solutions

Page 96: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

A Nonrandom Hand-coded Sample

●

0.0 0.1 0.2 0.3 0.4

0.0

0.1

0.2

0.3

0.4

Differences in Document Category Frequencies

Ph(D)

P(D

)

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

0.01 0.02 0.05 0.10

0.01

0.02

0.05

0.10

Differences in Word Profile Frequencies

Ph(S)

P(S

)

All existing methods would fail with these data.

Page 97: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Accurate Estimates

●

0.0 0.1 0.2 0.3 0.4

0.0

0.1

0.2

0.3

0.4

Actual P(D)

Est

imat

ed P

(D)

Page 98: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Out-of-sample Comparison: 60 Seconds vs. 8.7 Days

●

●● ●●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

●

●●●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

Page 99: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Out of Sample Validation: Other Examples

●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

Congressional Speeches

Actual P(D)

Est

imat

ed P

(D)

●

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Immigration Editorials

Actual P(D)

Est

imat

ed P

(D)

●

●●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

Enron Emails

Actual P(D)

Est

imat

ed P

(D)

Page 100: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Page 101: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

The Problem

Existing Approaches

Page 102: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policies

High quality death registration: only 23/192 countries

Existing Approaches

Page 103: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

The Problem

Existing Approaches

Page 104: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

The Problem

Existing Approaches

Page 105: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

The Problem

Existing Approaches

Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questions

Ask physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Page 106: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

The Problem

Existing Approaches

Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)

Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Page 107: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

The Problem

Existing Approaches

Verbal Autopsy: Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)

Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Page 108: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

The Problem

Existing Approaches

Page 109: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

An Alternative Approach

Document Category, Cause of Death,

Di =

1 if bladder cancer

2 if cardiovascular disease

3 if transportation accident...

...

J if infectious respiratory

Word Stem Profile, Symptoms:

Si =

Si1 = 1 if “breathing difficulties”, 0 if not

Si2 = 1 if “stomach ache”, 0 if not...

...

SiK = 1 if “diarrhea”, 0 if not

Apply the same methods

Page 110: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Di =

1 if bladder cancer

...

Si =

...

Page 111: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Di =

1 if bladder cancer

...

Si =

...

Page 112: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Di =

1 if bladder cancer

...

Si =

...

Page 113: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Validation in Tanzania

●

●●

●

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Random Split Sample

res.split.boot$true.CSMF

Est

imat

e

●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

−0.

2−

0.1

0.0

0.1

0.2

TRUE

Err

or

●

●●

●

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Community Sample

res.split.boot$true.CSMFE

stim

ate

●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

−0.

2−

0.1

0.0

0.1

0.2

TRUE

Err

or

●

● ●

●

Page 114: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Validation in China

●●

●

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Random Split Sample

Est

imat

e

●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

−0.

100.

000.

050.

10

TRUE

Err

or

●

●●

●

●●

●

0.00

0.05

0.10

0.15

0.20

0.25

0.30

City Sample I

Est

imat

e

●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

−0.

100.

000.

050.

10

TRUE

Err

or

●

●●

●

● ●

●

●●

●

0.00

0.05

0.10

0.15

0.20

0.25

0.30

City Sample II

res.big.boot$true.CSMF

Est

imat

e

●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

−0.

100.

000.

050.

10

TRUE

Err

or

●

Page 115: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Implications for an Individual Classifier

All existing classifiers assume: Ph(S ,D) = P(S ,D)

For a different quantity we assume: Ph(S |D) = P(S |D)

How to use this (less restrictive) assumption for classification (BayesTheorem):

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Page 116: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Page 117: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Page 118: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Page 119: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Page 120: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

The goal: individual classification

Page 121: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Output from our estimator (described above)

Page 122: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Nonparametric estimate from labeled set (an assumption)

Page 123: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

P(D`|S` = s`) =P(S` = s`|D` = j)P(D` = j)

P(S` = s`)

Nonparametric estimate from unlabeled set (no assumption)

Page 124: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Classification with Less Restrictive Assumptions

●

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

Ph (D

=j)

●

●●

●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.00

0.05

0.10

0.15

0.20

0.25

0.30

P(Sk=1)

Ph (S

k=1)

Page 125: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

●

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

Ph (D

=j)

●

●●

●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.00

0.05

0.10

0.15

0.20

0.25

0.30

P(Sk=1)

Ph (S

k=1)

Page 126: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

●

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

Ph (D

=j)

●

●●

●

0.00 0.05 0.10 0.15 0.20 0.25 0.300.

000.

050.

100.

150.

200.

250.

30

P(Sk=1)

Ph (S

k=1)

Page 127: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

●

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

P̂(D

=j)

●

SVMNonparametric

Percent correctly classified:

SVM (best existing classifier): 40.5%Our nonparametric approach: 59.8%

Page 128: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

●

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

P̂(D

=j)

●

SVMNonparametric

Page 129: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

●

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

P̂(D

=j)

●

SVMNonparametric

Page 130: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

●

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

P̂(D

=j)

●

SVMNonparametric

SVM (best existing classifier): 40.5%

Our nonparametric approach: 59.8%

Page 131: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

●

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

P(D=j)

P̂(D

=j)

●

SVMNonparametric

Page 132: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

Misclassification Matrix for Blog Posts

-2 -1 0 1 2 NA NB P(D1)

-2 .70 .10 .01 .01 .00 .02 .16 .28-1 .33 .25 .04 .02 .01 .01 .35 .080 .13 .17 .13 .11 .05 .02 .40 .021 .07 .06 .08 .20 .25 .01 .34 .032 .03 .03 .03 .22 .43 .01 .25 .03

NA .04 .01 .00 .00 .00 .81 .14 .12NB .10 .07 .02 .02 .02 .04 .75 .45

Page 133: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

SIMEX Analysis of “Not a Blog” Category

0.0

0.2

0.4

0.6

0.8

1.0

Category NB

αα−1 0

0.0

0.2

0.4

0.6

0.8

1.0

αα

Page 134: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

0.0

0.2

0.4

0.6

0.8

1.0

Category NB

αα−1 0

0.0

0.2

0.4

0.6

0.8

1.0

αα

0.0

0.2

0.4

0.6

0.8

1.0

αα

Page 135: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

−1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

Category NB

αα−1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

Category NB

αα

0.0

0.2

0.4

0.6

0.8

1.0

αα

0.0

0.2

0.4

0.6

0.8

1.0

αα

Page 136: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

SIMEX Analysis of Other Categories

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category −2

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category −2

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category −1

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category −1

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 0

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 0

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 1

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 1

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 2

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 2

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category NA

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category NA

αα

Page 137: Advanced Quantitative Research Methodology, Lecture Notes: =1=Text

For more information

http://GKing.Harvard.edu

advanced quantitative research methodology, lecture notes: =1=text

Documents