new a crash course in machine learning methods for text analysis · 2020. 10. 6. · a crash course...

A Crash Course in Machine Learning Methods forText Analysis

Mirko Draca

Warwick

6th October 2020

1 / 108

Introduction

I Massive increase in interest in machine learning tools.

I Economics goes through phases of interest in methods whereadoption rates go wild....

2 / 108

You May Have Heard...

3 / 108

The End of Theory?

4 / 108

So How Will the Influence of ML Methods Unfold?

We have two paradigms in modern empirical micro...

I Research Design / Causal Inference: Getting credibleestimates of a parameter of interest.

I Structural Econometrics: Mapping from simple models ofeconomic relationships to statistical ‘flesh’.

My current argument is that we are NOT in for a paradigm shiftwith the onset of ML methods. Instead...

5 / 108

Likely Innovations for Empirical Econ from ML Methods.

We are probably looking at three innovations:

I Qualitative Measurement: We can measure stuff that wecouldn’t measure before in a systematic way. Here, I’m talkingabout text.

I Measurement Depth: ‘increasingly comprehensive aspects ofhuman behaviour and the economy are recorded as data’.Richer, ‘big’ data lots to play with.

I Pattern Recognition: We might learn something aboutprocesses of change and innovation from the principles ofpattern recognition. Potentially deep implications.

6 / 108

Near Future?

We will probably see a hybrid of reduced form and ML.

I For example, use ML to construct new measures of objectsbased on text. Plug these into causal research designs. Alsoemerging: use ML to construct counterfactuals.

I So that’s what I’ll focus on. What do we need to know aboutML and text to get started?

I The best mini-library of ML methods is represented by thesetwo (free!) textbooks....

7 / 108

Hastie and Tibsharani

8 / 108

Usual ML Topics

Model Selection (eg: Lasso, Ridge regression)

Supervised Learning (including classifiers)

Unsupervised Learning (PCA, clustering tools, topic models)

Natural Language Applications - including basic tools (TFIDF,cosine distance)The method that has diffused in economics the most is this...

9 / 108

Lasso in Econometrics

10 / 108

Supervised Learning: What I will cover

Bottom line: tools that are useful for a basic sentiment analysisproblem.

I Naive Bayes: The ‘OLS of text’. Maps neatly to theterm-document-matrix, the fundamental data object you willbe working with.

I Regression Trees: Classification tool. Extensions to‘bagging’ and ‘random forests’.

I Support Vector Machines: Again, a classification tool. Thisis almost as wild and WTF as the name suggests.

One last thing: super applied exposition.

11 / 108

Examples

So what does this ‘hybrid’ of ML and existing empirical micro toolslook like? Some examples:

I Gentzkow and Shapiro (2010): Illustrates the principle of‘training’

I Kelly, Papanikolaou, Seru and Taddy (2020): ‘MeasuringTechnological Innovation in the Long Run’ using TFIDF andcosine distance tools.

I Warwick Crew: Fetzer (2015) using natural language, Dracaand Schawarz (2020) and Ash (2015) on a big kit of methods.

12 / 108

Dissecting Some Applied Papers

The plan is to go over these paper to illustrate some NLP ‘buildingblocks’:

I Kelly, Papanikolaou, Seru and Taddy (2020) ‘MeasuringTechnological Innovation in the Long Run’. AmericanEconomic review: Insights. Forthcoming.

I Hoberg and Phillips (2016) ‘Text-Based Network Industriesand Endogenous Product Differentiation’. Journal of PoliticalEconomy 124 (5), 1423-1465

I Webb, M (2020) ‘The Impact of Artificial Intelligence on theLabor Market’. Mimeo, Stanford.

13 / 108

https://www.michaelwebb.co/webb_ai.pdf

https://www.michaelwebb.co/webb_ai.pdf

Some Resources (1)

There’s a huge range of material online related to ML & NLPcoding & ‘theory’.

I ‘Basic Library’: The Hastie-Tibshirani text outlined are thego-to. Usefully, they are written with notation that you’llrecognise from econometrics. And there’s not too muchBayesian stuff (which confuses us in economics).

I Coding: I recommend investing in Python and following twoUdemy courses by Jose Portilla.

I These are: ‘Python: From Zero to Hero’ & ‘Learning Pythonfor Data Analysis and Visualization’. There’s also a great ‘NLPwith Python Course’.

I You should be able to get these for about 15 pounds each withspecials and discount codes.

14 / 108

Some Resources (2)

On ‘theory’, there are a set of total ‘gurus’ out there.

I Joe Blitzstein (Harvard): How the hell does a Gammadistribution work? JB does this sort of thing in detail andmakes it easy.

I Justin Esarey (Wake Forest University): Gets deep into themechanics of commonly used ML tools. Plus classic OLSmaterial.

I Ben Lambert(Imperial): Basically, if university education wasactually about acquiring human capital, this guy would put usall out of business...

15 / 108

https://www.youtube.com/playlist?list=PL2SOU6wwxB0uwwH80KTQ6ht66KWxbzTIo

https://www.youtube.com/user/jeesarey/playlists

https://www.youtube.com/user/SpartacanUsuals/playlists

Part 1 Naive Bayes Classifiers

16 / 108

Multinomial Naive Bayes

I We write a conditional distribution over a class Ck in terms ofxi features as:

p(Ck |x1, ..., xp) =1

Zp(Ck)

n∏i=1

p(xi |Ck)

I where the term inside the product operator is the likelihood;1/Z is a constant scaling factor and p(Ck) is the class prior.

I Then allocate the sentence according to the highest posterior:

y = argmaxk∈{1,...,K}

p(Ck)n∏

i=1

p(xi |Ck)

I The chosen class here is the ‘hypothesis that is the mostprobable’ and this is known as the maximum a posteriori(MAP) decision rule.

17 / 108

Good References for Naive Bayes (NB)

Ground-level, ‘getting started’ references:

I Lantz (2013) ‘Chapter 4: Probabilistic Learning –Classification Using Naive Bayes’. In Machine Learning WithR, Packt Publishing, Birmingham.

I Stone (2014) ‘Bayes’ Rule: A Tutorial Introduction toBayesian Analysis’. Sebtel Press, Sheffield.

18 / 108

Principle of NB Classification

I Remember the essence of Bayes’ Rule is to use past events toextrapolate the probability of events outside our existinginformation set.

I In particular, BR is more explicit about the process of revisingprobabilities in the light of new information than in thefrequentist stuff we are used to.

I I’ll illustrate this using the example of of distinguishingbetween ‘ham’ and ‘spam’ in email messages. This will showhow the spam filter updates the underlying informationfollowing BR.

19 / 108

‘Spam versus Ham’ as a Quantitative Problem

How do we organise the data?

I First step is to think of the data as a set of word frequenciesper message. We literally cut the data up into a table ofwords (‘terms’) that report how often they turn up acrossemails (‘documents’).

I Second step is to feed in some information. Specifically, thisis information about the ‘class’ of a document. This forms thetraining set for our NB classifier to learn patterns.

This gives is a term-document-matrix where the columns are thedifferent words and get called ‘features’ in the jargon.

20 / 108

‘Spam versus Ham’ as a Quantitative Problem (1)

21 / 108

‘Spam versus Ham’ as a Quantitative Problem (2)

22 / 108

Some Things to Notice

The NB classifier is not what we’re used to. Rather thanminimizing some function (eg: OLS) we are basically calculatingprobabilities from a frequency table and running it through BayesRule.

We encounter two assumptions as part of the NB classifier that actto make it ‘naive’:

I Independence of Features: The features are‘class-conditional independent’. This means we don’t thinkthat words co-occur or that ordering matters (the ‘bag ofwords’ assumption).

I Equal Importance of Features: The features have differentclass probabilities but are equally important from the outset.No weighting of features in the basic model.

23 / 108

Simplest Possible Setting

We have a set of messages and based on manual (ie: human)inspection we know that 20 percent are Spam and 80 percent areHam.

We also know that 5 percent of the messages contain the wordViagra. This subset of messages overlaps with the Spam subset.We have: P(Spam) = 0.2;P(Ham) = 0.8; and P(Viagra) = 0.05

24 / 108

Even Simpler...

I It really does come down to plugging word frequencyinformation into Bayes Rule:

P(A|B) =P(B|A)P(A)

P(B)=

P(B|A)P(A)

P(B)=

P(A ∩ B)

P(B)(1)

I Now let’s assign our A and B events to what we observe inour data. Specifically, assign event P(A) = P(Spam) andP(B) = P(Viagra).

I Practically, the probability we are interested in is theprobability that a message is spam given that the word Viagraappears. This is P(Spam|Viagra) and is called the posterior.

25 / 108

Naive Bayes calculation for a single word

P(Spam|Viagra)︸︷︷︸Posterior probability

=

Likelihood︷︸︸︷P(Viagra|Spam)

Prior prob.︷︸︸︷P(Spam)

P(Viagra)︸︷︷︸Marginal likelihood

I Intuitively, the exercise that we are running involves pluggingin the information built up from the training set to ‘flip’ theprobabilities to get the posterior we are interested in.

I Let’s see an example of this calculation.

26 / 108

Frequency Table of Data

27 / 108

Calculation - Weighted Likelihood

I For example, for the class of Spam we have the following:I P(Viagra|Spam) = 4/20 = 0.2 ‘Likelihood”I P(Spam) = 20/100 = 0.2 ‘Prior’

I We can multiply these two components together:Likelihood ∗ Prior = 0.2 ∗ 0.2 = 0.04

I This is sometimes called the ‘weighted likelihood’. You cansee it as the probability of seeing the word Viagra given thatwe know that the message is spam weighted by how commonit is to see Spam

28 / 108

Calculation - Posterior

We can then calculate the posterior as:

P(Spam|Viagra) = Posterior = Likelihood∗PriorMarginal = 0.04

0.05 = 0.2

These posterior probabilities therefore combine two pieces ofinformation:

I How common is the feature (Viagra) within a class, and:

I How common is the class (Spam) in the whole message set.

This lets us make the best possible prediction about anincoming message’s class based on all the information wehave observed.

29 / 108

More Realistic Example - Large Number of Words.

This first example shows the basic principle behind NB classifiers.We can extend this to use a large number of words.

Hence we will be computing a posterior probability thatsimultaneously takes into account a large number of feature, whichare all treated equally for the purpose of the calculations.

Let’s build a classifier based on 4 words that either appear or donot appear in a given message: Viagra (W1), Money (W2),Groceries (W3) and Unsubscribe (W4).

Denote non-appearance of a word in a message as ¬Wi .

30 / 108

More Realistic Example - Large Number of Words.

31 / 108

Bayes Posterior - Multiple Words Example

I We can summarise this combination of words appearing as asingle joint probability: P(W1 ∩ ¬W2 ∩ ¬W3 ∩W4).

I This is then plugged into Bayes Rule to calculate the posterior:

P(Spam|W1 ∩ ¬W2 ∩ ¬W3 ∩W4) =

P(W1 ∩ ¬W2 ∩ ¬W3 ∩W4|Spam)P(Spam)

P(W1 ∩ ¬W2 ∩ ¬W3 ∩W4)

I The challenge then is calculating the joint probabilityP(W1 ∩ ¬W2 ∩ ¬W3 ∩W4). As a joint intersection this ishard to compute and requires a lot of training data.

32 / 108

Naive Bayes - Class-Conditional Independence

I To simplify our expression we assume that the features (in thiscase the words Wi ) are independent of each other conditionalon observing the class. Hence we re-write the joint probability:

P(W1 ∩ ¬W2 ∩ ¬W3 ∩W4)

I As the following, using the independence assumption:

P(W1|Spam)P(¬W2|Spam)P(¬W3|Spam)P(W4|Spam)

I In effect we have assumed that the words are independent.There is no co-occurrence so we are assuming that there is‘bag of words’ we draw from in these problems.

33 / 108

General Notation

I We write a conditional distribution over a class Ck in terms ofxi features as:

p(Ck |x1, ..., xp) =1

Zp(Ck)

n∏i=1

p(xi |Ck)

where the term inside the product operator is the likelihood;1/Z is a constant scaling factor and p(Ck) is the class prior.

34 / 108

Allocating a Message to a Class

I Then allocate the sentence according to the highest posterior:

y = argmaxk∈{1,...,K}

p(Ck)n∏

i=1

p(xi |Ck)

The chosen class here is the ‘hypothesis that is the mostprobable’ and this is known as the maximum a posteriori(MAP) decision rule.

35 / 108

Training versus Test Data

36 / 108

Applications

As a researcher, what you want to do is feed in some kind ofpattern that you are interested in.

I Sentiment Analysis: Train the data on a set of good or badreviews.

I Article Classification: Hand-code a set of articles according toa topic and use this to train a classifier.

I Extra Non-word Features: Combine with meta-data featuressuch as publication date, author, length etc.

I Image Patterns: Cut an images into boxes and use theshading according to grid point as the set of features.

The trick in actual applications is to unpack the components interms of words driving the likelihood, the levels of the differentpriors and marginals.

37 / 108

Strengths of NB Classifiers

I Simple, fast and effective.

I Does well with noisy and missing data.

I Requires relatively few examples for training but a largenumber of examples is better.

I Can easily back out important words by browsing the contentsof the likelihood.

I Can ‘learn’ from mistakes by modifying the training set.

38 / 108

Weaknesses of NB Classifiers

I Naive assumptions of independence and equal importance offeatures are unrealistic and ignore information (eg: negationphrases such as ‘not good’).

I While it works well with these assumptions, no-one is surewhy (best guess is that tools such as OLS over-fit the trainingdata).

I Not ideal for data with numeric (continuous) features. In thatcase we need to discretize the data or use Guassian NaiveBayes.

I The estimated probabilities are less accurate than the simpleclass predictions (ie: knowing that the posterior is 0.7 ratherthan 0.52 does not help us much).

39 / 108

Part 2 Regression Trees

40 / 108

Regression Trees - Battle Plan

I Naive Bayes looks a bit crazy for people from our backgroundand regression trees are also quite different from our usualintuitions and obsessions.

I Classification and Regression Trees (CART) are based on theprinciple of sequentially splitting the data intomulti-dimensional ‘boxes’ according to the X features.

I The basic tree method then extends out to mysterioussounding modifications kown as ‘Bagging’, ‘Random Forests’and ‘Boosting’.

41 / 108

Reading for Regression Trees

I Chapter 8 “Tree-Based Methods” in James, Witten, Hastie,Tibshirani (2013) Introduction to Statistical Learning Using R.Springer: New York. Additional material taken from the more

advanced version of this (2013) text: Chapter 12 “RandomForests” in Hastie, Tibshirani and Friedman (2009) Elementsof Statistical Learning. Springer: New York.

I These books are accessible for economists: notation/jargonlooks a lot like econometrics and not much hardcore Bayesianfocus in there.

42 / 108

Basics of Regression Trees

The tree method involves stratifying or segmenting the predictor(‘feature’) space X into (1, .., J) sub-regions, denoted Rj .

We then obtain a predicted value y based on the mean value ofthe y outcome in the sub-region covered by the specific values ofX predictors that we are considering.

To understand this, consider a 2-dimensional plot of two featuresX1 and X2 where we have colour-coded the outcome y into low,medium, and high. This allows us to see the clustering of yvalues into distinct regions of the feature space X .

43 / 108

44 / 108

Baseball Salary DataNow we can imagine partitioning the data into ‘boxes’ to get thebest possible representation of different segments of the data.These boxes are giving us the different values of y conditional onour predictors X .

45 / 108

Baseball Salary DataThis can also be depicted as a decision tree that gives us the twosplits that we have seen in the 2D plot. Notice the terminologyhere: branches, internal nodes, terminal nodes and leaves.

46 / 108

Split it Up

Hence the principle of regression trees is to systematically split thedata according to thresholds defined on the X predictors in orderto create ‘optimal’ partitions. It is a two step process:

I Divide the Predictor Space: Using a ‘training set’ of data,partition the predictor or features set X1,X2, ...,Xp into Jdistinct and non-overlapping regions X1,X2, ...,Xp. These Rj

regions are the ‘boxes’ in the multi-dimensional feature spaceand will contain a set of y response values

I Make the Prediction: For every test set observation thatfalls into the region Rj we make a prediction cm = y based onthe mean (or the median) of the training set response valuesy in that region of feature space Xp.

47 / 108

Recursive Binary Splitting (1)

The core is a minimization problem across the J predictors in Xand the possible thresholds for cutting these predictors, denoted ass:

argminjs =J∑

j=1

∑i∈Rj

(yi − yRj)2

where the outer operator sums over the J predictors and the inneroperator sums over the observations in region Rj . So within eachregion or ‘box’ we have a ‘residual sum of squares’ (RSS) and wewant to make the sum of the RSS across boxes as small as possible.

48 / 108

Recursive Binary Splitting (2)

The set of boxes after minimization gives us a prediction functionfor given X feature values and regions (1, ...,M):

f (x) =M∑

m=1

cmI (x ∈ Rm)

where cm is a constant in the region m (typically the mean).Intuitively, we are allocating each test set observation i to a boxbased on where that i fits in terms of the ‘grid’ of X featuresmapped in the training set.

49 / 108

‘Top Down’ and ‘Greedy’

This recursive splitting method is described as ‘top down’ and‘greedy’.

‘Top Down’ = Since it starts at the top of tree and worksdownwards.

‘Greedy’ = Because at each step the best split is made at thatstep rather than looking ahead to try and optimize across steps.

50 / 108

Two Region Example (1)

Let’s look in detail at a single step, that is, one split. We want todivide the data into two half-planes:

R1(j , s) = [X |Xj ≤ s] and R2(j , s) = [X |Xj ≤ s]

The objective of the binary split at this step is to find the splittingvariable Xj and split point or threshold j that solves the problem:

mins

RSSj =∑

i :xi∈R1(j ,s)

(yi − yR1)2 +∑

i∈R2(j ,s)

(yi − yR2)2

Practically: scan through every potential (j , s) combination untilwe get the minimizer pair (j , s).

51 / 108

Two Region Example (2)

Having settled on the best (j , s) split we formulation a predictionas a ‘within-region constant’, that is, the mean value within agiven Xp box.

c1 = ave[yi |xi ∈ R1(j , s)] and c2 = ave[yi |xi ∈ R1(j , s)]

The intuition is that the mean gives us the best ‘minimumdistance’ in terms of the best possible RSS for a region. This goesback to the idea that if we run a regression of some y on aconstant k gives us the mean of y . We continue the splittingiteratively until we no longer reduce the RSS .

52 / 108

53 / 108

54 / 108

This is the tree corresponding to our first example of a feasiblesplit. Each of the ‘leaves’ is one of our boxes in the 2D diagramfrom before.

55 / 108

And this is the 2D plot compared to the 3D plot that incorporatesthe y response variable as the vertical axis (I think of this as a‘stepwise nonparametric density’).

56 / 108

Problems with Basic Regression Trees

Regression trees do not perform so well for prediction. There aretwo main reasons for this.

First, the tree method trades off lower bias for higher variance.The intuition is that as we cut the X feature space into smallerboxes this gives a chance for sampling variation or noise todominate (ie: higher variance for our estimates). In contrast, a less‘bushy’ tree will draw bigger boxes and we might be off-target withour predictions.

Second, the tree method is basically made up of a complexstepwise function. Linear methods may provide a betterapproximation if we think that the underlying conditionalrelationship is relatively simple. For example:

57 / 108

58 / 108

Tree Pruning

These limitations mean that the recursive splitting approach islikely to overfit the training data and test poorly. We are thereforeinterested in an approach that will simplify our trees and improveperformance. The aim is: a tree with fewer splits and lowervariance at the cost of some bias.

Choosing based purely on the RSS threshold (eg: set someminimum threshold for the reduction of RSS and stop growing thetree when we hit it) could end up being short-sighted. By this wemean that we could be missing out on bigger reductions inRSS as we move further down the tree.

Hence the approach we outline here – “Cost ComplexityPruning” – is designed to deal with this problem. The basicstrategy is to grow a large, full tree (denoted T0) and strategicallyprune it back to get the best test error rate.

59 / 108

Cost Complexity Pruning

We want to define a sub-tree T ⊂ T0 as a tree that prunes backthe full tree by collapsing internal and terminal nodes.

|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T |

where |T | is the number of terminal nodes in a tree, i.e. thenumber of boxes, Rm is rectangle pertaining to terminal node mand yRm is mean of training observations in Rm. Same problem asbefore except we have added a penalty according to the degree ofthe sub-tree.

60 / 108

Cross Validation Error, Test Error and Training Error

Hence the idea is to vary α and have the computer computedifferent subtrees, one for each value of α.

I As α→ 0, we obtain the unpruned tree.

I As α→∞, we obtain a tree with a single terminal node.

I How to decide on optimal α? Two approaches...

I We can perform a validation set approach to compute test andtraining error and based on the minimum point of test error,choose α.

I Alternatively, we can obtain a sequence of subtrees and thencompute the K-fold cross validation error.

61 / 108

62 / 108

63 / 108

Classification Trees

Classification-based problems involve using a qualitative responsevariable for y . We make predictions y based on the mostcommonly occurring class in a sub- region Rm.

In this case we do not use an RSS to evaluate model fit as much asmeasures based on the classification error rate.

Instead measure most commonly occurring class; Gini index; andcross-entropy.

The principles are the same, we just add more jargon!

64 / 108

Bagging - ‘Bootstrap Averaging’

We have seen that the main problem with trees is that they havehigh variance and are noisy. The bagging method tries to smoothout this noise by using trees grown on bootstrapped training sets.The basic steps are as follows:

1. Construct a new bootstrapped dataset by sampling withreplacement from our existing sample. We are treating the givensample as the ‘population’ and drawing hypothetical samples fromit. Index these bootstrapped samples as (b = 1, 2, ....,B). Think ofthese b samples as alternative training samples.

2. Then grow a full tree on each of these (b = 1, 2, ....,B)bootstrapped samples. Denote these trees as (T1,T2, ...,Tb).

65 / 108

Bagging - ‘Bootstrap Averaging’

3. Then for a given set of X predictors, plug the values into eachtree and get a set of predictions f1(x), f2(x), f3(x), ..., fB(x).

4. Finally, take the average of these predictions acrossbootstrapped samples:

favg (x) =1

B

B∑b=1

f b(x)

This is the so-called ‘bagged’ estimate or prediction and exploitsthe usual logic of smoothing out noise across different samples, inthis case bootstrapped training samples.

The validation procedure known as ‘out-of-bag’ estimation.Approximately 1/3 of the original observations can be left out andused for model validation.

66 / 108

Bagging - Variable Importance

Furthermore, we no longer have a single tree to describe the modelbut rather estimates based on the average across bootstrappedtraining samples. But we can construct a ‘variable importance’measure by looking at the average reduction in RSS across thebootstrapped samples.

67 / 108

Random Forests

I The problem: our different bootstrapped f b’s are correlatedmechanically, since they are obtained using the same set ofobservations. This increases the variance of our prediction.

I Correlation arises from using the same set of X ’s in theconstruction of each bootstrapped tree.

I One way to avoid this is to try to decorrelate individual f b’s

I To avoid this, when building these trees, each time a split in atree is considered, a random sample of m predictors is chosenas split candidates from the full set of p predictors.

I This mechanically decorrelates the fitted values.

68 / 108

Part 3: Support Vector Machines

69 / 108

Support Vector Machines - Battleplan

Might be the craziest approach. We will discuss:

I Basic hyperplane concept for classification.

I Finding the optimal splitting hyperplane as a ‘Maximal MarginHyperplane’.

I Support Vector Classifier = allowing a ‘soft margin’ along ourhyperplane in case we don’t think a clean, linear separation isdesirable.

I Support Vector Machine = nonlinear boundaries implementedusing a kernel approach.

Reading: Hastie and Tibshirani texts.

70 / 108

Hyperplanes

We going to use a ‘hyperplane’ as a tool for a two-classclassification problem. Recall that a hyperplane is just a way ofsplitting up some p-dimensional space. We say that, given ap-dimensional space, then a hyperplane is a sub-space of (p − 1)dimensions.

For example, in the two-dimensional p = 2 case then thehyperplane is (p − 1) = 1 and therefore just a line.

The p = 2 ‘straight line’ hyperplane is defined as you would expect:

β0 + β1X1 + β2X2 = 0 (2)

where β0, β1andβ2 are our parameters. The points X = (X1,X2)T

for which this equation holds define the hyperplane.

71 / 108

Hyperplanes (continued)

This generalizes to p-dimensions as:

β0 + β1X1 + β2X2 + ...+ βpXp = 0 (3)

again with those points X = (X1,X2)T satisfying (3) defining thehyperplane. But consider the cases when this is not met:

β0 + β1X1 + β2X2 + ...+ βpXp > 0

β0 + β1X1 + β2X2 + ...+ βpXp < 0

These will be points X that lie off the hyperplane to either side. Inthis sense we can see that the hyperplane is just a tool for dividinga p-dimensional space into different halves. See it like this...

72 / 108

73 / 108

Classification Using a Separating Hyperplane (1)

Our goal now is to build the best possible classifier using thishyperplane principle. Let’s set up some basic notation. We aregoing to considered an (n × p) features matrix X of trainingobservations.

Our response variable will fall into two classes, denoted with valuesor ‘class labels’ -1 and 1:

y1, y2, ..., yn ∈ [−1, 1]

The model from this training set will eventually be used on sometest observations, written individually as: x∗ = [x∗1 , x

∗2 , ..., x

∗p ].

74 / 108


This translate as the following decision rules:

β0 + β1x1 + β2x2 + ...+ βpxp > 0 if yi = 1

β0 + β1x1 + β2x2 + ...+ βpxp < 0 if yi = −1

We then re-write this a bit to help us later on:

yi (β0 + β1X1 + β2X2 + ...+ βpXp) > 0 (4)

All we’re doing here is using yi as an indicator function. If(yi = −1) then the bit inside the brackets will also be negative,giving us a positive overall. If (yi = 1) then we also get a positive.This is just a property we set up for mathematical convenience.

75 / 108


Having defined this hyperplane, we can plug in our testobservations in order to assign them to a class. Write this as:

f (x∗) = β0 + β1x∗1 + β2x

∗2 + ...+ βpxp > 0 if yi = 1

f (x∗) = β0 + β1x∗1 + β2x

∗2 + ...+ βpxp < 0 if yi = −1

Hence test observations are classified based on what side of thehyperplane they fall on. The magnitude of f (x∗) is alsoinformative here – the further away the test observation x∗ is fromthe hyperplane, the more confident we can be about its’ classassignment.

76 / 108


The challenge then is: how can we come up with the besthyperplane? In principle, given that the classes are separable, wecan define an infinite number of hyperplanes.

For example, a given hyperplane can just be shifted up or rotatedslightly and still separate the data. This is where we need to adoptan approach based on finding a hyperplane with the ‘maximummargin’.

77 / 108

78 / 108

Maximal Margin Hyperplane (MMH)

The best hyperplane (HP) will be the one that is farthest from theobservations. We work this out by first defining a perpendiculardistance between an observation and a proposed HP, and thenadding this up across observations.

The MMH is then the HP with the ‘farthest minimum distance’with respect to the training observations. The test observations arethen classified according to how they fall on either side of theMMH.

Our hope is that a large margin on the training observations is alsoreflected in a large margin on the test observations. Overfitting isa concern when considering cases where p > N. Intuitively, thefeatures are numerous enough to ‘box in’ many individualobservations by defining a very high dimensional space.

79 / 108

Here Come the Support Vectors

Defining the MMH is where we encounter the key concept of‘support vectors’.

The crucial observations for the MMH will be those that sit on themargin. They ‘support’ the HP in the sense that if theseobservations shifted then our whole HP would move as aconsequence.

The MMH depends on these ‘support’ observations directly. Incontrast, the MMH would not be affected if we shifted the otherobservations that are more firmly located on either side of thehyperplane.

80 / 108

81 / 108

Constructing the Maximal Margin Classifier

We set up the following problem based on the (n × p) trainingdata and class labels y1, y2, ..., yn ∈ [−1, 1]

maxβ0,.,βp ,M

st :

p∑j=1

β2j = 1

yi (β0 + β1xi1 + ...+ βpxip) ≥ M

The first constraint just normalizes any negative β′s by squaringand ensures we will have a convex combination across features.The second constraint ensures that each observation is on thecorrect side of the hyperplane.

82 / 108

Non-Separable Case

The MMH approach will work if we can define a viable separatinghyperplane, that is, a ‘clean’ separation of classes on the basis ofthe p dimensions of the features X .

If this is not feasible then we can give up on an exact separationand go with an ‘almost separation’ where we tolerate someobservations being wrongly positioned. In this case we define a‘soft margin’.

The generalization of the Maximal Margin Hyperplane (MMH) tothis non-separable case with soft margins is called a ‘SupportVector Classifier’ in the jargon.

We may also want to use a soft margin if the data is separable butwith a hyperplane that is overly sensitive to particular observations.

83 / 108

84 / 108

85 / 108

Support Vector Classifier

Building a support vector classifier is then a matter of modifyingour basic optimization problem with a softer margin in terms ofthe constraints that are used.

With a soft margin we are trading off (i) greater robustness to theposition of some individual observations, against (ii) betterclassification performance for the rest of our training observations.

Practically, the soft margin means that we are going to allow someobservations to fall on either the wrong side of the hyperplane orwithin the maximal margin we calculate.

This boils down to the following optimization problem that allowssome extra ‘slack’ in terms of misclassification.

86 / 108

Soft Margins

We can rewrite the optimization problem, introducing slackvariables ε1, ..., εn, one for each data point. We rewrite theconstrained optimization problem as:

maxβ0,...,βp ,ε1,...,εn M

Subject to:∑p

j=1 β2j = 1

yi (β0 + β1xi1 + ...+ βpxip) ≥ M(1− εi )

εi ≥ 0∑n

i=1 εi ≤ C

87 / 108

Soft Margins

We optimize now over the εi as well. They allow an observation tolie within the margin M (if 1 > εi > 0) or on the wrong side of themargin (if εi > 1). We have an additional parameter C , which wecan vary. This is the total budget by which the margin can beviolated. Setting C = 0, results in hard margin separatinghyperplane, which may not exist.

88 / 108

89 / 108

Support Vector Machines

The approach so far has been based on a linear classifier, with andwithout some tolerance for ‘soft’ margins. By definition, thisrestricts us to linear separations of the observations. The ‘supportvector machine’ is an extension of the basic SV classifier thatallows us to handle nonlinear boundaries.

The basic approach is to plug in some polynomial functions (eg:quadratics, cubics) and solve this through to a distance functiondefined in terms of a ‘target’ observation x that we are trying toclassify and the subset of support vector points that define thehyperplane.

90 / 108

Part 4: Some Applications

91 / 108

Sentiment - Pang, Lee and Vaithanathan (2002)

Uses IMDB database rec.arts.movies.reviews newsgroup.

Tests Naive Bayes, Maximum Entropy (a spin on Naive Bayes),and Support Vector Machines on a 50-50 sample of reviews for‘thumbs up’ or ‘thumbs down’ classification.

They first establish a ‘simple model’ baseline using a 1) ex antehuman word list and 2) a word list defined after casual inspectionof the data.

This gives 60-65 percent accuracy rates (the random choicebaseline is 50 percent).

92 / 108


93 / 108


94 / 108

Sentiment - Draca, Garred, Stickland and Warrinnier(2016) ”On Target?”

Look at differential sensitivity of ‘targeted’ firms on the TehranStock Exchange to news about potential sanctions relief.

Uses a diff-in-diff around the November 2013 Geneva deal plushigh(er) frequency measures of news built up from the TehranTimes.

Uses a ‘pre-loaded’ likelihood to classify the sentiment ofdeal-related news. Also decompose the topic structure.

95 / 108

Example of Tehran Times (1)

96 / 108

Example of Tehran Times (2)

97 / 108

News Shocks - Basic Approach

I Consider the first four pages of each edition (contains mainnew items). Break this up to the sentence level. Creates9,000-10,000 thousand sentences per month of editions

I Identify sets of articles that are related to the deal. Similar toBaker, Bloom and Davis(2015). Construct news coveragemeasure as the number of sentences.

I Use a text classifier to tag sentences as positive, negative orneutral. Then count this up. Better measure of underlyingdeal probability.

98 / 108

News coverage

99 / 108

News coverage: share positive

100 / 108

News coverage: results

(1) (2) (3) (4) (5) (6)July to December July to October Positive coverage

Target * coverage 0.149*** 0.170*** 0.152*** 0.170***(0.044) (0.050) (0.049) (0.050)

Target * positive 0.161*** 0.189***(0.051) (0.052)

Coverage 0.080*** 0.071***(0.025) (0.027)

Positive 0.033(0.029)

Month dummies Yes Yes Yes Yes Yes YesFirm FEs Yes Yes Yes Yes Yes YesIndustry interactions No Yes No Yes No YesObservations 13,784 10,563 10,563 10,563 10,563 10,563Number of firms 161 160 160 160 160 160R2 0.035 0.039 0.037 0.039 0.035 0.038

Coverage is a standardized count of sentences in deal-related articles.

Positive is a standardized count of deal-related sentences classified as positive.

Standard errors clustered by firm in parentheses.

101 / 108

Geneva deal: daily effects with negative coverage share

102 / 108

Fetzer (2015) - Social Insurance and Conflict

Research design is based on the effects of a major welfare-to-workprogramme in India (NREGA) in attenuating the rainfall-conflictlink. Builds new measures of conflict from news reports.

Combination of ‘natural language’ techniques (for extraction) withclassifiers applied over the top.

103 / 108

Natural Language Processing: Extracting Information

Use trained natural language processing algorithms to extract keypieces of data to fill an Event- tuple:

E = {L,T ,A, S ,O}

containing a Location, a T ime or Date, a Action (Act) and aSubject and Object.

104 / 108

Output after NLP

105 / 108

Asking humans to refine the set of acts

106 / 108

Asking Humans to Help Train Classifiers

107 / 108

Brief Notes

Gentzkow and Shapiro (2010): Uses the principles of training.First trains political speech on think-tank citations. Then mapsover to newspaper slant.

Ash (2015): Uses range of tools to pick out previously unobservedpolicy instruments.

Baker, Bloom and Davis: How to design a human coding systemthat you might use as a source for training ML classifiers.

108 / 108

new a crash course in machine learning methods for text analysis · 2020. 10. 6. · a crash course...

Documents