classification, linear models, naïve bayes · 2018. 12. 14. · classification, linear models,...

30
Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob Eisenstein

Upload: others

Post on 09-Sep-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Classification, Linear Models, Naïve Bayes

CMSC 470

Marine Carpuat

Slides credit: Dan Jurafsky & James Martin, Jacob Eisenstein

Page 2: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Today

• Text classification problems• and their evaluation

• Linear classifiers• Features & Weights

• Bag of words

• Naïve Bayes

Page 3: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Classification problems

Page 4: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Multiclass Classification

label1 label2 label3 label4

Classifiersupervised machine

learning algorithm

?unlabeled

document

label1?

label2?

label3?

label4?

TestingTraining

training data

Feature Functions

Page 5: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Is this spam?

From: "Fabian Starr“ <[email protected]>

Subject: Hey! Sofware for the funny prices!

Get the great discounts on popular software today for PC and Macintoshhttp://iiled.org/Cj4Lmx

70-90% Discounts from retail price!!!All sofware is instantly available to download - No Need Wait!

Page 6: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

What is the subject of this article?

• Antogonists and Inhibitors

• Blood Supply• Chemistry• Drug Therapy• Embryology• Epidemiology• …

MeSH Subject Category Hierarchy

?

MEDLINE Article

Page 7: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Text Classification

• Assigning subject categories, topics, or genres

• Spam detection

• Authorship identification

• Age/gender identification

• Language Identification

• Sentiment analysis

• …

Page 8: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Text Classification: definition

• Input:• a document d

• a fixed set of classes Y = {y1, y2,…, yJ}

• Output: a predicted class y Y

Page 9: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Classification Methods:Supervised Machine Learning

• Input• a document d

• a fixed set of classes Y = {y1, y2,…, yJ}

• a training set of m hand-labeled documents (d1,y1),....,(dm,ym)

• Output• a learned classifier d y

Page 10: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Aside: getting examples for supervised learning

• Human annotation• By experts or non-experts (crowdsourcing)

• Found data

• How do we know how good a classifier is?• Compare classifier predictions with human annotation

• On held out test examples

• Evaluation metrics: accuracy, precision, recall

Page 11: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

The 2-by-2 contingency table

correct not correct

selected tp fp

not selected fn tn

Page 12: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Precision and recall

• Precision: % of selected items that are correctRecall: % of correct items that are selected

correct not correct

selected tp fp

not selected fn tn

Page 13: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

A combined measure: F

• A combined measure that assesses the P/R tradeoff is F measure (weighted harmonic mean):

• People usually use balanced F1 measure• i.e., with = 1 (that is, = ½):

F = 2PR/(P+R)

RP

PR

RP

F+

+=

-+

=2

2 )1(

1)1(

1

1

b

b

aa

Page 14: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Linear Modelsfor Multiclass Classification

Page 15: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Linear Modelsfor Classification

Feature function

representation

Weights

Page 16: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Defining features: Bag of words

Page 17: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Defining features

Page 18: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Linear Classification

Page 19: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Linear Modelsfor Classification

Feature function

representation

Weights

Page 20: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

How can we learn weights?

• By hand

• Probability• e.g.,Naïve Bayes

• Discriminative training• e.g., perceptron, support vector machines

Page 21: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Naïve Bayes Modelsfor Text Classification

Page 22: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Generative Story for Multinomial Naïve Bayes

• A hypothetical stochastic process describing how training examples are generated

Page 23: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Prediction with Naïve BayesScore(x,y)

Definition of conditional probability

Generative story assumptions

This is a linear model!

Page 24: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Prediction with Naïve BayesScore(x,y)

Definition of conditional probability

Generative story assumptions

This is a linear model!

Page 25: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Prediction with Naïve BayesScore(x,y)

Definition of conditional probability

Generative story assumptions

This is a linear model!

Page 26: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Parameter Estimation

• “count and normalize”

• Parameters of a multinomial distribution

• Relative frequency estimator

• Formally: this is the maximum likelihood estimate• See CIML for derivation

Page 27: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Smoothing (add alpha)

Page 28: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Naïve Bayes recap

Page 29: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Why is this model called “Naïve Bayes”?Another view of the same model

𝑦 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 𝑃 𝑌 = 𝑦 𝑋 = 𝑥)

= 𝑎𝑟𝑔𝑚𝑎𝑥𝑦𝑃(𝑌 = 𝑦)𝑃 𝑋 = 𝑥 𝑌 = 𝑦)

= 𝑎𝑟𝑔𝑚𝑎𝑥𝑦𝑃(𝑌 = 𝑦)

𝑖=1

𝑑

𝑃 𝑋𝑖 = 𝑥𝑖 𝑌 = 𝑦)

Bayes rule

+ Conditional independence assumption

Page 30: Classification, Linear Models, Naïve Bayes · 2018. 12. 14. · Classification, Linear Models, Naïve Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Today

• Text classification problems• and their evaluation

• Linear classifiers• Features & Weights

• Bag of words

• Naïve Bayes