classification: naïve bayes and logistic...

39
Classification: Naïve Bayes and Logistic Regression Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 2A Slides adapted from Hinrich Schütze and Lauren Hannah Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 1 of 23

Upload: others

Post on 26-Jun-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Classification: Naïve Bayes andLogistic Regression

Machine Learning: Jordan Boyd-GraberUniversity of Colorado BoulderLECTURE 2A

Slides adapted from Hinrich Schütze and Lauren Hannah

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 1 of 23

Page 2: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

By the end of today . . .

• You’ll be able to frame many machine learning tasks as classificationproblems

• Apply logistic regression (given weights) to classify data

• Learn naïve bayes from data

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 2 of 23

Page 3: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Classification

Outline

1 Classification

2 Motivating Naïve Bayes Example

3 Estimating Probability Distributions

4 Naïve Bayes Definition

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 3 of 23

Page 4: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Classification

Probabilistic Classification

Given:

• A universe X our examples can come from (e.g., English documents witha predefined vocabulary)

◦ Examples are represented in this space.• A fixed set of labels y ∈C= {c1,c2, . . . ,cJ}◦ The classes are human-defined for the needs of an application (e.g.,

spam vs. ham).• A training set D of labeled documents with each labeled document{(x1,y1) . . .(xN ,yN)}

We learn a classifier γ that maps documents to class probabilities:

γ : (x ,y)→ [0,1]

such that∑

y γ(x ,y)= 1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 4 of 23

Page 5: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Classification

Probabilistic Classification

Given:

• A universe X our examples can come from (e.g., English documents witha predefined vocabulary)◦ Examples are represented in this space.

• A fixed set of labels y ∈C= {c1,c2, . . . ,cJ}◦ The classes are human-defined for the needs of an application (e.g.,

spam vs. ham).• A training set D of labeled documents with each labeled document{(x1,y1) . . .(xN ,yN)}

We learn a classifier γ that maps documents to class probabilities:

γ : (x ,y)→ [0,1]

such that∑

y γ(x ,y)= 1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 4 of 23

Page 6: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Classification

Probabilistic Classification

Given:

• A universe X our examples can come from (e.g., English documents witha predefined vocabulary)◦ Examples are represented in this space.

• A fixed set of labels y ∈C= {c1,c2, . . . ,cJ}

◦ The classes are human-defined for the needs of an application (e.g.,spam vs. ham).

• A training set D of labeled documents with each labeled document{(x1,y1) . . .(xN ,yN)}

We learn a classifier γ that maps documents to class probabilities:

γ : (x ,y)→ [0,1]

such that∑

y γ(x ,y)= 1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 4 of 23

Page 7: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Classification

Probabilistic Classification

Given:

• A universe X our examples can come from (e.g., English documents witha predefined vocabulary)◦ Examples are represented in this space.

• A fixed set of labels y ∈C= {c1,c2, . . . ,cJ}◦ The classes are human-defined for the needs of an application (e.g.,

spam vs. ham).

• A training set D of labeled documents with each labeled document{(x1,y1) . . .(xN ,yN)}

We learn a classifier γ that maps documents to class probabilities:

γ : (x ,y)→ [0,1]

such that∑

y γ(x ,y)= 1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 4 of 23

Page 8: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Classification

Probabilistic Classification

Given:

• A universe X our examples can come from (e.g., English documents witha predefined vocabulary)◦ Examples are represented in this space.

• A fixed set of labels y ∈C= {c1,c2, . . . ,cJ}◦ The classes are human-defined for the needs of an application (e.g.,

spam vs. ham).• A training set D of labeled documents with each labeled document{(x1,y1) . . .(xN ,yN)}

We learn a classifier γ that maps documents to class probabilities:

γ : (x ,y)→ [0,1]

such that∑

y γ(x ,y)= 1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 4 of 23

Page 9: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Classification

Probabilistic Classification

Given:

• A universe X our examples can come from (e.g., English documents witha predefined vocabulary)◦ Examples are represented in this space.

• A fixed set of labels y ∈C= {c1,c2, . . . ,cJ}◦ The classes are human-defined for the needs of an application (e.g.,

spam vs. ham).• A training set D of labeled documents with each labeled document{(x1,y1) . . .(xN ,yN)}

We learn a classifier γ that maps documents to class probabilities:

γ : (x ,y)→ [0,1]

such that∑

y γ(x ,y)= 1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 4 of 23

Page 10: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Classification

Generative vs. Discriminative Models

Generative

Model joint probability p(x ,y)including the data x .

Naïve Bayes

• Uses Bayes rule to reverseconditioning p(x |y)→ p(y |x)

• Naïve because it ignores jointprobabilities within the datadistribution

DiscriminativeModel only conditional probabilityp(y |x), excluding the data x .

Logistic regression

• Logistic: A special mathematicalfunction it uses

• Regression: Combines a weightvector with observations to createan answer

• General cookbook for buildingconditional probability distributions

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 5 of 23

Page 11: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Motivating Naïve Bayes Example

Outline

1 Classification

2 Motivating Naïve Bayes Example

3 Estimating Probability Distributions

4 Naïve Bayes Definition

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 6 of 23

Page 12: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Motivating Naïve Bayes Example

A Classification Problem

• Suppose that I have two coins, C1 and C2

• Now suppose I pull a coin out of my pocket, flip it a bunch of times, recordthe coin and outcomes, and repeat many times:

C1: 0 1 1 1 1C1: 1 1 0C2: 1 0 0 0 0 0 0 1C1: 0 1C1: 1 1 0 1 1 1C2: 0 0 1 1 0 1C2: 1 0 0 0

• Now suppose I am given a new sequence, 0 0 1; which coin is it from?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 7 of 23

Page 13: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Motivating Naïve Bayes Example

A Classification Problem

This problem has particular challenges:

• different numbers of covariates for each observation

• number of covariates can be large

However, there is some structure:

• Easy to get P(C1), P(C2)

• Also easy to get P(Xi = 1 |C1) and P(Xi = 1 |C2)

• By conditional independence,

P(X = 010 |C1)= P(X1 = 0 |C1)P(X2 = 1 |C1)P(X2 = 0 |C1)

• Can we use these to get P(C1 |X = 001)?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 8 of 23

Page 14: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Motivating Naïve Bayes Example

A Classification Problem

This problem has particular challenges:

• different numbers of covariates for each observation

• number of covariates can be large

However, there is some structure:

• Easy to get P(C1)= 4/7, P(C2)= 3/7

• Also easy to get P(Xi = 1 |C1) and P(Xi = 1 |C2)

• By conditional independence,

P(X = 010 |C1)= P(X1 = 0 |C1)P(X2 = 1 |C1)P(X2 = 0 |C1)

• Can we use these to get P(C1 |X = 001)?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 8 of 23

Page 15: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Motivating Naïve Bayes Example

A Classification Problem

This problem has particular challenges:

• different numbers of covariates for each observation

• number of covariates can be large

However, there is some structure:

• Easy to get P(C1)= 4/7, P(C2)= 3/7

• Also easy to get P(Xi = 1 |C1)= 12/16 and P(Xi = 1 |C2)= 6/18

• By conditional independence,

P(X = 010 |C1)= P(X1 = 0 |C1)P(X2 = 1 |C1)P(X2 = 0 |C1)

• Can we use these to get P(C1 |X = 001)?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 8 of 23

Page 16: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Motivating Naïve Bayes Example

A Classification Problem

Summary: have P(data |class), want P(class |data)

Solution: Bayes’ rule!

P(class |data)=P(data |class)P(class)

P(data)

=P(data |class)P(class)∑C

class=1 P(data |class)P(class)

To compute, we need to estimate P(data |class), P(class) for all classes

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 9 of 23

Page 17: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Motivating Naïve Bayes Example

Naive Bayes Classifier

This works because the coin flips are independent given the coin parameter.What about this case:

• want to identify the type of fruit given a set of features: color, shape andsize

• color: red, green, yellow or orange (discrete)

• shape: round, oval or long+skinny (discrete)

• size: diameter in inches (continuous)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 10 of 23

Page 18: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Motivating Naïve Bayes Example

Naive Bayes Classifier

Conditioned on type of fruit, these features are not necessarily independent:

Given category “apple,” the color “green” has a higher probability given “size< 2”:

P(green |size< 2, apple)> P(green |apple)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 11 of 23

Page 19: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Motivating Naïve Bayes Example

Naive Bayes Classifier

Using chain rule,

P(apple |green, round ,size = 2)

=P(green, round ,size = 2 |apple)P(apple)∑

fruits P(green, round ,size = 2 | fruit j)P(fruit j)

∝ P(green | round ,size = 2,apple)P(round |size = 2,apple)

×P(size = 2 |apple)P(apple)

But computing conditional probabilities is hard! There are manycombinations of (color ,shape,size) for each fruit.

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 12 of 23

Page 20: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Motivating Naïve Bayes Example

Naive Bayes Classifier

Idea: assume conditional independence for all features given class,

P(green | round ,size = 2,apple)= P(green |apple)

P(round |green,size = 2,apple)= P(round |apple)

P(size = 2 |green, round ,apple)= P(size = 2 |apple)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 13 of 23

Page 21: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Estimating Probability Distributions

Outline

1 Classification

2 Motivating Naïve Bayes Example

3 Estimating Probability Distributions

4 Naïve Bayes Definition

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 14 of 23

Page 22: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Estimating Probability Distributions

How do we estimate a probability?

• Suppose we want to estimate P(wn =“buy ′′|y =SPAM).

buy buy nigeria opportunity viagranigeria opportunity viagra fly money

fly buy nigeria fly buymoney buy fly nigeria viagra

• Maximum likelihood (ML) estimate of the probability is:

β̂i =ni∑

k nk(1)

• Is this reasonable?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 15 of 23

Page 23: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Estimating Probability Distributions

How do we estimate a probability?

• Suppose we want to estimate P(wn =“buy ′′|y =SPAM).

buy buy nigeria opportunity viagranigeria opportunity viagra fly money

fly buy nigeria fly buymoney buy fly nigeria viagra

• Maximum likelihood (ML) estimate of the probability is:

β̂i =ni∑

k nk(1)

• Is this reasonable?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 15 of 23

Page 24: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Estimating Probability Distributions

How do we estimate a probability?

• Suppose we want to estimate P(wn =“buy ′′|y =SPAM).

buy buy nigeria opportunity viagranigeria opportunity viagra fly money

fly buy nigeria fly buymoney buy fly nigeria viagra

• Maximum likelihood (ML) estimate of the probability is:

β̂i =ni∑

k nk(1)

• Is this reasonable?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 15 of 23

Page 25: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Estimating Probability Distributions

How do we estimate a probability?

• Suppose we want to estimate P(wn =“buy ′′|y =SPAM).

buy buy nigeria opportunity viagranigeria opportunity viagra fly money

fly buy nigeria fly buymoney buy fly nigeria viagra

• Maximum likelihood (ML) estimate of the probability is:

β̂i =ni∑

k nk(1)

• Is this reasonable?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 15 of 23

Page 26: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Estimating Probability Distributions

The problem with maximum likelihood estimates: Zeros (cont)

• If there were no occurrences of “bagel” in documents in class SPAM,we’d get a zero estimate:

P̂( “bagel”| SPAM)=T SPAM, “bagel”∑

w ′∈V T SPAM,w ′= 0

• →We will get P( SPAM|d)= 0 for any document that contains bagel!

• Zero probabilities cannot be conditioned away.

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 16 of 23

Page 27: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Estimating Probability Distributions

How do we estimate a probability?

• For many applications, we often have a prior notion of what our probabilitydistributions are going to look like (for example, non-zero, sparse, uniform,etc.).

• This estimate of a probability distribution is called the maximum aposteriori (MAP) estimate:

βMAP = argmaxβ f (x |β)g(β) (2)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 17 of 23

Page 28: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Estimating Probability Distributions

How do we estimate a probability?

• For a multinomial distribution (i.e. a discrete distribution, like over words):

βi =ni +αi∑

k nk +αk(3)

• αi is called a smoothing factor, a pseudocount, etc.

• When αi = 1 for all i , it’s called “Laplace smoothing” and corresponds to auniform prior over all multinomial distributions (just do this).

• To geek out, the set {α1, . . . ,αN} parameterizes a Dirichlet distribution,which is itself a distribution over distributions and is the conjugate prior ofthe Multinomial (don’t need to know this).

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 18 of 23

Page 29: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Estimating Probability Distributions

How do we estimate a probability?

• For a multinomial distribution (i.e. a discrete distribution, like over words):

βi =ni +αi∑

k nk +αk(3)

• αi is called a smoothing factor, a pseudocount, etc.

• When αi = 1 for all i , it’s called “Laplace smoothing” and corresponds to auniform prior over all multinomial distributions (just do this).

• To geek out, the set {α1, . . . ,αN} parameterizes a Dirichlet distribution,which is itself a distribution over distributions and is the conjugate prior ofthe Multinomial (don’t need to know this).

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 18 of 23

Page 30: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Estimating Probability Distributions

How do we estimate a probability?

• For a multinomial distribution (i.e. a discrete distribution, like over words):

βi =ni +αi∑

k nk +αk(3)

• αi is called a smoothing factor, a pseudocount, etc.

• When αi = 1 for all i , it’s called “Laplace smoothing” and corresponds to auniform prior over all multinomial distributions (just do this).

• To geek out, the set {α1, . . . ,αN} parameterizes a Dirichlet distribution,which is itself a distribution over distributions and is the conjugate prior ofthe Multinomial (don’t need to know this).

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 18 of 23

Page 31: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Naïve Bayes Definition

Outline

1 Classification

2 Motivating Naïve Bayes Example

3 Estimating Probability Distributions

4 Naïve Bayes Definition

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 19 of 23

Page 32: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Naïve Bayes Definition

The Naïve Bayes classifier

• The Naïve Bayes classifier is a probabilistic classifier.

• We compute the probability of a document d being in a class c as follows:

P(c|d)∝ P(c)∏

1≤i≤nd

P(wi |c)

• nd is the length of the document. (number of tokens)

• P(wi |c) is the conditional probability of term wi occurring in a documentof class c

• P(wi |c) as a measure of how much evidence wi contributes that c is thecorrect class.

• P(c) is the prior probability of c.

• If a document’s terms do not provide clear evidence for one class vs.another, we choose the c with higher P(c).

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 20 of 23

Page 33: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Naïve Bayes Definition

The Naïve Bayes classifier

• The Naïve Bayes classifier is a probabilistic classifier.

• We compute the probability of a document d being in a class c as follows:

P(c|d)∝ P(c)∏

1≤i≤nd

P(wi |c)

• nd is the length of the document. (number of tokens)

• P(wi |c) is the conditional probability of term wi occurring in a documentof class c

• P(wi |c) as a measure of how much evidence wi contributes that c is thecorrect class.

• P(c) is the prior probability of c.

• If a document’s terms do not provide clear evidence for one class vs.another, we choose the c with higher P(c).

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 20 of 23

Page 34: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Naïve Bayes Definition

The Naïve Bayes classifier

• The Naïve Bayes classifier is a probabilistic classifier.

• We compute the probability of a document d being in a class c as follows:

P(c|d)∝ P(c)∏

1≤i≤nd

P(wi |c)

• nd is the length of the document. (number of tokens)

• P(wi |c) is the conditional probability of term wi occurring in a documentof class c

• P(wi |c) as a measure of how much evidence wi contributes that c is thecorrect class.

• P(c) is the prior probability of c.

• If a document’s terms do not provide clear evidence for one class vs.another, we choose the c with higher P(c).

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 20 of 23

Page 35: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Naïve Bayes Definition

Maximum a posteriori class

• Our goal is to find the “best” class.

• The best class in Naïve Bayes classification is the most likely or maximuma posteriori (MAP) class c map :

c map = argmaxcj∈C

P̂(cj |d)= argmaxcj∈C

P̂(cj)∏

1≤i≤nd

P̂(wi |cj)

• We write P̂ for P since these values are estimates from the training set.

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 21 of 23

Page 36: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Naïve Bayes Definition

Naïve Bayes conditional independence assumption

To reduce the number of parameters to a manageable size, recall the NaïveBayes conditional independence assumption:

P(d |cj)= P(⟨w1, . . . ,wnd ⟩|cj)=∏

1≤i≤nd

P(Xi =wi |cj)

We assume that the probability of observing the conjunction of attributes isequal to the product of the individual probabilities P(Xi =wi |cj).Our estimates for these priors and conditional probabilities: P̂(cj)=

Nc+1N+|C|

and P̂(w |c)= Tcw+1(∑

w′∈V Tcw′)+|V |

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 22 of 23

Page 37: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Naïve Bayes Definition

Implementation Detail: Taking the log

• Multiplying lots of small probabilities can result in floating point underflow.• From last time lg is logarithm base 2; ln is logarithm base e.

lgx = a⇔ 2a = x lnx = a⇔ ea = x (4)

• Since ln(xy)= ln(x)+ ln(y), we can sum log probabilities instead ofmultiplying probabilities.

• Since ln is a monotonic function, the class with the highest score does notchange.

• So what we usually compute in practice is:

c map = argmaxcj∈C

[P̂(cj)∏

1≤i≤nd

P̂(wi |cj)]

argmaxcj∈C

[ ln P̂(cj)+∑

1≤i≤nd

ln P̂(wi |cj)]

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 23 of 23

Page 38: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Naïve Bayes Definition

Implementation Detail: Taking the log

• Multiplying lots of small probabilities can result in floating point underflow.• From last time lg is logarithm base 2; ln is logarithm base e.

lgx = a⇔ 2a = x lnx = a⇔ ea = x (4)

• Since ln(xy)= ln(x)+ ln(y), we can sum log probabilities instead ofmultiplying probabilities.

• Since ln is a monotonic function, the class with the highest score does notchange.

• So what we usually compute in practice is:

c map = argmaxcj∈C

[P̂(cj)∏

1≤i≤nd

P̂(wi |cj)]

argmaxcj∈C

[ ln P̂(cj)+∑

1≤i≤nd

ln P̂(wi |cj)]

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 23 of 23

Page 39: Classification: Naïve Bayes and Logistic Regressionusers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/02a.pdf · Naive Bayes Classifier Idea: assume conditional independence for all features

Naïve Bayes Definition

Implementation Detail: Taking the log

• Multiplying lots of small probabilities can result in floating point underflow.• From last time lg is logarithm base 2; ln is logarithm base e.

lgx = a⇔ 2a = x lnx = a⇔ ea = x (4)

• Since ln(xy)= ln(x)+ ln(y), we can sum log probabilities instead ofmultiplying probabilities.

• Since ln is a monotonic function, the class with the highest score does notchange.

• So what we usually compute in practice is:

c map = argmaxcj∈C

[P̂(cj)∏

1≤i≤nd

P̂(wi |cj)]

argmaxcj∈C

[ ln P̂(cj)+∑

1≤i≤nd

ln P̂(wi |cj)]

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 23 of 23