classification: naïve bayes and logistic regressionjbg/teaching/csci_5622/02a.pdfto compute, we...

39
Classification: Naïve Bayes and Logistic Regression Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 2A Slides adapted from Hinrich Schütze and Lauren Hannah Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 1 of 23

Upload: doandieu

Post on 14-May-2018

238 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Classification: Naïve Bayes andLogistic Regression

Machine Learning: Jordan Boyd-GraberUniversity of Colorado BoulderLECTURE 2A

Slides adapted from Hinrich Schütze and Lauren Hannah

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 1 of 23

Page 2: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

By the end of today . . .

• You’ll be able to frame many machine learning tasks as classificationproblems

• Apply logistic regression (given weights) to classify data

• Learn naïve bayes from data

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 2 of 23

Page 3: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Classification

Outline

1 Classification

2 Motivating Naïve Bayes Example

3 Estimating Probability Distributions

4 Naïve Bayes Definition

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 3 of 23

Page 4: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Classification

Probabilistic Classification

Given:

• A universe X our examples can come from (e.g., English documents witha predefined vocabulary)

◦ Examples are represented in this space.• A fixed set of labels y ∈C= {c1,c2, . . . ,cJ}◦ The classes are human-defined for the needs of an application (e.g.,

spam vs. ham).• A training set D of labeled documents with each labeled document{(x1,y1) . . .(xN ,yN)}

We learn a classifier γ that maps documents to class probabilities:

γ : (x ,y)→ [0,1]

such that∑

y γ(x ,y)= 1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 4 of 23

Page 5: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Classification

Probabilistic Classification

Given:

• A universe X our examples can come from (e.g., English documents witha predefined vocabulary)◦ Examples are represented in this space.

• A fixed set of labels y ∈C= {c1,c2, . . . ,cJ}◦ The classes are human-defined for the needs of an application (e.g.,

spam vs. ham).• A training set D of labeled documents with each labeled document{(x1,y1) . . .(xN ,yN)}

We learn a classifier γ that maps documents to class probabilities:

γ : (x ,y)→ [0,1]

such that∑

y γ(x ,y)= 1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 4 of 23

Page 6: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Classification

Probabilistic Classification

Given:

• A universe X our examples can come from (e.g., English documents witha predefined vocabulary)◦ Examples are represented in this space.

• A fixed set of labels y ∈C= {c1,c2, . . . ,cJ}

◦ The classes are human-defined for the needs of an application (e.g.,spam vs. ham).

• A training set D of labeled documents with each labeled document{(x1,y1) . . .(xN ,yN)}

We learn a classifier γ that maps documents to class probabilities:

γ : (x ,y)→ [0,1]

such that∑

y γ(x ,y)= 1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 4 of 23

Page 7: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Classification

Probabilistic Classification

Given:

• A universe X our examples can come from (e.g., English documents witha predefined vocabulary)◦ Examples are represented in this space.

• A fixed set of labels y ∈C= {c1,c2, . . . ,cJ}◦ The classes are human-defined for the needs of an application (e.g.,

spam vs. ham).

• A training set D of labeled documents with each labeled document{(x1,y1) . . .(xN ,yN)}

We learn a classifier γ that maps documents to class probabilities:

γ : (x ,y)→ [0,1]

such that∑

y γ(x ,y)= 1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 4 of 23

Page 8: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Classification

Probabilistic Classification

Given:

• A universe X our examples can come from (e.g., English documents witha predefined vocabulary)◦ Examples are represented in this space.

• A fixed set of labels y ∈C= {c1,c2, . . . ,cJ}◦ The classes are human-defined for the needs of an application (e.g.,

spam vs. ham).• A training set D of labeled documents with each labeled document{(x1,y1) . . .(xN ,yN)}

We learn a classifier γ that maps documents to class probabilities:

γ : (x ,y)→ [0,1]

such that∑

y γ(x ,y)= 1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 4 of 23

Page 9: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Classification

Probabilistic Classification

Given:

• A universe X our examples can come from (e.g., English documents witha predefined vocabulary)◦ Examples are represented in this space.

• A fixed set of labels y ∈C= {c1,c2, . . . ,cJ}◦ The classes are human-defined for the needs of an application (e.g.,

spam vs. ham).• A training set D of labeled documents with each labeled document{(x1,y1) . . .(xN ,yN)}

We learn a classifier γ that maps documents to class probabilities:

γ : (x ,y)→ [0,1]

such that∑

y γ(x ,y)= 1

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 4 of 23

Page 10: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Classification

Generative vs. Discriminative Models

Generative

Model joint probability p(x ,y)including the data x .

Naïve Bayes

• Uses Bayes rule to reverseconditioning p(x |y)→ p(y |x)

• Naïve because it ignores jointprobabilities within the datadistribution

DiscriminativeModel only conditional probabilityp(y |x), excluding the data x .

Logistic regression

• Logistic: A special mathematicalfunction it uses

• Regression: Combines a weightvector with observations to createan answer

• General cookbook for buildingconditional probability distributions

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 5 of 23

Page 11: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Motivating Naïve Bayes Example

Outline

1 Classification

2 Motivating Naïve Bayes Example

3 Estimating Probability Distributions

4 Naïve Bayes Definition

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 6 of 23

Page 12: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Motivating Naïve Bayes Example

A Classification Problem

• Suppose that I have two coins, C1 and C2

• Now suppose I pull a coin out of my pocket, flip it a bunch of times, recordthe coin and outcomes, and repeat many times:

C1: 0 1 1 1 1C1: 1 1 0C2: 1 0 0 0 0 0 0 1C1: 0 1C1: 1 1 0 1 1 1C2: 0 0 1 1 0 1C2: 1 0 0 0

• Now suppose I am given a new sequence, 0 0 1; which coin is it from?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 7 of 23

Page 13: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Motivating Naïve Bayes Example

A Classification Problem

This problem has particular challenges:

• different numbers of covariates for each observation

• number of covariates can be large

However, there is some structure:

• Easy to get P(C1), P(C2)

• Also easy to get P(Xi = 1 |C1) and P(Xi = 1 |C2)

• By conditional independence,

P(X = 010 |C1)= P(X1 = 0 |C1)P(X2 = 1 |C1)P(X2 = 0 |C1)

• Can we use these to get P(C1 |X = 001)?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 8 of 23

Page 14: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Motivating Naïve Bayes Example

A Classification Problem

This problem has particular challenges:

• different numbers of covariates for each observation

• number of covariates can be large

However, there is some structure:

• Easy to get P(C1)= 4/7, P(C2)= 3/7

• Also easy to get P(Xi = 1 |C1) and P(Xi = 1 |C2)

• By conditional independence,

P(X = 010 |C1)= P(X1 = 0 |C1)P(X2 = 1 |C1)P(X2 = 0 |C1)

• Can we use these to get P(C1 |X = 001)?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 8 of 23

Page 15: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Motivating Naïve Bayes Example

A Classification Problem

This problem has particular challenges:

• different numbers of covariates for each observation

• number of covariates can be large

However, there is some structure:

• Easy to get P(C1)= 4/7, P(C2)= 3/7

• Also easy to get P(Xi = 1 |C1)= 12/16 and P(Xi = 1 |C2)= 6/18

• By conditional independence,

P(X = 010 |C1)= P(X1 = 0 |C1)P(X2 = 1 |C1)P(X2 = 0 |C1)

• Can we use these to get P(C1 |X = 001)?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 8 of 23

Page 16: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Motivating Naïve Bayes Example

A Classification Problem

Summary: have P(data |class), want P(class |data)

Solution: Bayes’ rule!

P(class |data)=P(data |class)P(class)

P(data)

=P(data |class)P(class)∑C

class=1 P(data |class)P(class)

To compute, we need to estimate P(data |class), P(class) for all classes

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 9 of 23

Page 17: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Motivating Naïve Bayes Example

Naive Bayes Classifier

This works because the coin flips are independent given the coin parameter.What about this case:

• want to identify the type of fruit given a set of features: color, shape andsize

• color: red, green, yellow or orange (discrete)

• shape: round, oval or long+skinny (discrete)

• size: diameter in inches (continuous)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 10 of 23

Page 18: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Motivating Naïve Bayes Example

Naive Bayes Classifier

Conditioned on type of fruit, these features are not necessarily independent:

Given category “apple,” the color “green” has a higher probability given “size< 2”:

P(green |size< 2, apple)> P(green |apple)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 11 of 23

Page 19: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Motivating Naïve Bayes Example

Naive Bayes Classifier

Using chain rule,

P(apple |green, round ,size = 2)

=P(green, round ,size = 2 |apple)P(apple)∑

fruits P(green, round ,size = 2 | fruit j)P(fruit j)

∝ P(green | round ,size = 2,apple)P(round |size = 2,apple)

×P(size = 2 |apple)P(apple)

But computing conditional probabilities is hard! There are manycombinations of (color ,shape,size) for each fruit.

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 12 of 23

Page 20: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Motivating Naïve Bayes Example

Naive Bayes Classifier

Idea: assume conditional independence for all features given class,

P(green | round ,size = 2,apple)= P(green |apple)

P(round |green,size = 2,apple)= P(round |apple)

P(size = 2 |green, round ,apple)= P(size = 2 |apple)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 13 of 23

Page 21: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Estimating Probability Distributions

Outline

1 Classification

2 Motivating Naïve Bayes Example

3 Estimating Probability Distributions

4 Naïve Bayes Definition

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 14 of 23

Page 22: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Estimating Probability Distributions

How do we estimate a probability?

• Suppose we want to estimate P(wn =“buy ′′|y =SPAM).

buy buy nigeria opportunity viagranigeria opportunity viagra fly money

fly buy nigeria fly buymoney buy fly nigeria viagra

• Maximum likelihood (ML) estimate of the probability is:

β̂i =ni∑

k nk(1)

• Is this reasonable?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 15 of 23

Page 23: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Estimating Probability Distributions

How do we estimate a probability?

• Suppose we want to estimate P(wn =“buy ′′|y =SPAM).

buy buy nigeria opportunity viagranigeria opportunity viagra fly money

fly buy nigeria fly buymoney buy fly nigeria viagra

• Maximum likelihood (ML) estimate of the probability is:

β̂i =ni∑

k nk(1)

• Is this reasonable?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 15 of 23

Page 24: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Estimating Probability Distributions

How do we estimate a probability?

• Suppose we want to estimate P(wn =“buy ′′|y =SPAM).

buy buy nigeria opportunity viagranigeria opportunity viagra fly money

fly buy nigeria fly buymoney buy fly nigeria viagra

• Maximum likelihood (ML) estimate of the probability is:

β̂i =ni∑

k nk(1)

• Is this reasonable?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 15 of 23

Page 25: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Estimating Probability Distributions

How do we estimate a probability?

• Suppose we want to estimate P(wn =“buy ′′|y =SPAM).

buy buy nigeria opportunity viagranigeria opportunity viagra fly money

fly buy nigeria fly buymoney buy fly nigeria viagra

• Maximum likelihood (ML) estimate of the probability is:

β̂i =ni∑

k nk(1)

• Is this reasonable?

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 15 of 23

Page 26: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Estimating Probability Distributions

The problem with maximum likelihood estimates: Zeros (cont)

• If there were no occurrences of “bagel” in documents in class SPAM,we’d get a zero estimate:

P̂( “bagel”| SPAM)=T SPAM, “bagel”∑

w ′∈V T SPAM,w ′= 0

• →We will get P( SPAM|d)= 0 for any document that contains bagel!

• Zero probabilities cannot be conditioned away.

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 16 of 23

Page 27: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Estimating Probability Distributions

How do we estimate a probability?

• For many applications, we often have a prior notion of what our probabilitydistributions are going to look like (for example, non-zero, sparse, uniform,etc.).

• This estimate of a probability distribution is called the maximum aposteriori (MAP) estimate:

βMAP = argmaxβ f (x |β)g(β) (2)

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 17 of 23

Page 28: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Estimating Probability Distributions

How do we estimate a probability?

• For a multinomial distribution (i.e. a discrete distribution, like over words):

βi =ni +αi∑

k nk +αk(3)

• αi is called a smoothing factor, a pseudocount, etc.

• When αi = 1 for all i , it’s called “Laplace smoothing” and corresponds to auniform prior over all multinomial distributions (just do this).

• To geek out, the set {α1, . . . ,αN} parameterizes a Dirichlet distribution,which is itself a distribution over distributions and is the conjugate prior ofthe Multinomial (don’t need to know this).

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 18 of 23

Page 29: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Estimating Probability Distributions

How do we estimate a probability?

• For a multinomial distribution (i.e. a discrete distribution, like over words):

βi =ni +αi∑

k nk +αk(3)

• αi is called a smoothing factor, a pseudocount, etc.

• When αi = 1 for all i , it’s called “Laplace smoothing” and corresponds to auniform prior over all multinomial distributions (just do this).

• To geek out, the set {α1, . . . ,αN} parameterizes a Dirichlet distribution,which is itself a distribution over distributions and is the conjugate prior ofthe Multinomial (don’t need to know this).

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 18 of 23

Page 30: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Estimating Probability Distributions

How do we estimate a probability?

• For a multinomial distribution (i.e. a discrete distribution, like over words):

βi =ni +αi∑

k nk +αk(3)

• αi is called a smoothing factor, a pseudocount, etc.

• When αi = 1 for all i , it’s called “Laplace smoothing” and corresponds to auniform prior over all multinomial distributions (just do this).

• To geek out, the set {α1, . . . ,αN} parameterizes a Dirichlet distribution,which is itself a distribution over distributions and is the conjugate prior ofthe Multinomial (don’t need to know this).

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 18 of 23

Page 31: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Naïve Bayes Definition

Outline

1 Classification

2 Motivating Naïve Bayes Example

3 Estimating Probability Distributions

4 Naïve Bayes Definition

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 19 of 23

Page 32: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Naïve Bayes Definition

The Naïve Bayes classifier

• The Naïve Bayes classifier is a probabilistic classifier.

• We compute the probability of a document d being in a class c as follows:

P(c|d)∝ P(c)∏

1≤i≤nd

P(wi |c)

• nd is the length of the document. (number of tokens)

• P(wi |c) is the conditional probability of term wi occurring in a documentof class c

• P(wi |c) as a measure of how much evidence wi contributes that c is thecorrect class.

• P(c) is the prior probability of c.

• If a document’s terms do not provide clear evidence for one class vs.another, we choose the c with higher P(c).

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 20 of 23

Page 33: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Naïve Bayes Definition

The Naïve Bayes classifier

• The Naïve Bayes classifier is a probabilistic classifier.

• We compute the probability of a document d being in a class c as follows:

P(c|d)∝ P(c)∏

1≤i≤nd

P(wi |c)

• nd is the length of the document. (number of tokens)

• P(wi |c) is the conditional probability of term wi occurring in a documentof class c

• P(wi |c) as a measure of how much evidence wi contributes that c is thecorrect class.

• P(c) is the prior probability of c.

• If a document’s terms do not provide clear evidence for one class vs.another, we choose the c with higher P(c).

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 20 of 23

Page 34: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Naïve Bayes Definition

The Naïve Bayes classifier

• The Naïve Bayes classifier is a probabilistic classifier.

• We compute the probability of a document d being in a class c as follows:

P(c|d)∝ P(c)∏

1≤i≤nd

P(wi |c)

• nd is the length of the document. (number of tokens)

• P(wi |c) is the conditional probability of term wi occurring in a documentof class c

• P(wi |c) as a measure of how much evidence wi contributes that c is thecorrect class.

• P(c) is the prior probability of c.

• If a document’s terms do not provide clear evidence for one class vs.another, we choose the c with higher P(c).

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 20 of 23

Page 35: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Naïve Bayes Definition

Maximum a posteriori class

• Our goal is to find the “best” class.

• The best class in Naïve Bayes classification is the most likely or maximuma posteriori (MAP) class c map :

c map = argmaxcj∈C

P̂(cj |d)= argmaxcj∈C

P̂(cj)∏

1≤i≤nd

P̂(wi |cj)

• We write P̂ for P since these values are estimates from the training set.

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 21 of 23

Page 36: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Naïve Bayes Definition

Naïve Bayes conditional independence assumption

To reduce the number of parameters to a manageable size, recall the NaïveBayes conditional independence assumption:

P(d |cj)= P(⟨w1, . . . ,wnd ⟩|cj)=∏

1≤i≤nd

P(Xi =wi |cj)

We assume that the probability of observing the conjunction of attributes isequal to the product of the individual probabilities P(Xi =wi |cj).Our estimates for these priors and conditional probabilities: P̂(cj)=

Nc+1N+|C|

and P̂(w |c)= Tcw+1(∑

w′∈V Tcw′)+|V |

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 22 of 23

Page 37: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Naïve Bayes Definition

Implementation Detail: Taking the log

• Multiplying lots of small probabilities can result in floating point underflow.• From last time lg is logarithm base 2; ln is logarithm base e.

lgx = a⇔ 2a = x lnx = a⇔ ea = x (4)

• Since ln(xy)= ln(x)+ ln(y), we can sum log probabilities instead ofmultiplying probabilities.

• Since ln is a monotonic function, the class with the highest score does notchange.

• So what we usually compute in practice is:

c map = argmaxcj∈C

[P̂(cj)∏

1≤i≤nd

P̂(wi |cj)]

argmaxcj∈C

[ ln P̂(cj)+∑

1≤i≤nd

ln P̂(wi |cj)]

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 23 of 23

Page 38: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Naïve Bayes Definition

Implementation Detail: Taking the log

• Multiplying lots of small probabilities can result in floating point underflow.• From last time lg is logarithm base 2; ln is logarithm base e.

lgx = a⇔ 2a = x lnx = a⇔ ea = x (4)

• Since ln(xy)= ln(x)+ ln(y), we can sum log probabilities instead ofmultiplying probabilities.

• Since ln is a monotonic function, the class with the highest score does notchange.

• So what we usually compute in practice is:

c map = argmaxcj∈C

[P̂(cj)∏

1≤i≤nd

P̂(wi |cj)]

argmaxcj∈C

[ ln P̂(cj)+∑

1≤i≤nd

ln P̂(wi |cj)]

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 23 of 23

Page 39: Classification: Naïve Bayes and Logistic Regressionjbg/teaching/CSCI_5622/02a.pdfTo compute, we need to estimate P(datajclass), P(class) for all classes ... Naïve Bayes and Logistic

Naïve Bayes Definition

Implementation Detail: Taking the log

• Multiplying lots of small probabilities can result in floating point underflow.• From last time lg is logarithm base 2; ln is logarithm base e.

lgx = a⇔ 2a = x lnx = a⇔ ea = x (4)

• Since ln(xy)= ln(x)+ ln(y), we can sum log probabilities instead ofmultiplying probabilities.

• Since ln is a monotonic function, the class with the highest score does notchange.

• So what we usually compute in practice is:

c map = argmaxcj∈C

[P̂(cj)∏

1≤i≤nd

P̂(wi |cj)]

argmaxcj∈C

[ ln P̂(cj)+∑

1≤i≤nd

ln P̂(wi |cj)]

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Naïve Bayes and Logistic Regression | 23 of 23