naïve bayes (continued)

51
©2006 Carlos Guestrin 1 Naïve Bayes & Logistic Regression, See class website: Mitchell’s Chapter (required) Ng & Jordan ’02 (optional) Gradient ascent and extensions: Koller & Friedman Chapter 1.4 Naïve Bayes (Continued) Naïve Bayes with Continuous (variables) Logistic Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University January 30 th , 2006

Upload: trinhmien

Post on 26-Jan-2017

241 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Naïve Bayes (Continued)

©2006 Carlos Guestrin 1

Naïve Bayes & Logistic Regression,See class website:

Mitchell’s Chapter (required)Ng & Jordan ’02 (optional)

Gradient ascent and extensions:Koller & Friedman Chapter 1.4

Naïve Bayes (Continued)Naïve Bayes with Continuous (variables)

Logistic Regression

Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

January 30th, 2006

Page 2: Naïve Bayes (Continued)

©2006 Carlos Guestrin 2

Announcements

Recitations stay on Thursdays5-6:30pm in Wean 5409This week: Naïve Bayes & Logistic Regression

Extension for the first homework:Due Wed. Feb 8th beginning of classMitchell’s chapter is most useful reading

Go to the AI seminar:Tuesdays 3:30pm, Wean 5409http://www.cs.cmu.edu/~aiseminar/This week’s seminar very relevant to what we are covering in class

Page 3: Naïve Bayes (Continued)

©2006 Carlos Guestrin 3

Classification

Learn: h:X a YX – featuresY – target classes

Suppose you know P(Y|X) exactly, how should you classify?

Bayes classifier:

Why?

Page 4: Naïve Bayes (Continued)

©2006 Carlos Guestrin 4

Optimal classification

Theorem: Bayes classifier hBayes is optimal!

That isProof:

Page 5: Naïve Bayes (Continued)

©2006 Carlos Guestrin 5

How hard is it to learn the optimal classifier?

Data =

How do we represent these? How many parameters?Prior, P(Y):

Suppose Y is composed of k classes

Likelihood, P(X|Y):Suppose X is composed of n binary features

Complex model → High variance with limited data!!!

Page 6: Naïve Bayes (Continued)

©2006 Carlos Guestrin 6

Conditional Independence

X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z

e.g.,

Equivalent to:

Page 7: Naïve Bayes (Continued)

©2006 Carlos Guestrin 7

The Naïve Bayes assumption

Naïve Bayes assumption:Features are independent given class:

More generally:

How many parameters now?Suppose X is composed of n binary features

Page 8: Naïve Bayes (Continued)

©2006 Carlos Guestrin 8

The Naïve Bayes Classifier

Given:Prior P(Y)n conditionally independent features X given the class YFor each Xi, we have likelihood P(Xi|Y)

Decision rule:

If assumption holds, NB is optimal classifier!

Page 9: Naïve Bayes (Continued)

©2006 Carlos Guestrin 9

MLE for the parameters of NB

Given datasetCount(A=a,B=b) ← number of examples where A=a and B=b

MLE for NB, simply:Prior: P(Y=y) =

Likelihood: P(Xi=xi|Yi=yi) =

Page 10: Naïve Bayes (Continued)

©2006 Carlos Guestrin 10

Subtleties of NB classifier 1 –Violating the NB assumption

Usually, features are not conditionally independent:

Thus, in NB, actual probabilities P(Y|X) often biased towards 0 or 1 (see homework 1)Nonetheless, NB is the single most used classifier out there

NB often performs well, even when assumption is violated[Domingos & Pazzani ’96] discuss some conditions for good performance

Page 11: Naïve Bayes (Continued)

©2006 Carlos Guestrin 11

Subtleties of NB classifier 2 –Insufficient training dataWhat if you never see a training instance where X1=a when Y=b?

e.g., Y={SpamEmail}, X1={‘Enlargement’}P(X1=a | Y=b) = 0

Thus, no matter what the values X2,…,Xn take:P(Y=b | X1=a,X2,…,Xn) = 0

What now???

Page 12: Naïve Bayes (Continued)

©2006 Carlos Guestrin 12

MAP for Beta distribution

MAP: use most likely parameter:

Beta prior equivalent to extra thumbtack flipsAs N →∞, prior is “forgotten”But, for small sample size, prior is important!

Page 13: Naïve Bayes (Continued)

©2006 Carlos Guestrin 13

Bayesian learning for NB parameters – a.k.a. smoothingDataset of N examplesPrior

“distribution” Q(Xi,Y), Q(Y)m “virtual” examples

MAP estimateP(Xi|Y)

Now, even if you never observe a feature/class, posterior probability never zero

Page 14: Naïve Bayes (Continued)

©2006 Carlos Guestrin 14

Text classification

Classify e-mailsY = {Spam,NotSpam}

Classify news articlesY = {what is the topic of the article?}

Classify webpagesY = {Student, professor, project, …}

What about the features X?The text!

Page 15: Naïve Bayes (Continued)

©2006 Carlos Guestrin 15

Features X are entire document –Xi for ith word in article

Page 16: Naïve Bayes (Continued)

©2006 Carlos Guestrin 16

NB for Text classification

P(X|Y) is huge!!!Article at least 1000 words, X={X1,…,X1000}Xi represents ith word in document, i.e., the domain of Xi is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc.

NB assumption helps a lot!!!P(Xi=xi|Y=y) is just the probability of observing word xi in a document on topic y

Page 17: Naïve Bayes (Continued)

©2006 Carlos Guestrin 17

Bag of words model

Typical additional assumption – Position in document doesn’t matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)

“Bag of words” model – order of words on the page ignoredSounds really silly, but often works very well!

When the lecture is over, remember to wake up the person sitting next to you in the lecture room.

Page 18: Naïve Bayes (Continued)

©2006 Carlos Guestrin 18

Bag of words model

Typical additional assumption – Position in document doesn’t matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)

“Bag of words” model – order of words on the page ignoredSounds really silly, but often works very well!

in is lecture lecture next over person remember room sitting the the the to to up wake when you

Page 19: Naïve Bayes (Continued)

©2006 Carlos Guestrin 19

Bag of Words Approach

aardvark 0

about 2

all 2

Africa 1

apple 0

anxious 0

...

gas 1

...

oil 1

Zaire 0

Page 20: Naïve Bayes (Continued)

©2006 Carlos Guestrin 20

NB with Bag of Words for text classification

Learning phase:Prior P(Y)

Count how many documents you have from each topic (+ prior)

P(Xi|Y) For each topic, count how many times you saw word in documents of this topic (+ prior)

Test phase:For each document

Use naïve Bayes decision rule

Page 21: Naïve Bayes (Continued)

©2006 Carlos Guestrin 21

Twenty News Groups results

Page 22: Naïve Bayes (Continued)

©2006 Carlos Guestrin 22

Learning curve for Twenty News Groups

Page 23: Naïve Bayes (Continued)

©2006 Carlos Guestrin 23

What if we have continuous Xi ?

Eg., character recognition: Xi is ith pixel

Gaussian Naïve Bayes (GNB):

Sometimes assume varianceis independent of Y (i.e., σi), or independent of Xi (i.e., σk)or both (i.e., σ)

Page 24: Naïve Bayes (Continued)

©2006 Carlos Guestrin 24

Estimating Parameters: Y discrete, Xi continuous

Maximum likelihood estimates: jth training example

δ(x)=1 if x true, else 0

Page 25: Naïve Bayes (Continued)

©2006 Carlos Guestrin 25

Example: GNB for classifying mental states

~1 mm resolution

~2 images per sec.

15,000 voxels/image

non-invasive, safe

measures Blood Oxygen Level Dependent (BOLD) response

Typical impulse response

10 sec

[Mitchell et al.]

Page 26: Naïve Bayes (Continued)

©2006 Carlos Guestrin 26

Brain scans can track activation with precision and sensitivity

[Mitchell et al.]

Page 27: Naïve Bayes (Continued)

©2006 Carlos Guestrin 27

Gaussian Naïve Bayes: Learned µvoxel,wordP(BrainActivity | WordCategory = {People,Animal})

[Mitchell et al.]

Page 28: Naïve Bayes (Continued)

©2006 Carlos Guestrin 28

Learned Bayes Models – Means forP(BrainActivity | WordCategory)

[Mitchell et al.]Pairwise classification accuracy: 85%

People words Animal words

Page 29: Naïve Bayes (Continued)

©2006 Carlos Guestrin 29

What you need to know about Naïve Bayes

Types of learning problemsLearning is (just) function approximation!

Optimal decision using Bayes ClassifierNaïve Bayes classifier

What’s the assumptionWhy we use itHow do we learn itWhy is Bayesian estimation important

Text classificationBag of words model

Gaussian NBFeatures are still conditionally independentEach feature has a Gaussian distribution given class

Page 30: Naïve Bayes (Continued)

©2006 Carlos Guestrin 30

Generative v. Discriminative classifiers – Intuition Want to Learn: h:X a Y

X – featuresY – target classes

Bayes optimal classifier – P(Y|X)Generative classifier, e.g., Naïve Bayes:

Assume some functional form for P(X|Y), P(Y)Estimate parameters of P(X|Y), P(Y) directly from training dataUse Bayes rule to calculate P(Y|X= x)This is a ‘generative’ model

Indirect computation of P(Y|X) through Bayes ruleBut, can generate a sample of the data, P(X) = ∑y P(y) P(X|y)

Discriminative classifiers, e.g., Logistic Regression:Assume some functional form for P(Y|X)Estimate parameters of P(Y|X) directly from training dataThis is the ‘discriminative’ model

Directly learn P(Y|X)But cannot obtain a sample of the data, because P(X) is not available

Page 31: Naïve Bayes (Continued)

©2006 Carlos Guestrin 31

Logistic RegressionLogisticfunction(or Sigmoid):

Learn P(Y|X) directly!Assume a particular functional formSigmoid applied to a linear function of the data:

Z

Page 32: Naïve Bayes (Continued)

©2006 Carlos Guestrin 32

Understanding the sigmoid

w0=0, w1=1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=2, w1=1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1=0.5

Page 33: Naïve Bayes (Continued)

©2006 Carlos Guestrin 33

Logistic Regression –a Linear classifier

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 34: Naïve Bayes (Continued)

©2006 Carlos Guestrin 34

Very convenient!

implies

implies

implies

linear classification

rule!

Page 35: Naïve Bayes (Continued)

©2006 Carlos Guestrin 35

Logistic regression more generally

Logistic regression in more general case, where Y ∈{Y1 ... YR} : learn R-1 sets of weights

for k<R

for k=R (normalization, so no weights for this class)

Features can be discrete or continuous!

Page 36: Naïve Bayes (Continued)

©2006 Carlos Guestrin 36

Logistic regression v. Naïve Bayes

Consider learning f: X Y, whereX is a vector of real-valued features, < X1 … Xn >Y is boolean

Could use a Gaussian Naïve Bayes classifierassume all Xi are conditionally independent given Ymodel P(Xi | Y = yk) as Gaussian N(µik,σi)model P(Y) as Bernoulli(θ,1-θ)

What does that imply about the form of P(Y|X)?

Page 37: Naïve Bayes (Continued)

©2006 Carlos Guestrin 37

Logistic regression v. Naïve Bayes

Consider learning f: X Y, whereX is a vector of real-valued features, < X1 … Xn >Y is boolean

Could use a Gaussian Naïve Bayes classifierassume all Xi are conditionally independent given Ymodel P(Xi | Y = yk) as Gaussian N(µik,σi)model P(Y) as Bernoulli(θ,1-θ)

What does that imply about the form of P(Y|X)?

Cool!!!!

Page 38: Naïve Bayes (Continued)

©2006 Carlos Guestrin 38

Derive form for P(Y|X) for continuous Xi

Page 39: Naïve Bayes (Continued)

©2006 Carlos Guestrin 39

Ratio of class-conditional probabilities

Page 40: Naïve Bayes (Continued)

©2006 Carlos Guestrin 40

Derive form for P(Y|X) for continuous Xi

Page 41: Naïve Bayes (Continued)

©2006 Carlos Guestrin 41

Gaussian Naïve Bayes v. Logistic Regression

Set of Gaussian Naïve Bayes parameters

Set of Logistic Regression parameters

Representation equivalenceBut only in a special case!!! (GNB with class-independent variances)

But what’s the difference???LR makes no assumptions about P(X|Y) in learning!!!Loss function!!!

Optimize different functions → Obtain different solutions

Page 42: Naïve Bayes (Continued)

©2006 Carlos Guestrin 42

Loss functions: Likelihood v. Conditional LikelihoodGenerative (Naïve Bayes) Loss function: Data likelihood

Discriminative models cannot compute P(xj|w)!But, discriminative (logistic regression) loss function:Conditional Data Likelihood

Doesn’t waste effort learning P(X) – focuses on P(Y|X) all that matters for classification

Page 43: Naïve Bayes (Continued)

©2006 Carlos Guestrin 43

Expressing Conditional Log Likelihood

Page 44: Naïve Bayes (Continued)

©2006 Carlos Guestrin 44

Maximizing Conditional Log Likelihood

Good news: l(w) is concave function of w → no locally optimal solutions

Bad news: no closed-form solution to maximize l(w)

Good news: concave functions easy to optimize

Page 45: Naïve Bayes (Continued)

©2006 Carlos Guestrin 45

Optimizing concave function –Gradient ascent Conditional likelihood for Logistic Regression is concave → Find optimum with gradient ascent

Gradient ascent is simplest of optimization approachese.g., Conjugate gradient ascent much better (see reading)

Gradient:

Update rule:

Learning rate, η>0

Page 46: Naïve Bayes (Continued)

©2006 Carlos Guestrin 46

Maximize Conditional Log Likelihood: Gradient ascent

Gradient ascent algorithm: iterate until change < ε

For all i,

repeat

Page 47: Naïve Bayes (Continued)

©2006 Carlos Guestrin 47

That’s all M(C)LE. How about MAP?

One common approach is to define priors on wNormal distribution, zero mean, identity covariance“Pushes” parameters towards zero

Corresponds to RegularizationHelps avoid very large weights and overfittingExplore this in your homeworkMore on this later in the semester

MAP estimate

Page 48: Naïve Bayes (Continued)

©2006 Carlos Guestrin 48

Gradient of M(C)AP

Page 49: Naïve Bayes (Continued)

©2006 Carlos Guestrin 49

MLE vs MAP Maximum conditional likelihood estimate

Maximum conditional a posteriori estimate

Page 50: Naïve Bayes (Continued)

©2006 Carlos Guestrin 50

What you should know about Logistic Regression (LR)

Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR

Solution differs because of objective (loss) function

In general, NB and LR make different assumptionsNB: Features independent given class → assumption on P(X|Y)LR: Functional form of P(Y|X), no assumption on P(X|Y)

LR is a linear classifierdecision rule is a hyperplane

LR optimized by conditional likelihoodno closed-form solutionconcave → global optimum with gradient ascentMaximum conditional a posteriori corresponds to regularization

Page 51: Naïve Bayes (Continued)

©2006 Carlos Guestrin 51

Acknowledgements

Some of the material is the presentation is courtesy of Tom Mitchell