modeling social data, lecture 6: classification with naive bayes

22
Classification: Naive Bayes APAM E4990 Modeling Social Data Jake Hofman Columbia University February 27, 2015 Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 1 / 16

Upload: jakehofman

Post on 15-Jul-2015

455 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Classification: Naive BayesAPAM E4990

Modeling Social Data

Jake Hofman

Columbia University

February 27, 2015

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 1 / 16

Page 2: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Learning by example

• How did you solve this problem?

• Can you make this process explicit (e.g. write code to do so)?

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16

Page 3: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Learning by example

• How did you solve this problem?

• Can you make this process explicit (e.g. write code to do so)?

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 2 / 16

Page 4: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Diagnoses a la Bayes1

• You’re testing for a rare disease:• 1% of the population is infected

• You have a highly sensitive and specific test:• 99% of sick patients test positive• 99% of healthy patients test negative

• Given that a patient tests positive, what is probability thepatient is sick?

1Wiggins, SciAm 2006Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 3 / 16

Page 5: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Diagnoses a la Bayes

Population10,000 ppl

1% Sick100 ppl

99% Test +99 ppl

1% Test -1 per

99% Healthy9900 ppl

1% Test +99 ppl

99% Test -9801 ppl

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16

Page 6: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Diagnoses a la Bayes

Population10,000 ppl

1% Sick100 ppl

99% Test +99 ppl

1% Test -1 per

99% Healthy9900 ppl

1% Test +99 ppl

99% Test -9801 ppl

So given that a patient tests positive (198 ppl), there is a 50%chance the patient is sick (99 ppl)!

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16

Page 7: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Diagnoses a la Bayes

Population10,000 ppl

1% Sick100 ppl

99% Test +99 ppl

1% Test -1 per

99% Healthy9900 ppl

1% Test +99 ppl

99% Test -9801 ppl

The small error rate on the large healthy population producesmany false positives.

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 4 / 16

Page 8: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Natural frequencies a la Gigerenzer2

2http://bit.ly/ggbbcJake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 5 / 16

Page 9: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Inverting conditional probabilities

Bayes’ Theorem

Equate the far right- and left-hand sides of product rule

p (y |x) p (x) = p (x , y) = p (x |y) p (y)

and divide to get the probability of y given x from the probabilityof x given y :

p (y |x) =p (x |y) p (y)

p (x)

where p (x) =∑

y∈ΩYp (x |y) p (y) is the normalization constant.

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 6 / 16

Page 10: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Diagnoses a la Bayes

Given that a patient tests positive, what is probability the patientis sick?

p (sick |+) =

99/100︷ ︸︸ ︷p (+|sick)

1/100︷ ︸︸ ︷p (sick)

p (+)︸ ︷︷ ︸99/1002+99/1002=198/1002

=99

198=

1

2

where p (+) = p (+|sick) p (sick) + p (+|healthy) p (healthy).

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 7 / 16

Page 11: Modeling Social Data, Lecture 6: Classification with Naive Bayes

(Super) Naive Bayes

We can use Bayes’ rule to build a one-word spam classifier:

p (spam|word) =p (word |spam) p (spam)

p (word)

where we estimate these probabilities with ratios of counts:

p(word |spam) =# spam docs containing word

# spam docs

p(word |ham) =# ham docs containing word

# ham docs

p(spam) =# spam docs

# docs

p(ham) =# ham docs

# docs

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 8 / 16

Page 12: Modeling Social Data, Lecture 6: Classification with Naive Bayes

(Super) Naive Bayes

$ ./enron_naive_bayes.sh meeting

1500 spam examples

3672 ham examples

16 spam examples containing meeting

153 ham examples containing meeting

estimated P(spam) = .2900

estimated P(ham) = .7100

estimated P(meeting|spam) = .0106

estimated P(meeting|ham) = .0416

P(spam|meeting) = .0923

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 9 / 16

Page 13: Modeling Social Data, Lecture 6: Classification with Naive Bayes

(Super) Naive Bayes

$ ./enron_naive_bayes.sh money

1500 spam examples

3672 ham examples

194 spam examples containing money

50 ham examples containing money

estimated P(spam) = .2900

estimated P(ham) = .7100

estimated P(money|spam) = .1293

estimated P(money|ham) = .0136

P(spam|money) = .7957

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 10 / 16

Page 14: Modeling Social Data, Lecture 6: Classification with Naive Bayes

(Super) Naive Bayes

$ ./enron_naive_bayes.sh enron

1500 spam examples

3672 ham examples

0 spam examples containing enron

1478 ham examples containing enron

estimated P(spam) = .2900

estimated P(ham) = .7100

estimated P(enron|spam) = 0

estimated P(enron|ham) = .4025

P(spam|enron) = 0

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 11 / 16

Page 15: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Naive Bayes

Represent each document by a binary vector ~x where xj = 1 if thej-th word appears in the document (xj = 0 otherwise).

Modeling each word as an independent Bernoulli random variable,the probability of observing a document ~x of class c is:

p (~x |c) =∏j

θxjjc (1− θjc)1−xj

where θjc denotes the probability that the j-th word occurs in adocument of class c .

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 12 / 16

Page 16: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Naive Bayes

Using this likelihood in Bayes’ rule and taking a logarithm, we have:

log p (c |~x) = logp (~x |c) p (c)

p (~x)

=∑j

xj logθjc

1− θjc+∑j

log(1− θjc) + logθc

p (~x)

where θc is the probability of observing a document of class c .

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 13 / 16

Page 17: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Naive Bayes

We can eliminate p (~x) by calculating the log-odds:

logp (1|~x)

p (0|~x)=

∑j

xj logθj1(1− θj0)

θj0(1− θj1)︸ ︷︷ ︸wj

+∑j

log1− θj11− θj0

+ logθ1

θ0︸ ︷︷ ︸w0

which gives a linear classifier of the form ~w · ~x + w0

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 14 / 16

Page 18: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Naive Bayes

We train by counting words and documents within classes toestimate θjc and θc :

θjc =njcnc

θc =ncn

and use these to calculate the weights wj and bias w0:

wj = logθj1(1− θj0)

θj0(1− θj1)

w0 =∑j

log1− θj11− θj0

+ logθ1

θ0

.

We we predict by simply adding the weights of the words thatappear in the document to the bias term.

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 15 / 16

Page 19: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Naive Bayes

In practice, this works better than one might expect given itssimplicity3

3http://www.jstor.org/pss/1403452Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 16 / 16

Page 20: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Naive Bayes

Training is computationally cheap and scalable, and the model iseasy to update given new observations3

3http://www.springerlink.com/content/wu3g458834583125/Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 16 / 16

Page 21: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Naive Bayes

Performance varies with document representations andcorresponding likelihood models3

3http://ceas.cc/2006/15.pdfJake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 16 / 16

Page 22: Modeling Social Data, Lecture 6: Classification with Naive Bayes

Naive Bayes

It’s often important to smooth parameter estimates (e.g., byadding pseudocounts) to avoid overfitting

θjc =njc + α

nc + α + β

Jake Hofman (Columbia University) Classification: Naive Bayes February 27, 2015 16 / 16