naïve bayes - santini.sesantini.se/teaching/ml/2016/lect_06/06_naivebayes.pdf · lecture 6 -...
TRANSCRIPT
Naïve Bayes Lecture 6: Self-Study
-----
Marina Santini
Acknowledgements Slides borrowed and adapted from:
Data Mining by I. H. Witten, E. Frank and M. A. Hall
Lecture 6 - Self-Study: Naive Bayes 1 2016
Lecture 6: Required Reading
Daumé III (2015: 53-59; 107-110) Witten et al. (2011: 90-99; 305-308; 314-315;
322-323; 328-329; 331-332; 334)
2016 Lecture 6 - Self-Study: Naive Bayes 2
Outline
• Naïve Bayes • Zero-probability problem: smoothing • Multinomial Naïve Bayes • Discussion
2016 Lecture 6 - Self-Study: Naive Bayes 3
4 Lecture 6 - Self-Study: Naive Bayes
Statistical modeling
l Use all the attributes l Two assumptions: Attributes are
♦ equally important ♦ statistically independent (given the class value)
l I.e., knowing the value of one attribute says nothing about the value of another (if the class is known)
l Independence assumption is never correct! l But … this scheme works well in practice
2016
5 Lecture 6 - Self-Study: Naive Bayes
Probabilities for weather data
5/ 14
5
No
9/ 14
9
Yes
Play
3/5
2/5
3
2
No
3/9
6/9
3
6
Yes
True
False
True
False
Windy
1/5
4/5
1
4
No Yes No Yes No Yes
6/9
3/9
6
3
Normal
High
Normal
High
Humidity
1/5
2/5
2/5
1
2
2
3/9
4/9
2/9
3
4
2
Cool 2/5 3/9 Rainy
Mild
Hot
Cool
Mild
Hot
Temperature
0/5 4/9 Overcast
3/5 2/9 Sunny
2 3 Rainy
0 4 Overcast
3 2 Sunny
Outlook
No True High Mild Rainy
Yes False Normal Hot Overcast
Yes True High Mild Overcast
Yes True Normal Mild Sunny
Yes False Normal Mild Rainy
Yes False Normal Cool Sunny
No False High Mild Sunny
Yes True Normal Cool Overcast
No True Normal Cool Rainy
Yes False Normal Cool Rainy
Yes False High Mild Rainy
Yes False High Hot Overcast
No True High Hot Sunny
No False High Hot Sunny
Play Windy Humidity Temp Outlook
2016
6 Lecture 6 - Self-Study: Naive Bayes
5/ 14
5
No
9/ 14
9
Yes
Play
3/5
2/5
3
2
No
3/9
6/9
3
6
Yes
True
False
True
False
Windy
1/5
4/5
1
4
No Yes No Yes No Yes
6/9
3/9
6
3
Normal
High
Normal
High
Humidity
1/5
2/5
2/5
1
2
2
3/9
4/9
2/9
3
4
2
Cool 2/5 3/9 Rainy
Mild
Hot
Cool
Mild
Hot
Temperature
0/5 4/9 Overcast
3/5 2/9 Sunny
2 3 Rainy
0 4 Overcast
3 2 Sunny
Outlook
? True High Cool Sunny
Play Windy Humidity Temp. Outlook l A new day:
Likelihood of the two classes
For “yes” = 2/9 × 3/9 × 3/9 × 3/9 × 9/14 = 0.0053
For “no” = 3/5 × 1/5 × 4/5 × 3/5 × 5/14 = 0.0206
Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795
Probabilities for weather data
2016
7 Lecture 6 - Self-Study: Naive Bayes
Bayes’s rule l Probability of event H given evidence E: l A priori probability of H :
l Probability of event before evidence is seen
l A posteriori probability of H : l Probability of event after evidence is seen
Thomas Bayes Born: 1702 in London, England Died: 1761 in Tunbridge Wells, Kent, England
Pr [H ∣E ]= Pr [E ∣H ]Pr [H ]Pr [E ]
Pr [H ]
Pr [H ∣E ]
2016
8 Lecture 6 - Self-Study: Naive Bayes
Naïve Bayes for classification
l Classification learning: what’s the probability of the class given an instance?
♦ Evidence E = instance ♦ Event H = class value for instance
l Naïve assumption: evidence splits into parts (i.e. attributes) that are independent
2016
9 Lecture 6 - Self-Study: Naive Bayes
Weather data example
? True High Cool Sunny
Play Windy Humidity Temp. Outlook Evidence E
Probability of class “yes”
Pr [yes∣E ]= Pr [Outlook = S unny∣yes ]× Pr [Temperature = C ool∣yes ]× Pr [Humidity = H igh∣yes ]× Pr [Windy = True∣yes ]
× Pr [yes ]Pr [E ]
=
29× 39× 39× 39× 914
Pr [E ]2016
10 Lecture 6 - Self-Study: Naive Bayes
The “zero-frequency problem”
l What if an attribute value doesn’t occur with every class value? (e.g. “Humidity = high” for class “yes”)
♦ Probability will be zero! ♦ A posteriori probability will also be zero!
(No matter how likely the other values are!)
l Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator)
l Result: probabilities will never be zero! (also: stabilizes probability estimates)
Pr [H umidity = H igh∣yes ]= 0Pr [yes∣E ]= 0
2016
11 Lecture 6 - Self-Study: Naive Bayes
Modified probability estimates
l In some cases adding a constant different from 1 might be more appropriate
l Example: attribute outlook for class yes l Weights don’t need to be equal (but they must
sum to 1)
Sunny Overcast Rainy
2016
12 Lecture 6 - Self-Study: Naive Bayes
Missing values
l Training: instance is not included in frequency count for attribute value-class combination
l Classification: attribute will be omitted from calculation
l Example: ? True High Cool ?
Play Windy Humidity Temp. Outlook
Likelihood of “yes” = 3/9 × 3/9 × 3/9 × 9/14 = 0.0238
Likelihood of “no” = 1/5 × 4/5 × 3/5 × 5/14 = 0.0343
P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%
P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%
2016
13 Lecture 6 - Self-Study: Naive Bayes
Numeric attributes l Usual assumption: attributes have a normal
or Gaussian probability distribution (given the class)
l The probability density function for the normal distribution is defined by two parameters: l Sample mean µ
l Standard deviation σ
l Then the density function f(x) is
2016
14 Lecture 6 - Self-Study: Naive Bayes
Statistics for weather data
l Example density value:
5/ 14
5
No
9/ 14
9
Yes
Play
3/5
2/5
3
2
No
3/9
6/9
3
6
Yes
True
False
True
False
Windy
σ =9.7
µ =86
95, …
90, 91,
70, 85,
No Yes No Yes No Yes
σ =10.2
µ =79
80, …
70, 75,
65, 70,
Humidity
σ =7.9
µ =75
85, …
72,80,
65,71,
σ =6.2
µ =73
72, …
69, 70,
64, 68,
2/5 3/9 Rainy
Temperature
0/5 4/9 Overcast
3/5 2/9 Sunny
2 3 Rainy
0 4 Overcast
3 2 Sunny
Outlook
2016
15 Lecture 6 - Self-Study: Naive Bayes
Classifying a new day
l A new day: l Missing values during training are not
included in calculation of mean and standard deviation
? true 90 66 Sunny
Play Windy Humidity Temp. Outlook
Likelihood of “yes” = 2/9 × 0.0340 × 0.0221 × 3/9 × 9/14 = 0.000036
Likelihood of “no” = 3/5 × 0.0221 × 0.0381 × 3/5 × 5/14 = 0.000108
P(“yes”) = 0.000036 / (0.000036 + 0. 000108) = 25%
P(“no”) = 0.000108 / (0.000036 + 0. 000108) = 75%
2016
16 Lecture 6 - Self-Study: Naive Bayes
Probability densities
l Relationship between probability and density:
l But: this doesn’t change calculation of a posteriori probabilities because ε cancels out
l Exact relationship:
2016
17 Lecture 6 - Self-Study: Naive Bayes
Multinomial naïve Bayes I l Version of naïve Bayes used for document classification
using bag of words model l n1,n2, ... , nk: number of times word i occurs in document l P1,P2, ... , Pk: probability of obtaining word i when sampling
from documents in class H l Probability of observing document E given class H (based
on multinomial distribution):
l Ignores probability of generating a document of the right length (prob. assumed constant for each class)
2016
18 Lecture 6 - Self-Study: Naive Bayes
Multinomial naïve Bayes II l Suppose dictionary has two words, yellow and blue l Suppose Pr[yellow | H] = 75% and Pr[blue | H] = 25% l Suppose E is the document “blue yellow blue” l Probability of observing document:
Suppose there is another class H' that has Pr[yellow | H'] = 10% and Pr[yellow | H'] = 90%:
l Need to take prior probability of class into account to make final classification
l Factorials don't actually need to be computed l Underflows can be prevented by using logarithms
2016
19 Lecture 6 - Self-Study: Naive Bayes
Naïve Bayes: discussion
l Naïve Bayes works surprisingly well (even if independence assumption is clearly violated)
l Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class
l However: adding too many redundant attributes will cause problems (e.g. identical attributes)
l Note also: many numeric attributes are not normally distributed (→ kernel density estimators)
2016
The end
20 Lecture 6 - Self-Study: Naive Bayes 2016