bayes rule the product rule gives us two ways to factor a joint probability: therefore, why is this...
TRANSCRIPT
![Page 1: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/1.jpg)
Bayes Rule
• The product rule gives us two ways to factor a joint probability:
• Therefore,
• Why is this useful?– Can get diagnostic probability P(Cavity | Toothache) from causal
probability P(Toothache | Cavity)– Can update our beliefs based on evidence– Key tool for probabilistic inference
)()|()()|(),( APABPBPBAPBAP
)(
)()|()|(
BP
APABPBAP
Rev. Thomas Bayes(1702-1761)
![Page 2: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/2.jpg)
Bayes Rule example• Marie is getting married tomorrow, at an outdoor ceremony
in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?
![Page 3: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/3.jpg)
Bayes Rule example• Marie is getting married tomorrow, at an outdoor ceremony
in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?
)Predict(
)Rain()Rain|Predict()Predict|Rain(
P
PPP
111.00986.00126.0
0126.0
986.01.0014.09.0
014.09.0
)Rain()Rain|Predict()Rain()Rain|Predict(
)Rain()Rain|Predict(
PPPP
PP
![Page 4: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/4.jpg)
Bayes rule: Another example• 1% of women at age forty who participate in routine
screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?
)Positive(
)Cancer()Cancer|Positive()Positive|Cancer(
P
PPP
0776.0095.0008.0
008.0
99.0096.001.08.0
01.08.0
)Cancer()Cancer|Positive()Cancer()Cancer|Positive(
)Cancer()Cancer|Positive(
PPPP
PP
![Page 5: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/5.jpg)
Independence
• Two events A and B are independent if and only if P(A B) = P(A) P(B)– In other words, P(A | B) = P(A) and P(B | A) = P(B)– This is an important simplifying assumption for
modeling, e.g., Toothache and Weather can be assumed to be independent
• Are two mutually exclusive events independent?– No, but for mutually exclusive events we have
P(A B) = P(A) + P(B)
• Conditional independence: A and B are conditionally independent given C iff P(A B | C) = P(A | C) P(B | C)
![Page 6: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/6.jpg)
Conditional independence: Example
• Toothache: boolean variable indicating whether the patient has a toothache
• Cavity: boolean variable indicating whether the patient has a cavity• Catch: whether the dentist’s probe catches in the cavity
• If the patient has a cavity, the probability that the probe catches in it doesn't depend on whether he/she has a toothache
P(Catch | Toothache, Cavity) = P(Catch | Cavity)• Therefore, Catch is conditionally independent of Toothache given Cavity• Likewise, Toothache is conditionally independent of Catch given Cavity
P(Toothache | Catch, Cavity) = P(Toothache | Cavity)• Equivalent statement:• P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)
![Page 7: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/7.jpg)
Conditional independence: Example
• How many numbers do we need to represent the joint probability table P(Toothache, Cavity, Catch)? 23 – 1 = 7 independent entries
• Write out the joint distribution using chain rule:
P(Toothache, Catch, Cavity)
= P(Cavity) P(Catch | Cavity) P(Toothache | Catch, Cavity)
= P(Cavity) P(Catch | Cavity) P(Toothache | Cavity)
• How many numbers do we need to represent these distributions? 1 + 2 + 2 = 5 independent numbers
• In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n
![Page 8: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/8.jpg)
Probabilistic inference• Suppose the agent has to make a decision about
the value of an unobserved query variable X given some observed evidence variable(s) E = e – Partially observable, stochastic, episodic environment– Examples: X = {spam, not spam}, e = email message
X = {zebra, giraffe, hippo}, e = image features
• Bayes decision theory:– The agent has a loss function, which is 0 if the value of
X is guessed correctly and 1 otherwise– The estimate of X that minimizes expected loss is the one
that has the greatest posterior probability P(X = x | e)– This is the Maximum a Posteriori (MAP) decision
![Page 9: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/9.jpg)
MAP decision• Value x of X that has the highest posterior
probability given the evidence E = e:
• Maximum likelihood (ML) decision:
)|(maxarg* xePx x
)()|()|( xPxePexP likelihood priorposterior
)(
)()|()|(maxarg*
eEP
xXPxXeEPeExXPx x
)()|(maxarg xXPxXeEPx
![Page 10: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/10.jpg)
Naïve Bayes model• Suppose we have many different types of observations
(symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X
• MAP decision involves estimating
– If each feature Fi can take on k values, how many entries are in the (conditional) joint probability table P(E1, …, En |X = x)?
)()|,,(),,|( 11 XPXEEPEEXP nn
![Page 11: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/11.jpg)
Naïve Bayes model• Suppose we have many different types of observations
(symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X
• MAP decision involves estimating
• We can make the simplifying assumption that the different features are conditionally independent given the hypothesis:
– If each feature can take on k values, what is the complexity of storing the resulting distributions?
)()|,,(),,|( 11 XPXEEPEEXP nn
n
iin XEPXEEP
11 )|()|,,(
![Page 12: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/12.jpg)
Naïve Bayes Spam Filter• MAP decision: to minimize the probability of error, we should
classify a message as spam if P(spam | message) > P(¬spam | message)
![Page 13: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/13.jpg)
Naïve Bayes Spam Filter• MAP decision: to minimize the probability of error, we should
classify a message as spam if P(spam | message) > P(¬spam | message)
• We have P(spam | message) P(message | spam)P(spam) and ¬P(spam | message) P(message | ¬spam)P(¬spam)
![Page 14: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/14.jpg)
Naïve Bayes Spam Filter• We need to find P(message | spam) P(spam) and
P(message | ¬spam) P(¬spam)• The message is a sequence of words (w1, …, wn)
• Bag of words representation– The order of the words in the message is not important– Each word is conditionally independent of the others given
message class (spam or not spam)
![Page 15: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/15.jpg)
Naïve Bayes Spam Filter• We need to find P(message | spam) P(spam) and
P(message | ¬spam) P(¬spam)• The message is a sequence of words (w1, …, wn)
• Bag of words representation– The order of the words in the message is not important– Each word is conditionally independent of the others given
message class (spam or not spam)
• Our filter will classify the message as spam if
n
iin spamwPspamwwPspammessageP
11 )|()|,,()|(
n
ii
n
ii spamwPspamPspamwPspamP
11
)|()()|()(
![Page 16: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/16.jpg)
Bag of words illustration
US Presidential Speeches Tag Cloudhttp://chir.ag/projects/preztags/
![Page 17: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/17.jpg)
Bag of words illustration
US Presidential Speeches Tag Cloudhttp://chir.ag/projects/preztags/
![Page 18: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/18.jpg)
Bag of words illustration
US Presidential Speeches Tag Cloudhttp://chir.ag/projects/preztags/
![Page 19: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/19.jpg)
Naïve Bayes Spam Filter
n
iin spamwPspamPwwspamP
11 )|()(),,|(
prior likelihoodposterior
![Page 20: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/20.jpg)
Parameter estimation• In order to classify a message, we need to know the prior
P(spam) and the likelihoods P(word | spam) and P(word | ¬spam)– These are the parameters of the probabilistic model– How do we obtain the values of these parameters?
spam: 0.33
¬spam: 0.67
P(word | ¬spam)P(word | spam)prior
![Page 21: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/21.jpg)
Parameter estimation• How do we obtain the prior P(spam) and the likelihoods
P(word | spam) and P(word | ¬spam)?– Empirically: use training data
– This is the maximum likelihood (ML) estimate, or estimate that maximizes the likelihood of the training data:
P(word | spam) =# of word occurrences in spam messages
total # of words in spam messages
D
d
n
iidid
d
classwP1 1
,, )|(
d: index of training document, i: index of a word
![Page 22: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/22.jpg)
Parameter estimation• How do we obtain the prior P(spam) and the likelihoods
P(word | spam) and P(word | ¬spam)?– Empirically: use training data
• Parameter smoothing: dealing with words that were never seen or seen too few times– Laplacian smoothing: pretend you have seen every vocabulary word
one more time than you actually did
P(word | spam) =# of word occurrences in spam messages
total # of words in spam messages
P(word | spam) =# of word occurrences in spam messages + 1
total # of words in spam messages + V
(V: total number of unique words)
![Page 23: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/23.jpg)
Summary of model and parameters• Naïve Bayes model:
• Model parameters:
n
ii
n
ii
spamwPspamPmessagespamP
spamwPspamPmessagespamP
1
1
)|()()|(
)|()()|(
P(spam)
P(¬spam)
P(w1 | spam)
P(w2 | spam)
…
P(wn | spam)
P(w1 | ¬spam)
P(w2 | ¬spam)
…
P(wn | ¬spam)
Likelihoodof spam
prior
Likelihoodof ¬spam
![Page 24: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/24.jpg)
Bag-of-word models for images
Csurka et al. (2004), Willamowski et al. (2005), Grauman & Darrell (2005), Sivic et al. (2003, 2005)
![Page 25: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/25.jpg)
Bag-of-word models for images
1. Extract image features
![Page 26: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/26.jpg)
Bag-of-word models for images
1. Extract image features
![Page 27: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/27.jpg)
1. Extract image features
2. Learn “visual vocabulary”
Bag-of-word models for images
![Page 28: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/28.jpg)
1. Extract image features
2. Learn “visual vocabulary”
3. Map image features to visual words
Bag-of-word models for images
![Page 29: Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |](https://reader036.vdocuments.mx/reader036/viewer/2022062309/56649ca65503460f94967aba/html5/thumbnails/29.jpg)
Bayesian decision making: Summary
• Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E
• Inference problem: given some evidence E = e, what is P(X | e)?
• Learning problem: estimate the parameters of the probabilistic model P(X | E) given a training sample {(x1,e1), …, (xn,en)}