bayesian decision theory - jdl.ac.cn decision theory.pdf · kjx) = 1 p(c ijx) optimal decision rule...
TRANSCRIPT
![Page 1: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/1.jpg)
Outline
Bayesian Decision Theory
Hong Chang
Institute of Computing Technology,Chinese Academy of Sciences
Machine Learning Methods (Fall 2012)
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 2: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/2.jpg)
Outline
Outline I
1 Introduction
2 Losses and Risks
3 Discriminant Functions
4 Bayesian Networks
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 3: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/3.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Coin Tossing Example
Outcome of tossing a coin ∈ {head,tail}Random variable X :
X =
{1 if outcome is head0 if outcome is tail
X is Bernoulli-distributed:
P(X ) = pX0 (1− p0)
1−X
where the parameter p0 is the probability that the outcome is head,i.e., p0 = P(X = 1).
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 4: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/4.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Estimation and Prediction
Estimation of parameter p0 from sample X = {x (i)}Ni=1:
p̂0 =]heads]tosses
=
∑Ni=1 x (i)
N
Prediction of outcome of next toss:
Predicted outcome =
{head if p0 > 1/2tail otherwise
by choosing the more probable outcome, which minimizes theprobability of error (=1-probability of our choice for the predictedoutcome).
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 5: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/5.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Classification as Bayesian Decision
Credit scoring example:Inputs: income and savings, or x = (x1, x2)
T
Output: risk ∈ {low,high}, or C ∈ {0, 1}Prediction:
Choose =
{C = 1 if P(C = 1|x) > 0.5C = 0 otherwise
or equivalently
Choose =
{C = 1 if P(C = 1|x) > P(C = 0|x)C = 0 otherwise
Probability of error:
1−max(P(C = 1|x),P(C = 0|x))
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 6: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/6.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Bayes’ Rule
Bayes’ rule:
Posterior P(C|x) = likelihood× priorevidence
=p(x|C)P(C)
p(x)
Some useful properties:P(C = 1) + P(C = 0) = 1p(x) = p(x|C = 1)P(C = 1) + p(x|C = 0)P(C = 0)P(C = 0|x) + P(C = 1|x) = 1
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 7: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/7.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Bayes’ Rule for K > 2 Classes
Bayes’ rule for general case (K mutually exclusive and exhaustiveclasses):
P(Ci |x) =p(x|Ci)P(Ci)
p(x)
=p(x|Ci)P(Ci)∑K
k=1 p(x|Ck )P(Ck )
Optimal decision rule for Bayes’ classifier:
Choose Ci if P(Ci |x) = maxk
P(Ck |x)
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 8: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/8.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Losses and Risks
Different decisions or actions may not be equally good or costly.Action αi : decision to assign the input x to class Ci
Loss λik : loss incurred for taking action αi when the actual state if Ck
Expected risk for taking action αi :
R(αi |x) =K∑
k=1
λik P(Ck |x)
Optimal decision rule with minimum expected risk:
Choose αi if R(αi |x) = mink
R(αk |x)
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 9: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/9.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
0-1 Loss
All correct decisions have no loss and all errors have unit cost:
λik =
{C = 1 if i = kC = 0 if i 6= k
Expected risk:
R(αi |x) =K∑
k=1
λik P(Ck |x)
=∑k 6=i
P(Ck |x) = 1− P(Ci |x)
Optimal decision rule with minimum expected expected risk (or,equivalently, highest posterior probability):
Choose αi if P(Ci |x) = maxk
P(Ck |x)
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 10: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/10.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Reject Option
If the certainty of a decision is low but misclassiifcation has very highcost, the action of reject (αK+1) is preferred.Loss function:
λik =
0 if i = kλ if i = k + 11 otherwise
where 0 < λ < 1 is the loss incurred for choosing the action of reject.Expected risk:
R(λi |x) ={ ∑K
k=1 λP(Ck |x) = λ if i = k + 1∑k 6=i P(Ck |x) = 1− P(Ci |x) if i ∈ {1, . . . ,K}
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 11: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/11.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Reject Option (2)
Optimal decision rule:{Choose Ci if R(αi |x) = min1≤k≤K R(αk |x) < R(αK+1|x)Reject otherwise
Equivalent form of optimal decision rule:{Choose Ci if P(Ci |x) = min1≤k≤K P(Ck |x) > 1− λReject otherwise
This approach is meaningful only if 0 < λ < 1:If λ = 0, we always reject (a reject is as good as a correct classification).If λ ≥ 1, we never reject (a reject is as costly as or more costly than amisclassification).
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 12: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/12.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Discriminant Functions
One way of performing classification is through a set of discriminantfunctions.Classification rule:
Choose Ci if gi(x) = maxk
gk (x)
Different ways of defining the discriminant functions:gi(x) = −R(αi |x)gi(x) = P(Ci |x)gi(x) = p(x|Ci)P(Ci)
For the two-class case, we may define a single discriminant function:
g(x) = g1(x)− g2(x)
with the following classification rule:
Choose{
C1 if g(x) > 0C2 otherwise
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 13: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/13.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Decision Regions
The feature space is divided into K decision regions R1, . . . ,RK ,where
Ri = {x|gi(x) = maxk
gk (x)}
The decision regions are separated by decision boundaries where tiesoccur among the largest discriminant functions.
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 14: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/14.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Bayesian Networks
A.k.a. belief networks, probabilistic networks, or more generallygraphical models.Node (or vertex): random variableEdge (or arc or link): direct influence between variablesStructure: directed acyclic graph (DAG) formed by nodes and edgesParameters: probabilities and conditional probabilities
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 15: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/15.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Causal Graph and Diagnostic Inference
Causal graph: rain is the cause of wet grass.Diagnostic inference: knowing that the grass is wet, what is theprobability that rain is the cause?Bayes’ rule:
P(R|W ) =P(W |R)P(R)
P(W )
=0.9× 0.4
0.9× 0.4 + 0.2× 0.6= 0.75 > P(R) = 0.4
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 16: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/16.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Two Causes: Causal Inference
Causal or predictive inference: if the sprinkler is on, what is theprobability that the grass is wet?
P(W |S) = P(W |R,S)P(R|S) + P(W | ∼ R,S)P(∼ R|S)
= P(W |R,S)P(R) + P(W | ∼ R,S)P(∼ R)
= 0.95× 0.4 + 0.9× 0.6= 0.92
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 17: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/17.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Two Causes: Diagnostic Inference
Diagnostic inference: if the grass is wet, what is the probability that thesprinkler is on?
P(S|W ) =P(W |S)P(S)
P(W )
where
P(W ) = P(W |R,S)P(R,S) + P(W | ∼ R,S)P(∼ R,S) +
P(W |R,∼ S)P(R,∼ S) + P(W | ∼ R,∼ S)P(∼ R,∼ S)
= P(W |R,S)P(R)P(S) + P(W | ∼ R,S)P(∼ R)P(S) +
P(W |R,∼ S)P(R)P(∼ S) + P(W | ∼ R,∼ S)P(∼ R)P(∼ S)
= 0.52
soP(S|W ) =
0.92× 0.20.52
= 0.35 > P(S) = 0.2
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 18: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/18.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Two Causes: Explaining Away
Bayes’ rule:
P(S|R,W ) =P(W |R,S)P(S|R)
P(W |R)=
P(W |R,S)P(S)
P(W |R)= 0.21
Explaining away:
0.21 = P(S|R,W ) < P(S|W ) = 0.35
Knowing that it has rained decreases that probability that the sprinkleris on.Knowing that the grass is wet, rain and sprinkler become dependent:
P(S|R,W ) 6= P(S|W )
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 19: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/19.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Dependent Causes
Causal inference:
P(W |C) = P(W |R,S,C)P(R,S|C) + P(W | ∼ R,S,C)P(∼ R,S|C) +
P(W |R,∼ S,C)P(R,∼ S|C) + P(W | ∼ R,∼ S,C)P(∼ R,∼ S|C)
= P(W |R,S)P(R|C)P(S|C) + P(W | ∼ R,S)P(∼ R|C)P(S|C) +
P(W |R,∼ S)P(R|C)P(∼ S|C) + P(W | ∼ R,∼ S)P(∼ R|C)P(∼ S|C)
Independence: W and C are independent given R and S; R and Sare independent given C.
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 20: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/20.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Local Structures
The network represents conditional independence statements.The joint distribution can be broken down into local structures:
P(C,S,R,W ,F ) = P(C)P(S|C)P(R|C)P(W |S,R)P(F |R)
In general,
P(X1, . . . ,Xd ) =d∏
i=1
P(Xi |parents(Xi))
where Xi is either continuous or discrete with ≥ 2 possible values.
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 21: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/21.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Bayesian Network for Classification
Bayes’ rule inverts the edge:
P(C|x) = P(x|C)P(C)
P(x)
Classification as diagnostic inference.
Hong Chang (ICT, CAS) Bayesian Decision Theory
![Page 22: Bayesian Decision Theory - jdl.ac.cn Decision Theory.pdf · kjx) = 1 P(C ijx) Optimal decision rule withminimum expected expected risk(or, equivalently,highest posterior probability):](https://reader031.vdocuments.mx/reader031/viewer/2022022119/5d04d66288c9936e148dd8bf/html5/thumbnails/22.jpg)
IntroductionLosses and Risks
Discriminant FunctionsBayesian Networks
Naive Bayes’ Classifier
Given C, the input variables xj are independent:
p(x|C) =d∏
j=1
p(xj |C)
The Naive Bayes’ classifier ignores possible dependencies among theinput variables and reduces a multivariate problem to a group ofunivariate problems.
Hong Chang (ICT, CAS) Bayesian Decision Theory