bayesian decision theory - jdl.ac.cn decision theory.pdf · kjx) = 1 p(c ijx) optimal decision rule...

Outline

Bayesian Decision Theory

Hong Chang

Institute of Computing Technology,Chinese Academy of Sciences

Machine Learning Methods (Fall 2012)

Hong Chang (ICT, CAS) Bayesian Decision Theory

Outline

Outline I

1 Introduction

2 Losses and Risks

3 Discriminant Functions

4 Bayesian Networks


IntroductionLosses and Risks

Discriminant FunctionsBayesian Networks

Coin Tossing Example

Outcome of tossing a coin ∈ {head,tail}Random variable X :

X =

{1 if outcome is head0 if outcome is tail

X is Bernoulli-distributed:

P(X ) = pX0 (1− p0)

1−X

where the parameter p0 is the probability that the outcome is head,i.e., p0 = P(X = 1).




Estimation and Prediction

Estimation of parameter p0 from sample X = {x (i)}Ni=1:

p̂0 =]heads]tosses

=

∑Ni=1 x (i)

N

Prediction of outcome of next toss:

Predicted outcome =

{head if p0 > 1/2tail otherwise

by choosing the more probable outcome, which minimizes theprobability of error (=1-probability of our choice for the predictedoutcome).




Classification as Bayesian Decision

Credit scoring example:Inputs: income and savings, or x = (x1, x2)

T

Output: risk ∈ {low,high}, or C ∈ {0, 1}Prediction:

Choose =

{C = 1 if P(C = 1|x) > 0.5C = 0 otherwise

or equivalently

Choose =

{C = 1 if P(C = 1|x) > P(C = 0|x)C = 0 otherwise

Probability of error:

1−max(P(C = 1|x),P(C = 0|x))




Losses and Risks

Different decisions or actions may not be equally good or costly.Action αi : decision to assign the input x to class Ci

Loss λik : loss incurred for taking action αi when the actual state if Ck

Expected risk for taking action αi :

R(αi |x) =K∑

k=1

λik P(Ck |x)

Optimal decision rule with minimum expected risk:

Choose αi if R(αi |x) = mink

R(αk |x)




0-1 Loss

All correct decisions have no loss and all errors have unit cost:

λik =

{C = 1 if i = kC = 0 if i 6= k

Expected risk:

R(αi |x) =K∑

k=1

λik P(Ck |x)

=∑k 6=i

P(Ck |x) = 1− P(Ci |x)

Optimal decision rule with minimum expected expected risk (or,equivalently, highest posterior probability):

Choose αi if P(Ci |x) = maxk

P(Ck |x)




Reject Option

If the certainty of a decision is low but misclassiifcation has very highcost, the action of reject (αK+1) is preferred.Loss function:

λik =

0 if i = kλ if i = k + 11 otherwise

where 0 < λ < 1 is the loss incurred for choosing the action of reject.Expected risk:

R(λi |x) ={ ∑K

k=1 λP(Ck |x) = λ if i = k + 1∑k 6=i P(Ck |x) = 1− P(Ci |x) if i ∈ {1, . . . ,K}




Reject Option (2)

Optimal decision rule:{Choose Ci if R(αi |x) = min1≤k≤K R(αk |x) < R(αK+1|x)Reject otherwise

Equivalent form of optimal decision rule:{Choose Ci if P(Ci |x) = min1≤k≤K P(Ck |x) > 1− λReject otherwise

This approach is meaningful only if 0 < λ < 1:If λ = 0, we always reject (a reject is as good as a correct classification).If λ ≥ 1, we never reject (a reject is as costly as or more costly than amisclassification).




Discriminant Functions

One way of performing classification is through a set of discriminantfunctions.Classification rule:

Choose Ci if gi(x) = maxk

gk (x)

Different ways of defining the discriminant functions:gi(x) = −R(αi |x)gi(x) = P(Ci |x)gi(x) = p(x|Ci)P(Ci)

For the two-class case, we may define a single discriminant function:

g(x) = g1(x)− g2(x)

with the following classification rule:

Choose{

C1 if g(x) > 0C2 otherwise




Decision Regions

The feature space is divided into K decision regions R1, . . . ,RK ,where

Ri = {x|gi(x) = maxk

gk (x)}

The decision regions are separated by decision boundaries where tiesoccur among the largest discriminant functions.




Bayesian Networks

A.k.a. belief networks, probabilistic networks, or more generallygraphical models.Node (or vertex): random variableEdge (or arc or link): direct influence between variablesStructure: directed acyclic graph (DAG) formed by nodes and edgesParameters: probabilities and conditional probabilities




Causal Graph and Diagnostic Inference

Causal graph: rain is the cause of wet grass.Diagnostic inference: knowing that the grass is wet, what is theprobability that rain is the cause?Bayes’ rule:

P(R|W ) =P(W |R)P(R)

P(W )

=0.9× 0.4

0.9× 0.4 + 0.2× 0.6= 0.75 > P(R) = 0.4




Dependent Causes

Causal inference:

P(W |C) = P(W |R,S,C)P(R,S|C) + P(W | ∼ R,S,C)P(∼ R,S|C) +

P(W |R,∼ S,C)P(R,∼ S|C) + P(W | ∼ R,∼ S,C)P(∼ R,∼ S|C)

= P(W |R,S)P(R|C)P(S|C) + P(W | ∼ R,S)P(∼ R|C)P(S|C) +

P(W |R,∼ S)P(R|C)P(∼ S|C) + P(W | ∼ R,∼ S)P(∼ R|C)P(∼ S|C)

Independence: W and C are independent given R and S; R and Sare independent given C.




Local Structures

The network represents conditional independence statements.The joint distribution can be broken down into local structures:

P(C,S,R,W ,F ) = P(C)P(S|C)P(R|C)P(W |S,R)P(F |R)

In general,

P(X1, . . . ,Xd ) =d∏

i=1

P(Xi |parents(Xi))

where Xi is either continuous or discrete with ≥ 2 possible values.




Bayesian Network for Classification

Bayes’ rule inverts the edge:

P(C|x) = P(x|C)P(C)

P(x)

Classification as diagnostic inference.




Naive Bayes’ Classifier

Given C, the input variables xj are independent:

p(x|C) =d∏

j=1

p(xj |C)

The Naive Bayes’ classifier ignores possible dependencies among theinput variables and reduces a multivariate problem to a group ofunivariate problems.


bayesian decision theory - jdl.ac.cn decision theory.pdf · kjx) = 1 p(c ijx) optimal decision rule...

Documents