bayesian decision theory: a framework for making decisions when uncertainty exit 1 lecture notes for...

Bayesian decision theory: A framework for making decisions when uncertainty exit

1Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)


Modeling data as random variablesExample: coin toss Given sufficient knowledge, we could use Newton’s laws of motion to calculate the result of each toss with minimal uncertainty

In conjunction with our model, analysis of experimental trajectories will probably reveal why the coin is unfair if heads and tails do not occur with equal probability

Alternative: Accept doubt about result of toss. Treat result as random variable X subject to P(X=x). Use P(X=x) to make rational decision about result of next toss.Assume that we are not interested in why the coin is unfair if that is the case.“The reason is in the data”

Statistical Analysis of Coin-Toss Data• Let heads = 1; tails = 0• Boolean random variables obey Bernoulli statistics

P (x) = poX (1 ‒ po)(1 ‒ X) po = probability of heads

• Given a sample of N tosses, an unbiased estimator of po is the fraction of tosses that show heads.

• Prediction of next toss:Heads if po > ½, Tails otherwise


xxx

ppP

PCCC |

|

4

posterior

Class likelihoodprior

normalization

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Prior is information relevant to classifying that is independent of attributes xClass likelihood is probability that member of class C will have attribute xAssign client with attribute x to class C if P(C|x) > 0.5

Review: Bayes’ Rule for binary classification

Review: Bayes’ Rule: K>2 Classes

K

kkk

iiiii

CPCp

CPCpp

CPCpCP

1

|

|||x

xx

xx

xxx | max | if to attributesh client witassign

1 and 01

kkii

K

iii

CPCPC

CPCP

5


μxμx 1

2/12/Σ

21exp

Σ21)C|P(x Td

6


With class labels rit,

estimators are

Review: Estimating priors and class likelihoods from data

Number of examples in a class is an estimate of its prior.If we assume members of a class are Gaussian distributed,then mean and covariance parameterize class likelihood.

tti

Ti

tt i

tti

i

tti

ttt

ii

tti

i

rr

rr

Nr

CP

mxmx

xm

S

)(

d

i i

iid

ii

d

d

iii

xCxpCp1

2

1

2/1 21exp

)2(

1||

x

7


Review: naïve Bayes classification

Each class is characterized by a set of means and variances for the components of the attributes in that class.

A simpler model results from assuming that components of x are independent random variables. Covariance matrix is diagonal and p(x|C) is product of probabilities for each component of x.

• Actions: αi assigning x to Ci of K classes• Loss λik occurs if we take αi when x belongs to Ck

• Expected risk (Duda and Hart, 1973)

xx

xx

|min| if choose

||

kkii

k

K

kiki

RR

CPR

1


Minimizing risk given attributes x

Special case: correct decisions no loss and error have equal cost: “0/1 loss function”

kiki

ik if if

10

x

x

xx

|

|

||

i

ikk

K

kkiki

CP

CP

CPR

1

1

9

For minimum risk, choose the most probable class


Add rejection option: don’t assign a class

10 ,otherwise 1

1K if if 0

i

ki

ik

xxx

xx

|||

||

iik

ki

K

kkK

CPCPR

CPR

11

1

reject otherwise,

1| and || if choose xxx ikii CPikCPCPC

10Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT

Press (V1.0)

risk of no assignment

risk of choosing Ci

1- is risk making some assignment

xx

xx

|min| if choose

||

kkii

k

K

kiki

RR

CPR

1

R(1|x) = 11 P(C1|x) + 12 P(C2|x) = 10 P(C2|x)

R(2|x) = 21 P(C1|x) + 22 P(C2|x) = P(C1|x)

Choose C1 if R(1|x) < R(2|x), which is true if

10 P(C2|x) < P(C1|x), which becomes

P(C1|x) > 10/11 using normalization of posteriors

Consequence of erroneously assigning instance to C1 is so bad that we choose C1 only when we are virtually certain it is correct.

Example of risk minimization with 11 = 22 = 0, 12 = 10, and 21 = 1

Loss λik occurs if we take αi when x belongs to Ck


K

kkk

iiiii

CPCp

CPCpp

CPCpCP

1

|

|||x

xx

xx

Bayes’ classifier based on neighbors

Consider data set with N examples, Ni of which belong to class i; P(Ci) = Ni

Given a new example x, draw a hyper-sphere of volume V in attribute space, centered on x and containing precisely K training examples, irrespective of their class.

Suppose this sphere contains ni examples from class i, then p(x|Ci)P(Ci) = V-1(ni/Ni)Ni = V-1ni

K

i

1i

1-

i-1

1

n

nV

nV

|

|| K

k

K

kkk

iii

CPCp

CPCpCPx

xx


Using Bayes’ rule we find posteriors p(Ck|x) = nk/K

Assign x to the class with highest posterior, which is the class with the highest representation among the K training examples in the hyper-sphere centered on x

K=1 (nearest neighbor rule) assign x to the class of nearest neighbor in the training data.

Bayes’ classifier based on neighbors


Usually chose K from a range values based on validation error

In 2D, we can visualize the classification by applying KNN to every point in the (x1,x2) plane. As K increases expect fewer islands and smoother boundaries

Bayes’ classifier based on K nearest neighbors (KNN)

Analysis of binary classification: beyond the confusion matrix

Quantities defined by binary confusion matrix

Let C1 be positive class, C2 be negative class, N be # of instancesError rate = (FP+FN)/N = 1-accuracyFalse positive rate = FP / (FP+TN) = fraction of C2 instances misclassifiedTure positive rate = TP / (TP+FN) = fraction of C1 instances correctly classified



Receiver operating characteristic (ROC) curve

Let C1 be positive classLet q be the threshold of P(C1|x) for assignment of x to C1If q is near 1, rare assignments to C1 have high probability of being correct

both FP-rate and TP-rate are smallAs q decreases both FP-rate and TP-rate increaseFor every value of q, (FP-rate, TP-rate) is point on ROC curve


Chance alone

marginal success

ROC curves

Drawing ROC curvesAssume C1 is the positive class. Rand all examples by decreasing P(C1|x)In decreasing rank order, move up 1/P(C1) for each positive example and move right 1/P(C2) for each negative example

If all examples are correctly classified, ROC curve will be in upper left.

If P(C1|x) is not correlated with class labels, ROC curve will be close to the diagonal

Performance with reduced attribute set is slightly improved

Slight improvement

Misclassified malignant cases decreased by 2

bayesian decision theory: a framework for making decisions when uncertainty exit 1 lecture notes for...

Documents