bayesian decision theory - disi, university of...

Bayesian decision theory

Andrea [email protected]

Machine Learning


Bayesian decision theory: introduction

OverviewBayesian decision theory allows to take optimal decisionsin a fully probabilistic settingIt assumes all relevant probabilities are knownIt allows to provide upper bounds on acheivable errors andevaluate classifiers accordinglyBayesian reasoning can be generalized to cases when theprobabilistic structure is not entirely known



Binary classification

Assume examples (x , y) ∈ X × {−1,1} are drawn from aknown distribution p(x , y).The task is predicting the class y of examples given theinput x .Bayes rule allows us to write it in probabilistic terms as:

P(y |x) =p(x |y)P(y)

p(x)


Bayesian decision theory: introductionBayes ruleBayes rule allows to compute the posterior probability givenlikelihood, prior and evidence:

posterior =likelihood × prior

evidenceposterior P(y |x) is the probability that class is y given that x

was observedlikelihood p(x |y) is the probability of observing x given that

its class is yprior P(y) is the prior probability of the class, without

any evidenceevidence p(x) is the probability of the observation, and by

the law of total probability can be computed as:

p(x) =2∑

i=1

p(x |y)P(y)



Probability of errorProbability of error given x :

P(error |x) =

{P(y2|x) if we decide y1P(y1|x) if we decide y2

Average probability of error:

P(error) =

∫ ∞−∞

P(error |x)p(x)dx



Bayes decision rule

yB = argmaxyi∈{−1,1}P(yi |x) = argmaxyi∈{−1,1}p(x |yi)P(yi)

The probability of error given x is:

P(error |x) = min[P(y1|x),P(y2|x)]

The Bayes decision rule minimizes the probability of error


Bayesian decision theory: Continuous features

Setting

Inputs are vectors in a d-dimensional Euclidean space IRd

called feature spaceThe output is one of c possible categories or classesY = {y1, . . . , yc} (binary/multiclass classification)A decision rule is a function f : X → Y telling what class topredict for a given observationA loss function ` : Y × Y → IR is a function measuring thecost for predicting yi when the true class is yj

The conditional risk of predicting y given x is:

R(y |x) =c∑

i=1

`(y , yi)P(yi |x)


Bayesian decision theory: Continuous features

Bayes decision rule

y∗ = argminy∈YR(y |x)

The overall risk of a decision rule f is given by

R[f ] =

∫R(f (x)|x)p(x)dx =

∫ c∑i=1

`(f (x), yi)p(yi ,x)dx

The Bayes decision rule minimizes the overall risk.The Bayes risk R∗ is the overall risk of a Bayes decisionrule, and it’s the best performance achievable.


Minimum-error-rate classification

Zero-one loss

`(y , yi) =

{0 if y = yi1 if y 6= yi

It assigns a unit cost to a misclassified example, and azero cost to a correctly classified oneIt assumes that all errors have same costIts risk corresponds to the average probability of error:

R(yi |x) =c∑

j=1

`(yi , yj)P(yj |x) =∑j 6=i

P(yj |x) = 1− P(yi |x)


Minimum-error-rate classification: binary case

x

θa

p(x|ω1)p(x|ω2)

θb

R 1R 2 R 1R 2

Bayes decision rule

yB = argmaxyi∈{−1,1}P(yi |x) = argmaxyi∈{−1,1}p(x |yi)P(yi)

Likelihood ratio:

p(x |y1)P(y1) > p(x |y2)P(y2)

p(x |y1)

p(x |y2)>

P(y2)

P(y1)= θa


Representing classifiers

Discriminant functionsA classifier can be represented as a set of discriminantfunctions gi(x), i ∈ 1, . . . , c, giving:

y = argmaxi∈1,...,c gi(x)

A discriminant function is not unique⇒ the mostconvenient one for computational or explanatory reasonscan be used:

gi(x) = P(yi |x) =p(x|yi)P(yi)

p(x)

gi(x) = p(x|yi)P(yi)

gi(x) = lnp(x|yi) + lnP(yi)


Representing classifiers

0

0.1

0.2

0.3

decisionboundary

p(x|ω2)P(ω2)

R 1

R 2

p(x|ω1)P(ω1)

R 2

0

5

0

5

Decision regionsThe feature space is divided into decision regionsR1, . . . ,Rc such that:

x ∈ Ri if gi(x) > gj(x) ∀ j 6= i

Decision regions are separated by decision boundaries,regions in which ties occur among the largest discriminantfunctions


Normal density

Multivariate normal density

1(2π)d/2|Σ|1/2 exp−1

2(x− µ)t Σ−1(x− µ)

The covariance matrix Σ is always symmetric and positivesemi-definiteThe covariance matrix is strictly positive definite if thedimension of the feature space is d (otherwise |Σ| = 0)


Normal densityx2

x1

µ

HyperellipsoidsThe loci of points of constant density are hyperellipsoids ofconstant Mahalanobis distance from x to µ.The principal axes of such hyperellipsoids are theeigenvectors of Σ, their lengths are given by thecorresponding eigenvalues


Discriminant functions for normal density

Discriminant functions

gi(x) = ln p(x|yi) + ln P(yi)

= −12

(x− µi)t Σ−1

i (x− µi)−d2

ln 2π − 12

ln |Σi |+ ln P(yi)

Discarding terms which are independent of i we obtain:

gi(x) = −12

(x− µi)t Σ−1

i (x− µi)−12

ln |Σi |+ ln P(yi)



case Σi = σ2IFeatures are statistically independentAll features have same variance σ2

Covariance determinant |Σi | = σ2d can be ignored beingindependent of iCovariance inverse is given by Σ−1

i = (1/σ2)IThe discriminant functions become:

gi(x) = −||x− µi ||2

2σ2 + ln P(yi)



-2 2 4

0.1

0.2

0.3

0.4

P(ω1)=.5 P(ω2)=.5

x

p(x|ωi) ω1 ω2

0

R 1 R 2

02

4

0

0.05

0.1

0.15

-2

P(ω1)=.5P(ω2)=.5

ω1ω2

R 1

R 2

-20

2

4

-2-1

01

2

0

1

2

-2

-1

0

1

2

P(ω1)=.5

P(ω2)=.5

ω1

ω2

R 1

R 2

case Σi = σ2IExpansion of the quadratic form leads to:

gi(x) = − 12σ2 [xtx− 2µt

i x + µtiµi ] + ln P(yi)

Discarding terms which are independent of i we obtainlinear discriminant functions:

gi(x) =1σ2µ

ti︸︷︷︸

w ti

x− 12σ2µ

tiµi + ln P(yi)︸︷︷︸

wi0


case Σi = σ2I

Separating hyperplane

Setting gi(x) = gj(x) we note that the decision boundariesare pieces of hyperplanes:

(µi − µj)t︸︷︷︸

w t

(x−

12

(µi + µj)−σ2

||µi − µj ||2ln

P(yi)

P(yj)(µi − µj)︸︷︷︸

x0

)

The hyperplane is orthogonal to vector w ⇒ orthogonal tothe line linking the meansThe hyperplane passes through x0:

if the prior probabilities of classes are equal, x0 is halfwaybetween the meansotherwise, x0 shifts towards the more likely mean


case Σi = σ2I

Separating hyperplane: derivation (1)

gi(x)− gj(x) = 01σ2µ

ti x−

12σ2µ

tiµi + ln P(yi)−

1σ2µ

tj x−

12σ2µ

tjµj + ln P(yj) = 0

(µi − µj)tx− 1/2(µt

iµi − µtjµj) + σ2ln

P(yi)

P(yj)= 0

w t (x− x0) = 0w = (µi − µj)

(µi − µj)tx0 = 1/2(µt

iµi − µtjµj)− σ2ln

P(yi)

P(yj)


case Σi = σ2I

Separating hyperplane: derivation (2)

(µi − µj)tx0 = 1/2(µt

iµi − µtjµj)− σ2ln

P(yi)

P(yj)

(µtiµi − µt

jµj) = (µi − µj)t (µi + µj)

lnP(yi)

P(yj)=

(µi − µj)t (µi − µj)

(µi − µj)t (µi − µj)

lnP(yi)

P(yj)=

= (µi − µj)t (µi − µj)

||µi − µj ||2ln

P(yi)

P(yj)

x0 = 1/2(µi + µj)− σ2 (µi − µj)

||µi − µj ||2ln

P(yi)

P(yj)


case Σi = σ2I

P(ω1)=.7 P(ω2)=.3

ω1 ω2

R 1 R 2

p(x|ωi)

x-2 2 4

0.1

0.2

0.3

0.4

0

P(ω1)=.9 P(ω2)=.1

ω1 ω2

R 1 R 2

p(x|ωi)

x-2 2 4

0.1

0.2

0.3

0.4

0

-2

02

4

-20

24

0

0.05

0.1

0.15

P(ω1)=.8

P(ω2)=.2

ω1 ω2

R 1

R 2

-2

02

4

-20

24

0

0.05

0.1

0.15

P(ω1)=.99

P(ω2)=.01

ω1ω2

R 1

R 2

-10

1

2

0

1

2

3

-2

-1

0

1

2

-2

P(ω1)=.8

P(ω2)=.2

ω1

ω2R 1

R 2

0

2

4

-10

1

2

-2

-1

0

1

2

-2

P(ω1)=.99

P(ω2)=.01ω1

ω2

R 1

R 2


Discriminant functions for normal densitycase Σi = Σ

All classes have same covariance matrixThe discriminant functions become:

gi(x) = −12

(x− µi)t Σ−1(x− µi) + ln P(yi)

Expanding the quadratic form and discarding termsindependent of i we again obtain linear discriminantfunctions:

gi(x) = µti Σ−1︸︷︷︸

w ti

x−12µt

i Σ−1µi + ln P(yi)︸︷︷︸

wi0

The separating hyperplanes are not necessarily orthogonalto the line linking the means:

(µi − µj)t Σ−1︸︷︷︸

w t

(x−12

(µi + µj)−ln P(yi)/P(yj)

(µi − µj)t Σ−1(µi − µj)

(µi − µj)︸︷︷︸x0

)


case Σi = Σ

-5

0

5 -5

0

5

0

-0.1

0.2

P(ω1)=.5

P(ω2)=.5

ω1ω2

R 1R 2

-5

0

5-5

0

5

0

-0.1

0.2

P(ω1)=.1

P(ω2)=.9

ω1ω2

R 1R 2

-20

2-2

02

4

-2.5

0

2.5

5

7.5

P(ω1)=.5

P(ω2)=.5

ω1

ω2

R 1

R 2

-20

2-2

02

4

0

-2.5

5

7.5

10

P(ω1)=.1

P(ω2)=.9

ω1

ω2

R 1

R 2



case Σi = arbitraryThe discriminant functions are inherently quadratic:

gi(x) = xt (−12

Σ−1i )︸︷︷︸

Wi

x+µti Σ−1i︸︷︷︸

w ti

x−12µt

i Σ−1i µi −

12

ln |Σi |+ ln P(yi)︸︷︷︸wio

In two category case, decision surfaces arehyperquadratics: hyperplanes, pairs of hyperplanes,hyperspheres, hyperellipsoids, etc.


case Σi = arbitrary


Reducible error

ω2ω1

x

p(x|ωi)P(ωi)

reducibleerror

∫p(x|ω1)P(ω1) dx∫p(x|ω2)P(ω2) dxR 2R 1

R 1 R 2

xB x*

Probability of error in binary classification

P(error) = P(x ∈ R1, y2) + P(x ∈ R2, y1)

=

∫R1

p(x|y2)P(y2)dx +

∫R2

p(x|y1)P(y1)dx

The reducible error is the error resulting from a suboptimalchoice of the decision boundary


Bayesian decision theory: arbitrary inputs and outputs

Setting

Examples are input-output pairs (x , y) ∈ X × Y generatedwith probability p(x , y).The conditional risk of predicting y∗ given x is:

R(y∗|x) =

∫Y`(y∗, y)P(y |x)dy

The overall risk of a decision rule f is given by

R[f ] =

∫R(f (x)|x)p(x)dx =

∫X

∫Y`(f (x), y)p(y , x)dxdy

Bayes decision rule

yB = argminy∈YR(y |x)


Handling missing features

Marginalize over missing variablesAssume input x consists of an observed part xo andmissing part xm.Posterior probability of yi given the observation can beobtained from probabilities over entire inputs bymarginalizing over the missing part:

P(yi |xo) =p(yi ,xo)

p(xo)=

∫p(yi ,xo,xm)dxm

p(xo)

=

∫P(yi |xo,xm)p(xo,xm)dxm∫

p(xo,xm)dxm

=

∫P(yi |x)p(x)dxm∫

p(x)dxm


Handling noisy features

Marginalize over true variablesAssume x consists of a clean part xc and noisy part xn.Assume we have a noise model for the probability of thenoisy feature given its true version p(xn|xt ).Posterior probability of yi given the observation can beobtained from probabilities over clean inputs bymarginalizing over true variables via the noise model:

P(yi |xc ,xn) =p(yi ,xc ,xn)

p(xc ,xn)=

∫p(yi ,xc ,xn,xt )dxt∫

p(xc ,xn,xt )dxt

=

∫p(yi |xc ,xn,xt )p(xc ,xn,xt )dxt∫

p(xc ,xn,xt )dxt

=

∫p(yi |xc ,xt )p(xn|xc ,xt )p(xc ,xt )dxt∫

p(xn|xc ,xt )p(xc ,xt )dxt

=

∫p(yi |x)p(xn|xt )p(x)dxt∫

p(xn|xt )p(x)dxt


bayesian decision theory - disi, university of...

Documents