bayesian decision theory - disi, university of...
TRANSCRIPT
![Page 2: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/2.jpg)
Bayesian decision theory: introduction
OverviewBayesian decision theory allows to take optimal decisionsin a fully probabilistic settingIt assumes all relevant probabilities are knownIt allows to provide upper bounds on acheivable errors andevaluate classifiers accordinglyBayesian reasoning can be generalized to cases when theprobabilistic structure is not entirely known
Bayesian decision theory
![Page 3: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/3.jpg)
Bayesian decision theory: introduction
Binary classification
Assume examples (x , y) ∈ X × {−1,1} are drawn from aknown distribution p(x , y).The task is predicting the class y of examples given theinput x .Bayes rule allows us to write it in probabilistic terms as:
P(y |x) =p(x |y)P(y)
p(x)
Bayesian decision theory
![Page 4: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/4.jpg)
Bayesian decision theory: introductionBayes ruleBayes rule allows to compute the posterior probability givenlikelihood, prior and evidence:
posterior =likelihood × prior
evidenceposterior P(y |x) is the probability that class is y given that x
was observedlikelihood p(x |y) is the probability of observing x given that
its class is yprior P(y) is the prior probability of the class, without
any evidenceevidence p(x) is the probability of the observation, and by
the law of total probability can be computed as:
p(x) =2∑
i=1
p(x |y)P(y)
Bayesian decision theory
![Page 5: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/5.jpg)
Bayesian decision theory: introduction
Probability of errorProbability of error given x :
P(error |x) =
{P(y2|x) if we decide y1P(y1|x) if we decide y2
Average probability of error:
P(error) =
∫ ∞−∞
P(error |x)p(x)dx
Bayesian decision theory
![Page 6: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/6.jpg)
Bayesian decision theory: introduction
Bayes decision rule
yB = argmaxyi∈{−1,1}P(yi |x) = argmaxyi∈{−1,1}p(x |yi)P(yi)
The probability of error given x is:
P(error |x) = min[P(y1|x),P(y2|x)]
The Bayes decision rule minimizes the probability of error
Bayesian decision theory
![Page 7: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/7.jpg)
Bayesian decision theory: Continuous features
Setting
Inputs are vectors in a d-dimensional Euclidean space IRd
called feature spaceThe output is one of c possible categories or classesY = {y1, . . . , yc} (binary/multiclass classification)A decision rule is a function f : X → Y telling what class topredict for a given observationA loss function ` : Y × Y → IR is a function measuring thecost for predicting yi when the true class is yj
The conditional risk of predicting y given x is:
R(y |x) =c∑
i=1
`(y , yi)P(yi |x)
Bayesian decision theory
![Page 8: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/8.jpg)
Bayesian decision theory: Continuous features
Bayes decision rule
y∗ = argminy∈YR(y |x)
The overall risk of a decision rule f is given by
R[f ] =
∫R(f (x)|x)p(x)dx =
∫ c∑i=1
`(f (x), yi)p(yi ,x)dx
The Bayes decision rule minimizes the overall risk.The Bayes risk R∗ is the overall risk of a Bayes decisionrule, and it’s the best performance achievable.
Bayesian decision theory
![Page 9: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/9.jpg)
Minimum-error-rate classification
Zero-one loss
`(y , yi) =
{0 if y = yi1 if y 6= yi
It assigns a unit cost to a misclassified example, and azero cost to a correctly classified oneIt assumes that all errors have same costIts risk corresponds to the average probability of error:
R(yi |x) =c∑
j=1
`(yi , yj)P(yj |x) =∑j 6=i
P(yj |x) = 1− P(yi |x)
Bayesian decision theory
![Page 10: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/10.jpg)
Minimum-error-rate classification: binary case
x
θa
p(x|ω1)p(x|ω2)
θb
R 1R 2 R 1R 2
Bayes decision rule
yB = argmaxyi∈{−1,1}P(yi |x) = argmaxyi∈{−1,1}p(x |yi)P(yi)
Likelihood ratio:
p(x |y1)P(y1) > p(x |y2)P(y2)
p(x |y1)
p(x |y2)>
P(y2)
P(y1)= θa
Bayesian decision theory
![Page 11: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/11.jpg)
Representing classifiers
Discriminant functionsA classifier can be represented as a set of discriminantfunctions gi(x), i ∈ 1, . . . , c, giving:
y = argmaxi∈1,...,c gi(x)
A discriminant function is not unique⇒ the mostconvenient one for computational or explanatory reasonscan be used:
gi(x) = P(yi |x) =p(x|yi)P(yi)
p(x)
gi(x) = p(x|yi)P(yi)
gi(x) = lnp(x|yi) + lnP(yi)
Bayesian decision theory
![Page 12: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/12.jpg)
Representing classifiers
0
0.1
0.2
0.3
decisionboundary
p(x|ω2)P(ω2)
R 1
R 2
p(x|ω1)P(ω1)
R 2
0
5
0
5
Decision regionsThe feature space is divided into decision regionsR1, . . . ,Rc such that:
x ∈ Ri if gi(x) > gj(x) ∀ j 6= i
Decision regions are separated by decision boundaries,regions in which ties occur among the largest discriminantfunctions
Bayesian decision theory
![Page 13: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/13.jpg)
Normal density
Multivariate normal density
1(2π)d/2|Σ|1/2 exp−1
2(x− µ)t Σ−1(x− µ)
The covariance matrix Σ is always symmetric and positivesemi-definiteThe covariance matrix is strictly positive definite if thedimension of the feature space is d (otherwise |Σ| = 0)
Bayesian decision theory
![Page 14: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/14.jpg)
Normal densityx2
x1
µ
HyperellipsoidsThe loci of points of constant density are hyperellipsoids ofconstant Mahalanobis distance from x to µ.The principal axes of such hyperellipsoids are theeigenvectors of Σ, their lengths are given by thecorresponding eigenvalues
Bayesian decision theory
![Page 15: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/15.jpg)
Discriminant functions for normal density
Discriminant functions
gi(x) = ln p(x|yi) + ln P(yi)
= −12
(x− µi)t Σ−1
i (x− µi)−d2
ln 2π − 12
ln |Σi |+ ln P(yi)
Discarding terms which are independent of i we obtain:
gi(x) = −12
(x− µi)t Σ−1
i (x− µi)−12
ln |Σi |+ ln P(yi)
Bayesian decision theory
![Page 16: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/16.jpg)
Discriminant functions for normal density
case Σi = σ2IFeatures are statistically independentAll features have same variance σ2
Covariance determinant |Σi | = σ2d can be ignored beingindependent of iCovariance inverse is given by Σ−1
i = (1/σ2)IThe discriminant functions become:
gi(x) = −||x− µi ||2
2σ2 + ln P(yi)
Bayesian decision theory
![Page 17: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/17.jpg)
Discriminant functions for normal density
-2 2 4
0.1
0.2
0.3
0.4
P(ω1)=.5 P(ω2)=.5
x
p(x|ωi) ω1 ω2
0
R 1 R 2
02
4
0
0.05
0.1
0.15
-2
P(ω1)=.5P(ω2)=.5
ω1ω2
R 1
R 2
-20
2
4
-2-1
01
2
0
1
2
-2
-1
0
1
2
P(ω1)=.5
P(ω2)=.5
ω1
ω2
R 1
R 2
case Σi = σ2IExpansion of the quadratic form leads to:
gi(x) = − 12σ2 [xtx− 2µt
i x + µtiµi ] + ln P(yi)
Discarding terms which are independent of i we obtainlinear discriminant functions:
gi(x) =1σ2µ
ti︸ ︷︷ ︸
w ti
x− 12σ2µ
tiµi + ln P(yi)︸ ︷︷ ︸
wi0
Bayesian decision theory
![Page 18: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/18.jpg)
case Σi = σ2I
Separating hyperplane
Setting gi(x) = gj(x) we note that the decision boundariesare pieces of hyperplanes:
(µi − µj)t︸ ︷︷ ︸
w t
(x−
12
(µi + µj)−σ2
||µi − µj ||2ln
P(yi)
P(yj)(µi − µj)︸ ︷︷ ︸
x0
)
The hyperplane is orthogonal to vector w ⇒ orthogonal tothe line linking the meansThe hyperplane passes through x0:
if the prior probabilities of classes are equal, x0 is halfwaybetween the meansotherwise, x0 shifts towards the more likely mean
Bayesian decision theory
![Page 19: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/19.jpg)
case Σi = σ2I
Separating hyperplane: derivation (1)
gi(x)− gj(x) = 01σ2µ
ti x−
12σ2µ
tiµi + ln P(yi)−
1σ2µ
tj x−
12σ2µ
tjµj + ln P(yj) = 0
(µi − µj)tx− 1/2(µt
iµi − µtjµj) + σ2ln
P(yi)
P(yj)= 0
w t (x− x0) = 0w = (µi − µj)
(µi − µj)tx0 = 1/2(µt
iµi − µtjµj)− σ2ln
P(yi)
P(yj)
Bayesian decision theory
![Page 20: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/20.jpg)
case Σi = σ2I
Separating hyperplane: derivation (2)
(µi − µj)tx0 = 1/2(µt
iµi − µtjµj)− σ2ln
P(yi)
P(yj)
(µtiµi − µt
jµj) = (µi − µj)t (µi + µj)
lnP(yi)
P(yj)=
(µi − µj)t (µi − µj)
(µi − µj)t (µi − µj)
lnP(yi)
P(yj)=
= (µi − µj)t (µi − µj)
||µi − µj ||2ln
P(yi)
P(yj)
x0 = 1/2(µi + µj)− σ2 (µi − µj)
||µi − µj ||2ln
P(yi)
P(yj)
Bayesian decision theory
![Page 21: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/21.jpg)
case Σi = σ2I
P(ω1)=.7 P(ω2)=.3
ω1 ω2
R 1 R 2
p(x|ωi)
x-2 2 4
0.1
0.2
0.3
0.4
0
P(ω1)=.9 P(ω2)=.1
ω1 ω2
R 1 R 2
p(x|ωi)
x-2 2 4
0.1
0.2
0.3
0.4
0
-2
02
4
-20
24
0
0.05
0.1
0.15
P(ω1)=.8
P(ω2)=.2
ω1 ω2
R 1
R 2
-2
02
4
-20
24
0
0.05
0.1
0.15
P(ω1)=.99
P(ω2)=.01
ω1ω2
R 1
R 2
-10
1
2
0
1
2
3
-2
-1
0
1
2
-2
P(ω1)=.8
P(ω2)=.2
ω1
ω2R 1
R 2
0
2
4
-10
1
2
-2
-1
0
1
2
-2
P(ω1)=.99
P(ω2)=.01ω1
ω2
R 1
R 2
Bayesian decision theory
![Page 22: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/22.jpg)
Discriminant functions for normal densitycase Σi = Σ
All classes have same covariance matrixThe discriminant functions become:
gi(x) = −12
(x− µi)t Σ−1(x− µi) + ln P(yi)
Expanding the quadratic form and discarding termsindependent of i we again obtain linear discriminantfunctions:
gi(x) = µti Σ−1︸ ︷︷ ︸
w ti
x−12µt
i Σ−1µi + ln P(yi)︸ ︷︷ ︸
wi0
The separating hyperplanes are not necessarily orthogonalto the line linking the means:
(µi − µj)t Σ−1︸ ︷︷ ︸
w t
(x−12
(µi + µj)−ln P(yi)/P(yj)
(µi − µj)t Σ−1(µi − µj)
(µi − µj)︸ ︷︷ ︸x0
)
Bayesian decision theory
![Page 23: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/23.jpg)
case Σi = Σ
-5
0
5 -5
0
5
0
-0.1
0.2
P(ω1)=.5
P(ω2)=.5
ω1ω2
R 1R 2
-5
0
5-5
0
5
0
-0.1
0.2
P(ω1)=.1
P(ω2)=.9
ω1ω2
R 1R 2
-20
2-2
02
4
-2.5
0
2.5
5
7.5
P(ω1)=.5
P(ω2)=.5
ω1
ω2
R 1
R 2
-20
2-2
02
4
0
-2.5
5
7.5
10
P(ω1)=.1
P(ω2)=.9
ω1
ω2
R 1
R 2
Bayesian decision theory
![Page 24: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/24.jpg)
Discriminant functions for normal density
case Σi = arbitraryThe discriminant functions are inherently quadratic:
gi(x) = xt (−12
Σ−1i )︸ ︷︷ ︸
Wi
x+µti Σ−1i︸ ︷︷ ︸
w ti
x−12µt
i Σ−1i µi −
12
ln |Σi |+ ln P(yi)︸ ︷︷ ︸wio
In two category case, decision surfaces arehyperquadratics: hyperplanes, pairs of hyperplanes,hyperspheres, hyperellipsoids, etc.
Bayesian decision theory
![Page 25: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/25.jpg)
case Σi = arbitrary
Bayesian decision theory
![Page 26: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/26.jpg)
Reducible error
ω2ω1
x
p(x|ωi)P(ωi)
reducibleerror
∫p(x|ω1)P(ω1) dx∫p(x|ω2)P(ω2) dxR 2R 1
R 1 R 2
xB x*
Probability of error in binary classification
P(error) = P(x ∈ R1, y2) + P(x ∈ R2, y1)
=
∫R1
p(x|y2)P(y2)dx +
∫R2
p(x|y1)P(y1)dx
The reducible error is the error resulting from a suboptimalchoice of the decision boundary
Bayesian decision theory
![Page 27: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/27.jpg)
Bayesian decision theory: arbitrary inputs and outputs
Setting
Examples are input-output pairs (x , y) ∈ X × Y generatedwith probability p(x , y).The conditional risk of predicting y∗ given x is:
R(y∗|x) =
∫Y`(y∗, y)P(y |x)dy
The overall risk of a decision rule f is given by
R[f ] =
∫R(f (x)|x)p(x)dx =
∫X
∫Y`(f (x), y)p(y , x)dxdy
Bayes decision rule
yB = argminy∈YR(y |x)
Bayesian decision theory
![Page 28: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/28.jpg)
Handling missing features
Marginalize over missing variablesAssume input x consists of an observed part xo andmissing part xm.Posterior probability of yi given the observation can beobtained from probabilities over entire inputs bymarginalizing over the missing part:
P(yi |xo) =p(yi ,xo)
p(xo)=
∫p(yi ,xo,xm)dxm
p(xo)
=
∫P(yi |xo,xm)p(xo,xm)dxm∫
p(xo,xm)dxm
=
∫P(yi |x)p(x)dxm∫
p(x)dxm
Bayesian decision theory
![Page 29: Bayesian decision theory - DISI, University of Trentodisi.unitn.it/.../slides/05_bayesian_decision/talk.pdf · 2013-09-10 · Bayesian decision theory: introduction Binary classification](https://reader033.vdocuments.mx/reader033/viewer/2022053004/5f07f2757e708231d41f8f1c/html5/thumbnails/29.jpg)
Handling noisy features
Marginalize over true variablesAssume x consists of a clean part xc and noisy part xn.Assume we have a noise model for the probability of thenoisy feature given its true version p(xn|xt ).Posterior probability of yi given the observation can beobtained from probabilities over clean inputs bymarginalizing over true variables via the noise model:
P(yi |xc ,xn) =p(yi ,xc ,xn)
p(xc ,xn)=
∫p(yi ,xc ,xn,xt )dxt∫
p(xc ,xn,xt )dxt
=
∫p(yi |xc ,xn,xt )p(xc ,xn,xt )dxt∫
p(xc ,xn,xt )dxt
=
∫p(yi |xc ,xt )p(xn|xc ,xt )p(xc ,xt )dxt∫
p(xn|xc ,xt )p(xc ,xt )dxt
=
∫p(yi |x)p(xn|xt )p(x)dxt∫
p(xn|xt )p(x)dxt
Bayesian decision theory