csc 411 lecture 14: probabilistic models iirgrosse/courses/csc411_f18/slides/lec1… · p( jd) /p(...
TRANSCRIPT
![Page 1: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/1.jpg)
CSC 411 Lecture 14: Probabilistic Models II
Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla
University of Toronto
UofT CSC 411: 14-Probabilistic Models II 1 / 42
![Page 2: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/2.jpg)
Overview
Bayesian parameter estimation
MAP estimation
Gaussian discriminant analysis
UofT CSC 411: 14-Probabilistic Models II 2 / 42
![Page 3: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/3.jpg)
Data Sparsity
Maximum likelihood has a pitfall: if you have too little data, it canoverfit.
E.g., what if you flip the coin twice and get H both times?
θML =NH
NH + NT=
2
2 + 0= 1
Because it never observed T, it assigns this outcome probability 0.This problem is known as data sparsity.
If you observe a single T in the test set, the log-likelihood is −∞.
UofT CSC 411: 14-Probabilistic Models II 3 / 42
![Page 4: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/4.jpg)
Bayesian Parameter Estimation
In maximum likelihood, the observations are treated as randomvariables, but the parameters are not.
The Bayesian approach treats the parameters as random variables aswell.
To define a Bayesian model, we need to specify two distributions:
The prior distribution p(θ), which encodes our beliefs about theparameters before we observe the dataThe likelihood p(D |θ), same as in maximum likelihood
When we update our beliefs based on the observations, we computethe posterior distribution using Bayes’ Rule:
p(θ | D) =p(θ)p(D |θ)∫
p(θ′)p(D |θ′)dθ′.
We rarely ever compute the denominator explicitly.
UofT CSC 411: 14-Probabilistic Models II 4 / 42
![Page 5: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/5.jpg)
Bayesian Parameter Estimation
Let’s revisit the coin example. We already know the likelihood:
L(θ) = p(D) = θNH (1− θ)NT
It remains to specify the prior p(θ).
We can choose an uninformative prior, which assumes as little aspossible. A reasonable choice is the uniform prior.But our experience tells us 0.5 is more likely than 0.99. Oneparticularly useful prior that lets us specify this is the beta distribution:
p(θ; a, b) =Γ(a + b)
Γ(a)Γ(b)θa−1(1− θ)b−1.
This notation for proportionality lets us ignore the normalizationconstant:
p(θ; a, b) ∝ θa−1(1− θ)b−1.
UofT CSC 411: 14-Probabilistic Models II 5 / 42
![Page 6: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/6.jpg)
Bayesian Parameter Estimation
Beta distribution for various values of a, b:
Some observations:
The expectation E[θ] = a/(a + b).The distribution gets more peaked when a and b are large.The uniform distribution is the special case where a = b = 1.
The main thing the beta distribution is used for is as a prior for the Bernoullidistribution.
UofT CSC 411: 14-Probabilistic Models II 6 / 42
![Page 7: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/7.jpg)
Bayesian Parameter Estimation
Computing the posterior distribution:
p(θ | D) ∝ p(θ)p(D |θ)
∝[θa−1(1− θ)b−1
] [θNH (1− θ)NT
]= θa−1+NH (1− θ)b−1+NT .
This is just a beta distribution with parameters NH + a and NT + b.
The posterior expectation of θ is:
E[θ | D] =NH + a
NH + NT + a + b
The parameters a and b of the prior can be thought of aspseudo-counts.
The reason this works is that the prior and likelihood have the samefunctional form. This phenomenon is known as conjugacy, and it’s veryuseful.
UofT CSC 411: 14-Probabilistic Models II 7 / 42
![Page 8: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/8.jpg)
Bayesian Parameter Estimation
Bayesian inference for the coin flip example:
Small data settingNH = 2, NT = 0
Large data settingNH = 55, NT = 45
When you have enough observations, the data overwhelm the prior.
UofT CSC 411: 14-Probabilistic Models II 8 / 42
![Page 9: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/9.jpg)
Bayesian Parameter Estimation
What do we actually do with the posterior?
The posterior predictive distribution is the distribution over futureobservables given the past observations. We compute this bymarginalizing out the parameter(s):
p(D′ | D) =
∫p(θ | D)p(D′ |θ) dθ. (1)
For the coin flip example:
θpred = Pr(x ′ = H | D)
=
∫p(θ | D)Pr(x ′ = H | θ)dθ
=
∫Beta(θ;NH + a,NT + b) · θ dθ
= EBeta(θ;NH+a,NT+b)[θ]
=NH + a
NH + NT + a+ b, (2)
UofT CSC 411: 14-Probabilistic Models II 9 / 42
![Page 10: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/10.jpg)
Bayesian Parameter Estimation
Bayesian estimation of the mean temperature in Toronto
Assume observations arei.i.d. Gaussian with knownstandard deviation σ andunknown mean µ
Broad Gaussian prior over µ,centered at 0
We can compute the posteriorand posterior predictivedistributions analytically (fullderivation in notes)
Why is the posterior predictivedistribution more spread out thanthe posterior distribution?
UofT CSC 411: 14-Probabilistic Models II 10 / 42
![Page 11: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/11.jpg)
Bayesian Parameter Estimation
Comparison of maximum likelihood and Bayesian parameter estimation
The Bayesian approach deals better with data sparsity
Maximum likelihood is an optimization problem, while Bayesianparameter estimation is an integration problem
This means maximum likelihood is much easier in practice, since wecan just do gradient descentAutomatic differentiation packages make it really easy to computegradientsThere aren’t any comparable black-box tools for Bayesian parameterestimation (although Stan can do quite a lot)
UofT CSC 411: 14-Probabilistic Models II 11 / 42
![Page 12: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/12.jpg)
Maximum A-Posteriori Estimation
Maximum a-posteriori (MAP) estimation: find the most likelyparameter settings under the posterior
This converts the Bayesian parameter estimation problem into amaximization problem
θ̂MAP = arg maxθ
p(θ | D)
= arg maxθ
p(θ,D)
= arg maxθ
p(θ) p(D |θ)
= arg maxθ
log p(θ) + log p(D |θ)
UofT CSC 411: 14-Probabilistic Models II 12 / 42
![Page 13: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/13.jpg)
Maximum A-Posteriori Estimation
Joint probability in the coin flip example:
log p(θ,D) = log p(θ) + log p(D | θ)= const+ (a− 1) log θ + (b − 1) log(1− θ) + NH log θ + NT log(1− θ)= const+ (NH + a− 1) log θ + (NT + b − 1) log(1− θ)
Maximize by finding a critical point
0 =d
dθlog p(θ,D) =
NH + a− 1
θ− NT + b − 1
1− θ
Solving for θ,
θ̂MAP =NH + a− 1
NH + NT + a + b − 2
UofT CSC 411: 14-Probabilistic Models II 13 / 42
![Page 14: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/14.jpg)
Maximum A-Posteriori Estimation
Comparison of estimates in the coin flip example:
Formula NH = 2,NT = 0 NH = 55,NT = 45
θ̂MLNH
NH+NT1 55
100 = 0.55
θpredNH+a
NH+NT+a+b46 ≈ 0.67 57
104 ≈ 0.548
θ̂MAPNH+a−1
NH+NT+a+b−234 = 0.75 56
102 ≈ 0.549
θ̂MAP assigns nonzero probabilities as long as a, b > 1.
UofT CSC 411: 14-Probabilistic Models II 14 / 42
![Page 15: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/15.jpg)
Maximum A-Posteriori Estimation
Comparison of predictions in the Toronto temperatures example
1 observation 7 observations
UofT CSC 411: 14-Probabilistic Models II 15 / 42
![Page 16: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/16.jpg)
Gaussian Discriminant Analysis
UofT CSC 411: 14-Probabilistic Models II 16 / 42
![Page 17: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/17.jpg)
Motivation
Generative models - model p(x|t = k)
Instead of trying to separate classes, try to model what each class”looks like”.
Recall that p(x|t = k) may be very complex
p(x1, · · · , xd , y) = p(x1|x2, · · · , xd , y) · · · p(xd−1|xd , y)p(xd , y)
Naive bayes used a conditional independence assumption. What elsecould we do? Choose a simple distribution.
Today we will discuss fitting Gaussian distributions to our data.
UofT CSC 411: 14-Probabilistic Models II 17 / 42
![Page 18: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/18.jpg)
Bayes Classifier
Let’s take a step back...
Bayes Classifier
h(x) = arg max p(t = k |x) = arg maxp(x|t = k)p(t = k)
p(x)
= arg max p(x|t = k)p(t = k)
Talked about Discrete x, what if x is continuous?
UofT CSC 411: 14-Probabilistic Models II 18 / 42
![Page 19: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/19.jpg)
Classification: Diabetes Example
Observation per patient: White blood cell count & glucose value.
How can we model p(x |t = k)? Multivariate Gaussian
UofT CSC 411: 14-Probabilistic Models II 19 / 42
![Page 20: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/20.jpg)
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)
Gaussian Discriminant Analysis in its general form assumes that p(x|t) isdistributed according to a multivariate normal (Gaussian) distribution
Multivariate Gaussian distribution:
p(x|t = k) =1
(2π)d/2|Σk |1/2exp
[−1
2(x− µk)TΣ−1
k (x− µk)
]where |Σk | denotes the determinant of the matrix, and d is dimension of x
Each class k has associated mean vector µk and covariance matrix Σk
Σk has O(d2) parameters - could be hard to estimate (more on that later).
UofT CSC 411: 14-Probabilistic Models II 20 / 42
![Page 21: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/21.jpg)
Multivariate Data
Multiple measurements (sensors)
d inputs/features/attributes
N instances/observations/examples
X =
x(1)1 x
(1)2 · · · x
(1)d
x(2)1 x
(2)2 · · · x
(2)d
......
. . ....
x(N)1 x
(N)2 · · · x
(N)d
UofT CSC 411: 14-Probabilistic Models II 21 / 42
![Page 22: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/22.jpg)
Multivariate Parameters
MeanE[x] = [µ1, · · · , µd ]T
Covariance
Σ = Cov(x) = E[(x− µ)T (x− µ)] =
σ21 σ12 · · · σ1d
σ12 σ22 · · · σ2d
......
. . ....
σd1 σd2 · · · σ2d
For Gaussians - all you need to know to represent! (not true in general)
UofT CSC 411: 14-Probabilistic Models II 22 / 42
![Page 23: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/23.jpg)
Multivariate Gaussian Distribution
x ∼ N (µ,Σ), a Gaussian (or normal) distribution defined as
p(x) =1
(2π)d/2|Σ|1/2exp
[−1
2(x− µ)TΣ−1(x− µ)
]
Mahalanobis distance (x− µk)TΣ−1(x− µk) measures the distance from xto µ in terms of Σ
It normalizes for difference in variances and correlations
UofT CSC 411: 14-Probabilistic Models II 23 / 42
![Page 24: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/24.jpg)
Bivariate Normal
Σ =
(1 00 1
)Σ = 0.5
(1 00 1
)Σ = 2
(1 00 1
)
Figure: Probability density function
Figure: Contour plot of the pdf
UofT CSC 411: 14-Probabilistic Models II 24 / 42
![Page 25: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/25.jpg)
Bivariate Normal
var(x1) = var(x2) var(x1) > var(x2) var(x1) < var(x2)
Figure: Probability density function
Figure: Contour plot of the pdf
UofT CSC 411: 14-Probabilistic Models II 25 / 42
![Page 26: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/26.jpg)
Bivariate Normal
Σ =
(1 00 1
)Σ =
(1 0.5
0.5 1
)Σ =
(1 0.8
0.8 1
)
Figure: Probability density function
Figure: Contour plot of the pdf
UofT CSC 411: 14-Probabilistic Models II 26 / 42
![Page 27: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/27.jpg)
Bivariate Normal
Cov(x1, x2) = 0 Cov(x1, x2) > 0 Cov(x1, x2) < 0
Figure: Probability density function
Figure: Contour plot of the pdf
UofT CSC 411: 14-Probabilistic Models II 27 / 42
![Page 28: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/28.jpg)
Bivariate Normal
UofT CSC 411: 14-Probabilistic Models II 28 / 42
![Page 29: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/29.jpg)
Bivariate Normal
UofT CSC 411: 14-Probabilistic Models II 29 / 42
![Page 30: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/30.jpg)
Gaussian Discriminant Analysis (Gaussian Bayes Classifier)
GDA (GBC) decision boundary is based on class posterior:
log p(tk |x) = log p(x|tk) + log p(tk)− log p(x)
= −d
2log(2π)− 1
2log |Σ−1
k | −1
2(x− µk)TΣ−1
k (x− µk) +
+ log p(tk)− log p(x)
Decision boundary:
(x− µk)TΣ−1k (x− µk) = (x− µ`)
TΣ−1` (x− µ`) + Const
xTΣ−1k x− 2µT
k Σ−1k x = xTΣ−1
` x− 2µT` Σ−1
` x + Const
Quadratic function in x
What if Σk = Σ`?
UofT CSC 411: 14-Probabilistic Models II 30 / 42
![Page 31: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/31.jpg)
Decision Boundary
likelihoods)
posterior)for)t1)
discriminant:!!P!(t1|x")!=!0.5!
UofT CSC 411: 14-Probabilistic Models II 31 / 42
![Page 32: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/32.jpg)
Learning
Learn the parameters for each class using maximum likelihood
Assume the prior is Bernoulli (we have two classes)
p(t|φ) = φt(1− φ)1−t
You can compute the ML estimate in closed form
φ =1
N
N∑n=1
1[t(n) = 1]
µk =
∑Nn=1 1[t(n) = k] · x(n)∑N
n=1 1[t(n) = k]
Σk =1∑N
n=1 1[t(n) = k]
N∑n=1
1[t(n) = k](x(n) − µt(n))(x(n) − µt(n))T
UofT CSC 411: 14-Probabilistic Models II 32 / 42
![Page 33: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/33.jpg)
Simplifying the Model
What if x is high-dimensional?
For Gaussian Bayes Classifier, if input x is high-dimensional, then covariancematrix has many parameters
Save some parameters by using a shared covariance for the classes
Any other idea you can think of?
MLE in this case:
Σ =1
N
N∑n=1
(x(n) − µt(n))(x(n) − µt(n))T
Linear decision boundary.
UofT CSC 411: 14-Probabilistic Models II 33 / 42
![Page 34: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/34.jpg)
Decision Boundary: Shared Variances (between Classes)
variances may be different
UofT CSC 411: 14-Probabilistic Models II 34 / 42
![Page 35: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/35.jpg)
Gaussian Discriminative Analysis vs Logistic Regression
Binary classification: If you examine p(t = 1|x) under GDA and assumeΣ0 = Σ1 = Σ, you will find that it looks like this:
p(t|x, φ, µ0, µ1,Σ) =1
1 + exp(−wTx)
where w is an appropriate function of (φ, µ0, µ1,Σ), φ = p(t = 1)
Same model as logistic regression!
When should we prefer GDA to LR, and vice versa?
UofT CSC 411: 14-Probabilistic Models II 35 / 42
![Page 36: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/36.jpg)
Gaussian Discriminative Analysis vs Logistic Regression
GDA makes stronger modeling assumption: assumes class-conditional data ismultivariate Gaussian
If this is true, GDA is asymptotically efficient (best model in limit of large N)
But LR is more robust, less sensitive to incorrect modeling assumptions(what loss is it optimizing?)
Many class-conditional distributions lead to logistic classifier
When these distributions are non-Gaussian (a.k.a almost always), LR usuallybeats GDA
GDA can handle easily missing features (how do you do that with LR?)
UofT CSC 411: 14-Probabilistic Models II 36 / 42
![Page 37: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/37.jpg)
Naive Bayes
Naive Bayes: Assumes features independent given the class
p(x|t = k) =d∏
i=1
p(xi |t = k)
Assuming likelihoods are Gaussian, how many parameters required for NaiveBayes classifier?
Equivalent to assuming Σk is diagonal.
UofT CSC 411: 14-Probabilistic Models II 37 / 42
![Page 38: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/38.jpg)
Gaussian Naive Bayes
Gaussian Naive Bayes classifier assumes that the likelihoods are Gaussian:
p(xi |t = k) =1√
2πσikexp
[−(xi − µik)2
2σ2ik
](this is just a 1-dim Gaussian, one for each input dimension)
Model the same as Gaussian Discriminative Analysis with diagonalcovariance matrix
Maximum likelihood estimate of parameters
µik =
∑Nn=1 1[t(n) = k] · x (n)i∑N
n=1 1[t(n) = k]
σ2ik =
∑Nn=1 1[t(n) = k] · (x (n)i − µik)2∑N
n=1 1[t(n) = k]
What decision boundaries do we get?
UofT CSC 411: 14-Probabilistic Models II 38 / 42
![Page 39: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/39.jpg)
Decision Boundary: isotropic
In this case: σi,k = σ (just one parameter), class priors equal (e.g.,p(tk) = 0.5 for 2-class case)
Going back to class posterior for GDA:
log p(tk |x) = log p(x|tk) + log p(tk)− log p(x)
= −d
2log(2π)− 1
2log |Σ−1
k | −1
2(x− µk)TΣ−1
k (x− µk) +
+ log p(tk)− log p(x)
where we take Σk = σ2I and ignore terms that don’t depend on k (don’tmatter when we take max over classes):
log p(tk |x) = − 1
2σ2(x− µk)T (x− µk)
UofT CSC 411: 14-Probabilistic Models II 39 / 42
![Page 40: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/40.jpg)
Decision Boundary: isotropic
* ?
Same variance across all classes and input dimensions, all class priors equal
Classification only depends on distance to the mean. Why?
UofT CSC 411: 14-Probabilistic Models II 40 / 42
![Page 41: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/41.jpg)
Example
UofT CSC 411: 14-Probabilistic Models II 41 / 42
![Page 42: CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta](https://reader034.vdocuments.mx/reader034/viewer/2022050323/5f7d256b33d9041802797931/html5/thumbnails/42.jpg)
Generative models - Recap
GDA - quadratic decision boundary.
With shared covariance ”collapses” to logistic regression.
Generative models:
Flexible models, easy to add/remove class.Handle missing data naturallyMore ”natural” way to think about things, but usually doesn’t work aswell.
Tries to solve a hard problem in order to solve a easy problem.
UofT CSC 411: 14-Probabilistic Models II 42 / 42