lecture9: classification,lda - stanford...
TRANSCRIPT
-
Lecture 9: Classification, LDA
Reading: Chapter 4
STATS 202: Data mining and analysis
Jonathan Taylor, 10/12Slide credits: Sergio Bacallado
1 / 21
-
Review: Main strategy in Chapter 4
Find an estimate P̂ (Y | X). Then, given an input x0, we predictthe response as in a Bayes classifier:
ŷ0 = argmax y P̂ (Y = y | X = x0).
2 / 21
-
Linear Discriminant Analysis (LDA)
Instead of estimating P (Y | X), we will estimate:
1. P̂ (X | Y ): Given the response, what is the distribution of theinputs.
2. P̂ (Y ): How likely are each of the categories.
Then, we use Bayes rule to obtain the estimate:
P̂ (Y = k | X = x) = P̂ (X = x | Y = k)P̂ (Y = k)
3 / 21
-
Linear Discriminant Analysis (LDA)
Instead of estimating P (Y | X), we will estimate:
1. P̂ (X | Y ): Given the response, what is the distribution of theinputs.
2. P̂ (Y ): How likely are each of the categories.
Then, we use Bayes rule to obtain the estimate:
P̂ (Y = k | X = x) = P̂ (X = x | Y = k)P̂ (Y = k)
3 / 21
-
Linear Discriminant Analysis (LDA)
Instead of estimating P (Y | X), we will estimate:
1. P̂ (X | Y ): Given the response, what is the distribution of theinputs.
2. P̂ (Y ): How likely are each of the categories.
Then, we use Bayes rule to obtain the estimate:
P̂ (Y = k | X = x) = P̂ (X = x | Y = k)P̂ (Y = k)
3 / 21
-
Linear Discriminant Analysis (LDA)
Instead of estimating P (Y | X), we will estimate:
1. P̂ (X | Y ): Given the response, what is the distribution of theinputs.
2. P̂ (Y ): How likely are each of the categories.
Then, we use Bayes rule to obtain the estimate:
P̂ (Y = k | X = x) = P̂ (X = x | Y = k)P̂ (Y = k)P̂ (X = x)
3 / 21
-
Linear Discriminant Analysis (LDA)
Instead of estimating P (Y | X), we will estimate:
1. P̂ (X | Y ): Given the response, what is the distribution of theinputs.
2. P̂ (Y ): How likely are each of the categories.
Then, we use Bayes rule to obtain the estimate:
P̂ (Y = k | X = x) = P̂ (X = x | Y = k)P̂ (Y = k)∑j P̂ (X = x | Y = j)P̂ (Y = j)
3 / 21
-
Linear Discriminant Analysis (LDA)
Instead of estimating P (Y | X), we will estimate:
1. We model P̂ (X = x | Y = k) = f̂k(x) as a MultivariateNormal Distribution:
−4 −2 0 2 4
−4
−2
02
4
−4 −2 0 2 4
−4
−2
02
4
X1X1
X2
X2
2. P̂ (Y = k) = π̂k is estimated by the fraction of trainingsamples of class k.
4 / 21
-
Linear Discriminant Analysis (LDA)
Instead of estimating P (Y | X), we will estimate:
1. We model P̂ (X = x | Y = k) = f̂k(x) as a MultivariateNormal Distribution:
−4 −2 0 2 4
−4
−2
02
4
−4 −2 0 2 4
−4
−2
02
4
X1X1
X2
X2
2. P̂ (Y = k) = π̂k is estimated by the fraction of trainingsamples of class k.
4 / 21
-
LDA has linear decision boundaries
Suppose that:
I We know P (Y = k) = πk exactly.
I P (X = x|Y = k) is Mutivariate Normal with density:
fk(x) =1
(2π)p/2|Σ|1/2e−
12(x−µk)TΣ−1(x−µk)
µk : Mean of the inputs for category k.Σ : Covariance matrix (common to all categories).
Then, what is the Bayes classifier?
5 / 21
-
LDA has linear decision boundaries
Suppose that:
I We know P (Y = k) = πk exactly.
I P (X = x|Y = k) is Mutivariate Normal with density:
fk(x) =1
(2π)p/2|Σ|1/2e−
12(x−µk)TΣ−1(x−µk)
µk : Mean of the inputs for category k.Σ : Covariance matrix (common to all categories).
Then, what is the Bayes classifier?
5 / 21
-
LDA has linear decision boundaries
Suppose that:
I We know P (Y = k) = πk exactly.
I P (X = x|Y = k) is Mutivariate Normal with density:
fk(x) =1
(2π)p/2|Σ|1/2e−
12(x−µk)TΣ−1(x−µk)
µk : Mean of the inputs for category k.Σ : Covariance matrix (common to all categories).
Then, what is the Bayes classifier?
5 / 21
-
LDA has linear decision boundaries
Suppose that:
I We know P (Y = k) = πk exactly.
I P (X = x|Y = k) is Mutivariate Normal with density:
fk(x) =1
(2π)p/2|Σ|1/2e−
12(x−µk)TΣ−1(x−µk)
µk : Mean of the inputs for category k.Σ : Covariance matrix (common to all categories).
Then, what is the Bayes classifier?
5 / 21
-
LDA has linear decision boundaries
Suppose that:
I We know P (Y = k) = πk exactly.
I P (X = x|Y = k) is Mutivariate Normal with density:
fk(x) =1
(2π)p/2|Σ|1/2e−
12(x−µk)TΣ−1(x−µk)
µk : Mean of the inputs for category k.Σ : Covariance matrix (common to all categories).
Then, what is the Bayes classifier?
5 / 21
-
LDA has linear decision boundaries
By Bayes rule, the probability of category k, given the input x is:
P (Y = k | X = x) = fk(x)πkP (X = x)
The denominator does not depend on the response k, so we canwrite it as a constant:
P (Y = k | X = x) = C × fk(x)πk
Now, expanding fk(x):
P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2
e−12(x−µk)TΣ−1(x−µk)
6 / 21
-
LDA has linear decision boundaries
By Bayes rule, the probability of category k, given the input x is:
P (Y = k | X = x) = fk(x)πkP (X = x)
The denominator does not depend on the response k, so we canwrite it as a constant:
P (Y = k | X = x) = C × fk(x)πk
Now, expanding fk(x):
P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2
e−12(x−µk)TΣ−1(x−µk)
6 / 21
-
LDA has linear decision boundaries
By Bayes rule, the probability of category k, given the input x is:
P (Y = k | X = x) = fk(x)πkP (X = x)
The denominator does not depend on the response k, so we canwrite it as a constant:
P (Y = k | X = x) = C × fk(x)πk
Now, expanding fk(x):
P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2
e−12(x−µk)TΣ−1(x−µk)
6 / 21
-
LDA has linear decision boundaries
P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2
e−12(x−µk)TΣ−1(x−µk)
Now, let us absorb everything that does not depend on k into aconstant C ′:
P (Y = k | X = x) = C ′πke−12(x−µk)TΣ−1(x−µk)
and take the logarithm of both sides:
logP (Y = k | X = x) = logC ′+ log πk −1
2(x− µk)TΣ−1(x− µk).
This is the same for every category, k.So we want to find the maximum of this over k.
7 / 21
-
LDA has linear decision boundaries
P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2
e−12(x−µk)TΣ−1(x−µk)
Now, let us absorb everything that does not depend on k into aconstant C ′:
P (Y = k | X = x) = C ′πke−12(x−µk)TΣ−1(x−µk)
and take the logarithm of both sides:
logP (Y = k | X = x) = logC ′+ log πk −1
2(x− µk)TΣ−1(x− µk).
This is the same for every category, k.So we want to find the maximum of this over k.
7 / 21
-
LDA has linear decision boundaries
P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2
e−12(x−µk)TΣ−1(x−µk)
Now, let us absorb everything that does not depend on k into aconstant C ′:
P (Y = k | X = x) = C ′πke−12(x−µk)TΣ−1(x−µk)
and take the logarithm of both sides:
logP (Y = k | X = x) = logC ′+ log πk −1
2(x− µk)TΣ−1(x− µk).
This is the same for every category, k.So we want to find the maximum of this over k.
7 / 21
-
LDA has linear decision boundaries
P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2
e−12(x−µk)TΣ−1(x−µk)
Now, let us absorb everything that does not depend on k into aconstant C ′:
P (Y = k | X = x) = C ′πke−12(x−µk)TΣ−1(x−µk)
and take the logarithm of both sides:
logP (Y = k | X = x) = logC ′+ log πk −1
2(x− µk)TΣ−1(x− µk).
This is the same for every category, k.
So we want to find the maximum of this over k.
7 / 21
-
LDA has linear decision boundaries
P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2
e−12(x−µk)TΣ−1(x−µk)
Now, let us absorb everything that does not depend on k into aconstant C ′:
P (Y = k | X = x) = C ′πke−12(x−µk)TΣ−1(x−µk)
and take the logarithm of both sides:
logP (Y = k | X = x) = logC ′+ log πk −1
2(x− µk)TΣ−1(x− µk).
This is the same for every category, k.So we want to find the maximum of this over k.
7 / 21
-
LDA has linear decision boundaries
Goal, maximize the following over k:
log πk −1
2(x− µk)TΣ−1(x− µk).
= log πk −1
2
[xTΣ−1x+ µTkΣ
−1µk]+ xTΣ−1µk
=C ′′ + log πk −1
2µTkΣ
−1µk + xTΣ−1µk
We define the objective:
δk(x) = log πk −1
2µTkΣ
−1µk + xTΣ−1µk
At an input x, we predict the response with the highest δk(x).
8 / 21
-
LDA has linear decision boundaries
Goal, maximize the following over k:
log πk −1
2(x− µk)TΣ−1(x− µk).
= log πk −1
2
[xTΣ−1x+ µTkΣ
−1µk]+ xTΣ−1µk
=C ′′ + log πk −1
2µTkΣ
−1µk + xTΣ−1µk
We define the objective:
δk(x) = log πk −1
2µTkΣ
−1µk + xTΣ−1µk
At an input x, we predict the response with the highest δk(x).
8 / 21
-
LDA has linear decision boundaries
Goal, maximize the following over k:
log πk −1
2(x− µk)TΣ−1(x− µk).
= log πk −1
2
[xTΣ−1x+ µTkΣ
−1µk]+ xTΣ−1µk
=C ′′ + log πk −1
2µTkΣ
−1µk + xTΣ−1µk
We define the objective:
δk(x) = log πk −1
2µTkΣ
−1µk + xTΣ−1µk
At an input x, we predict the response with the highest δk(x).
8 / 21
-
LDA has linear decision boundaries
Goal, maximize the following over k:
log πk −1
2(x− µk)TΣ−1(x− µk).
= log πk −1
2
[xTΣ−1x+ µTkΣ
−1µk]+ xTΣ−1µk
=C ′′ + log πk −1
2µTkΣ
−1µk + xTΣ−1µk
We define the objective:
δk(x) = log πk −1
2µTkΣ
−1µk + xTΣ−1µk
At an input x, we predict the response with the highest δk(x).
8 / 21
-
LDA has linear decision boundaries
What is the decision boundary? It is the set of points in which 2classes do just as well:
δk(x) = δ`(x)
log πk −1
2µTkΣ
−1µk + xTΣ−1µk = log π` −
1
2µT` Σ
−1µ` + xTΣ−1µ`
This is a linear equation in x.
−4 −2 0 2 4
−4
−2
02
4
−4 −2 0 2 4
−4
−2
02
4
X1X1
X2
X2
9 / 21
-
LDA has linear decision boundaries
What is the decision boundary? It is the set of points in which 2classes do just as well:
δk(x) = δ`(x)
log πk −1
2µTkΣ
−1µk + xTΣ−1µk = log π` −
1
2µT` Σ
−1µ` + xTΣ−1µ`
This is a linear equation in x.
−4 −2 0 2 4
−4
−2
02
4
−4 −2 0 2 4
−4
−2
02
4
X1X1
X2
X2
9 / 21
-
LDA has linear decision boundaries
What is the decision boundary? It is the set of points in which 2classes do just as well:
δk(x) = δ`(x)
log πk −1
2µTkΣ
−1µk + xTΣ−1µk = log π` −
1
2µT` Σ
−1µ` + xTΣ−1µ`
This is a linear equation in x.
−4 −2 0 2 4
−4
−2
02
4
−4 −2 0 2 4
−4
−2
02
4
X1X1
X2
X2
9 / 21
-
Estimating πk
π̂k =#{i ; yi = k}
n
In English, the fraction of training samples of class k.
10 / 21
-
Estimating the parameters of fk(x)
Estimate the center of each class µk:
µ̂k =1
#{i ; yi = k}∑
i ; yi=k
xi
Estimate the common covariance matrix Σ:
I One predictor (p = 1):
σ̂2 =1
n−K
K∑k=1
∑i ; yi=k
(xi − µ̂k)2.
I Many predictors (p > 1): Compute the vectors of deviations(x1 − µ̂y1), (x2 − µ̂y2), . . . , (xn − µ̂yn) and use an unbiasedestimate of its covariance matrix, Σ.
11 / 21
-
Estimating the parameters of fk(x)
Estimate the center of each class µk:
µ̂k =1
#{i ; yi = k}∑
i ; yi=k
xi
Estimate the common covariance matrix Σ:
I One predictor (p = 1):
σ̂2 =1
n−K
K∑k=1
∑i ; yi=k
(xi − µ̂k)2.
I Many predictors (p > 1): Compute the vectors of deviations(x1 − µ̂y1), (x2 − µ̂y2), . . . , (xn − µ̂yn) and use an unbiasedestimate of its covariance matrix, Σ.
11 / 21
-
Estimating the parameters of fk(x)
Estimate the center of each class µk:
µ̂k =1
#{i ; yi = k}∑
i ; yi=k
xi
Estimate the common covariance matrix Σ:
I One predictor (p = 1):
σ̂2 =1
n−K
K∑k=1
∑i ; yi=k
(xi − µ̂k)2.
I Many predictors (p > 1): Compute the vectors of deviations(x1 − µ̂y1), (x2 − µ̂y2), . . . , (xn − µ̂yn) and use an unbiasedestimate of its covariance matrix, Σ.
11 / 21
-
Estimating the parameters of fk(x)
Estimate the center of each class µk:
µ̂k =1
#{i ; yi = k}∑
i ; yi=k
xi
Estimate the common covariance matrix Σ:
I One predictor (p = 1):
σ̂2 =1
n−K
K∑k=1
∑i ; yi=k
(xi − µ̂k)2.
I Many predictors (p > 1): Compute the vectors of deviations(x1 − µ̂y1), (x2 − µ̂y2), . . . , (xn − µ̂yn) and use an unbiasedestimate of its covariance matrix, Σ.
11 / 21
-
LDA prediction
For an input x, predict the class with the largest:
δ̂k(x) = log π̂k −1
2µ̂Tk Σ̂
−1µ̂k + xT Σ̂−1µ̂k
The decision boundaries are defined by:
log π̂k −1
2µ̂Tk Σ̂
−1µ̂k + xT Σ̂−1µ̂k = log π̂` −
1
2µ̂T` Σ̂
−1µ̂` + xT Σ̂−1µ̂`
Solid lines in:
−4 −2 0 2 4
−4
−2
02
4
−4 −2 0 2 4
−4
−2
02
4
X1X1
X2
X2
12 / 21
-
LDA prediction
For an input x, predict the class with the largest:
δ̂k(x) = log π̂k −1
2µ̂Tk Σ̂
−1µ̂k + xT Σ̂−1µ̂k
The decision boundaries are defined by:
log π̂k −1
2µ̂Tk Σ̂
−1µ̂k + xT Σ̂−1µ̂k = log π̂` −
1
2µ̂T` Σ̂
−1µ̂` + xT Σ̂−1µ̂`
Solid lines in:
−4 −2 0 2 4
−4
−2
02
4
−4 −2 0 2 4
−4
−2
02
4
X1X1
X2
X2
12 / 21
-
LDA prediction
For an input x, predict the class with the largest:
δ̂k(x) = log π̂k −1
2µ̂Tk Σ̂
−1µ̂k + xT Σ̂−1µ̂k
The decision boundaries are defined by:
log π̂k −1
2µ̂Tk Σ̂
−1µ̂k + xT Σ̂−1µ̂k = log π̂` −
1
2µ̂T` Σ̂
−1µ̂` + xT Σ̂−1µ̂`
Solid lines in:
−4 −2 0 2 4
−4
−2
02
4
−4 −2 0 2 4
−4
−2
02
4
X1X1
X2
X2
12 / 21
-
Quadratic discriminant analysis (QDA)
The assumption that the inputs of every class have the samecovariance Σ can be quite restrictive:
−4 −2 0 2 4
−4
−3
−2
−1
01
2
−4 −2 0 2 4
−4
−3
−2
−1
01
2
X1X1
X2
X2
13 / 21
-
Quadratic discriminant analysis (QDA)
In quadratic discriminant analysis we estimate a mean µ̂k and acovariance matrix Σ̂k for each class separately.
Given an input, it is easy to derive an objective function:
δk(x) = log πk−1
2µTkΣ
−1k µk+x
TΣ−1k µk−1
2xTΣ−1k x−
1
2log |Σk|
This objective is now quadratic in x and so are the decisionboundaries.
14 / 21
-
Quadratic discriminant analysis (QDA)
In quadratic discriminant analysis we estimate a mean µ̂k and acovariance matrix Σ̂k for each class separately.
Given an input, it is easy to derive an objective function:
δk(x) = log πk−1
2µTkΣ
−1k µk+x
TΣ−1k µk−1
2xTΣ−1k x−
1
2log |Σk|
This objective is now quadratic in x and so are the decisionboundaries.
14 / 21
-
Quadratic discriminant analysis (QDA)
In quadratic discriminant analysis we estimate a mean µ̂k and acovariance matrix Σ̂k for each class separately.
Given an input, it is easy to derive an objective function:
δk(x) = log πk−1
2µTkΣ
−1k µk+x
TΣ−1k µk−1
2xTΣ−1k x−
1
2log |Σk|
This objective is now quadratic in x and so are the decisionboundaries.
14 / 21
-
Quadratic discriminant analysis (QDA)
I Bayes boundary (– – –)
I LDA (· · · · · · )I QDA (——).
−4 −2 0 2 4
−4
−3
−2
−1
01
2
−4 −2 0 2 4
−4
−3
−2
−1
01
2
X1X1
X2
X2
15 / 21
-
Evaluating a classification method
We have talked about the 0-1 loss:
1
m
m∑i=1
1(yi 6= ŷi).
It is possible to make the wrong prediction for some classes moreoften than others. The 0-1 loss doesn’t tell you anything about this.
A much more informative summary of the error is a confusionmatrix:
16 / 21
-
Evaluating a classification method
We have talked about the 0-1 loss:
1
m
m∑i=1
1(yi 6= ŷi).
It is possible to make the wrong prediction for some classes moreoften than others. The 0-1 loss doesn’t tell you anything about this.
A much more informative summary of the error is a confusionmatrix:
16 / 21
-
Example. Predicting default
Used LDA to predict credit card default in a dataset of 10K people.
Predicted “yes” if P (default = yes|X) > 0.5.
I The error rate among people who do not default (falsepositive rate) is very low.
I However, the rate of false negatives is 76%.I It is possible that false negatives are a bigger source of
concern!I One possible solution: Change the threshold.
17 / 21
-
Example. Predicting default
Used LDA to predict credit card default in a dataset of 10K people.
Predicted “yes” if P (default = yes|X) > 0.5.
I The error rate among people who do not default (falsepositive rate) is very low.
I However, the rate of false negatives is 76%.I It is possible that false negatives are a bigger source of
concern!I One possible solution: Change the threshold.
17 / 21
-
Example. Predicting default
Used LDA to predict credit card default in a dataset of 10K people.
Predicted “yes” if P (default = yes|X) > 0.5.
I The error rate among people who do not default (falsepositive rate) is very low.
I However, the rate of false negatives is 76%.
I It is possible that false negatives are a bigger source ofconcern!
I One possible solution: Change the threshold.
17 / 21
-
Example. Predicting default
Used LDA to predict credit card default in a dataset of 10K people.
Predicted “yes” if P (default = yes|X) > 0.5.
I The error rate among people who do not default (falsepositive rate) is very low.
I However, the rate of false negatives is 76%.I It is possible that false negatives are a bigger source of
concern!
I One possible solution: Change the threshold.
17 / 21
-
Example. Predicting default
Used LDA to predict credit card default in a dataset of 10K people.
Predicted “yes” if P (default = yes|X) > 0.5.
I The error rate among people who do not default (falsepositive rate) is very low.
I However, the rate of false negatives is 76%.I It is possible that false negatives are a bigger source of
concern!I One possible solution: Change the threshold.
17 / 21
-
Example. Predicting default
Changing the threshold to 0.2 makes it easier to classify to “yes”.
Predicted “yes” if P (default = yes|X) > 0.2.
Note that the rate of false positives became higher! That is theprice to pay for fewer false negatives.
18 / 21
-
Example. Predicting default
Changing the threshold to 0.2 makes it easier to classify to “yes”.
Predicted “yes” if P (default = yes|X) > 0.2.
Note that the rate of false positives became higher! That is theprice to pay for fewer false negatives.
18 / 21
-
Example. Predicting default
Let’s visualize the dependence of the error on the threshold:
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
0.6
Threshold
Err
or
Rate
I – – – False negative rate (error for defaulting customers)
I · · · · · False positive rate (error for non-defaulting customers)I —— 0-1 loss or total error rate.
19 / 21
-
Example. The ROC curve
ROC Curve
False positive rate
Tru
e p
ositiv
e r
ate
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
I Displays the performance ofthe method for any choice ofthreshold.
I The area under the curve(AUC) measures the qualityof the classifier:
I 0.5 is the AUC for arandom classifier
I The closer AUC is to 1,the better.
20 / 21
-
Example. The ROC curve
ROC Curve
False positive rate
Tru
e p
ositiv
e r
ate
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
I Displays the performance ofthe method for any choice ofthreshold.
I The area under the curve(AUC) measures the qualityof the classifier:
I 0.5 is the AUC for arandom classifier
I The closer AUC is to 1,the better.
20 / 21
-
Next time
I Comparison of logistic regression, LDA, QDA, and KNNclassification.
I Start Chapter 5: Resampling.
21 / 21