lecture9: classiﬁcation,lda - stanford...

Lecture 9: Classification, LDA

Reading: Chapter 4

STATS 202: Data mining and analysis

Jonathan Taylor, 10/12Slide credits: Sergio Bacallado

1 / 21

Review: Main strategy in Chapter 4

Find an estimate P̂ (Y | X). Then, given an input x0, we predictthe response as in a Bayes classifier:

ŷ0 = argmax y P̂ (Y = y | X = x0).

2 / 21

Linear Discriminant Analysis (LDA)

Instead of estimating P (Y | X), we will estimate:

1. P̂ (X | Y ): Given the response, what is the distribution of theinputs.

2. P̂ (Y ): How likely are each of the categories.

Then, we use Bayes rule to obtain the estimate:

P̂ (Y = k | X = x) = P̂ (X = x | Y = k)P̂ (Y = k)

3 / 21

P̂ (Y = k | X = x) = P̂ (X = x | Y = k)P̂ (Y = k)

3 / 21

P̂ (Y = k | X = x) = P̂ (X = x | Y = k)P̂ (Y = k)

3 / 21

P̂ (Y = k | X = x) = P̂ (X = x | Y = k)P̂ (Y = k)P̂ (X = x)

3 / 21

P̂ (Y = k | X = x) = P̂ (X = x | Y = k)P̂ (Y = k)∑j P̂ (X = x | Y = j)P̂ (Y = j)

3 / 21

1. We model P̂ (X = x | Y = k) = f̂k(x) as a MultivariateNormal Distribution:

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4

−4

−2

02

4

X1X1

X2

2. P̂ (Y = k) = π̂k is estimated by the fraction of trainingsamples of class k.

4 / 21

1. We model P̂ (X = x | Y = k) = f̂k(x) as a MultivariateNormal Distribution:

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4

−4

−2

02

4

X1X1

X2

2. P̂ (Y = k) = π̂k is estimated by the fraction of trainingsamples of class k.

4 / 21

LDA has linear decision boundaries

Suppose that:

I We know P (Y = k) = πk exactly.

I P (X = x|Y = k) is Mutivariate Normal with density:

fk(x) =1

(2π)p/2|Σ|1/2e−

12(x−µk)TΣ−1(x−µk)

µk : Mean of the inputs for category k.Σ : Covariance matrix (common to all categories).

Then, what is the Bayes classifier?

5 / 21

Suppose that:

fk(x) =1

(2π)p/2|Σ|1/2e−

5 / 21

Suppose that:

fk(x) =1

(2π)p/2|Σ|1/2e−

5 / 21

Suppose that:

fk(x) =1

(2π)p/2|Σ|1/2e−

5 / 21

Suppose that:

fk(x) =1

(2π)p/2|Σ|1/2e−

5 / 21

By Bayes rule, the probability of category k, given the input x is:

P (Y = k | X = x) = fk(x)πkP (X = x)

The denominator does not depend on the response k, so we canwrite it as a constant:

P (Y = k | X = x) = C × fk(x)πk

Now, expanding fk(x):

P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2

e−12(x−µk)TΣ−1(x−µk)

6 / 21

P (Y = k | X = x) = fk(x)πkP (X = x)

P (Y = k | X = x) = C × fk(x)πk

P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2

6 / 21

P (Y = k | X = x) = fk(x)πkP (X = x)

P (Y = k | X = x) = C × fk(x)πk

P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2

6 / 21

P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2

Now, let us absorb everything that does not depend on k into aconstant C ′:

P (Y = k | X = x) = C ′πke−12(x−µk)TΣ−1(x−µk)

and take the logarithm of both sides:

logP (Y = k | X = x) = logC ′+ log πk −1

2(x− µk)TΣ−1(x− µk).

This is the same for every category, k.So we want to find the maximum of this over k.

7 / 21

P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2

7 / 21

P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2

7 / 21

P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2

This is the same for every category, k.

So we want to find the maximum of this over k.

7 / 21

P (Y = k | X = x) = Cπk(2π)p/2|Σ|1/2

7 / 21

Goal, maximize the following over k:

log πk −1

= log πk −1

2

[xTΣ−1x+ µTkΣ

−1µk]+ xTΣ−1µk

=C ′′ + log πk −1

2µTkΣ

−1µk + xTΣ−1µk

We define the objective:

δk(x) = log πk −1

2µTkΣ

At an input x, we predict the response with the highest δk(x).

8 / 21

log πk −1

= log πk −1

2

[xTΣ−1x+ µTkΣ

=C ′′ + log πk −1

2µTkΣ

8 / 21

log πk −1

= log πk −1

2

[xTΣ−1x+ µTkΣ

=C ′′ + log πk −1

2µTkΣ

8 / 21

log πk −1

= log πk −1

2

[xTΣ−1x+ µTkΣ

=C ′′ + log πk −1

2µTkΣ

8 / 21

What is the decision boundary? It is the set of points in which 2classes do just as well:

δk(x) = δ`(x)

log πk −1

2µTkΣ

−1µk + xTΣ−1µk = log π` −

1

2µT` Σ

−1µ` + xTΣ−1µ`

This is a linear equation in x.

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4

−4

−2

02

4

X1X1

X2

9 / 21

δk(x) = δ`(x)

log πk −1

2µTkΣ

1

2µT` Σ

−1µ` + xTΣ−1µ`

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4

−4

−2

02

4

X1X1

X2

9 / 21

δk(x) = δ`(x)

log πk −1

2µTkΣ

1

2µT` Σ

−1µ` + xTΣ−1µ`

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4

−4

−2

02

4

X1X1

X2

9 / 21

Estimating πk

π̂k =#{i ; yi = k}

n

In English, the fraction of training samples of class k.

10 / 21

Estimating the parameters of fk(x)

Estimate the center of each class µk:

µ̂k =1

#{i ; yi = k}∑

i ; yi=k

xi

Estimate the common covariance matrix Σ:

I One predictor (p = 1):

σ̂2 =1

n−K

K∑k=1

∑i ; yi=k

(xi − µ̂k)2.

I Many predictors (p > 1): Compute the vectors of deviations(x1 − µ̂y1), (x2 − µ̂y2), . . . , (xn − µ̂yn) and use an unbiasedestimate of its covariance matrix, Σ.

11 / 21

µ̂k =1

#{i ; yi = k}∑

i ; yi=k

xi

σ̂2 =1

n−K

K∑k=1

∑i ; yi=k

(xi − µ̂k)2.

11 / 21

µ̂k =1

#{i ; yi = k}∑

i ; yi=k

xi

σ̂2 =1

n−K

K∑k=1

∑i ; yi=k

(xi − µ̂k)2.

11 / 21

µ̂k =1

#{i ; yi = k}∑

i ; yi=k

xi

σ̂2 =1

n−K

K∑k=1

∑i ; yi=k

(xi − µ̂k)2.

11 / 21

LDA prediction

For an input x, predict the class with the largest:

δ̂k(x) = log π̂k −1

2µ̂Tk Σ̂

−1µ̂k + xT Σ̂−1µ̂k

The decision boundaries are defined by:

log π̂k −1

2µ̂Tk Σ̂

−1µ̂k + xT Σ̂−1µ̂k = log π̂` −

1

2µ̂T` Σ̂

−1µ̂` + xT Σ̂−1µ̂`

Solid lines in:

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4

−4

−2

02

4

X1X1

X2

12 / 21

LDA prediction

2µ̂Tk Σ̂

−1µ̂k + xT Σ̂−1µ̂k

log π̂k −1

2µ̂Tk Σ̂

−1µ̂k + xT Σ̂−1µ̂k = log π̂` −

1

2µ̂T` Σ̂

−1µ̂` + xT Σ̂−1µ̂`

Solid lines in:

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4

−4

−2

02

4

X1X1

X2

12 / 21

LDA prediction

2µ̂Tk Σ̂

−1µ̂k + xT Σ̂−1µ̂k

log π̂k −1

2µ̂Tk Σ̂

−1µ̂k + xT Σ̂−1µ̂k = log π̂` −

1

2µ̂T` Σ̂

−1µ̂` + xT Σ̂−1µ̂`

Solid lines in:

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4

−4

−2

02

4

X1X1

X2

12 / 21

Quadratic discriminant analysis (QDA)

The assumption that the inputs of every class have the samecovariance Σ can be quite restrictive:

−4 −2 0 2 4

−4

−3

−2

−1

01

2

−4 −2 0 2 4

−4

−3

−2

−1

01

2

X1X1

X2

13 / 21

In quadratic discriminant analysis we estimate a mean µ̂k and acovariance matrix Σ̂k for each class separately.

Given an input, it is easy to derive an objective function:

δk(x) = log πk−1

2µTkΣ

−1k µk+x

TΣ−1k µk−1

2xTΣ−1k x−

1

2log |Σk|

This objective is now quadratic in x and so are the decisionboundaries.

14 / 21

2µTkΣ

−1k µk+x

TΣ−1k µk−1

2xTΣ−1k x−

1

2log |Σk|

14 / 21

2µTkΣ

−1k µk+x

TΣ−1k µk−1

2xTΣ−1k x−

1

2log |Σk|

14 / 21

I Bayes boundary (– – –)

I LDA (· · · · · · )I QDA (——).

−4 −2 0 2 4

−4

−3

−2

−1

01

2

−4 −2 0 2 4

−4

−3

−2

−1

01

2

X1X1

X2

15 / 21

Evaluating a classification method

We have talked about the 0-1 loss:

1

m

m∑i=1

1(yi 6= ŷi).

It is possible to make the wrong prediction for some classes moreoften than others. The 0-1 loss doesn’t tell you anything about this.

A much more informative summary of the error is a confusionmatrix:

16 / 21

Evaluating a classification method

We have talked about the 0-1 loss:

1

m

m∑i=1

1(yi 6= ŷi).

It is possible to make the wrong prediction for some classes moreoften than others. The 0-1 loss doesn’t tell you anything about this.

A much more informative summary of the error is a confusionmatrix:

16 / 21

Example. Predicting default

Used LDA to predict credit card default in a dataset of 10K people.

Predicted “yes” if P (default = yes|X) > 0.5.

I The error rate among people who do not default (falsepositive rate) is very low.

I However, the rate of false negatives is 76%.I It is possible that false negatives are a bigger source of

concern!I One possible solution: Change the threshold.

17 / 21

I However, the rate of false negatives is 76%.

I It is possible that false negatives are a bigger source ofconcern!

I One possible solution: Change the threshold.

17 / 21

concern!

I One possible solution: Change the threshold.

17 / 21

Changing the threshold to 0.2 makes it easier to classify to “yes”.

Note that the rate of false positives became higher! That is theprice to pay for fewer false negatives.

18 / 21

Changing the threshold to 0.2 makes it easier to classify to “yes”.

Note that the rate of false positives became higher! That is theprice to pay for fewer false negatives.

18 / 21

Let’s visualize the dependence of the error on the threshold:

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

Threshold

Err

or

Rate

I – – – False negative rate (error for defaulting customers)

I · · · · · False positive rate (error for non-defaulting customers)I —— 0-1 loss or total error rate.

19 / 21

Example. The ROC curve

ROC Curve

False positive rate

Tru

e p

ositiv

e r

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

I Displays the performance ofthe method for any choice ofthreshold.

I The area under the curve(AUC) measures the qualityof the classifier:

I 0.5 is the AUC for arandom classifier

I The closer AUC is to 1,the better.

20 / 21

Example. The ROC curve

ROC Curve

False positive rate

Tru

e p

ositiv

e r

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

I Displays the performance ofthe method for any choice ofthreshold.

I The area under the curve(AUC) measures the qualityof the classifier:

I 0.5 is the AUC for arandom classifier

I The closer AUC is to 1,the better.

20 / 21

Next time

I Comparison of logistic regression, LDA, QDA, and KNNclassification.

I Start Chapter 5: Resampling.

21 / 21

lecture9: classiﬁcation,lda - stanford...

Documents