chapter 4. linear models for classificationweip/course/dm/slides/chap4.pdf · chapter 4. linear...

Chapter 4. Linear Models for Classification

Wei Pan

Division of Biostatistics, School of Public Health, University of Minnesota,Minneapolis, MN 55455

Email: [email protected]

PubH 7475/8475c©Wei Pan

Linear Model and Least Squares

I Data: (Yi ,Xi ), Xi = (Xi1, ...,Xip)′, i = 1, ..., n.Yi : categorical with K classes; often K = 2.

I LM: defined yk = I (class k),

fk(x) = E (yk |x) = Pr(G = k|x) = β0k + β′kx

I Dedcision boundary: fk(x) = fl(x), linear

I Use LSE

I Is it ok to use LM?Yes, but ...masking

I can be formulated as multivariat/multiple response LM.

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 4

Linear Regression

1

1

1

1

1

1111

1

1

1

1

11

1 1

1

1

1

1

1

1 11

1

1

1

1

1

1

11

11

1

1

1

1 11

1

1

1

1

1

1

1

1 1

1

1

1

1

11

1 1 1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

111 1

1

11 1

1

1

1 1

1

1

11

1

1

1

1

1

1

1

1

1

1

111

1

1

1

1

1

1

1

11

11

1

1

11

1

1

1

1

1

1

1

1 111

1

1

11

1

1

11

11

11

1

1

1

1

1

111

1

11

1

1

11

1

1

11

1

1

1

1

1

1

1

1

11

11

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1 111 1

1

1

1

1

1

1

1

1

1 11

11

1

1

1

1

1

1

1

1

1 1

1

11

11

1

1

1

1

1

1 1

1

1

1

1

1

11

11

1

1 1

11

1

1

1

1

1

1

1

111

1

1

1 1

1

1

1

11

1

1

1

11

1

1 1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

11

1

1

1

1

1 1

1

1

1

1

1 1

111

1

1

111

11

11

1

111

1

1

1

1

1

1

1

1

1

11 1

1

1

1

1

1

11

1

11

1

1

1

1

11

1

11

1

1

1 1

11

1

11

1

11

1

1

1

1

11

1

11

1

11

11

1

1

1

11

1

1

1

11

1

11

11

1

1

1

1

11

11

1

1

1

1

1

111 1

11

1 1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

1

11

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1 1

11

1 1

11

1

1

1

1

1

1

1

1

1

1

11

22

2

2

22

2

2

2

2

2

2

22

2

22

22 22

2

2 2

2

2 222

2

22

2

2

2

2

2

2

2

2

22

2

22

22

2

22

2

2

22

2

2

2

2

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2 22

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

22

2

2

2

22

2

2 22 2

2

2

2

2

2

22

22

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

22

2

22

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

2

22

2

2

2

22

2

2

2

2

2

2

2

2

2

22

2

2

2

22

22

2

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

222

2

22

2

2

2

2

2

22

2

2

22

2

2

2 2

2

22

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2

22

2

2

22

2

2

2

2

2

22

2

2

2

2

2

2

2 2

2

2

2

2

2

2

2 2

2

2 22

2

2

2

2

22

2

2

222

2

2

2

22

22

222

22

2

2

22

2

2

2 2

2

2

222

2

2

2

2

2

22

2

2 2

2

22

2

2

22

2

2

22

2

2

2

2

2

2

22

2 2

2

2

2

2

2

2

22

2

2 2

2

2

22

22 22

2

2

2 2

2

22

2

2

2

2 2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2 2

2

2 2

2

2

2

22

2

2

2

22

222

2

22

2

22

222

2

22

2

2

2

22

2

2

2

2

22

2

2

22

2 2

2

2

2

22

2

2

22

2

2 2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

3

3

3

3

3

3

33

3

3

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

3

3

3

33

3

3 3 3

3

3

3

33

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

333

3 3

3

3

33

3

33

33

3

3

3

3

333

3

3

33

33

33

33

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

33

33

3

3

33

3

3

333

3

3

3

33

3

3

3

3

3

3

3

33

3

3

3

3

33

3

3

3

33

3

3

3 3

3

3

3

3

3

3

3

33

3

3

3

3

3 33

3

33

3

3

3

3

333

3

3

3

33

3

3

3

3 3

33

3 333

3

3

3

3

33

3

33

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

33

3

3

3

333

3

3

3

3

3

3

3

3

3

3

3 3

3

3

3

3

3

3

3

33

3

3

33

33

3

3

3

3

3

33

3

3

33

3

3

3

33

3

3 33

3

3

3

33

3

33

333

3

3

33

33

3

3

3

3

3

3 3

3

3

3

3

33

3

3

33

3

3

3

3

33

3 3

3

3

33

3

33

33

3

3

33

33

33

33 3

3

3

3

3

3

3

33

33

33

33

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

33

3

3

3 33

3

3

3

33

3

3

33

3

3

3

3

3

3

3

3

3

33

3

33

3

3

3

3

3

33

3

3

3

3 3

3

3

33

3

3

33

3

3

3

33

333

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

33

3

3

3

3

333

3

3

3

3

33

3

3

33

3

3

3

3

33

33

33

Linear Discriminant Analysis

1

1

1

1

1

1111

1

1

1

1

11

1 1

1

1

1

1

1

1 11

1

1

1

1

1

1

11

11

1

1

1

1 11

1

1

1

1

1

1

1

1 1

1

1

1

1

11

1 1 1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

111 1

1

11 1

1

1

1 1

1

1

11

1

1

1

1

1

1

1

1

1

1

111

1

1

1

1

1

1

1

11

11

1

1

11

1

1

1

1

1

1

1

1 111

1

1

11

1

1

11

11

11

1

1

1

1

1

111

1

11

1

1

11

1

1

11

1

1

1

1

1

1

1

1

11

11

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1 111 1

1

1

1

1

1

1

1

1

1 11

11

1

1

1

1

1

1

1

1

1 1

1

11

11

1

1

1

1

1

1 1

1

1

1

1

1

11

11

1

1 1

11

1

1

1

1

1

1

1

111

1

1

1 1

1

1

1

11

1

1

1

11

1

1 1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

11

1

1

1

1

1 1

1

1

1

1

1 1

111

1

1

111

11

11

1

111

1

1

1

1

1

1

1

1

1

11 1

1

1

1

1

1

11

1

11

1

1

1

1

11

1

11

1

1

1 1

11

1

11

1

11

1

1

1

1

11

1

11

1

11

11

1

1

1

11

1

1

1

11

1

11

11

1

1

1

1

11

11

1

1

1

1

1

111 1

11

1 1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

1

11

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1 1

11

1 1

11

1

1

1

1

1

1

1

1

1

1

11

22

2

2

22

2

2

2

2

2

2

22

2

22

22 22

2

2 2

2

2 222

2

22

2

2

2

2

2

2

2

2

22

2

22

22

2

22

2

2

22

2

2

2

2

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2 22

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

22

2

2

2

22

2

2 22 2

2

2

2

2

2

22

22

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

22

2

22

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

2

22

2

2

2

22

2

2

2

2

2

2

2

2

2

22

2

2

2

22

22

2

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

222

2

22

2

2

2

2

2

22

2

2

22

2

2

2 2

2

22

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2

22

2

2

22

2

2

2

2

2

22

2

2

2

2

2

2

2 2

2

2

2

2

2

2

2 2

2

2 22

2

2

2

2

22

2

2

222

2

2

2

22

22

222

22

2

2

22

2

2

2 2

2

2

222

2

2

2

2

2

22

2

2 2

2

22

2

2

22

2

2

22

2

2

2

2

2

2

22

2 2

2

2

2

2

2

2

22

2

2 2

2

2

22

22 22

2

2

2 2

2

22

2

2

2

2 2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2 2

2

2 2

2

2

2

22

2

2

2

22

222

2

22

2

22

222

2

22

2

2

2

22

2

2

2

2

22

2

2

22

2 2

2

2

2

22

2

2

22

2

2 2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

3

3

3

3

3

3

33

3

3

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

3

3

3

33

3

3 3 3

3

3

3

33

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

333

3 3

3

3

33

3

33

33

3

3

3

3

333

3

3

33

33

33

33

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

33

33

3

3

33

3

3

333

3

3

3

33

3

3

3

3

3

3

3

33

3

3

3

3

33

3

3

3

33

3

3

3 3

3

3

3

3

3

3

3

33

3

3

3

3

3 33

3

33

3

3

3

3

333

3

3

3

33

3

3

3

3 3

33

3 333

3

3

3

3

33

3

33

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

33

3

3

3

333

3

3

3

3

3

3

3

3

3

3

3 3

3

3

3

3

3

3

3

33

3

3

33

33

3

3

3

3

3

33

3

3

33

3

3

3

33

3

3 33

3

3

3

33

3

33

333

3

3

33

33

3

3

3

3

3

3 3

3

3

3

3

33

3

3

33

3

3

3

3

33

3 3

3

3

33

3

33

33

3

3

33

33

33

33 3

3

3

3

3

3

3

33

33

33

33

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

33

3

3

3 33

3

3

3

33

3

3

33

3

3

3

3

3

3

3

3

3

33

3

33

3

3

3

3

3

33

3

3

3

3 3

3

3

33

3

3

33

3

3

3

33

333

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

33

3

3

3

3

333

3

3

3

3

33

3

3

33

3

3

3

3

33

33

33

X1X1

X2

X2

FIGURE 4.2. The data come from three classes inIR2 and are easily separated by linear decision bound-aries. The right plot shows the boundaries found bylinear discriminant analysis. The left plot shows theboundaries found by linear regression of the indica-tor response variables. The middle class is completelymasked (never dominates).


11

1

1

1

1

11

11

1

1

1

1

1

1

1

1

1

11

1

11

111

1

1

1

1

1

1

1

1

1

1

1111

1

11

111

11

1

1

11

1

1

1

1

1

1

1

1

1

11

111

11

1

111

1

1

1

1

1

1111

11

11

1

1

11

11

1

111

1

11

1

1

1

11

11

1

1

1

1

1

11

1

1

11

1

1

1

1

1

1

11

1

1

1

1

1

1

1

111

1

11

111

11

11

1

111

1

222 222 2 2 2222 22 22 2 22 22222 2 22 222 22 2 22 22 22 22222 22 2 222 2 2222 222 222 222 2222 22 22 2 22 222 2222 22 222 222 222 222 222 2 22 22 222 222 22 22 2222 22 2 222 2 22 2 22 2 22 22 222 2 222 222 222 2

33

3

3

3

3

33

33

3

3

3

3

3

3

3

3

3

33

3

33

333

3

3

3

3

3

3

3

33

3

3333

3

33

333

33

3

3

33

3

3

3

3

3

3

3

3

3

33

333

33

3

333

3

33

3

3

3333

33

33

3

3

33

33

3

333

3

33

3

3

3

33

33

3

3

3

3

3

33

3

3

33

3

3

3

3

3

3

33

3

3

3

3

3

3

3

333

3

33

333

33

33

3

333

3

0.0

0.5

1.0

0.0 0.2 0.4 0.6 0.8 1.0

11

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

111

1

1

1

1

1

1

1

1

1

1

11

11

1

11

111

1

1

1

1

111

1

1

11

11

1

1

11

1111 1

1

111

11 1

11

1111

11

11

1

1

1111

1

1111

111

11 11 111 11

11

1 11 1111 11 1

111 1 11 1 11 1 1

1 111

11 1 111 111 1111

22

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

2

222

2

2

2

2

2

2

2

2

2

2

22

22

2

22

222

2

2

2

2 2222

2

222

22 222 2222 22

22 2

2

22

22 2222 22 2222

22 222 2222

22 2

2

2

22

22

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

22

2

2

2

2

2

2

2

22

2

2

2

2

2

2

22

2

2

2

22

2

2

333 33

33 3 333

3

33

33

33

3 3333

3

3 33 333

3

3

3 33 33

33 33333 33 3 33

3 3

3333

3

33

3

3

3

3

33

333

3 3

333

3

3

33

33

3333

33

33

3

3

33

33

3333

3

333

3

3

33

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

3

3

3

33

3

3

0.0

0.5

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Degree = 1; Error = 0.33 Degree = 2; Error = 0.04

FIGURE 4.3. The effects of masking on linear regres-sion in IR for a three-class problem. The rug plot atthe base indicates the positions and class membershipof each observation. The three curves in each panel arethe fitted regressions to the three-class indicator vari-ables; for example, for the blue class, yblue is 1 for theblue observations, and 0 for the green and orange. Thefits are linear and quadratic polynomials. Above eachplot is the training error rate. The Bayes error rate is0.025 for this problem, as is the LDA error rate.

Discriminant Analysis

I optimal Bayes rule: G (x) = arg maxk Pr(k|x).

Pr(k |x) = πk fk (x)∑Kl=1 πl fl (x)

,

πk : prior prob Pr(k),fk : density of x in class k .

I Assume fk = N(µk ,Σ) =⇒ LDA.

I Assume fk = N(µk ,Σk) =⇒ QDA.

I Assume fk =∑

j pjN(µj ,Σj) =⇒ MDA. §12.7

I Estimat fk nonparametrically, e.g. by kernel density estimation=⇒ KDA.General, but not working well for large p – “curse of dim.”

I Naive Bayes: assuming independence among the predictors,fk(x) =

∏pj=1 fkj(xj).

often work quite well!

LDA

I Assume: x |k ∼ N(µk ,Σ).

I log[πk fk(x)] ∝ log πk + x ′Σ−1µk − 12µ′kΣ−1µk = δk(x).

G (x) = arg maxk Pr(k |x) = arg maxk δk(x),Linear.

I In practice, estimateπk = nk/n,µk =

∑Gi=k xi/nk ,

Σ =∑K

k=1

∑Gi=k(xi − µk)′(xi − µk)/(n − K ),

LDA

I LDA:

logPr(k |x)

Pr(l |x)= δk(x)− δl(x)

= logπkπl− 1

2(µk + µl)

′Σ−1(µk − µl) + x ′Σ−1(µk − µl)

= β0,kl + x ′βkl ,

is a linear logistic reg model!

I K = 2, code y = ±1, is LM ok?E (y) = b0 + x ′b, then LSEb ∝ Σ−1(µ2 − µ1),but intercept differs.Explains LM ≈ LDA ≈ Logistic reg for K = 2?

QDA

I Assume: x |k ∼ N(µk ,Σk).

δk(x) = log πk −1

2log |Σk | −

1

2(x − µk)′Σ−1k (x − µk).

quadratic, or

logPr(k |x)

Pr(l |x)= β0,kl + x ′β1,kl + x ′B2,klx .

quadratic logit model.

I Estimation: similar to LDA ...

I QDA: more general and thus better than LDA?

I Figs 4.1 and 4.6.

I Example code: ex4.1.r


1

1

1

11

11

11

11

1

1

1

1

1

1

11 1

11

1

1

1

1

1

1

11

1

1

1

1

1

11

11

1

1

1

1

1

1

1 11

1

1 1

1

1

1

1

1

1

111

1

1

1

1

1

1

1 1

11

1

1

1

1

11

1

1

1

1

11

1

1

1

11

1

1

1

1

111

1

1

11

11

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

11

1

1

1

1 1

1

1

11 1

1

1

1 1

11

1

1

1

1

1

1

1

1

11

1 1

11

1

1

1

11

1

1

1

1 1

1

1

1

11

1

1

1

1

1

1

1

11

1

11

1

1

11

11

1 1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

22

22

2

2

2

2

2

2

2

2

22

22

2

22

222

2

22

22

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2 22

2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

222

2

2

2

2

2 2

2

2

2

2

2

22

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2 2

22

2

2

2

2

2

2

22

2

2

2

22

2 2

22 2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

222

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

22

2 2

2

2

2

3

3

3

3

33

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

33

3

3

3

3

33

3

3

33

3

3

3

3

3

33

3

3

3

3 3

3 3

3

3

3

3

3

3

3

3

3

3

33

3 3

3

33

33

33

3

3

3

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

3

33

3

3 33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3 3

3

3

3 3

3

3

3

33

3

3

3 3

33

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

1

1

1

11

11

11

11

1

1

1

1

1

1

11 1

11

1

1

1

1

1

1

11

1

1

1

1

1

11

11

1

1

1

1

1

1

1 11

1

1 1

1

1

1

1

1

1

111

1

1

1

1

1

1

1 1

11

1

1

1

1

11

1

1

1

1

11

1

1

1

11

1

1

1

1

111

1

1

11

11

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

11

1

1

1

1 1

1

1

11 1

1

1

1 1

11

1

1

1

1

1

1

1

1

11

1 1

11

1

1

1

11

1

1

1

1 1

1

1

1

11

1

1

1

1

1

1

1

11

1

11

1

1

11

11

1 1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

22

22

2

2

2

2

2

2

2

2

22

22

2

22

222

2

22

22

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2 22

2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

222

2

2

2

2

2 2

2

2

2

2

2

22

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2 2

22

2

2

2

2

2

2

22

2

2

2

22

2 2

22 2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

222

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

22

2 2

2

2

2

3

3

3

3

33

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

33

3

3

3

3

33

3

3

33

3

3

3

3

3

33

3

3

3

3 3

3 3

3

3

3

3

3

3

3

3

3

3

33

3 3

3

33

33

33

3

3

3

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

3

33

3

3 33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3 3

3

3

3 3

3

3

3

33

3

3

3 3

33

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

FIGURE 4.1. The left plot shows some data fromthree classes, with linear decision boundaries foundby linear discriminant analysis. The right plot showsquadratic decision boundaries. These were obtained byfinding linear boundaries in the five-dimensional spaceX1, X2, X1X2, X

21 , X2

2 . Linear inequalities in this spaceare quadratic inequalities in the original space.


1

1

1

11

11

11

11

1

1

1

1

1

1

11 1

11

1

1

1

1

1

1

11

1

1

1

1

1

11

11

1

1

1

1

1

1

1 11

1

1 1

1

1

1

1

1

1

111

1

1

1

1

1

1

1 1

11

1

1

1

1

11

1

1

1

1

11

1

1

1

11

1

1

1

1

111

1

1

11

11

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

11

1

1

1

1 1

1

1

11 1

1

1

1 1

11

1

1

1

1

1

1

1

1

11

1 1

11

1

1

1

11

1

1

1

1 1

1

1

1

11

1

1

1

1

1

1

1

11

1

11

1

1

11

11

1 1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

22

22

2

2

2

2

2

2

2

2

22

22

2

22

222

2

22

22

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2 22

2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

222

2

2

2

2

2 2

2

2

2

2

2

22

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2 2

22

2

2

2

2

2

2

22

2

2

2

22

2 2

22 2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

222

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

22

2 2

2

2

2

3

3

3

3

33

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

33

3

3

3

3

33

3

3

33

3

3

3

3

3

33

3

3

3

3 3

3 3

3

3

3

3

3

3

3

3

3

3

33

3 3

3

33

33

33

3

3

3

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

3

33

3

3 33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3 3

3

3

3 3

3

3

3

33

3

3

3 3

33

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

1

1

1

11

11

11

11

1

1

1

1

1

1

11 1

11

1

1

1

1

1

1

11

1

1

1

1

1

11

11

1

1

1

1

1

1

1 11

1

1 1

1

1

1

1

1

1

111

1

1

1

1

1

1

1 1

11

1

1

1

1

11

1

1

1

1

11

1

1

1

11

1

1

1

1

111

1

1

11

11

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

11

1

1

1

1 1

1

1

11 1

1

1

1 1

11

1

1

1

1

1

1

1

1

11

1 1

11

1

1

1

11

1

1

1

1 1

1

1

1

11

1

1

1

1

1

1

1

11

1

11

1

1

11

11

1 1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

22

22

2

2

2

2

2

2

2

2

22

22

2

22

222

2

22

22

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2 22

2

2

2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

222

2

2

2

2

2 2

2

2

2

2

2

22

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2 2

22

2

2

2

2

2

2

22

2

2

2

22

2 2

22 2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

222

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

22

2 2

2

2

2

3

3

3

3

33

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

33

3

3

3

3

33

3

3

33

3

3

3

3

3

33

3

3

3

3 3

3 3

3

3

3

3

3

3

3

3

3

3

33

3 3

3

33

33

33

3

3

3

33

3

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

3

33

3

3 33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3 3

3

3

3 3

3

3

3

33

3

3

3 3

33

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

FIGURE 4.6. Two methods for fitting quadraticboundaries. The left plot shows the quadratic de-cision boundaries for the data in Figure 4.1 (ob-tained using LDA in the five-dimensional spaceX1, X2, X1X2, X

21 , X2

2 ). The right plot shows thequadratic decision boundaries found by QDA. The dif-ferences are small, as is usually the case.

I LDA and QDA: too strong assumptions and thus do not workwell?e.g. with categorical predictors?

I LDA: can be applied to x with high-order terms; Fig 4.6.

I STATLOG: evaluations based on 22 real datasets,LDA: in the top 3 for 7 out of 22;QDA: in the top 3 for 4 out of 22;L/QDA: in the top 3 for 11 out of 22.

I Dudoit & Speed (JASA, 2001): for high-dim gene expressiondata, LDA/QDA performed well.In fact, a diagonalized LDA or QDA could perform evenbetter!DLDA or DQDA: only use the diagonal elements of Σ or Σk .(8000) Theory: Bickel and Levina (2004).

I Why?

DLDA (§18.2)

I DLDA: only use diag(Σ),

δk(x∗) = −p∑

j=1

(x∗j − xkj)2

s2j+ 2 log πk ,

where x∗ = (x∗1 , ..., x∗p )′ is a test obs, xkj is the within-class

sample mean for predictor j in class k , and s2j is the pooledsample variance for predictor j , both based on a training set.

I Classification rule: C (x∗) = arg maxk δk(x∗).

I It is a naive Bayes rule and a nearest centroid rule.

I Problem: if p too large ... need to do VS!(8000) Theory: Fan & Fan (2008); DLDA becomes a randomguessing as p →∞ unless the signal levels are extremely high...

Nearest Shrunken Centroids (§18.2)

I Still use the DLDA rule, but first do VS.

I Define

dkj =xkj − xj

mk(sj + s0),

where xj is the overall mean for predictor j , s0 is a smallconstant, e.g. the median of sj ’s, m2

k = 1/Nk − 1/N, andVar(xkj − xj) = m2

kσ2.

Like a (regularized) t-statistic!

I ST: d ′kj = sign(dkj)(|dkj | −∆)+, or HT: d ′kj = dkj I (|dkj | ≥ ∆),where the tuning parameter ∆ is chosen by CV.

I new centroids: x ′kj = xj + mk(sj + s0)d ′kj , and x ′kj = xj ifd ′kj = 0; if so for all k , no use of predictor j–why?

I Still use the DLDA rule (with NEW centroids):

δk(x∗) = −p∑

j=1

(x∗j − x ′kj)2

s2j+2 log πk , C (x∗) = arg max

kδk(x∗).

I Estimate the class probability:

pk(x∗) =exp[12δk(x∗)]∑Kl=1 exp[12δl(x

∗)].

I A penalized approach; simple and yet often effective, but ...



0 2 4 6M

iscl

assi

ficat

ion

Err

or

2308 2059 1223 598 284 159 81 43 23 15 10 5 1

Number of Genes

0.0

0.2

0.4

0.6

0.8

Training10−fold CVTest

Amount of Shrinkage Δ

−1.0 −0.5 0.0 0.5 1.0

BL

050

010

0015

0020

00

−1.0 −0.5 0.0 0.5 1.0

EWS

−1.0 −0.5 0.0 0.5 1.0

NB

−1.0 −0.5 0.0 0.5 1.0

RMS

Gen

e

Centroids: Average Expression Centered at Overall Centroid

FIGURE 18 4 (Top): Error curves for the SRBCT

Regularized Discriminant Analysis

I A compromise b/w LDA and QDA: use Σk(α) in QDA,

Σk(α) = αΣk + (1− α)Σ,

where α ∈ [0, 1] to be determined by CV.

I Fig 4.7.

I Similarly (§18.3.1),Σk(α) = αΣ + (1− α)σ2I ,Σk(α) = αΣ + (1− α)diag(Σ),or, ..., covering DLDA, DQDA, ...

I (8000) A direct approach (Mai et al 2012): no need toestimate Σ! how?A connection b/w LDA and LSE!So, use penalized LR!


Mis

clas

sific

atio

n R

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

•• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •

•

Regularized Discriminant Analysis on the Vowel Data

Test DataTrain Data

α

FIGURE 4.7. Test and training errors for the voweldata, using regularized discriminant analysis with a se-ries of values of α ∈ [0, 1]. The optimum for the testdata occurs around α = 0.9, close to quadratic discrim-inant analysis.

Logistic regressionI Binary or multinomial logit model: for k = 1, ...,K − 1,

logPr(k |x)

Pr(K |x)= β0,k + x ′β1,k ,

or equivalently,

Pr(k |x) =exp(β0,k + x ′β1,k)

1 +∑K−1

l=1 exp(β0,k + x ′β1,k).

Then G (x) = arg maxk Pr(k|x).I x can be expanded to include high-order terms.I Parameter estimation: MLE

Note: approx equivalent to fitting multiple binary logit modelsseparetely (Begg & Gray, 1984, Biometrika).

I Logistic reg vs L/QDA: the former is more general; the latterhas a stronger assumption and thus possibly more efficient if...; Logistic reg is quite good.


Penalized logistic regression (§18.3.2, 18.4)

I Need VS or regularization for a large p.

I Add a penalty term J(β) to − log LJ(β) can be Lasso, ..., as before.

I Computing algorithms: a Taylor expansion (i.e. quadraticapprox) of log L, then the same as penalized LR.

I R package glmnet: an elastic net penalty.hence do either Lasso or Ridge (or both).

chapter 4. linear models for classificationweip/course/dm/slides/chap4.pdf · chapter 4. linear...

Documents