mathematical techniques in data science - vucdinh.github.io filelogistic regression linear...

20
Mathematical techniques in data science Lecture 7: Classification – Logistic regression and LDA March 1st, 2019 Mathematical techniques in data science

Upload: others

Post on 21-Oct-2019

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Mathematical techniques in data science

Lecture 7: Classification – Logistic regression and LDA

March 1st, 2019

Mathematical techniques in data science

Page 2: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Chapter 4: Classification

Logistic regression

Linear Discriminant Analysis

Nearest neighbours

Support Vector Machines

Next Friday (03/08): Homework 1 due

Groups (for class projects) need to be formed by the end ofnext week

Mathematical techniques in data science

Page 3: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Class project: example

Dataset: 25,000 images of dogs and cats

Similar: Malaria cell images dataset, Sign language dataset

Mathematical techniques in data science

Page 4: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Class project: example

Dataset: 285,000 credit card transactions

Similar: heart disease dataset

Mathematical techniques in data science

Page 5: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Linear discriminant analysis

Mathematical techniques in data science

Page 6: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Linear discriminant analysis

Suppose

Y ∈ {1, 2, . . . ,K}P(Y = i) = πi , i = 1, 2, . . . ,K .

P(X = x |Y = i) ∼ fi (x)

Then

P(Y = i |X = x) =P(X = x |Y = i)P(Y = i)∑Kj=1 P(X = x |Y = j)P(Y = j)

=fi (x)πi∑Kj=1 fj(x)πj

Mathematical techniques in data science

Page 7: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Model P(X = x |Y = i) using Gaussian distributions

The natural model for fi (x) is the multivariate Gaussiandistribution

fi (x) =1√

(2π)p det(Σi )e−

12

(x−µi )T Σ−1

i (x−µi ), x ∈ Rp

µ: mean vector

Σ: covariance matrix

Σ = E [(X − µ)T (X − µ)]

Mathematical techniques in data science

Page 8: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

LDA and QDA

The natural model for fi (x) is the multivariate Gaussiandistribution

fi (x) =1√

(2π)p det(Σi )e−

12

(x−µi )T Σ−1

i (x−µi ), x ∈ Rp

Linear discriminant analysis (LDA): We assume

Σ1 = Σ2 = . . . = ΣK

Quadratic discriminant analysis (QDA): general cases

Mathematical techniques in data science

Page 9: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Parameter estimation: LDA

Suppose we have dataset (x1, y1), (x2, y2), . . . , (xn, yn) where niobservations have label i .

An estimate of the class probabilities πi

πi =nin

Estimate the mean vectors

µi =1

ni

∑yj=i

xj

Estimate the covariance matrix Σ

Σ =1

N − K

K∑i=1

∑yj=i

(xj − µi )(xj − µi )T

Mathematical techniques in data science

Page 10: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Decision rule

Suppose we have dataset (x1, y1), (x2, y2), . . . , (xn, yn) where niobservations have label i .A new instance x arrives, how to predict label of x?

Compute πi , µi and Σ

Compute

P(Y = i |X = x) ≈ pk(x) =fi (x , µi , Σ)πi∑Kj=1 fj(x , µj , Σ)πj

Bayes classifier: assign an observation to the class for whichthe posterior probability pk(x) is greatest.

Mathematical techniques in data science

Page 11: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

LDA

Mathematical techniques in data science

Page 12: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

LDA: linearity of the decision boundary

Note that

pi (x) =fi (x , µi , Σ)πi∑Kj=1 fj(x , µj , Σ)πj

and

fi (x) =1√

(2π)p det(Σ)e−

12

(x−µi )T Σ−1(x−µi ), x ∈ Rp

Thus

logpi (x)

pk(x)= log

πiπk− 1

2(µi + µk)T Σ−1(µi − µk) + xT Σ−1(µi − µk)

= β0 + xTβ

Mathematical techniques in data science

Page 13: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

How is that different from logistic regression?

Recall that for logistic regression

logP[Y = 1|X = x ]

P[Y = 0|X = x ]= β0 + xTβ

Both methods use linear decision boundary

Both are simple, and often perform very well.

However

The probability models are different

The estimations are different

Mathematical techniques in data science

Page 14: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Practical problem when n� p

Estimating covariance matrices when n� p is challenging

The sample covariance Σ is singular when n� p

Need regularization

Mathematical techniques in data science

Page 15: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

LDA

Mathematical techniques in data science

Page 16: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Nearest neigbours

Mathematical techniques in data science

Page 17: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Nearest neigbours

Suppose Y = {0, 1}Parameter: k

Use closest observations in the training set to make predictions

Y (x) =1

k

∑Nk (x)

yi

Here Nk(x) denotes the k-nearest neighbors of x (w.r.t. somemetric, e.g. Euclidean distance)

Decision:

label(x) =

{0 if Y (x) < 0.5

1 otherwise

Mathematical techniques in data science

Page 18: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Nearest neighbors

Note: Y = {0, 1},Y (x) =

1

k

∑Nk (x)

yi

Mathematical techniques in data science

Page 19: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Nearest neighbors

Decision boundary, k = 3

Mathematical techniques in data science

Page 20: Mathematical techniques in data science - vucdinh.github.io fileLogistic regression Linear Discriminant Analysis Nearest neighbours Support Vector Machines Next Friday (03/08): Homework

Nearest neighbors

Mathematical techniques in data science