# uva cs 4501: machine learning lecture 13: logistic regression logistic regression models are...

Post on 08-Jul-2020

0 views

Category:

## Documents

Embed Size (px)

TRANSCRIPT

• UVA CS 4501: Machine Learning

Lecture 13: Logistic Regression

Dr. Yanjun Qi

University of Virginia   Department of Computer Science

3/23/18

Dr. Yanjun Qi / UVA CS / s18

1

• Where are we ? è Five major sections of this course

q Regression (supervised) q Classification (supervised) q Unsupervised models q Learning theory q Graphical models

3/23/18 2

Dr. Yanjun Qi / UVA CS / s18

• Where are we ? è Three major sections for classification

• We can divide the large variety of classification approaches into roughly three major types

1. Discriminative - directly estimate a decision rule/boundary - e.g., logistic regression, support vector machine, decisionTree

2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks

3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors

3/23/18 3

Dr. Yanjun Qi / UVA CS / s18

• Today

q Bayes Classifier q Logistic Regression q Binary to multi-class q Training LG by MLE

Dr. Yanjun Qi / UVA CS / s18

3/23/18 4

• • Treat each feature attribute and the class label as random variables.

• Given a sample x with attributes ( x1, x2, … , xp ): – Goal is to predict its class C. – Specifically, we want to find the value of Ci that maximizes

p( Ci | x1, x2, … , xp ).

• Can we estimate p(Ci |x) = p( Ci | x1, x2, … , xp ) directly from data?

Bayes classifiers

3/23/18

Dr. Yanjun Qi / UVA CS / s18

5

• • Treat each feature attribute and the class label as random variables.

• Given a sample x with attributes ( x1, x2, … , xp ): – Goal is to predict its class c. – Specifically, we want to find the class that maximizes

p( c | x1, x2, … , xp ).

• Can we estimate p(Ci |x) = p( Ci | x1, x2, … , xp ) directly from data?

Bayes classifiers

3/23/18

Dr. Yanjun Qi / UVA CS / s18

6

MAP Rule

• Bayes Classifiers – MAP Rule

Task: Classify a new instance X based on a tuple of attribute values into one of the classesX = X1,X2,…,Xp

cMAP = argmax cj∈C

P(cj | x1, x2,…, xp )

7

MAP = Maximum Aposteriori Probability

MAP Rule

3/23/18

Dr. Yanjun Qi / UVA CS / s18

• A Dataset for classification

• Data/points/instances/examples/samples/records: [ rows ] • Features/attributes/dimensions/independent variables/covariates/predictors/regressors: [ columns, except the last] • Target/outcome/response/label/dependent variable: special column to be predicted [ last column ]

3/23/18 8

Output as Discrete Class Label

C1, C2, …, CL

C

C Dr. Yanjun Qi / UVA CS / s18

Discriminative argmax

c∈C P(c |X) C = {c1 ,⋅⋅⋅,cL}

• Establishing a probabilistic model for classification

– Discriminative model

x = (x1 ,x2 ,⋅⋅⋅,xp)

Discriminative Probabilistic Classifier

1x 2x xp

)|( 1 xcP )|( 2 xcP )|( xLcP

•••

•••

3/23/18 Adapt from Prof. KeChen NB slides

9

Dr. Yanjun Qi / UVA CS / s18

argmax

c∈C P(c |X), C = {c1 ,⋅⋅⋅,cL}

• A Dataset for classification

• Data/points/instances/examples/samples/records: [ rows ] • Features/attributes/dimensions/independent variables/covariates/predictors/regressors: [ columns, except the last] • Target/outcome/response/label/dependent variable: special column to be predicted [ last column ]

3/23/18 10

Output as Discrete Class Label

C1, C2, …, CL

C

C

Generative

Dr. Yanjun Qi / UVA CS / s18

argmax

c∈C P(c |X )= argmax

c∈C P(X ,c)= argmax

c∈C P(X |c)P(c)

argmax

c∈C P(c |X) C = {c1 ,⋅⋅⋅,cL}Discriminative

Later!

• Today

q Bayes Classifier q Logistic Regression q Binary to multi-class q Training LG by MLE

Dr. Yanjun Qi / UVA CS / s18

3/23/18 11

• Multivariate linear regression to Logistic Regression

y = β0 +β1x1 +β2x2 + ...+βpxp

Logistic regression for binary classification

ln P( y |x)

1−P( y x) ⎡

⎣ ⎢ ⎢

⎦ ⎥ ⎥ = β0 +β1x1 +β2x2 + ...+βpxp

3/23/18 12

Dr. Yanjun Qi / UVA CS / s18

• Dr. Yanjun Qi / UVA CS / s18

Review: Probabilistic Interpretation of Linear Regression

• Let us assume that the target variable and the inputs are related by the equation:

where ε is an error term of unmodeled effects or random noise

• Now assume that ε follows a Gaussian N(0,σ), then we have:

• By IID assumption:

ii T

iy εθ += x

⎟⎟⎠

⎞ ⎜⎜⎝

⎛ −−= 2 2

22 1

σ θ

σπ θ )(exp);|( i

T i

ii yxyp x

⎟ ⎟

⎞ ⎜ ⎜

⎛ − −⎟

⎠ ⎞⎜

⎝ ⎛== ∑∏ =

= 2

1 2

1 22 1

σ θ

σπ θθ

n

i i T

i nn

i ii

y xypL

)( exp);|()(

x

3/23/18 13

• Logistic Regression p(y|x)

3/23/18

Dr. Yanjun Qi / UVA CS / s18

14

P(y x)= e

β0+β1x1+β2x2+...+βpxp

1+eβ0+β1x1+β2x2+...+βpxp = 1 1+e−βT X

ln P( y |x)

1−P( y x) ⎡

⎣ ⎢ ⎢

⎦ ⎥ ⎥ = β0 +β1x1 +β2x2 + ...+βpxp

• Logistic Regression models a linear classification boundary!

3/23/18

Dr. Yanjun Qi / UVA CS / s18

15

!!y∈{0,1}

ln P( y |x)

1−P( y x) ⎡

⎣ ⎢ ⎢

⎦ ⎥ ⎥ = β0 +β1x1 +β2x2 + ...+βpxp

• 3/23/18

Dr. Yanjun Qi / UVA CS / s18

16

!!y∈{0,1}

(2) p(y|x)

!! ln P( y |x)

1−P( y x) ⎡

⎣ ⎢ ⎢

⎦ ⎥ ⎥ =α +β1x1 +β2x2 + ...+βpxp

0.5

0.5

Logistic Regression models a linear classification boundary!

• The logistic function (1) -- is a common "S" shape function

0.0

0.2

0.4

0.6

0.8

1.0 e.g. Probability of disease

x

βxα

βxα

e1 e)xP(y +

+

+ =

P (Y=1|X)

3/23/18 17

Dr. Yanjun Qi / UVA CS / s18

• The logistic function (1) -- is a common "S" shape function

0.0

0.2

0.4

0.6

0.8

1.0 e.g. Probability of disease

x

βxα

βxα

e1 e)xP(y +

+

+ =

P (Y=1|X)

3/23/18 18

Dr. Yanjun Qi / UVA CS / s18

• Logistic Regression—when?

Logistic regression models are appropriate for target variable coded as 0/1.

We only observe “0” and “1” for the target variable—but we think of the target variable conceptually as a probability that “1”will occur.

3/23/18 19

Dr. Yanjun Qi / UVA CS / s18

19 P(x) 1-p(x)

• Logistic Regression—when?

Logistic regression models are appropriate for target variable coded as 0/1.

We only observe “0” and “1” for the target variable—but we think of the target variable conceptually as a probability that “1”will occur.

This means we use Bernoulli distribution to model the target variable with its Bernoulli parameter p=p(y=1|x) predefined.

The main interest è predicting the probability that an event occurs (i.e., the probability that p(y=1|x) ).

3/23/18 20

Dr. Yanjun Qi / UVA CS / s18

• ln ( ) ( )

P y x P y x

x 1− ⎡

⎣ ⎢

⎦ ⎥ = +α β

The logit function View

Logit of P(y|x)

{ P y x e

e

x

x( ) = +

+

+

α β

α β1

3/23/18 21

Dr. Yanjun Qi / UVA CS / s18

ln P( y =1|x)

P( y =0|x) ⎡

⎣ ⎢

⎦ ⎥ = ln

P( y =1|x) 1−P( y =1|x) ⎡

⎣ ⎢

⎦ ⎥ =α +β1x1 +β2x2 + ...+βpxp

Logit function

Decision Boundary è equals to zero

• Logistic Regression Assumptions

• Linearity in the logit – the regression equation should have a linear relationship with the logit form of the target variable

• There is no assumption about the feature variables / target predictors being linearly related to each other.

3/23/18 22

Dr. Yanjun Qi / UVA CS / s18

• Today

q Bayes Classifier q Logistic Regression q Binary to multi-class q Training LG by MLE

Dr. Yanjun Qi / UVA CS / s18

3/23/18 23

• Binary Logistic Regression In summary that the logistic regression tells us two things at once. • Transformed, the “log odds” (logit) are linear.