Download - UVA CS 4501: Machine Learning Lecture 13: Logistic RegressionLogistic regression models are appropriate for target variable coded as 0/1. We only observe “0”and “1”for the

UVACS4501:MachineLearning

Lecture13:LogisticRegression

Dr.Yanjun Qi

UniversityofVirginiaDepartmentofComputerScience

3/23/18

Dr.YanjunQi/UVACS/s18

1

Wherearewe?èFivemajorsectionsofthiscourse

q Regression(supervised)q Classification(supervised)q Unsupervisedmodelsq Learningtheoryq Graphicalmodels

3/23/18 2


Wherearewe?èThreemajorsectionsforclassification

• We can divide the large variety of classification approaches into roughly three major types

1. Discriminative- directly estimate a decision rule/boundary- e.g., logistic regression, support vector machine, decisionTree

2. Generative:- build a generative statistical model- e.g., naïve bayes classifier, Bayesian networks

3. Instance based classifiers- Use observation directly (no models)- e.g. K nearest neighbors

3/23/18 3


Today

q BayesClassifierq LogisticRegressionq Binarytomulti-classq TrainingLGbyMLE


3/23/18 4

• Treat each feature attribute and the class label as random variables.

• Given a sample x with attributes ( x1, x2, … , xp ):– Goal is to predict its class C.– Specifically, we want to find the value of Ci that maximizes

p( Ci | x1, x2, … , xp ).

• Can we estimate p(Ci |x) = p( Ci | x1, x2, … , xp ) directly from data?

Bayes classifiers

3/23/18


5

• Treat each feature attribute and the class label as random variables.

• Given a sample x with attributes ( x1, x2, … , xp ):– Goal is to predict its class c.– Specifically, we want to find the class that maximizes

p( c | x1, x2, … , xp ).

• Can we estimate p(Ci |x) = p( Ci | x1, x2, … , xp ) directly from data?

Bayes classifiers

3/23/18


6

MAPRule

Bayes Classifiers – MAP Rule

Task: Classify a new instance X based on a tuple of attribute values into one of the classesX = X1,X2,…,Xp

cMAP = argmaxcj∈C

P(cj | x1, x2,…, xp )

7

MAP=MaximumAposteriori Probability

MAPRule

3/23/18


AdaptFromCarols’prob tutorial

PleasereadtheL13ExtraslidesforWHY

ADatasetforclassification

• Data/points/instances/examples/samples/records: [rows]• Features/attributes/dimensions/independent variables/covariates/predictors/regressors: [columns,exceptthelast]• Target/outcome/response/label/dependentvariable:specialcolumntobepredicted [lastcolumn ]

3/23/18 8

Output as Discrete Class Label

C1, C2, …, CL

C

CDr.YanjunQi/UVACS/s18

Discriminativeargmax

c∈CP(c |X)C = {c1 ,⋅⋅⋅,cL}

Establishing a probabilistic model for classification

– Discriminative model

x = (x1 ,x2 ,⋅⋅⋅,xp)

Discriminative Probabilistic Classifier

1x 2x xp

)|( 1 xcP )|( 2 xcP )|( xLcP

•••

•••

3/23/18AdaptfromProf.KeChenNBslides

9


argmax

c∈CP(c |X),C = {c1 ,⋅⋅⋅,cL}

ADatasetforclassification

• Data/points/instances/examples/samples/records: [rows]• Features/attributes/dimensions/independent variables/covariates/predictors/regressors: [columns,exceptthelast]• Target/outcome/response/label/dependentvariable:specialcolumntobepredicted [lastcolumn ]

3/23/18 10

Output as Discrete Class Label

C1, C2, …, CL

C

C

Generative


argmax

c∈CP(c |X )= argmax

c∈CP(X ,c)= argmax

c∈CP(X |c)P(c)

argmax

c∈CP(c |X)C = {c1 ,⋅⋅⋅,cL}Discriminative

Later!

Today



3/23/18 11

MultivariatelinearregressiontoLogisticRegression

y = β0 +β1x1 +β2x2 + ...+βpxp

Logisticregression forbinaryclassification

ln P( y |x)

1−P( y x)⎡

⎣⎢⎢

⎤

⎦⎥⎥= β0 +β1x1 +β2x2 + ...+βpxp

3/23/18 12



Review: ProbabilisticInterpretationofLinearRegression

• Letusassumethatthetargetvariableandtheinputsarerelatedbytheequation:

whereε isanerrortermofunmodeled effectsorrandomnoise

• Nowassumethatε followsaGaussianN(0,σ),thenwehave:

• ByIIDassumption:

iiT

iy εθ += x

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−= 2

2

221

σθ

σπθ )(exp);|( i

Ti

iiyxyp x

⎟⎟

⎠

⎞⎜⎜

⎝

⎛ −−⎟

⎠⎞⎜

⎝⎛== ∑∏ =

=2

12

1 221

σθ

σπθθ

n

i iT

inn

iii

yxypL

)(exp);|()(

x

3/23/18 13

LogisticRegressionp(y|x)

3/23/18


14

P(y x)= eβ0+β1x1+β2x2+...+βpxp

1+eβ0+β1x1+β2x2+...+βpxp= 11+e−βT X

ln P( y |x)

1−P( y x)⎡

⎣⎢⎢

⎤

⎦⎥⎥= β0 +β1x1 +β2x2 + ...+βpxp

LogisticRegressionmodelsalinearclassificationboundary!

3/23/18


15

!!y∈{0,1}

ln P( y |x)

1−P( y x)⎡

⎣⎢⎢

⎤

⎦⎥⎥= β0 +β1x1 +β2x2 + ...+βpxp

3/23/18


16

!!y∈{0,1}

(2)p(y|x)

!!ln P( y |x)

1−P( y x)⎡

⎣⎢⎢

⎤

⎦⎥⎥=α +β1x1 +β2x2 + ...+βpxp

0.5

0.5

LogisticRegressionmodelsalinearclassificationboundary!

Thelogisticfunction(1)-- isacommon"S"shapefunction

0.0

0.2

0.4

0.6

0.8

1.0e.g.Probability ofdisease

x

βxα

βxα

e1e)xP(y +

+

+=

P (Y=1|X)

3/23/18 17


Thelogisticfunction(1)-- isacommon"S"shapefunction

0.0

0.2

0.4

0.6

0.8

1.0e.g.Probability ofdisease

x

βxα

βxα

e1e)xP(y +

+

+=

P (Y=1|X)

3/23/18 18


LogisticRegression—when?

Logisticregressionmodelsareappropriatefortargetvariablecodedas0/1.

Weonlyobserve“0” and“1” forthetargetvariable—butwethinkofthetargetvariableconceptuallyasaprobabilitythat“1”willoccur.

3/23/18 19


19P(x) 1-p(x)

LogisticRegression—when?

Logisticregressionmodelsareappropriatefortargetvariablecodedas0/1.

Weonlyobserve“0” and“1” forthetargetvariable—butwethinkofthetargetvariableconceptuallyasaprobabilitythat“1”willoccur.

This means we use Bernoulli distribution to model the target variable with its Bernoulli parameter p=p(y=1|x) predefined.

The main interest è predicting the probability that an event occurs (i.e., the probability that p(y=1|x) ).

3/23/18 20


ln( )( )

P y xP y x

x1−⎡

⎣⎢

⎤

⎦⎥ = +α β

Thelogit functionView

Logit ofP(y|x)

{P y x e

e

x

x( ) =+

+

+

α β

α β1

3/23/18 21


ln P( y =1|x)

P( y =0|x)⎡

⎣⎢

⎤

⎦⎥ = ln

P( y =1|x)1−P( y =1|x)⎡

⎣⎢

⎤

⎦⎥ =α +β1x1 +β2x2 + ...+βpxp

Logit function

DecisionBoundaryè equalstozero

LogisticRegressionAssumptions

• Linearityinthelogit – theregressionequationshouldhavealinearrelationshipwiththelogit formofthetargetvariable

• Thereisnoassumptionaboutthefeaturevariables/targetpredictorsbeinglinearlyrelatedtoeachother.

3/23/18 22


Today



3/23/18 23

Binary Logistic RegressionIn summary that the logistic regression tells us two things at once.• Transformed, the “log odds” (logit) are linear.

• Logistic Distribution

ln[p/(1-p)]

P (Y=1|x)

x

x3/23/18 24


This means we use Bernoulli distribution to model the target variable with its Bernoulli parameter p=p(y=1|x)predefined.

P(y=1|x) 1-p(x)

Odds=p/(1-p)

25

BinaryèMultinoulliLogisticRegressionModel

Pr(G = k |X = x)= exp(βk0 +βkT x)

1+ exp(βl0 +βlT x)l=1

K−1

∑, k =1,…,K −1

Pr(G = K |X = x)= 11+ exp(βl0 +βlT x)

l=1

K−1

∑

Directly modelstheposteriorprobabilitiesastheoutputofregression

Notethattheclassboundariesarelinear

x isp-dimensional inputvector

isap-dimensionalvectorforeachclassk

Totalnumber ofparametersis(K-1)(p+1)

3/23/18


βkT

26

BinaryèMultinoulliLogisticRegressionModel

Notethattheclassboundariesarelinear3/23/18


Today



3/23/18 27

Parameter EstimationforLGèMLEfrom thedata

• Review:– Linear regression trainingè Leastsquares– Gaussian explaination ofLinear Regression trainingèMLE

• Logistic regression Training:– MaximumLikelihood Estimationbased– Classic algorithms using iterative Newton

3/23/18 28


PleasereadtheL13ExtraslidesforHOW

Review:DefiningLikelihoodforbasicBernoulli

• Likelihood=p(data|parameter)

!!f (xi |p)= pxi (1− p)1−xi

function of x_i

PMF:

!!L(p)=∏i=1n pxi (1− p)1−xi = px(1− p)n−x

function of p

LIKELIHOOD:

è e.g.,fornindependenttossesofcoins,withunknownp

Dr.YanjunQi/UVACS

Observeddataèxheads-upfromntrials

x = xi

i=1

n

∑

30

MLEforLogisticRegressionTraining(extra)

Let’sfitthelogisticregressionmodelforK=2, i.e.,number ofclassesis2

!!

l(β)= {logPr(Y = yi |X = xi )}i=1

N

∑

= yi log(Pr(Y =1|X = xi ))+(1− yi )log(Pr(Y =0|X = xi ))i=1

N

∑

= ( yi logexp(βT xi )

1+exp(βT xi ))+(1− yi )log

11+exp(βT xi )

)i=1

N

∑

= ( yiβT xi − log(1+exp(βT xi )))i=1

N

∑

Trainingset:(xi,yi),i=1,…,N

Wewanttomaximize thelog-likelihood inordertoestimate\beta

xi are(p+1)-dimensional inputvectorwithleadingentry1\betaisa(p+1)-dimensional vector

pβ( y |x)y(1− pβ( y |x))1− y

ForBernoullidistribution

3/23/18


How?

3/23/18


31

!!l(β)= {logPr(Y = yi |X = xi )}

i=1

N

∑

Logistic Regression

classification

Log-odds = linear function of X’s

EPE, MLE

Iterative (Newton) method

Logistic weights

Task

Representation

Score Function

Search/Optimization

Models, Parameters

P(c =1 x) = eα+βx

1+ eα+βx3/23/18 32


References

q Prof.Tan,Steinbach,Kumar’s“IntroductiontoDataMining”slide

q Prof.AndrewMoore’sslidesq Prof.EricXing’sslidesq Prof.Ke ChenNBslidesq Hastie,Trevor,etal.Theelementsofstatisticallearning.Vol.2.No.1.NewYork:Springer,2009.

3/23/18 33


Download - UVA CS 4501: Machine Learning Lecture 13: Logistic RegressionLogistic regression models are appropriate for target variable coded as 0/1. We only observe “0”and “1”for the

Top Related