logistic regression & classification - stony...

of 68 /68
Logistic Regression & Classification Logistic Regression & Classification 1

Upload: phammien

Post on 22-Apr-2019

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Logistic Regression & ClassificationLogistic Regression & Classification

1

Page 2: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Logistic Regression ModelLogistic Regression ModelIn 1938, Ronald Fisher and Frank Yates suggested the logit link for regression with a bibinary response variable.

ln(Odds of Y 1| x)

( 1| ) ( 1| )ln ln( 0| ) 1 ( 1| )

P Y x P Y xP Y x P Y x

x

0 1ln1

xx

x

x

xx

1

lnlogit

Page 3: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Logistic regression model is the most popular model for binary data.

Logistic regression model is generally used to study the relationship between a binary response variable and a group of predictors (can be either continuousand a group of predictors (can be either continuous or categorical).Y = 1 (true success YES etc ) orY 1 (true, success, YES, etc.) orY = 0 ( false, failure, NO, etc.)

Logistic regression model can be extended to Logistic regression model can be extended to model a categorical response variable with more than two categories. The resulting model is sometimes referred to as the multinomial logistic regression model (in contrast to the ‘binomial’ logistic regression for a binary response variable )logistic regression for a binary response variable.)

Page 4: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

C id bi i bl Y 0 1 d i l di t Consider a binary response variable Y=0 or 1and a single predictor variable x. We want to model E(Y|x) =P(Y=1|x) as a function of x. The logistic regression model expresses the logistic transform of P(Y=1|x) as a linear function of the predictor.

( 1| )l P Y x

This model can be rewritten as0 1

( | )ln1 ( 1| )

xP Y x

)exp(11

)exp(1)exp()|1( 10

xxxxYP

E(Y|x)= P(Y=1| x) *1 + P(Y=0|x) * 0 = P(Y=1|x) is bounded between 0 and 1 for all values of x The following linear model may

)exp(1)exp(1 1010 xx

between 0 and 1 for all values of x. The following linear model may violate this condition sometimes:

YP |1 xxYP 10|1

Page 5: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

More reasonable modelingMore reasonable modeling

Pi

LogitTransformTransform

Predictor Predictor

Page 6: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

In the simple logistic regression, the regression coefficient has the interpretation that it is the 1log of the odds ratio of a success event (Y=1) for a unit change in x.

11010 ][)]1([)|0()|1(ln

)1|0()1|1(ln

xxYP

xYPYP

xYP11010)|0()1|0(

xYPxYP

For multiple predictor variables, the logistic regression model is

kkk xx

xxxYPxxxYP

...

)|0(),...,,|1(

ln 11021

kxxxYP

),...,,|0( 21

Page 7: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

The Logistic Regression ModelThe Logistic Regression Model

Page 8: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Alternative Model EquationAlternative Model Equation 

xx xx

Page 9: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

L gi ti R g i SAS P dLogistic Regression, SAS Procedure http://www.ats.ucla.edu/stat/sas/output/SAS_logit_output.htm Proc Logistic This page shows an example of logistic regression with

footnotes explaining the output The data were collected onfootnotes explaining the output. The data were collected on 200 high school students, with measurements on various tests, including science, math, reading and social studies. The response variable is high writing test score (honcomp),where a writing score greater than or equal to 60 is considered high and less than 60 considered low; from whichconsidered high, and less than 60 considered low; from which we explore its relationship with gender (female), reading test score (read), and science test score (science). The dataset used in this page can be downloaded from

http://www.ats.ucla.edu/stat/sas/webbooks/reg/default.htm.

9

Page 10: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Logistic Regression SAS ProcedureLogistic Regression, SAS Procedure

data logit; set "c:\Temp\hsb2"; honcomp = (write >= 60); p ( );run; proc logistic data= logit descending; p g g g;model honcomp = female read science; run;run;

proc export data=logitproc export data logitoutfile = "c:\Temp\hsb2.xls"dbms = xls replace;dbms xls replace;run;

10

Page 11: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Logistic Regression SAS ProcedureLogistic Regression, SAS Procedure

data logit; set "c:\Temp\hsb2"; honcomp = (write >= 60); p ( );run; proc logistic data= logit descending; p g g g;model honcomp = female read science; output out=pred p=phat lower = lcl upper = ucl;output out pred p phat lower lcl upper ucl;run;

t d t dproc export data=predoutfile = "c:\Temp\pred.xls"db l ldbms = xls replace;run;

11

Page 12: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Logistic Regression SAS ProcedureLogistic Regression, SAS Procedure

data logit; set "c:\Temp\hsb2"; honcomp = (write >= 60); run; proc logistic data= logit descending; model honcomp = female read science /selection=stepwise; run;

12

Page 13: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Descending option in proc logistic and proc genmod

Th d di ti i SAS th The descending option in SAS causes the levels of your response variable to be sorted from highest to lowest (by defaultsorted from highest to lowest (by default, SAS models the probability of the lower category)category).

In the binary response setting, we code the event of interest as a ‘1’ and use theevent of interest as a 1 and use the descending option to model that probability P(Y = 1 | X = x).( | )

In our SAS example, we’ll see what happens when this option is not used.pp p

Page 14: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Logistic Regression, SAS Output

14

Page 15: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

LOGISTIC and GENMOD procedure for a single continuous predictor

PROC LOGISTIC DATA= dataset <options>;PROC LOGISTIC DATA= dataset <options>;MODEL response=predictor /<options>;OUTPUT OUT=SAS-dataset keyword=name y

</option>;RUN;

PROC GENMOD DATA=dataset <options>;MAKE ‘OBSTATS’ OUT=SAS-data-set;MAKE OBSTATS OUT SAS data set;MODEL response=predictors </ options>;

RUN;

☺ The GENMOD procedure is much more general than the logistic procedure as it can accommodate other generalized linear model, and also cope with repeated measures data. It is not required for exam.

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_genmod_sect053.htm

Page 16: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Logistic RegressionLogistic Regression

Logistic Regression - Dichotomous Logistic Regression Dichotomous Response variable and numeric and/or categorical explanatory variable(s)categorical explanatory variable(s) Goal: Model the probability of a particular outcome

as a function of the predictor variable(s)as a function of the predictor variable(s) Problem: Probabilities are bounded between 0 and

1 Distribution of Responses: Binomial Link Function: Link Function:

1

ln)(g

16

Page 17: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Logistic Regression with 1 PredictorLogistic Regression with 1 Predictor

• Response - Presence/Absence of characteristic

• Predictor - Numeric variable observed for each case

• Model - (x) Probability of presence at predictor level x

x

xex10

1)(

xe 101)(

• = 0 P(Presence) is the same at each level of x• = 0 P(Presence) is the same at each level of x

• > 0 P(Presence) increases as x increases

• < 0 P(Presence) decreases as x increases17

Page 18: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Hypothesis testingHypothesis testing

Significance tests focuses on a test of H0: = 0 vs. Ha: ≠ 0.0

The Wald, Likelihood Ratio, and Score test are used (we’ll focus on Waldtest are used (we ll focus on Wald method)

Wald CI easil obtained score and LR Wald CI easily obtained, score and LR CI numerically obtained.

For Wald, the 95% CI (on the log odds scale) is

))ˆ((961ˆ SE)

))((96.1 11 SE

Page 19: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Logistic Regression with 1 Predictor

are unknown parameters and must be estimated using statistical software such as SAS, or R

· Primary interest in estimating and testing y g ghypotheses regarding where S.E. stands for standard error.· Large-Sample test (Wald Test):· H0: = 0 HA: 0H0: 0 HA: 0

:..

2

12obsXST

:

..

22

1

obs

XRR

ES

)(:

:..22

1,

obs

obs

XPvalP

XRR

19

Page 20: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Example Rizatriptan for Migraine Example - Rizatriptan for Migraine

Response - Complete Pain Relief at 2 hours (Yes/No)( )

Predictor - Dose (mg): Placebo (0),2.5,5,10

Dose # Patients # Relieved % Relieved 0 67 2 3.0 0 67 3.0

2.5 75 7 9.3 5 130 29 22.3

10 145 40 27 610 145 40 27.6

20

Page 21: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

SAS DataSAS DataDose # Patients # Relieved % Relieved

0 67 2 3.0 2.5 75 7 9.3 5 130 29 22.3

10 145 40 27 610 145 40 27.6

Y Dose NumberY Dose Number0 0 651 0 21 0 20 2.5 681 2.5 70 5 1011 5 290 10 105

21

0 10 1051 10 40

Page 22: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

SAS ProcedureSAS ProcedureData dr gData drug;Input Y Dose Number;Datalines;Datalines;0 0 651 0 2 proc logistic data= drug descending;1 0 20 2.5 681 2.5 7

proc logistic data= drug descending; model Y = Dose; f N b0 5 101

1 5 29freq Number;run;

0 10 1051 10 40;Run;

22

Page 23: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Example - Rizatriptan for Migrainep p g

Variables in the Equation

.165 .037 19.819 1 .000 1.1802 490 285 76 456 1 000 083

DOSEConstant

Step1

a

B S.E. Wald df Sig. Exp(B)

-2.490 .285 76.456 1 .000 .083Constant1

Variable(s) entered on step 1: DOSE.a.

xe 165.0490.2^

xeex 165.0490.21

)(

1650

0:0:2

110 HH A

843:

819.19037.0165.0:..

22

2

XRR

XST obs

000.:

84.3: 1,05.

valPXRR obs

23

Page 24: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Interpretation of a single continuousparameter

The sign (±) of determines whether thelog odds of y is increasing or decreasing for every 1-unit increase in x.

If > 0, there is an increase in the log odds If 0, there is an increase in the log oddsof y for every 1-unit increase in x.

If < 0 there is a decrease in the log odds If < 0, there is a decrease in the log oddsof y for every 1-unit increase in x.

If = 0 there is no linear relationship If = 0 there is no linear relationship between the log odds and x.

Page 25: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Odds Ratio

Interpretation of Regression Coefficient ():I li i th l ffi i t i th h In linear regression, the slope coefficient is the change in the mean response as x increases by 1 unit

In logistic regression we can show that: In logistic regression, we can show that:

)()()1(

1xxoddsexodds

)(1

)()(

1

xxoddse

xodds

Th t th h i th dd f th t• Thus e represents the change in the odds of the outcome (multiplicatively) by increasing x by 1 unit

• If = 0, the odds and probability are the same at all x levels (e =1)

• If > 0 , the odds and probability increase as x increases (e>1)

• If < 0 , the odds and probability decrease as x increases (e<1)25

Page 26: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

95% Confidence Interval for Odds Ratio

Step 1: Construct a 95% CI for :

]..[*96.1,]..[*96.1]..[*96.1 111

ESESES 111

• Step 2: Raise e = 2.718 to the lower and upper bounds of the CI:

]..[*96.1]..[*96.1

11

11

, ESES

ee

,

• If entire interval is above 1, conclude positive association

• If entire interval is below 1, conclude negative association, g

• If interval contains 1, cannot conclude there is an association26

Page 27: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Example - Rizatriptan for MigraineExample Rizatriptan for Migraine

• 95% CI for :

)2375009250()0370(9611650:%95

037.0..165.01

1

CI

ES

)2375.0,0925.0()037.0(96.1165.0:%95 CI

• 95% CI for population odds ratio:95% CI for population odds ratio:

)27.1,10.1(, 2375.00925.0 ee ),(,

• Conclude positive association between dose and• Conclude positive association between dose and probability of complete relief

27

Page 28: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Logistic regression model with a single categorical (≥ 2 levels) predictor

logit (pk) = log (odds) = 0 + kXkwherelogit(pk) logit transformation of the

probability of the eventprobability of the event0 intercept of the regression linek difference between the logits for

category k vs. the reference category

Page 29: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

LOGISTIC and GENMOD procedures for a LOGISTIC and GENMOD procedures for a single categorical predictorPROC LOGISTIC DATA=dataset <options>;

CLASS variables </option>;MODEL di t / tiMODEL response=predictors </options>;OUTPUT OUT=SAS-data-set keyword=name

</option>;</option>;RUN;

PROC GENMOD DATA d t t < ti >PROC GENMOD DATA=dataset <options>;CLASS variables </option>;

MAKE ‘OBSTATS’ OUT=SAS-data-set;MAKE OBSTATS OUT SAS data set;MODEL response=predictors </ options>;

RUN;

Page 30: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Class statement in proc logisticClass statement in proc logistic SAS will create dummy variables for a

t i l i bl if t ll it tcategorical variable if you tell it to. We need to specify dummy coding by using the

param = ref option in the class statement; weparam = ref option in the class statement; we can also specify the comparison group by using the ref = option after the variable name.p

Using class automatically generates a test of significance for all parameters associated with th l i bl (t bl f T 3 t t ) ifthe class variable (table of Type 3 tests); if you use dummy variables instead (more on this soon) you will not automatically get an “overall”soon), you will not automatically get an overall test for that variable.

We will see this more clearly in the SAS yexamples.

Page 31: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Reference categoryReference category

Each factor has as many parameters as categories, but one is redundant, so we need to specify a reference category.

Similar concept to what you have just Similar concept to what you have just learned in the linear regression analysis.

Page 32: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Interpretation of a single categoricalparameter

If your reference group is level 0, then the coefficient of βk represents the difference in the log odds between level k of your variable and level 0.y

Therefore, eβk is an odds ratio for category k vs the reference category ofcategory k vs. the reference category of x.

Page 33: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Creating your own dummy variables and not using the class statement

An equivalent model uses dummy variables (that you create), which accounts for ( y ),redundancy by not including a dummy variable for your reference category.y g y

The choice of reference category is arbitrary. Remember this method will not produce an Remember, this method will not produce an

“overall” test of significance for that variable.

Page 34: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Multiple Logistic RegressionMultiple Logistic Regression

Extension to more than one predictor variable (either te s o to o e t a o e p ed cto a ab e (e t enumeric or dummy variables).

With k predictors, the model is written:p

kk xxe

110

kk xxe

1101Adj t d Odd ti f i i b 1 it h ldi• Adjusted Odds ratio for raising xi by 1 unit, holding

all other predictors constant:

OR ieORi

• Many models have nominal/ordinal predictors andMany models have nominal/ordinal predictors, and widely make use of dummy variables

34

Page 35: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Testing Regression Coefficients Testing the overall model:

0: 10 kH

))log(2())log(2(

0 allNot :2

iA

LLXST

H

..

))log(2())log(2(..2

,2

10

kobs

obs

XRR

LLXST

)( 22

,

obsXPP • L0, L1 are values of the maximized likelihood function, computed by statistical software packages. This logic can also be used to compare full and reduced models based on subsets of predictors Testing forfull and reduced models based on subsets of predictors. Testing for individual terms is done as in model with a single predictor.

35

Page 36: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Example - ED in Older Dutch Men

Response: Presence/Absence of ED (n=1688)( ) Predictors: (p=12)

Age stratum (50-54* 55-59 60-64 65-69 70-78) Age stratum (50-54 , 55-59, 60-64, 65-69, 70-78) Smoking status (Nonsmoker*, Smoker)

BMI t t (<25* 25 30 >30) BMI stratum (<25*, 25-30, >30) Lower urinary tract symptoms (None*, Mild,

M d t S )Moderate, Severe) Under treatment for cardiac symptoms (No*, Yes) Under treatment for COPD (No*, Yes)

* Baseline group for dummy variables

36

Page 37: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Example - ED in Older Dutch MenPredictor b sb Adjusted OR (95% CI) Age 55-59 (vs 50-54) 0.83 0.42 2.3 (1.0 – 5.2) Age 60-64 (vs 50-54) 1.53 0.40 4.6 (2.1 – 10.1)Age 60 64 (vs 50 54) 1.53 0.40 4.6 (2.1 10.1) Age 65-69 (vs 50-54) 2.19 0.40 8.9 (4.1 – 19.5) Age 70-78 (vs 50-54) 2.66 0.41 14.3 (6.4 – 32.1) Smoker (vs nonsmoker) 0.47 0.19 1.6 (1.1 – 2.3) BMI 25 30 (vs <25) 0 41 0 21 1 5 (1 0 2 3)BMI 25-30 (vs <25) 0.41 0.21 1.5 (1.0 – 2.3) BMI >30 (vs <25) 1.10 0.29 3.0 (1.7 – 5.4) LUTS Mild (vs None) 0.59 0.41 1.8 (0.8 – 4.3) LUTS Moderate (vs None) 1.22 0.45 3.4 (1.4 – 8.4) LUTS S ( N ) 2 01 0 56 7 5 (2 5 22 5)LUTS Severe (vs None) 2.01 0.56 7.5 (2.5 – 22.5) Cardiac symptoms (Yes vs No) 0.92 0.26 2.5 (1.5 – 4.3) COPD (Yes vs No) 0.64 0.28 1.9 (1.1 – 3.6)

Interpretations: Risk of ED appears to be:

• Increasing with age, BMI, and LUTS strataIncreasing with age, BMI, and LUTS strata

• Higher among smokers

• Higher among men being treated for cardiac or COPD37

Page 38: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Logistic Regression Variable SelectionLogistic Regression –Variable Selection

F t l ti Feature selection Stepwise variable selection Best subset variable selection Find a subset Best subset variable selection -- Find a subset

of the variables that are sufficient for explaining their joint effect on the response. j p

Regularization Maximum penalized likelihood Shrinking the parameters via an L1 constraint,

imposing a margin constraint in the separable casecase

2

2)|(log

CxyP ii

n

1 2i

Page 39: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Predicted ValuesPredicted ValuesThe output of a logit model is the predicted probability of a success for each observation.

Page 40: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Predicted ValuesPredicted ValuesThese are obtained and stored in a separate SAS data set using the OUTPUT statement (see the following code).

Page 41: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Predicted ValuesPredicted ValuesPROC LOGISTIC outputs the predicted values and 95% CI limits to an output data set that also contains the original raw data.

Page 42: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Predicted ValuesPredicted ValuesUse the PREDPROBS = I option in order to obtain the predicted category (which is saved in the _INTO_ variable).

Page 43: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Predicted ValuesPredicted Values_FROM_ = The observed response category = The same value as the response variable.

Page 44: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Predicted ValuesPredicted Values_INTO_ = The predicted response category.

Page 45: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Predicted ValuesPredicted ValuesIP_1 = The Individual Probability of a response of 1.

Page 46: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Note to my students in AMS394Note to my students in AMS394

Dear students, you have done well this semester by learning so many methods that are rarely taught to undergraduates!

For the final exam the required For the final exam, the required materials would end here. From the next slide and on those materials are notslide and on, those materials are not required.A l l d littl SAS As always, please read your little SAS book by Cody & Smith for logistic regression.

46

Page 47: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Scoring Observations in SASScoring Observations in SAS

Obtaining predicted probabilities and/or predicted outcomes (categories) for new observations (i.e., scoring new observations) is done in logit modeling ) g gusing the same procedure we used in scoring new observations in linearscoring new observations in linear regression. 

Page 48: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Scoring Observations in SASScoring Observations in SAS

1. Create a new data set with the desired values of the x variables and the y variable set to missing.

2 Merge the new data set with the2. Merge the new data set with the original data set.

3 Refit the fi al odel u i PROC3. Refit the final model using PROC LOGISTIC using the OUTPUT statement.

Page 49: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Classification Table & RatesClassification Table & RatesA Classification Table is used to summarize the results of the predictions and to ultimately evaluate the fitness of the model.Obtain a classification table using PROC FREQFREQ.

Page 50: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Classification Table & RatesClassification Table & RatesThe observed (or actual) response is in rows and the predicted response is in columns.

Page 51: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Classification Table & RatesClassification Table & RatesCorrect classifications are summarized on the main diagonal.

Page 52: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Classification Table & RatesClassification Table & RatesThe total number of correct classifications (i.e., ‘hits’) is the sum of the main diagonal frequencies.O = 130+9 = 139O   130+9   139

Page 53: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Classification Table & RatesClassification Table & RatesThe total‐group hit rate is the ratio of O and N.  HR = 139/196 = .698

Page 54: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Classification Table & RatesClassification Table & RatesIndividual group hit rates can also be calculated. These are essentially the row percents on the main diagonal.

Page 55: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

LDA vs Logistic RegressionLDA vs. Logistic Regression

LDA (G ti d l) LDA (Generative model) Assumes Gaussian class-conditional densities

and a common covarianceand a common covariance Model parameters are estimated by maximizing

the full log likelihood, parameters for each class g , pare estimated independently of other classes, Kp+p(p+1)/2+(K-1) parametersM k f i l d it i f ti P (X) Makes use of marginal density information Pr(X)

Easier to train, low variance, more efficient if model is correctmodel is correct

Higher asymptotic error, but converges faster

Page 56: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

LDA vs Logistic RegressionLDA vs. Logistic Regression

L i ti R i (Di i i ti d l) Logistic Regression (Discriminative model) Assumes class-conditional densities are

members of the (same) exponential familymembers of the (same) exponential family distribution

Model parameters are estimated by maximizing the conditional log likelihood simultaneousthe conditional log likelihood, simultaneous consideration of all other classes, (K-1)(p+1)parametersI i l d it i f ti P (X) Ignores marginal density information Pr(X)

Harder to train, robust to uncertainty about the data generation processg p

Lower asymptotic error, but converges more slowly

Page 57: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Generative vs Discriminative LearningGenerative vs. Discriminative Learning(Rubinstein 97)

Generative DiscriminativeExample Linear Discriminant

A l iLogistic Regression

Analysis

Objective Functions Full log likelihood: Conditional log likelihood

yxp )(log xyp )|(log

Model Assumptions Class densities: Discriminant functions

i

ii yxp ),(log i

ii xyp )|(log

)|( kyxp )(e.g. Gaussian in LDA

Parameter Estimation “Easy” – One single sweep “Hard” – iterative optimization

)|( kyxp )(xk

Advantages More efficient if model correct, borrows strength from p(x)

More flexible, robust because fewer assumptions

from p(x)

Disadvantages Bias if model is incorrect May also be biased. Ignores information in p(x)

Page 58: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Comparison between LDA and LOGREG

(ErrorRate / T Di t ib ti(ErrorRate / Standard Error)

True Distribution

Highly non- N/A Gaussiang yGaussian

LDA 25.2/0.47 9.6/0.61 7.6/0.12

LOGREG 12.6/0.94 4.1/0.17 8.1/0.27

(Rubinstein 97)

Page 59: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Link between LDA, QDA, and Logistic regression

Suppose we classify by computing a posterior probability. The posterior was calculated by modeling the likelihood and prior for

h leach class.

• In LDA & QDA, we compute the posterior, assuming that each class distribution was Gaussian with equal (LDA) or unequal (QDA) variances.

• In logistic regression, we directly model the posterior as a function of the variable x.

ˆ

• In practice when there are k classes to classify we model:

P l gx x

• In practice, when there are k classes to classify, we model:

1P x 11P

gP k

x

xx

Page 60: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

In this example we assume that the two distributions for the classesClassification by maximizing the posterior distribution

In this example we assume that the two distributions for the classes have equal variance. Suppose we want to classify a person as male or female based on height.g

| 0 and | 1 and 1

1|

p x y p x y P y q

P y x

What we have:What we want:

| 1p x y | 0p x y

Height is normally distributed in the

x

Height is normally distributed in the population of men and in the population of women, with different means, and similar variances. Let y be an indicator variable for

2

2

1 1| 1 exp22

1 1

fp x y x

ybeing a female. Then the conditional distribution of x (the height becomes):

22

1 1| 0 exp22 mp x y x

Page 61: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Posterior probability for classification when we have two classes of equal variance:

1 | 11|

1 | 1 0 | 0P y p x y

P y xP y p x y P y p x y

of equal variance:

2

2

2 2

1exp2

1 1exp 1 exp

fq x

q x q x

2 2

2

exp 1 exp2 2

11

f mq x q x

22

2

2

11 exp21

1exp2

m

f

q x

q x

222

21

1 11 exp log m fq x x

2p g2

1

m fq

1 1 fq 1 exp log 2 22 2

1 12

m fm f

q xq

Page 62: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Computing the probability that the subject is female, given that we observed height xobserved height x.

11|1 1

P y x

a logistic functionIn the denominator x

2 22 2

1 11 exp log2

m fm f

q xq

In the denominator, x appears linearly inside the exponential

176166

m

f

cmcm

| 1p x y | 0p x y

12

1 0.5cm

p y

1|P y x 0.8

1

So if we assume that the 1|P y x

0.4

0.6

Posterior:

So if we assume that the class membership densities p(x/y) are normal with equal variance then the posterior

120 140 160 180 200 220

0.2

x

variance, then the posterior probability will be a logistic function.

Page 63: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Logistic regression with assumption of equal variance

among density of classes implies a linear decision boundary

60 0Ta a x

linear decision boundary2

4

2x

-2

0 Class 0

1-4 -2 0 2 4 6

1x

( ) ( )

0

111 exp

i iiT

iT

P ya

xa x

0( ) ( )

0 0

exp10 11 exp 1 exp

iTi i

i iT T

aP y

a a

a xx

a x a x

( ) ( )( )

( ) ( )

11 if log 0

0

i ii

i i

P yy

P y

x

x

( ) ( )

0( ) ( )0

1 1log log0 exp

i iiT

ii i T

P ya

P y a

xa x

x a x

Page 64: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Modeling the posterior when the densities have unequal variance

1 | 11|

1 | 1 0 | 0P y p x y

P y xP y p x y P y p x y

212

11

1 1exp22

1 1 1 1

q x x

2 21 22 2

1 2

1 1 1 1exp 1 exp2 22 2

1f m

q x x q x x

222

22

2

1 11 exp221

1 1

q x x

212

11

1 1exp22

1

q x x

11 exp log q

q

2 212 12 2

2 2 1

1 1log2 2

1

x x x x

20 1 2

11 exp w w x w x

Page 65: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Related sites/papers:/p p

http://www.stats.ox.ac.uk/pub/StatMeth/Counts.pdfp p p ftp://gis.msl.mt.gov/Maxell/Models/Predictive_Modelin

g_for_DSS_Lincoln_NE_121510/Modeling_Literature/Efron Efficiency%20of%20LR%20versus%20DFA pdfEfron_Efficiency%20of%20LR%20versus%20DFA.pdf

http://math.arizona.edu/~hzhang/math574m/Read/LogitOrLDA.pdfg p

http://arxiv.org/ftp/arxiv/papers/0804/0804.0650.pdf http://www.diva-

t l / h/ t/di 2 724982/FULLTEXT01 dfportal.org/smash/get/diva2:724982/FULLTEXT01.pdf https://onlinecourses.science.psu.edu/stat857/node/1

8484

65

Page 66: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Related sites/papers:/p p

Paul D. Allison, “Logistic Regression Using , g g gthe SAS System: Theory and Application”, SAS Institute, Cary, North Carolina, 1999.

Alan Agresti “Categorical Data Analysis” 2nd Alan Agresti, Categorical Data Analysis , 2nd

Ed., Wiley Interscience, 2002. David W Hosmer and Stanley Lemeshow David W. Hosmer and Stanley Lemeshow

“Applied Logistic Regression”, Wiley-Interscience, 2nd Edition, 2000.

Rubinstein, Y. D., & Hastie, T. (1997). Discriminative vs. informative learning. In Proceedings Third International ConferenceProceedings Third International Conference on Knowledge Discovery and Data Mining, pp. 49--53.

66

Page 67: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Related sites/papers:/p p

Menard, S. Applied Logistic Regression , pp g gAnalysis. Series: Quantitative Applications in the Social Sciences. Sage University Series

Lemeshow S; Teres D ; Avrunin J S and Lemeshow, S; Teres, D.; Avrunin, J.S. and Pastides, H. Predicting the Outcome of Intensive Care Unit Patients. Journal of the American Statistical Association 83, 348-356

Hosmer, D.W.; Jovanovic, B. and Lemeshow, S. Best S bsets Logistic Regression BiometricsBest Subsets Logistic Regression. Biometrics 45, 1265-1270. 1989.

Pergibon D Logistic Regression Diagnostics Pergibon, D. Logistic Regression Diagnostics. The Annals of Statistics 19(4), 705-724. 1981.

67

Page 68: Logistic Regression & Classification - Stony Brookzhu/ams394/LogisticRegressionandClassification.pdf · Logistic regression model is the most popular model for binary data. Logistic

Acknowledgement:g

We also thank colleagues who posted We also thank colleagues who posted their lecture notes on the internet@!

68