logistic regression & classification - stony...
Post on 22-Apr-2019
216 views
Embed Size (px)
TRANSCRIPT
Logistic Regression & ClassificationLogistic Regression & Classification
1
Logistic Regression ModelLogistic Regression ModelIn 1938, Ronald Fisher and Frank Yates suggested the logit link for regression with a bibinary response variable.
ln(Odds of Y 1| x)
( 1| ) ( 1| )ln ln( 0| ) 1 ( 1| )
P Y x P Y xP Y x P Y x
x
0 1ln
1x
xx
x
xx
1
lnlogit
Logistic regression model is the most popular model for binary data.
Logistic regression model is generally used to study the relationship between a binary response variable and a group of predictors (can be either continuousand a group of predictors (can be either continuous or categorical).Y = 1 (true success YES etc ) orY 1 (true, success, YES, etc.) orY = 0 ( false, failure, NO, etc.)
Logistic regression model can be extended to Logistic regression model can be extended to model a categorical response variable with more than two categories. The resulting model is sometimes referred to as the multinomial logistic regression model (in contrast to the binomial logistic regression for a binary response variable )logistic regression for a binary response variable.)
C id bi i bl Y 0 1 d i l di t Consider a binary response variable Y=0 or 1and a single predictor variable x. We want to model E(Y|x) =P(Y=1|x) as a function of x. The logistic regression model expresses the logistic transform of P(Y=1|x) as a linear function of the predictor.
( 1| )l P Y x This model can be rewritten as
0 1( | )ln
1 ( 1| )x
P Y x
)exp(11
)exp(1)exp()|1( 10
xxxxYP
E(Y|x)= P(Y=1| x) *1 + P(Y=0|x) * 0 = P(Y=1|x) is bounded between 0 and 1 for all values of x The following linear model may
)exp(1)exp(1 1010 xx
between 0 and 1 for all values of x. The following linear model may violate this condition sometimes:
YP |1 xxYP 10|1
More reasonable modelingMore reasonable modeling
Pi
LogitTransformTransform
Predictor Predictor
In the simple logistic regression, the regression coefficient has the interpretation that it is the 1log of the odds ratio of a success event (Y=1) for a unit change in x.
11010 ][)]1([)|0()|1(ln
)1|0()1|1(ln
xxYP
xYPYP
xYP11010)|0()1|0( xYPxYP
For multiple predictor variables, the logistic regression model is
kkk xx
xxxYPxxxYP
...
)|0(),...,,|1(
ln 11021
kxxxYP ),...,,|0( 21
The Logistic Regression ModelTheLogisticRegressionModel
Alternative Model EquationAlternativeModelEquation
xx xx
L gi ti R g i SAS P dLogistic Regression, SAS Procedure http://www.ats.ucla.edu/stat/sas/output/SAS_logit_output.htm Proc Logistic This page shows an example of logistic regression with
footnotes explaining the output The data were collected onfootnotes explaining the output. The data were collected on 200 high school students, with measurements on various tests, including science, math, reading and social studies. The response variable is high writing test score (honcomp),where a writing score greater than or equal to 60 is considered high and less than 60 considered low; from whichconsidered high, and less than 60 considered low; from which we explore its relationship with gender (female), reading test score (read), and science test score (science). The dataset used in this page can be downloaded from
http://www.ats.ucla.edu/stat/sas/webbooks/reg/default.htm.
9
Logistic Regression SAS ProcedureLogistic Regression, SAS Procedure
data logit; set "c:\Temp\hsb2"; honcomp = (write >= 60); p ( );run; proc logistic data= logit descending; p g g g;model honcomp = female read science; run;run;
proc export data=logitproc export data logitoutfile = "c:\Temp\hsb2.xls"dbms = xls replace;dbms xls replace;run;
10
Logistic Regression SAS ProcedureLogistic Regression, SAS Procedure
data logit; set "c:\Temp\hsb2"; honcomp = (write >= 60); p ( );run; proc logistic data= logit descending; p g g g;model honcomp = female read science; output out=pred p=phat lower = lcl upper = ucl;output out pred p phat lower lcl upper ucl;run;
t d t dproc export data=predoutfile = "c:\Temp\pred.xls"db l ldbms = xls replace;run;
11
Logistic Regression SAS ProcedureLogistic Regression, SAS Procedure
data logit; set "c:\Temp\hsb2"; honcomp = (write >= 60); run; proc logistic data= logit descending; model honcomp = female read science /selection=stepwise; run;
12
Descending option in proc logistic and proc genmod
Th d di ti i SAS th The descending option in SAS causes the levels of your response variable to be sorted from highest to lowest (by defaultsorted from highest to lowest (by default, SAS models the probability of the lower category)category).
In the binary response setting, we code the event of interest as a 1 and use theevent of interest as a 1 and use the descending option to model that probability P(Y = 1 | X = x).( | )
In our SAS example, well see what happens when this option is not used.pp p
Logistic Regression, SAS Output
14
LOGISTIC and GENMOD procedure for a single continuous predictor
PROC LOGISTIC DATA= dataset ;PROC LOGISTIC DATA= dataset ;MODEL response=predictor /;OUTPUT OUT=SAS-dataset keyword=name y
;RUN;
PROC GENMOD DATA=dataset ;MAKE OBSTATS OUT=SAS-data-set;MAKE OBSTATS OUT SAS data set;MODEL response=predictors ;
RUN;
The GENMOD procedure is much more general than the logistic procedure as it can accommodate other generalized linear model, and also cope with repeated measures data. It is not required for exam.
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_genmod_sect053.htm
Logistic RegressionLogistic Regression
Logistic Regression - Dichotomous Logistic Regression Dichotomous Response variable and numeric and/or categorical explanatory variable(s)categorical explanatory variable(s) Goal: Model the probability of a particular outcome
as a function of the predictor variable(s)as a function of the predictor variable(s) Problem: Probabilities are bounded between 0 and
1 Distribution of Responses: Binomial Link Function: Link Function:
1
ln)(g
16
Logistic Regression with 1 PredictorLogistic Regression with 1 Predictor
Response - Presence/Absence of characteristic
Predictor - Numeric variable observed for each case
Model - (x) Probability of presence at predictor level x
x
xex10
1)(
xe 101)(
= 0 P(Presence) is the same at each level of x = 0 P(Presence) is the same at each level of x
> 0 P(Presence) increases as x increases
< 0 P(Presence) decreases as x increases17
Hypothesis testingHypothesis testing
Significance tests focuses on a test of H0: = 0 vs. Ha: 0.0
The Wald, Likelihood Ratio, and Score test are used (well focus on Waldtest are used (we ll focus on Wald method)
Wald CI easil obtained score and LR Wald CI easily obtained, score and LR CI numerically obtained.
For Wald, the 95% CI (on the log odds scale) is
))((961 SE)
))((96.1 11 SE
Logistic Regression with 1 Predictor
are unknown parameters and must be estimated using statistical software such as SAS, or R
Primary interest in estimating and testing y g ghypotheses regarding where S.E. stands for standard error. Large-Sample test (Wald Test): H0: = 0 HA: 0H0: 0 HA: 0
:..
2
12obsXST
:
..
22
1
obs
XRR
ES
)(:
:..22
1,
obs
obs
XPvalP
XRR
19
Example Rizatriptan for Migraine Example - Rizatriptan for Migraine
Response - Complete Pain Relief at 2 hours (Yes/No)( )
Predictor - Dose (mg): Placebo (0),2.5,5,10
Dose # Patients # Relieved % Relieved 0 67 2 3.0 0 67 3.0
2.5 75 7 9.3 5 130 29 22.3
10 145 40 27 610 145 40 27.6
20
SAS DataSAS DataDose # Patients # Relieved % Relieved
0 67 2 3.0 2.5 75 7 9.3 5 130 29 22.3
10 145 40 27 610 145 40 27.6
Y Dose NumberY Dose Number0 0 651 0 21 0 20 2.5 681 2.5 70 5 1011 5 290 10 105
21
0 10 1051 10 40
SAS ProcedureSAS ProcedureData dr gData drug;Input Y Dose Number;Datalines;Datalines;0 0 651 0 2 proc logistic data= drug descending;1 0 20 2.5 681 2.5 7
proc logistic data= drug descending; model Y = Dose; f N b0 5 101
1 5 29freq Number;run;
0 10 1051 10 40;Run;
22
Example - Rizatriptan for Migrainep p g
Variables in the Equation
.165 .037 19.819 1 .000 1.1802 490 285 76 456 1 000 083
DOSEConstant
Step1
a
B S.E. Wald df Sig. Exp(B)
-2.490 .285 76.456 1 .000 .083Constant1
Variable(s) entered on step 1: DOSE.a.
xe 165.0490.2^ xe
ex 165.0490.21)(
1650
0:0:2
110 HH A
843:
819.19037.0165.0:..
22
2
XRR
XST obs
000.:
84.3: 1,05.valPXRR obs
23
Interpretation of a single continuousparameter
The sign () of determines whether thelog odds of y is increasing or decreasing for every 1-unit increase in x.
If > 0, there is an increase in the log odds If 0, there is an increase in the