logistic regression using sas prepared by voytek grus for sas user group, halifax february 24, 2006

18
Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

Post on 18-Dec-2015

228 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

Logistic Regression using SAS

prepared by

Voytek Grus

for

SAS user group, Halifax February 24, 2006

Page 2: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

What is Logistic Regression?

• Regression Analysis where the response variable Y is discrete and represents either categories or counts. There are no restrictions on predictors.

• Linear regression equation of the type yi=α+βxi+ εi is not appropriate … • … but like in linear regression analysis logistic regression is used to

– test statistical significance of relationship between response and predictor variables

– predict the category of outcomes given its predictors• Falls into the category of generalized linear models and either complements

or offers flexible alternative to – Multiple linear regression – similarity in equations, statistical diagnostics– Contingency tables (cross tabulation)– Loglinear models– Discriminant analysis – answers similar questions but is less restrictive

• Relatively New statistical tool for the analysis of categorical data– Contingency tables – 1900’s– Regression Analysis – 1970’s– Loglinear modes – 1975– Logistic Regression – late 70’s early 80’s but became more popular in the 90’s

Y X1 X2 X3green 46.8 1 Noyellow 15.9 0 No

red 51.8 0 Yes

Page 3: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

Fields of application.

• Health sciences - questions about disease: yes or no?• Social Sciences: deals with great deal of dichotomous variables:

employed vs unemployed, married vs unmarried,etc– Attitude to work as based on demographic or behavioral predictors– Racial bias in judicial decisions, etc

• Political science: • Which party voters will vote for and why?• Which voters will vote for a particular party?

• Public Opinion Polls• Used in economics and marketing to study consumer choice.

– Banks use it to assess credit rating of customers– Some regulators require that utilities submit customer choice studies on

energy conservation options.– Choice of mode of transportation

• Used in demand forecasting

Page 4: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

PART I Conceptual Framework of

Logistic Regression

Page 5: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

Why not to use OLS for the estimation of the categorical response equation?

• Multiple Linear Regression of categorical response variables does not satisfy two assumptions of a Linear Model necessary to produce unbiased and efficient coefficients.1. Linearity of coefficients: yi=α+βxi+ εi

2. E(εi)=0

3. Heteroscedasticity: var(εi)≠σ2

1. E(yi)=1*P(yi=1)+0*P(yi=0)=pi= α+βxi

2. var(εi)= var(yi)=pi*(1- pi)=(α+βxi)*(1-α-βxi)– Errors are uncorrelated: cov(εi, εj)=0– Errors are not normally distributed: εi ~ Binomial

1. Errors take on only two values: εi=1-α-βxi or εi=0- α-βxi and are bounded by 0 and 1.

1. As a result 2. coefficient estimates are no longer efficient3. Standard error estimates are no longer consistent4. Estimated values of the response variable Y may be implausible because

• Linear function is unbounded (estimates will be outside of the (0, 1) interval but the Binary regression is a linear probability model: E(yi)=pi= α+βxi

Page 6: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

Logit Transformation a remedy to violation of OLS assumptions

• Instead of estimating this linear equation: yi=α+ βxi1+βxi1 + …+ βxk1 + εi can apply logit transformation: log[pi/(1- pi)] =α+β1xi1+β2xi2 +. + βkxk1

where pi/(1- pi) is an odds ratio that an event of y=1 will occur.

• Consequences: – pi=exp(α+β1xi1+β2xi2 +. + βkxk1 )/(1+exp(α+β1xi1+β2xi2 +. + βkxk1)) happens to

be a cumulative logistic distribution function.

– No matter what the coefficients are pi is always between 0 and 1

– Absence of εi complicates stats analysis: standardized coefficients?

– Derivative of x is a function of p: Dpi/dxi= βpi(1-pi) and reflects changing slope of the S curve making interpreation of coefficients difficult. Need to be

cautious when

interpreting coefficients

from the prob.

perspective

Page 7: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

Alternatives to logit transformation in the context of latent variables: probit and complementary log log

• In a perfect world there is a model for a continuous response variable zi. The dichotomous logit model is only its simplification. There is a true equation zi=α0+ α1xi1+ α2xi1 + …+ α3xk1 + σεi but it can not be observed. It is latent. Instead we observe dichotomous y whose values of 1 and 0 depend on probability z. Y’s relationship with predictors X’s depends on the probability distribution of ε.

• Assumption of distribution of ε help determine standardized coefficients.

Link Distribution of ε

Standard deviation of ε

Link function = (inverse of CDF of ε ~ fcn)

CDF of ε

Logit ε ~ Logistic Distribution: f(ε) = eε/(1+ eε)2

σ =π/3= 1.8138

ƒ(p)=log(p/(1-p))

F(x)=ex/(1+ ex)

Probit ε ~ Normal Std Distribution

σ =1 ƒ(p)=Φ-1(p) Φ(x)=(2π)-1/2 ∫- ∞x

exp(-z2/2) dz

Complementary Log Log

ε ~ double exponential Distribution

σ = π/√6= 1.28

ƒ(p)=log(-log(1-p))

F(x)=1-exp(-exp(x))

Page 8: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

Logistic Regression in the context of the generalized linear models.

Type of regression

Link

Link Function Distribution of the response variable Y

Regression Model

Error Distribution

Estimation Procedure

Linear

Regression

Indentity E(Y)=XTβ Normal E(Y)=XTβ Normal OLS

Logistic

Regression

Logit ƒ(p)=log(p/(1-p))

Binomial or Multinomial

E(Y)=exp(XTβ)/(1+ exp(XTβ))

Binomial ML sometimes WLS

Logistic

Regression

Probit ƒ(p)=Φ-1(p) Binomial or Multinomial

Φ(y)=(2π)-1/2 ∫-

∞y exp(-z2/2) dz

Binomial ML sometimes WLS

Logistic

Regression

Complementary Log Log

ƒ(p)=log(-log(1-p))

Gompertz (extreme value)

E(Y)=1-exp(-exp(x))

Distr=Poisson, Binomial, etc

ML

Poisson

Regression

Log-linear ƒ(p)=log(y) Poisson E(y)=exp(XTβ) Poisson ML

? inverse E(y)=1/(y) Gamma E(Y)=1/( XTβ) Gamma ML

Log linear

Regression

Cumulative Log Log

ƒ(p)=log(-log(1-p))

Gompertz (extreme value)

E(Y)=1-exp(-exp(x))

Distr=Poisson, Binomial, etc

ML

Page 9: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

I Logistic Regression compared to ordinary linear regression

Analytical tools

Ordinary Linear Regression

Logistic Regression

Coefficient Interpretation In general, have meaning. Have no intuitive meaning except for the sign. Use the adjusted ODDs ratio instead.(ecoefficient)

Coefficient Confidence Intervals and hypothesis testing

t test, partial and sequential F tests Wald Confidence Interv./ Profile Likelihood confidence interv./ Max. likelihood interval

Global Hypotesis Testing

Ho vs H1F test= SSreg/1/SSres/(n-2). Likelihood ratio test:

Λ = max[lik(θ)] θ to ωo/ max[lik(θ)] θ to Ω

Under Null H0: -2log Λ ~ χ21

Wald Chi-sqr statistic

Score

Goodness of fit R2=SSreg/SStotal or =1-SSres/SStotal

AIC, Rsqradj, Press

Deviance=Is there a better model than this one?

-2log(max[lik(θ)] fitted/ max[lik(θ)] saturated

Global Chi-sqr=Is this model better than nothing?

Σcells(Oi-Ei)2/Ei ~ χ21,

Hosmer-Lemeshow test

ROC curve

Model (Variable) Selection Method

Direct, Forward, Backward, Stepwise, Maxr, Minr, Rsqr,Rsqradf, Mallows Cp

Direct, Forward, Backward, Stepwise, Score

Multicollinearity Detect collinear variables or group of variables using PROC REG: TOL, VIF, COLLNOINT. And/or PROC CORR.

The same.

Page 10: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

II Logistic Regression compared to ordinary linear regression

Analytical tools

Ordinary Linear Regression

Logistic Regression

Influence diagnostics, Residuals, predictive powers etc

DFBETAS, DFFITS, Cook distance, studentized residuals, partial residual plots, Predicted values,

DFBETAS, DIFCHISQ, DIFDEV, residuals: deviance rs, pearson rs, raw rs. Predicted values,

non-constant error variance

May transform response Y to stabilize variance( log(y), 1/Y, sqr(Y)) or run WLS.

Autocorrelation

Dependence of observations

Cause: correlation of ε’s in time series regression.

Durbin Watson to diagnose, use autoregression to combat.

Cause: Clustered or Longitudinal data

Use of GEE Estimation or conditional logit analysis.

Over-dispersion/under-dispersion

Lack of fit: due to underspecified model or dependence of observations.

Estimation OLS ML, WLS, OLS

Unobserved Heterogeneity (heterogeneity shrinkage)

coefficients are related to the underlying continuous model βj = αj/σ. The random

disturbance may reflect omitted explanatory variables. Include predictors known to be important even in absence of stat. Significance.

Page 11: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

PART II SAS Application of Logistic

Regression

Page 12: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

Summary of SAS procedures for logistic regression analysis

• Binary Logit Analysis: – PROCS: LOGISTIC; GENMOD; CATMOD; PROBIT, MDC, NLMIXED.

• Multinomial Logit Analysis– Predictors are characteristics of the individual

• Nominal (no ordering of Y’s): proc logistic; proc catmod

• Ordinal (inherent ordering of Y’s): proc logistic; proc catmod; proc genmod.

• Conditional Logit Analysis– Predictors are the characteristics of the response variable

• Can use mdc proc & phreg proc.

– Logit Analysis of Clustered data: • Proc Logistic or (Proc Phreg)

• Proc Genmod (gee)

Page 13: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

Binary Logit Models

• PROC LOGISTIC at its simplest: Main effect Model

1. Individual-level data: PROC LOGISTIC DATA=input;FREQ frequency; /*

optional */MODEL y=X1 X2;RUN; or

2. Grouped data: PROC LOGISTIC DATA=input;MODEL events/trials=X1 X2;RUN;

• PROC LOGISTIC with more features– PROC LOGISTIC DATA=lrdata.penalty DESCENDING;

• CLASS culp;– MODEL death=blackd|whitvic|culp / STB LACKFIT

AGGREGATE RSQ link=logit technique=newton CLODDS=PL CLODDS=WALD SELECTION=stepwise SCALE=WILLIAMS CORRB influence iplots;

• UNITS culp=2 / DEFAULT=1;Output out=results pred=phat lower=lb upper=up reschi=stres dfbetas=dfs;RUN;

• PROC GENMOD at its simplest– PROC GENMOD DATA=lrdata.penalty;– MODEL y=X1 X2 /Dist=Binomial;RUN;

Page 14: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

Multinomial Logit Models

• Multinomial logit for nominal response (Generalized Logit)– The logit transformation of the type log (pi/(1-pi)) for more than 2 categories does

not work because Σi=1kpi ≠1

– K-1 equations are estimated: log (pij/(pik)= +βjxi where j=1,2, … k-1.

• Multinomial logit for ordinal response (Cumulative, adjacent categories, continuation ratio)– Inherent ordering of Y responses allows to relax the assumption of

multiple odds equations.– Estimate k-1 equations of odds of Cum. Probabilities Fij

• Log (Fij/(1-Fij)= αj+βxi - all coefficients except for intercept stay the same– Because there is a hierarchy in the categories of response variable

• The model is easier to estimate and interpret • Hypothesis test are more powerful• one coefficient of each predictor but k-1 intercepts.

Available tools in SAS:1. PROC LOGISTIC DATA=lrdata.wallet;

MODEL wallet = male business punish explain / link=glogit; /* or link=clogit */ RUN;2. PROC CATMOD DATA=lrdata.wallet;

DIRECT male business punish explain;MODEL wallet = male business punish explain / NOITER PRED; RUN;

Page 15: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

Conditional logit Models

• Consumer Choice Studies– Consumer taste preferences, choice of mode of transportation, locational

characteristics for a retail store, – Conditional Logit: proc mdc; model decision = x1 x2 / type=clogit

choice=(mode 1 2 3); id pid; run; – Nested Logit: proc mdc data=newdata; model decision = ttime / type=nlogit

choice=(mode 1 2 3) covest=hess; id pid; utility u(1,) = ttime; nest level(1) = (1 2 @ 1, 3 @ 2), level(2) = (1 2 @ 1); run;

• Analysis of clustered data– Observations within clusters can often be dependent: longitudinal data,

students clustered in classrooms or schools, husbands & wives clustered in families, etc

– Dependent observations produce underestimated errors and overestimated test statistics and coefficient estimates which are inefficient.

– Remedies: Can use GEE (PROC GENMOD) or Conditional Logit (PROC LOGISTIC or PROC PHREG) and other methods such as Mixed Models or hybrids of the above.

Page 16: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

Consumer choice Modeling: Nested Logit Example

• Example• proc mdc data=travel2 maxit=200 outest=a; • model choice = ttime time cost / type=nlogit choice=(mode);

id id; • utility u(1,1 2 3 @ 1) = ttime time cost, • u(1,4 @ 2) = time cost; • nest level(1) = (1 2 3 @ 1, 4 @ 2), • level(2) = (1 2 @ 1); run;

Top

1 (Public) 2 (private)

1 (plane) 2 (train) 3 (bus) 4 (car) Level 1

Level 2

Decision Tree

Page 17: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

Literature

• Logistic Regression Using The SAS system by Paul D. Allison (4th edition August, 2003)

• Categorical Data Analysis Using The SAS System by Maura E. Stokes, Charles S. Davis, Gary G. Koch. (4th edition January, 2005)

• Multivariate Statistical Methods by B. Tabachnik (1996)

• SAS Help Examples

Page 18: Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006

Questions?