november 5, 2008 logistic and poisson regression: modeling binary and count data lisa short course...

84
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Upload: denis-gaines

Post on 03-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

November 5, 2008

Logistic and Poisson Regression: Modeling Binary and Count Data

LISA Short Course Series

Logistic and Poisson Regression: Modeling Binary and Count Data

LISA Short Course Series

Mark Seiss, Dept. of Statistics

Page 2: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Presentation Outline

1. Introduction to Generalized Linear Models

2. Binary Response Data -

Logistic Regression Model

3. Count Response Data -

Poisson Regression Model

Page 3: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Reference Material

Categorical Data Analysis – Alan Agresti

Examples found with SAS Code at www.stat.ufl.edu/~aa/cda/cda.html

Presentation and Data from Examples

www.stat.vt.edu/consult/short_courses.html

Page 4: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models

• Generalized linear models (GLM) extend ordinary regression to non-normal response distributions.

• 3 Components• Random – identifies response Y and its probability distribution

• Systematic – explanatory variables in a linear predictor function (Xβ)

• Link function – function (g(.)) that links the mean of the response (E[Yi]=μi) to the systematic

component.

• Model• for i = 1 to n ij

jjx ig

Page 5: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models

• Why do we use GLM’s?• Linear regression assumes that the response is

distributed normally• GLM’s allow us to analyze the linear relationship

between predictor variables and the mean of the response variable when it is not reasonable to assume the data is distributed normally.

Page 6: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models• Predictor Variables

• Two Types: Continuous and Categorical

• Continuous Predictor Variables• Examples – Time, Grade Point Average, Test Score, etc.

• Coded with one parameter – βixi

• Categorical Predictor Variables• Examples – Sex, Political Affiliation, Marital Status, etc.• Actual value assigned to Category not important• Ex) Sex - Male/Female, M/F, 1/2, 0/1, etc.• Coded Differently than continuous variables

Page 7: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models• Categorical Predictor Variables cont.

• Consider a categorical predictor variable with L categories

• One category selected as reference category • Assignment of Reference Category is arbitrary

• Variable represented by L-1 dummy variables • Model Identifiability

• Two types of coding – Dummy and Effect

Page 8: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models• Categorical Predictor Variables cont.

• Dummy Coding (Used in R)

• xk = 1 if predictor variable is equal to category k

0 otherwise

• xk = 0 for all k if predictor variable equals category I

• Effect Coding (Used in JMP)

• xk = 1 if predictor variable is equal to category k

0 otherwise

• xk = -1 for all k if predictor variable equals category I

Page 9: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models

• Saturated Model• Contains a separate indicator parameter for each

observation• Perfect fit μ = y• Not useful since there is no data reduction, i.e.

number of parameters equals number of observations.

• Maximum achievable log likelihood – baseline for comparison to other model fits

Page 10: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models

• Deviance

• Let L(μ|y) = maximum of the log likelihood for the model

L(y|y) = maximum of the log likelihood for the saturated model

• Deviance = D(y| μ) = -2 [L(μ|y) - L(y|y) ]

• Likelihood Ratio Statistic for testing the null hypothesis that the model is a good alternative to the saturated model

• Likelihood ratio statistic has an asymptotic chi-squared distribution with N – p degrees of freedom, where p is the number of parameters in the model.

• Allows for the comparison of one model to another using the likelihood ratio test.

Page 11: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models

• Nested Models

• Model 1 - model with p predictor variables {X1, X2, X3,….,Xp} and vector of fitted values μ1

• Model 2 - model with q<p predictor variables {X1, X2, X3,….,Xq} and vector of fitted values μ2

• Model 2 is nested within Model 1 if all predictor variables found in Model 2 are included in Model 1.

• i.e. the set of predictor variables in Model 2 are a subset of the set of predictor variables in Model 1

• Model 2 is a special case of Model 1 - all the coefficients associated with Xp+1, Xp+2, Xp+3,….,Xq are equal to zero

q2p1ppp110 0…+0+0++…+ = g(u) XXXXX

Page 12: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models

• Likelihood Ratio Test• Null Hypothesis: There is not a significant difference between the fit

of two models.

• Null Hypothesis for Nested Models: The predictor variables in Model 1 that are not found in Model 2 are not significant to the model fit.

• Alternate Hypothesis for Nested Models - The predictor variables in Model 1 that are not found in Model 2 are significant to the model fit.

• Likelihood Ratio Statistic = -2* [L(y,u2)-L(y,u1)]

= D(y,μ2) - D(y, μ1)

Difference of the deviances of the two models

• Always D(y,μ2) > D(y,μ1) implies LRT > 0

• LRT is distributed Chi-Squared with p-q degrees of freedom

Page 13: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models

• Likelihood Ratio Test cont.• Later, we will use the Likelihood Ratio Test to test the significance of

variables in Logistic and Poisson regression models.

Page 14: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models

• Theoretical Example of Likelihood Ratio Test

• 3 predictor variables – 1 Continuous (X1), 1 Categorical with 4 Categories (X2, X3, X4), 1 Categorical with 1 Category (X5)

• Model 1 - predictor variables {X1, X2, X3, X4, X5}

• Model 2 - predictor variables {X1, X5}

• Null Hypothesis – Variables with 4 categories is not significant to the model (β2 = β3 = β4 = 0)

• Alternate Hypothesis - Variable with 4 categories is significant

• Likelihood Ratio Statistic = D(y,μ2) - D(y, μ1)

• Difference of the deviance statistics from the two models

• Chi-Squared Distribution with 5-2=3 degrees of freedom

Page 15: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models• Model Selection

• 2 Goals: Complex enough to fit the data well

Simple to interpret, does not overfit the data

• Study the effect of each predictor on the response Y• Continuous Predictor – Graph P[Y=1] versus X• Discrete Predictor - Contingency Table of P[Y=1] versus

categories of X

• Unbalance Data – Few responses of one type• Guideline – 10 outcomes of each type for each X terms• Example – Y=1 for only 30 observations out of 1000

Model should contain no more than 3 X terms

Page 16: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models• Model Selection cont.

• Multicollinearity• Correlations among predictors resulting in an increase in

variance• Reduces the significance value of the variable • Occurs when several predictor variables are used in the

model

• Determining Model Fit• Other criteria besides significance tests (i.e. Likelihood

Ratio Test) can be used to select a model

Page 17: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models• Model Selection cont.

• Determining Model Fit cont.• Akaike Information Criterion (AIC)

– Penalizes model for having many parameters– AIC = Deviance+2*p where p is the number of

parameters in model

• Bayesian Information Criterion (BIC)– BIC = -2 Log L + ln(n)*p where p is the number of

parameters in model and n is the number of observations

Page 18: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models• Model Selection cont.

• Selection Algorithms• Best subset – Tests all combinations of predictor variables

to find best subset• Algorithmic – Forward, Backward and Stepwise

Procedures

Page 19: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models• Best Subsets Procedure

• Run model with all possible combinations of the predictor variables

• Number of possible models equal to 2p where p is the number of predictor variables

• Dummy Variables for categorical predictors considered together

• Ex) For a set of predictors {X1, X2, X3}

• runs models with sets of predictors {X1, X2, X3}, {X1, X2},

{X2, X3}, {X1, X3}, {X1}, {X2}, {X3}, and no predictor variables.

• 23 = 8 possible models

• Most programs only allow for a small set of predictor variables

• Cannot be run in a reasonable amount of time

• 210 = 1024 models run for a set of 10 predictor variables

Page 20: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models• Forward Selection

• Idea: Start with no variables in the model and add one at a time

• Step One: Fit model with single predictor variable and determine fit

• Step Two: Select predictor variable with best fit and add to model

• Step Three: Add each variable to the model one at a time and determine fit

• Step Four: If at least one variable produces better fit, return to step two

If no variables produce better fit, use model

• Drawback: Variables Added to the model cannot be taken out.

Page 21: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models• Backward Selection

• Idea: Start with all variables in the model and take out one at a time

• Step One: Fit all predictor variables in model and determine fit

• Step Two: Delete one variable at a time and determine fit

• Step Three: If the deletion of at least one variable produces better fit, remove variable that produces best

fit when deleted and return to step 2

If the deletion of a variable does not produce a better fit, use model

• Drawback: Variables taken out of model cannot be added back in.

Page 22: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models• Stepwise Selection

• Idea: Combination of forward and backward selection

• Forward Step then backward step • Step One: Fit each predictor variable as a single predictor variable

and determine fit

• Step Two: Select variable that produces best fit and add to model.

• Step Three: Add each predictor variable one at a time to the model and determine fit

• Step Four: Select variable that produces best fit and add to the model

• Step Five: Delete each variable in the model one at a time and determine fit

• Step Six: Remove variable that produces best fit when deleted

• Step Seven: Return to Step Two

• Loop until no variables added or deleted improve the fit.

Page 23: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models

• Summary

• 3 Components of the GLM• Random (Y)• Link Function (g(E[Y]))• Systematic (xtβ)

• Continuous and Categorical Predictor Variables• Coding Categorical Variables – Effect and Dummy Coding

• Likelihood Ratio Test for Nested Models• Test the significance of a predictor variable or set of

predictor variables in the model.

• Model Selection – Best Subset, Forward, Backward, Stepwise

Page 24: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Generalized Linear Models

• Questions/Comments

Page 25: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Consider a binary response variable.

• Variable with two outcomes

• One outcome represented by a 1 and the other represented by a 0

• Examples:

Does the person have a disease? Yes or No

Who is the person voting for? McCain or Obama

Outcome of a baseball game? Win or loss

Page 26: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Logistic Regression Example Data Set

• Response Variable –> Admission to Grad School (Admit)• 0 if admitted, 1 if not admitted

• Predictor Variables• GRE Score (gre)

– Continuous• University Prestige (topnotch)

– 1 if prestigious, 0 otherwise • Grade Point Average (gpa)

– Continuous

Page 27: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• First 10 Observations of the Data Set

ADMIT GRE TOPNOTCH GPA

1 380 0 3.61

0 660 1 3.67

0 800 1 4

0 640 0 3.19

1 520 0 2.93

0 760 0 3

0 560 0 2.98

1 400 0 3.08

0 540 0 3.39

1 700 1 3.92

Page 28: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Consider the linear probability model

where yi = response for observation i

xi = 1x(p+1) matrix of covariates for observation i

p = number of covariates

• GLM with binomial random component and identity link g(μ) = μ

• Issue: π(Xi) can take on values less than 0 or greater than 0

• Issue: Predicted probability for some subjects fall outside of the [0,1] range.

iiiii xxxYPYE )()|0(

Page 29: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Consider the logistic regression model

• GLM with binomial random component and identity link g(μ) = logit(μ)

• Range of values for π(Xi) is 0 to 1

i

iiiii x

xxxYPYE

exp1

exp)()|0(

ii

ii x

x

xxit

1

loglog

Page 30: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Consider the logistic regression model

And the linear probability model

Then the graph of the predicted probabilities for different grade point averages:

Important Note: JMP models P(Y=0) and effect coding is used for categorical variables

ii gpaxit *log

ii gpax *)(

Page 31: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression

Page 32: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Interpretation of Coefficient β – Odds Ratio

• The odds ratio is a statistic that measures the odds of an event compared to the odds of another event.

• Say the probability of Event 1 is π1 and the probability of Event 2 is π2 . Then the odds ratio of Event 1 to Event 2 is:

• Value of Odds Ratio range from 0 to Infinity

• Value between 0 and 1 indicate the odds of Event 2 are greater

• Value between 1 and infinity indicate odds of Event 1 are greater

• Value equal to 1 indicates events are equally likely

2

2

1

1

1

1

2

1

)(

)(_

Odds

OddsRatioOdds

Page 33: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Interpretation of Coefficient β – Odds Ratio cont.

• Link to Logistic Regression :

• Thus the odds ratio between two events is)()()()()_( 2111 2

2

1

1

LogitLogitLogLogRatioOddsLog

)}()(exp{_ 12 LogitLogitRatioOdds

Page 34: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Interpretation of Coefficient β – Odds Ratio cont.

• Consider Event 1 is Y=0 given X and Event 2 is Y=0 given X+1

• From our logistic regression model

• Thus the ratio of the odds of Y=0 for X and X+1 is

))|0(())1|0(()_( XYPLogitXYPLogitRatioOddsLog

)())1(( XX

)exp(_ RatioOdds

Page 35: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Single Continuous Predictor Variable - GPA

Generalized Linear Model Fit

Response: Admit

Modeling P(Admit=0)

Distribution: Binomial

Link: Logit

Observations (or Sum Wgts) = 400

Whole Model Test

Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq

Difference 6.50444839 13.0089 1 0.0003

Full 243.48381

Reduced 249.988259

Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq

Pearson 401.1706 398 0.4460 398 0.4460

Deviance 486.9676 398 0.0015 398 0.0015

Page 36: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Single Continuous Predictor Variable – GPA cont.

Effect Tests

Source DF L-R ChiSquare Prob>ChiSq

GPA 1 13.008897 0.0003

Parameter Estimates

Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL

Intercept -4.357587 1.0353175 19.117873 <.0001 -6.433355 -2.367383

GPA 1.0511087 0.2988695 13.008897 0.0003 0.4742176 1.6479411

Interpretation of the Parameter Estimate:

Exp{1.0511087} = 2.86 = odds ratio between the odds at x+1 and odds at x for all x

The ratio of the odds of being admitted between a person with a 3.0 gpa and 2.0 gpa is equal to 2.86 or equivalently the odds of the person with the 3.0 is 2.86 times the odds of the person with the 2.0.

Page 37: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Single Categorical Predictor Variable – Top Notch

Generalized Linear Model Fit

Response: Admit

Modeling P(Admit=0)

Distribution: Binomial

Link: Logit

Observations (or Sum Wgts) = 400

Whole Model Test

Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq

Difference 3.53984692 7.0797 1 0.0078

Full 246.448412

Reduced 249.988259

Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq

Pearson 400.0000 398 0.4624

Deviance 492.8968 398 0.0008

I

Page 38: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Single Categorical Predictor Variable – Top Notch cont.

Effect Tests

Source DF L-R ChiSquare Prob>ChiSq

TOPNOTCH 1 7.0796939 0.0078

Parameter Estimates

Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL

Intercept -0.525855 0.138217 14.446085 0.0001 -0.799265 -0.255667

TOPNOTCH[0] -0.371705 0.138217 7.0796938 0.0078 -0.642635 -0.099011

Interpretation of the Parameter Estimate:

Exp{2*-.371705} = 0.4755 = odds ratio between the odds of admittance for a student at a less prestigous university and the odds of admittance for a student from a more prestigous university.

The odds of being admitted from a less prestigous university is .48 times the odds of being admitted from a more prestigous university.

I

Page 39: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Variable Selection– Likelihood Ratio Test

• Consider the model with GPA, GRE, and Top Notch as predictor variablesGeneralized Linear Model Fit

Response: Admit

Modeling P(Admit=0)

Distribution: Binomial

Link: Logit

Observations (or Sum Wgts) = 400

Whole Model Test

Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq

Difference 10.9234504 21.84693 <.0001

Full 239.064808

Reduced 249.988259

Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq

Pearson 396.9196 396 0.4775

Deviance 478.1296 396 0.0029

Page 40: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Variable Selection– Likelihood Ratio Test cont.

Effect Tests

Source DF L-R ChiSquare Prob>ChiSq

TOPNOTCH 1 2.2143635 0.1367

GPA 1 4.2909753 0.0383

GRE 1 5.4555484 0.0195

Parameter Estimates

Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL

Intercept -4.382202 1.1352224 15.917859 <.0001 -6.657167 -2.197805

TOPNOTCH[0] -0.218612 0.1459266 2.2143635 0.1367 -0.503583 0.070142

GPA 0.6675556 0.3252593 4.2909753 0.0383 0.0356956 1.3133755

GRE 0.0024768 0.0010702 5.4555484 0.0195 0.0003962 0.0046006

Page 41: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Model Selection – Forward

Stepwise Fit

Response:

Admit

Stepwise Regression Control

Prob to Enter 0.250

Prob to Leave 0.100

Direction:

Rules:

Current Estimates

-LogLikelihood RSquare

239.06481 0.0437

Page 42: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Model Selection – Forward cont.

Parameter Estimate nDF Wald/Score ChiSq "Sig Prob"

Intercept[1] -4.3821986 1 0 1.0000

GRE 0.00247683 1 5.356022 0.0207

GPA 0.66755511 1 4.212258 0.0401

TOPNOTCH{1-0} 0.21861181 1 2.244286 0.1341

Step History

Step Parameter Action L-R ChiSquare "Sig Prob" RSquare p

1 GRE Entered 13.92038 0.0002 0.0278 2

2 GPA Entered 5.712157 0.0168 0.0393 3

3 TOPNOTCH{1-0} Entered 2.214363 0.1367 0.0437 4

Page 43: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Model Selection – Backward

• Start by selecting to enter all variables into the model

Stepwise Fit

Response: Admit

Stepwise Regression Control

Prob to Enter 0.250

Prob to Leave 0.100

Direction: Backward

Rules: Combine

Page 44: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Model Selection – Backward cont.

Current Estimates

-LogLikelihood RSquare

240.17199 0.0393

Parameter Estimate nDF Wald/Score ChiSq "Sig Prob"

Intercept[1] -4.9493751 1 0 1.0000

GRE 0.00269068 1 6.473978 0.0109

GPA 0.75468641 1 5.576461 0.0182

TOPNOTCH{1-0} 0 1 2.259729 0.1328

Step History

Step Parameter Action L-R ChiSquare "Sig Prob" RSquare p

1 TOPNOTCH{1-0} Removed 2.214363 0.1367 0.0393 3

Page 45: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Variable Selection – Stepwise

Stepwise Fit

Response:

Admit

Stepwise Regression Control

Prob to Enter 0.250

Prob to Leave 0.250

Direction: Mixed

Rules: Combine

Current Estimates

-LogLikelihood RSquare

239.06481 0.0437

Page 46: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Variable Selection – Stepwise cont.

Parameter Estimate nDF Wald/Score ChiSq "Sig Prob"

Intercept[1] -4.3821986 1 0 1.0000

GRE 0.00247683 1 5.356022 0.0207

GPA 0.66755511 1 4.212258 0.0401

TOPNOTCH{1-0} 0.21861181 1 2.244286 0.1341

Step History

Step Parameter Action L-R ChiSquare "Sig Prob" Rsquare p

1 GRE Entered 13.92038 0.0002 0.0278 2

2 GPA Entered 5.712157 0.0168 0.0393 3

3 TOPNOTCH{1-0} Entered 2.214363 0.1367 0.0437 4

Page 47: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression• Summary

• Introduction to the Logistic Regression Model

• Interpretation of the Parameter Estimates β – Odds Ratio

• Variable Significance – Likelihood Ratio Test

• Model Selection • Forward• Backward• Stepwise

Page 48: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Logistic Regression

• Questions/Comments

Page 49: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Consider a count response variable.

• Response variable is the number of occurrences in a given time frame.

• Outcomes equal to 0, 1, 2, ….

• Examples:

Number of penalties during a football game.

Number of customers shop at a store on a given day.

Number of car accidents at an intersection.

Page 50: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Poisson Regression Example Data Set

• Response Variable –> Number of Days Absent – Integer

• Predictor Variables• Gender- 1 if Female, 2 if Male• Ethnicity – 6 Ethnic Categories• School – 1 if School, 2 if School 2• Math Test Score – Continuous• Language Test Score – Continuous• Bilingual Status – 6 Bilingual Categories

Page 51: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• First 10 Observations from the Poisson Regression Example

Data SetGENDER ethnicity school.1.or.2 ctbs.math.nce ctbs.lang.nce bilingual.status number.days.absent

1 2 4 1 56.988830 42.45086 2 4

2 2 4 1 37.094160 46.82059 2 4

3 1 4 1 32.275460 43.56657 2 2

4 1 4 1 29.056720 43.56657 2 3

5 1 4 1 6.748048 27.24847 3 3

6 1 4 1 61.654280 48.41482 0 13

7 1 4 1 56.988830 40.73543 2 11

8 2 4 1 10.390490 15.35938 2 7

9 2 4 1 50.527950 52.11514 2 10

10 2 6 1 49.472050 42.45086 0 9

Page 52: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Consider the model

where Yi = response for observation i

xi = 1x(p+1) matrix of covariates for observation i

p = number of covariates

μi = expected number of events given xi

• GLM with poisson random component and identity link g(μ) = μ

• Issue: Predicted values range from -∞ to +∞

iii xYE

Page 53: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Consider the Poisson log-linear model

• GLM with poisson random component and log link g(μ) = log(μ)

• Predicted response values fall between 0 and +∞

• In the case of a single predictor, An increase of one unit of x results an increase of exp(β) in μ

iiii xxYE exp|

ii x log

Page 54: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Consider the Poisson log-linear model

And the Poisson linear model

Then a graph of the predicted values from the model:

ii ScoreMath _*log

ii ScoreMathx _*

Page 55: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression

Page 56: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Single Continuous Predictor Variable – Math Score

> fitline<-glm(number.days.absent~ctbs.math.nce,data=poisson_data,family=poisson(link=log))

> summary(fitline)

Call:

glm(formula = number.days.absent ~ ctbs.math.nce, family = poisson(link = log), data = poisson_data)

Deviance Residuals:

Min 1Q Median 3Q Max

-4.4451 -2.5583 -1.0842 0.6647 12.4431

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.302100 0.062776 36.671 <2e-16 ***

ctbs.math.nce -0.011568 0.001294 -8.939 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Page 57: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Single Continuous Predictor Variable – Math Score

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 2409.8 on 315 degrees of freedom

Residual deviance: 2330.6 on 314 degrees of freedom

AIC: 3196

Number of Fisher Scoring iterations: 6

Interpretation of the parameter estimate:

Exp{-0.011568} = .98 = multiplicative effect on the expected number of days absent for an increase of 1 in the Math Score

Fabricated Example – If a student is expected to miss 5 days with a math of 50, then another student with a math score of 51 is expected to miss 5*.98 = 4.9 days

Page 58: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Single Continuous Predictor Variable – Gender

> fitline<-glm(number.days.absent~factor(GENDER),data=poisson_data,family=poisson(link=log))

> summary(fitline)

Call:

glm(formula = number.days.absent ~ factor(GENDER), family = poisson(link = log), data = poisson_data)

Deviance Residuals:

Min 1Q Median 3Q Max

-3.660 -2.755 -1.128 0.902 9.738

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.90174 0.03036 62.644 < 2e-16 ***

factor(GENDER)2 -0.31729 0.04747 -6.684 2.32e-11 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Page 59: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Single Continuous Predictor Variable – Gender

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 2409.8 on 315 degrees of freedom

Residual deviance: 2364.5 on 314 degrees of freedom

AIC: 3229.9

Number of Fisher Scoring iterations: 5

Important Note: The function factor(categorical variable) uses the dummy coding

Interpretation of the parameter estimate:

Exp{-0.31729} = 0.7289 = multiplicative effect on the expected number of days absent of being male rather than female

If a female student is expected to miss X days, then a male student is expected to miss 0.7289*X.

Page 60: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Variable Selection – Likelihood Ratio Test

• Model with all variables > fitline<-glm(number.days.absent~factor(GENDER)+factor(school.1.or.2)+ctbs.math.nce+ctbs.lang.nce+factor(bilingual.status)+

factor(ethnicity),data=poisson_data,family=poisson(link=log))

summary(fitline)

Call:

glm(formula = number.days.absent ~ factor(GENDER) + factor(school.1.or.2) +

ctbs.math.nce + ctbs.lang.nce + factor(bilingual.status) +

factor(ethnicity), family = poisson(link = log), data = poisson_data)

Deviance Residuals:

Min 1Q Median 3Q Max

-4.5222 -2.1863 -0.9622 0.7454 10.4077

Page 61: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Variable Selection – Likelihood Ratio Test

• Model with all variables Cont> Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.972325 0.424645 7.000 2.57e-12 ***

factor(GENDER)2 -0.401980 0.048954 -8.211 < 2e-16 ***

factor(school.1.or.2)2 -0.582321 0.070717 -8.235 < 2e-16 ***

ctbs.math.nce -0.001043 0.001845 -0.565 0.57181

ctbs.lang.nce -0.003048 0.002003 -1.521 0.12822

factor(bilingual.status)1 -0.344696 0.083754 -4.116 3.86e-05 ***

factor(bilingual.status)2 -0.282194 0.070846 -3.983 6.80e-05 ***

factor(bilingual.status)3 -0.053406 0.081850 -0.652 0.51409

factor(ethnicity)2 -0.131202 0.420704 -0.312 0.75515

factor(ethnicity)3 -0.434061 0.418013 -1.038 0.29909

factor(ethnicity)4 -0.326230 0.419158 -0.778 0.43639

factor(ethnicity)5 -0.876270 0.416398 -2.104 0.03534 *

factor(ethnicity)6 -1.188835 0.457470 -2.599 0.00936 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Page 62: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Variable Selection – Likelihood Ratio Test

• Model with all variables Cont(Dispersion parameter for poisson family taken to be 1)

Null deviance: 2409.8 on 315 degrees of freedom

Residual deviance: 1909.2 on 303 degrees of freedom

AIC: 2796.6

Number of Fisher Scoring iterations: 6

Page 63: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Variable Selection – Likelihood Ratio Test

• Model with all variables except Ethnicity>fitline<glm(number.days.absent~factor(GENDER)+factor(school.1.or.2)+ctbs.math.nce+ctbs.lang.nce+factor(bilingual.status),

data=poisson_data,family=poisson(link=log))

> summary(fitline)

Call:

glm(formula = number.days.absent ~ factor(GENDER) + factor(school.1.or.2) + ctbs.math.nce + ctbs.lang.nce + factor(bilingual.status),

family = poisson(link = log), data = poisson_data)

Deviance Residuals:

Min 1Q Median 3Q Max

-4.6955 -2.3130 -0.9115 0.7527 11.4247

Page 64: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Variable Selection – Likelihood Ratio Test

• Model with all variables except Ethnicity

Coefficients: Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.5741133 0.0838754 30.690 < 2e-16 ***

factor(GENDER)2 -0.4212841 0.0484383 -8.697 < 2e-16 ***

factor(school.1.or.2)2 -0.8242109 0.0570241 -14.454 < 2e-16 ***

ctbs.math.nce 0.0008193 0.0018278 0.448 0.65398

ctbs.lang.nce -0.0050753 0.0019380 -2.619 0.00882 **

factor(bilingual.status)1 -0.3080131 0.0762534 -4.039 5.36e-05 ***

factor(bilingual.status)2 -0.1815997 0.0581877 -3.121 0.00180 **

factor(bilingual.status)3 0.0363656 0.0686396 0.530 0.59625

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Page 65: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Variable Selection – Likelihood Ratio Test

• Model with all variables except Ethnicity

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 2409.8 on 315 degrees of freedom

Residual deviance: 1984.1 on 308 degrees of freedom

AIC: 2861.5

Number of Fisher Scoring iterations: 6

Page 66: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Variable Selection – Likelihood Ratio Test

• Model 1 with All Variables – Deviance = -2 Log L = 1909.2 with

df = 303

• Model 2 without Ethnicity - Deviance = -2 Log L = 1984.1 with

df = 308

• Likelihood Ratio Test = Deviance (Model 2) – Deviance (Model 1)

= 1984.1 – 1909.2= 74.9

• Likelihood Ratio Test ~ Chi Square with 308-303 = 5 degrees of freedom

• P-Value < .0001

• There is significant evidence to conclude that ethnicity is a significant predictor variable.

Page 67: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Model Selection

• Forward Selection> fitline<-glm(number.days.absent~1,data=data1,family=poisson(link=log))

> step(fitline,scope = list(upper = ~factor(GENDER)+factor(school.1.or.2)+ctbs.math.nce+ctbs.lang.nce+factor(bilingual.status)+factor(ethnicity), lower = ~1),direction="forward")

Start: AIC=3273.22

number.days.absent ~ 1

Df Deviance AIC

+ factor(school.1.or.2) 1 2103.7 2969.1

+ factor(ethnicity) 5 2095.9 2969.3

+ ctbs.lang.nce 1 2311.7 3177.0

+ ctbs.math.nce 1 2330.6 3196.0

+ factor(bilingual.status) 3 2339.2 3208.6

+ factor(GENDER) 1 2364.5 3229.9

<none> 2409.8 3273.2

Page 68: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Model Selection

• Forward Selection cont.Step: AIC=2969.12

number.days.absent ~ factor(school.1.or.2)

Df Deviance AIC

+ factor(ethnicity) 5 2018.7 2894.1

+ factor(GENDER) 1 2029.3 2896.7

+ factor(bilingual.status) 3 2066.0 2937.4

+ ctbs.lang.nce 1 2092.7 2960.1

+ ctbs.math.nce 1 2096.7 2964.1

<none> 2103.7 2969.1

-

Page 69: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Model Selection

• Forward Selection cont.Step: AIC=2894.07

number.days.absent ~ factor(school.1.or.2) + factor(ethnicity)

Df Deviance AIC

+ factor(GENDER) 1 1951.3 2828.7

+ factor(bilingual.status) 3 1981.6 2863.0

+ ctbs.math.nce 1 2011.1 2888.5

+ ctbs.lang.nce 1 2012.5 2889.9

<none> 2018.7 2894.1

Step: AIC=2828.67

number.days.absent ~ factor(school.1.or.2) + factor(ethnicity) + factor(GENDER)

Df Deviance AIC

+ factor(bilingual.status) 3 1915.3 2798.8

+ ctbs.lang.nce 1 1938.5 2817.8

+ ctbs.math.nce 1 1942.3 2821.7

<none> 1951.3 2828.7

Page 70: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Model Selection

• Forward Selection cont.

Step: AIC=2798.75

number.days.absent ~ factor(school.1.or.2) + factor(ethnicity) + factor(GENDER) + factor(bilingual.status)

Df Deviance AIC

+ ctbs.lang.nce 1 1909.5 2794.9

+ ctbs.math.nce 1 1911.5 2796.9

<none> 1915.3 2798.8

Step: AIC=2794.89

number.days.absent ~ factor(school.1.or.2) + factor(ethnicity) + factor(GENDER) + factor(bilingual.status) + ctbs.lang.nce

Df Deviance AIC

<none> 1909.5 2794.9

+ ctbs.math.nce 1 1909.2 2796.6

Page 71: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Model Selection

• Forward Selection cont.

Call: glm(formula = number.days.absent ~ factor(school.1.or.2) + factor(ethnicity) + factor(GENDER) + factor(bilingual.status) + ctbs.lang.nce, family = poisson(link = log), data = data1)

Coefficients:

(Intercept) factor(school.1.or.2)2 factor(ethnicity)2 factor(ethnicity)3 factor(ethnicity)4

2.948689 -0.586678 -0.126806 -0.423376 -0.313360

factor(ethnicity)5 factor(ethnicity)6 factor(GENDER)2 factor(bilingual.status)1 factor(bilingual.status)2

-0.862743 -1.175574 -0.404215 -0.343907 -0.284027

factor(bilingual.status)3 ctbs.lang.nce

-0.051558 -0.003763

Degrees of Freedom: 315 Total (i.e. Null); 304 Residual

Null Deviance: 2410

Page 72: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Model Selection cont.

• Backward Selection> fitline<-glm(number.days.absent~factor(GENDER)

+factor(school.1.or.2)+ctbs.math.nce+ctbs.lang.nce+factor(bilingual.status)+

factor(ethnicity),data=poisson_data,family=poisson(link=log))

> backwards<-step(fitline,direction="backward")

Start: AIC=2796.57

number.days.absent ~ factor(GENDER) + factor(school.1.or.2) + ctbs.math.nce + ctbs.lang.nce + factor(bilingual.status) +

factor(ethnicity)

Df Deviance AIC

- ctbs.math.nce 1 1909.5 2794.9

<none> 1909.2 2796.6

- ctbs.lang.nce 1 1911.5 2796.9

- factor(bilingual.status) 3 1937.8 2819.2

- factor(ethnicity) 5 1984.1 2861.5

- factor(GENDER) 1 1977.8 2863.2

- factor(school.1.or.2) 1 1983.6 2869.0

Page 73: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Model Selection cont.

• Backward Selection cont.Step: AIC=2794.89

number.days.absent ~ factor(GENDER) + factor(school.1.or.2) + ctbs.lang.nce + factor(bilingual.status) + factor(ethnicity)

Df Deviance AIC

<none> 1909.5 2794.9

- ctbs.lang.nce 1 1915.3 2798.8

- factor(bilingual.status) 3 1938.5 2817.8

- factor(ethnicity) 5 1984.3 2859.7

- factor(GENDER) 1 1979.4 2862.8

- factor(school.1.or.2) 1 1986.5 2869.9

Page 74: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Model Selection cont.

• Stepwise Selection cont.> fitline<-glm(number.days.absent~1,data=data1,family=poisson(link=log))

> step(fitline,scope = list(upper=~factor(GENDER)+factor(school.1.or.2)+ctbs.math.nce+ctbs.lang.nce+factor(bilingual.status)+factor(ethnicity), lower = ~1),direction="both")

Start: AIC=3273.22

number.days.absent ~ 1

Df Deviance AIC

+ factor(school.1.or.2) 1 2103.7 2969.1

+ factor(ethnicity) 5 2095.9 2969.3

+ ctbs.lang.nce 1 2311.7 3177.0

+ ctbs.math.nce 1 2330.6 3196.0

+ factor(bilingual.status) 3 2339.2 3208.6

+ factor(GENDER) 1 2364.5 3229.9

<none> 2409.8 3273.2

Page 75: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Model Selection cont.

• Stepwise Selection cont.

Step: AIC=2969.12

number.days.absent ~ factor(school.1.or.2)

Df Deviance AIC

+ factor(ethnicity) 5 2018.7 2894.1

+ factor(GENDER) 1 2029.3 2896.7

+ factor(bilingual.status) 3 2066.0 2937.4

+ ctbs.lang.nce 1 2092.7 2960.1

+ ctbs.math.nce 1 2096.7 2964.1

<none> 2103.7 2969.1

- factor(school.1.or.2) 1 2409.8 3273.2

Page 76: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Model Selection cont.

• Stepwise Selection cont.

• Step: AIC=2894.07

• number.days.absent ~ factor(school.1.or.2) + factor(ethnicity)

• Df Deviance AIC

• + factor(GENDER) 1 1951.3 2828.7

• + factor(bilingual.status) 3 1981.6 2863.0

• + ctbs.math.nce 1 2011.1 2888.5

• + ctbs.lang.nce 1 2012.5 2889.9

• <none> 2018.7 2894.1

• - factor(ethnicity) 5 2103.7 2969.1

• - factor(school.1.or.2) 1 2095.9 2969.3

Page 77: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Model Selection cont.

• Stepwise Selection cont.

Step: AIC=2828.67

number.days.absent ~ factor(school.1.or.2) + factor(ethnicity) + factor(GENDER)

Df Deviance AIC

+ factor(bilingual.status) 3 1915.3 2798.8

+ ctbs.lang.nce 1 1938.5 2817.8

+ ctbs.math.nce 1 1942.3 2821.7

<none> 1951.3 2828.7

- factor(GENDER) 1 2018.7 2894.1

- factor(ethnicity) 5 2029.3 2896.7

- factor(school.1.or.2) 1 2050.5 2925.9

Page 78: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Model Selection cont.

• Stepwise Selection cont.

Step: AIC=2798.75

number.days.absent ~ factor(school.1.or.2) + factor(ethnicity) + factor(GENDER) + factor(bilingual.status)

Df Deviance AIC

+ ctbs.lang.nce 1 1909.5 2794.9

+ ctbs.math.nce 1 1911.5 2796.9

<none> 1915.3 2798.8

- factor(bilingual.status) 3 1951.3 2828.7

- factor(GENDER) 1 1981.6 2863.0

- factor(ethnicity) 5 1993.4 2866.8

- factor(school.1.or.2) 1 2003.4 2884.8

Page 79: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Model Selection cont.

• Stepwise Selection cont.

• Step: AIC=2794.89

• number.days.absent ~ factor(school.1.or.2) + factor(ethnicity) + factor(GENDER) + factor(bilingual.status) + ctbs.lang.nce

Df Deviance AIC

<none> 1909.5 2794.9

+ ctbs.math.nce 1 1909.2 2796.6

- ctbs.lang.nce 1 1915.3 2798.8

- factor(bilingual.status) 3 1938.5 2817.8

- factor(ethnicity) 5 1984.3 2859.7

- factor(GENDER) 1 1979.4 2862.8

- factor(school.1.or.2) 1 1986.5 2869.9

Page 80: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Model Selection cont.

• Stepwise Selection cont.

Call: glm(formula = number.days.absent ~ factor(school.1.or.2) + factor(ethnicity) + factor(GENDER) + factor(bilingual.status) + ctbs.lang.nce, family = poisson(link = log), data = data1)

Coefficients:

(Intercept) factor(school.1.or.2)2 factor(ethnicity)2 factor(ethnicity)3 factor(ethnicity)4

2.948689 -0.586678 -0.126806 -0.423376 -0.313360

factor(ethnicity)5 factor(ethnicity)6 factor(GENDER)2 factor(bilingual.status)1 factor(bilingual.status)2

-0.862743 -1.175574 -0.404215 -0.343907 -0.284027

factor(bilingual.status)3 ctbs.lang.nce

-0.051558 -0.003763

Degrees of Freedom: 315 Total (i.e. Null); 304 Residual

Null Deviance: 2410

Residual Deviance: 1909 AIC: 2795

Page 81: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Lets look back at the Poisson log-linear model

• Taking the sample mean and sample variance of the response for intervals of Math Scores

ii ScoreMath _*log

Math Score Sample Mean Sample Standard Deviation

0-20 11.66666667 10.64397095

20-40 6.453333333 6.595029523

40-60 5.270072993 7.382913152

60-80 4.324675325 5.434881392

80-100 9.666666667 14.50861813

Page 82: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression

• Overdispersion for Poisson Regression Models• For Yi~Poisson(λi), E [Yi] = Var [Yi] = λi

• The variance of the response is much larger than the mean.

• Larger variance known as overdispersion• Consequences: Parameter estimates are still

consistent

Standard errors are inconsistent• Remedy: Negative Binomial model

Page 83: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression• Summary

• Introduction to the Poisson Regression Model

• Interpretation of β

• Variable Significance – Likelihood Ratio Test

• Model Selection • Forward• Backward• Stepwise

• Overdispersion

Page 84: November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics

Poisson Regression

• Questions/Comments