simple logistic regression an introduction to proc freq and proc logistic

26
Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Upload: marcos-idell

Post on 16-Dec-2015

229 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Simple Logistic Regression

An introduction to

PROC FREQ and

PROC LOGISTIC

Page 2: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Introduction to Logistic Regression

Logistic Regression is used when the outcome variable of interest is categorical, rather than continuous. Examples include: death vs. no death, recovery vs. no recovery, obese vs. not obese, etc. All of the examples you will see in this class have binary outcomes, meaning there are only two possible outcomes.

Simple Logistic Regression has only one predictor variable. You may already be familiar with this type of regression under a different name: odds ratio.

Page 3: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Simple Logistic Regression: An example

Imagine you are interested in investigating whether there is a relationship between race and party identification. Race (Black or White) is the independent variable, and Party Identification (Democrat or Republican) is the dependent variable. Consider the following table:

Example from Agresti, A. Categorical Data Analysis, 2nd ed. 2002.

Page 4: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Race x Party Identification

Democrat Republican

Black 103 11

White 341 405

Page 5: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

The odds of being a Democrat for Black vs. White is:

• OR(odds ratio) = (103/11)/(341/405) = (103x405)/(341x11) = 11.12

• Blacks have a 11.12 times greater odds of being a Democrat than Whites.

The odds of being a Republican for Black vs. White is:

• (11/103)/(405/341) = (11x341)/(405x103) = 0.09

• Blacks have a 91% (1-0.09) lower odds of being a Republican than Whites.

Page 6: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Odds Ratios in SAS

Copy the following code into SAS:

DATA partyid; INPUT race $ party $ count; DATALINES; B D 103 B R 11 W D 341 W R 405 ; RUN; PROC PRINT DATA = partyid; RUN;

Page 7: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Odds Ratios with PROC FREQ

There are two ways to get Odds Ratios in SAS when there is one predictor and one outcome variable. The first is with PROC FREQ. Type the following code into SAS:

PROC FREQ DATA = partyid; weight count; TABLES race * party / chisq relrisk; RUN;

Page 8: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Notes about the SAS code:

• weight is a term in SAS which weighs whatever variable you specify. When you have a table you want to enter into SAS, it is often easier to use a “count” variable rather than list each subject individually. Because the data set has 860 observations, we would have to type out 860 separate datalines if we did not use the “count” variable and “weight count” option.

Page 9: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

• TABLES tells SAS to construct a table with the two specified variables (in this case, race and party).

• The chisq option requests all Chi-Square statistics.

• The relrisk option gives you estimates of the odds ratio and relative risks for the two columns.

Page 10: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Output from PROC FREQ

Page 11: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Reading the Table

• Each cell has four numbers: count, percent, row %, and column %

• There are 103 Black Democrats, which is 11.98% of the total sample.

• 90.35% of Blacks are Democrats.

• 20.32% of Democrats are Black. Compare this to 2.64% of Republicans who are Black.

Page 12: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Interpreting Chi-Square StatisticThe Chi-Square (Χ2) test statistic tests the

null hypothesis that two variables are independent versus the alternative, that they are not independent (that is, related).

Ho: race and party identification are independent

Ha: race and party identification are associated

Χ2 = 78.9082, pvalue < 0.0001.

Reject Ho. Conclude that race and party identification are associated.

Page 13: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Output of Odds Ratio

Page 14: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Interpreting the Odds Ratio

You can find the OR in the SAS output under “Case-Control (Odds Ratio).”

The odds ratio is 11.12 with a 95% Confidence Interval of [5.87, 21.05]. Because this C.I. does not contain 0, we know that the OR is statistically significant.

Blacks have a 11.12 times greater odds of being Democratic than Whites.

Page 15: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

A note about the PROC FREQ table:

Notice the way the table is set up in SAS:

When calculating the OR in PROC FREQ, SAS will alphabetize the table, and this affects the OR it will calculate. SAS is calculating the odds of being a Democrat for Blacks versus Whites (or the odds of being Black for Democrats versus Republicans). If you wanted the odds of being Democratic for Whites versus Blacks, you would have to either calculate this by hand or use PROC LOGISTIC.

Dem Rep

Black 103 11

White 341 405

Page 16: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Odds Ratio with PROC LOGISTICTo simplify our data set, we will change our variables

to have values of 1 and 0, rather than B/W and D/R. If someone is Black, s/he will have a value of “1” for the variable “race2.” Whites will have a value of “0.” If someone is a Democrat, s/he will have a value of “1” for “party2.” Republicans will have a value of “0.” Type the following code into SAS, which creates a new data set called “partyid2”:

DATA partyid2; SET partyid; if race = "B" then race2 = 1; else race2 = 0; if party = "D" then party2 = 1; else party2 = 0; RUN;

Page 17: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

PROC LOGISTICOnce you have created the new data set, do regression

analysis on the data, using PROC LOGISTIC (notice the format is similar to that of linear regression, with the model statement y = x):

• “Descending” tells SAS to model the probability that “party2” = 1 (Democratic). If you did not include the descending statement, SAS would model the probability that “party2” = 0 (Republican). All subsequent interpretations will be in terms of the odds of being Democratic, not Republican.

PROC LOGISTIC descending data = partyid2; weight count; MODEL party2 = race2; RUN;

Page 18: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

PROC LOGISTIC Output

Page 19: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Interpreting the Output

From PROC LOGISITC, we now have an equation for our log(odds):

Log(odds) = β0 + β1x

Log(odds) = -0.1720 + 2.4088x

where x = 1 if the person is Black and x = 0 if the person is White.

Page 20: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Calculating the Odds RatioSuppose we wanted to know the odds of being a Democrat

for Blacks vs. Whites.• The log(odds) of being Democratic for Blacks is:

β0 + β1(1) = β0 + β1

• The log(odds) of being Democratic for Whites is:

β0 + β1(0) = β0.• To calculate the OR, take the log(odds) for Blacks minus

the log(odds) for Whites:

β0 + β1 – (β0) = β1

• Then exponentiate this value:

exp(β1) = exp(2.4088) = 11.12This is the same OR calculated earlier using PROC FREQ.

In addition, it is given to you in the PROC LOGISTIC output under “Odds Ratio Estimates” with the 95% C.I.

Page 21: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Calculating the OR, cont.

Suppose we wanted to know the odds of being a Democrat for Whites vs. Blacks.

• To calculate the OR, take the log(odds) for Whites minus the log(odds) for Blacks:

β0 – (β0 + β1) = -β1

• Then exponentiate this value:

exp(-β1) = exp(-2.4088) = 0.0899

Whites have a 91% (1-.0899) decreased odds of being Democratic than Blacks.

Page 22: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Significance TestingTesting the significance of a parameter

estimate can be done by constructing a confidence interval around that parameter estimate.

If the C.I. for an estimate (or log(OR)) contains 0, the variable is not significantly associated with the outcome.

If the C.I. for an OR contains 1, the variable is not significantly associated with the outcome.

Page 23: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

The Wald Chi-Square statistic tests whether the parameter estimate equals zero, that is Ho: β1 = 0 vs. Ha: β1 ≠ 0.

From the output, we see that the pvalue of this test < 0.0001, so we reject Ho and conclude that race is significantly related to party identification.

Page 24: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Confidence Interval Construction

Confidence interval construction is similar to what you have seen for linear regression, except that it is now on the natural log scale:

95% C.I. for β1 = β1 +/- 1.96*se(β1)

= 2.4088 +/- 1.96*(0.3256)

= [1.77,3.05]. This C.I. does not contain 0.

exp [1.77,3.05] = [5.875, 21.052] This C.I. does not contain 1.

Notice that [5.875, 21.052] is also the 95% C.I. for the OR given in the SAS output.

Page 25: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Calculating the Probability

If you were asked to calculate the probability that someone is a Democrat, given that he is Black, you would use the following formula:

Π(probability) = exp(log(odds))/[1+ exp(log(odds))]

Π = exp(-0.1720+2.4088)/[1+ exp(-0.1720+2.4088)] = 0.9035

A Black person has a 90.35% chance of being a Democrat.

Page 26: Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC

Summary

This has been an introduction to calculating odds ratios in PROC FREQ and PROC LOGISTIC. The next section will introduce you to multiple predictors in logistic regression, including interactions.