logistic regression- dichotomous dependent variables march 21 & 23, 2011

24
Logistic Regression- Logistic Regression- Dichotomous Dependent Dichotomous Dependent Variables Variables March 21 & 23, 2011 March 21 & 23, 2011

Upload: franklin-mosley

Post on 26-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

Logistic Regression- Logistic Regression- Dichotomous Dependent Dichotomous Dependent

VariablesVariables

March 21 & 23, 2011March 21 & 23, 2011

Page 2: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

ObjectivesObjectivesBy the end of this meeting, participants By the end of this meeting, participants

should be able to:should be able to:a)a) Explain why OLS regression is Explain why OLS regression is

inappropriate for dichotomous inappropriate for dichotomous dependent variables.dependent variables.

b)b) List the assumptions of a logistic List the assumptions of a logistic regression model.regression model.

c)c) Estimate a logistic regression model Estimate a logistic regression model in R.in R.

d)d) Interpret the results of a logistic Interpret the results of a logistic regression model.regression model.

Page 3: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

The Problems with OLS and The Problems with OLS and Binary Dependent VariablesBinary Dependent Variables

• With a dichotomous (zero/one or similar) With a dichotomous (zero/one or similar) dependent variable the assumptions of dependent variable the assumptions of least squares regression (OLS) are least squares regression (OLS) are violated.violated.

• OLS assumes a linear relationship OLS assumes a linear relationship between the dependent variable and the between the dependent variable and the independent variable which cannot be independent variable which cannot be true with only two categories for the true with only two categories for the dependent variable (more of a dependent variable (more of a conceptual than a technical issue).conceptual than a technical issue).

Page 4: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

The Problems with OLS and The Problems with OLS and Binary Dependent VariablesBinary Dependent Variables

• Least squares regression (OLS) assumes Least squares regression (OLS) assumes normally distributed variables which with a normally distributed variables which with a dichotomous dependent variable cannot dichotomous dependent variable cannot be true (a not that difficult problem).be true (a not that difficult problem).

• OLS assumes that the variances of the OLS assumes that the variances of the error terms are the same, which cannot be error terms are the same, which cannot be true for dichotomous dependent variables. true for dichotomous dependent variables. This makes hypothesis testing very difficult This makes hypothesis testing very difficult (major problem).(major problem).

• More intuitively, regression with More intuitively, regression with dichotomous dependent variables will dichotomous dependent variables will frequently predict values outside of the frequently predict values outside of the actual range of the dependent variable.actual range of the dependent variable.

Page 5: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

A Brief Return to OLSA Brief Return to OLSa)a) OLS is computed with a simple single OLS is computed with a simple single

formulaformula• A straight linear modelA straight linear model• Fit a line to various pointsFit a line to various points• The fit will not be perfect but least The fit will not be perfect but least

squares will compute the smallest squares will compute the smallest possible differencepossible difference

b)b) That method cannot work for That method cannot work for dichotomous dependent variables dichotomous dependent variables

Page 6: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

Logistic Regression and Logistic Regression and Maximum LikelihoodMaximum Likelihood

a)a) The way to compute regression with The way to compute regression with a dichotomous dependent variables a dichotomous dependent variables is through a procedure known as is through a procedure known as maximum likelihood.maximum likelihood.

b)b) Maximum likelihood is an iterative Maximum likelihood is an iterative process based on probability theory process based on probability theory that needs the use of a computer.that needs the use of a computer.

Page 7: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

Logistic Regression and Logistic Regression and Maximum LikelihoodMaximum Likelihood

c)c) Instead of fitting a single line, maximum Instead of fitting a single line, maximum likelihood models are a guided trial and likelihood models are a guided trial and error process where a set of coefficients error process where a set of coefficients are chosen and a likelihood function are chosen and a likelihood function computed. computed.

d)d) From that initial function, different From that initial function, different likelihood functions are computed to try to likelihood functions are computed to try to get closer and closer to the true probability get closer and closer to the true probability of the dependent variable.of the dependent variable.

e)e) When a function can get no closer to the When a function can get no closer to the dependent variabledependent variable’’s probability the s probability the process ends.process ends.

Page 8: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

Logistic Regression and Logistic Regression and Maximum LikelihoodMaximum Likelihood

• Logistic regression is the most Logistic regression is the most commonly used of all of the commonly used of all of the maximum likelihood estimation maximum likelihood estimation methods. The other primary method methods. The other primary method for computing models involving for computing models involving dichotomous dependent variables is dichotomous dependent variables is probit. Generally, probit results are probit. Generally, probit results are comparable to logit results.comparable to logit results.

Page 9: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

Assumptions of LogitAssumptions of Logit• It does not assume a linear relationship.It does not assume a linear relationship.• The dependent variable needs to be binary The dependent variable needs to be binary

and coded in a meaningful way. It is and coded in a meaningful way. It is standard to code the category of interest as standard to code the category of interest as the higher value (for example: voter, the higher value (for example: voter, Democrat, etc.).Democrat, etc.).

• All the relevant variables need to be All the relevant variables need to be included in the model.included in the model.

• The variables in the model need to be The variables in the model need to be relevant.relevant.

• Error terms need to be independent.Error terms need to be independent.• There should be low error rates on the There should be low error rates on the

predicting variables.predicting variables.

Page 10: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

Assumptions of LogitAssumptions of Logit• Independent variables should not be Independent variables should not be

highly correlated with each other highly correlated with each other (multicollinearity). This problem will (multicollinearity). This problem will likely present as high standard errors.likely present as high standard errors.

• There should be no major outliers on There should be no major outliers on the independent variables.the independent variables.

• Samples need to be relatively large Samples need to be relatively large (such as 10 cases per predictor). If the (such as 10 cases per predictor). If the samples are too small, standard errors samples are too small, standard errors can be very large and in some cases can be very large and in some cases very large coefficients will also occur.very large coefficients will also occur.

Page 11: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

The Logit FormulaThe Logit Formulaa)a) On first glance, the formula for logit is not On first glance, the formula for logit is not

all that much different than that for OLS:all that much different than that for OLS:yyii*=a+b*=a+bxx1i1i+b+bxx2i2i+e+eii

b)b) Where Y is the dependent variable, X is Where Y is the dependent variable, X is the independent variable(s), and e is the the independent variable(s), and e is the error term.error term.• It is important to note that Y is a probability It is important to note that Y is a probability

rather than a strict valuerather than a strict value• Also key: Y* is a transformation of Y, so you are Also key: Y* is a transformation of Y, so you are

modeling probability indirectly. (Technical: Y* modeling probability indirectly. (Technical: Y* is log of odds.)is log of odds.)

c)c) The b values can be thought of like in OLS The b values can be thought of like in OLS but their intuition is somewhat different.but their intuition is somewhat different.

Page 12: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

The Logit FormulaThe Logit Formulad)d) a can be thought of as a shift a can be thought of as a shift

parameter, it shifts the term to the left parameter, it shifts the term to the left or the right.or the right.• aa <0 shifts the curve to the right<0 shifts the curve to the right• aa >0 shifts the curve to the left>0 shifts the curve to the left

e)e) The value of bThe value of b11 can be thought of as can be thought of as the stretch parameter, it stretches the the stretch parameter, it stretches the curve or shrinks it.curve or shrinks it.

f)f) The sign of bThe sign of b1 1 can be thought of as the can be thought of as the direction parameter, they determine direction parameter, they determine the direction of the curve.the direction of the curve.

Page 13: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

A Diagram of a Logit Function: y vs. A Diagram of a Logit Function: y vs. y*y*

Page 14: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

Interpreting Logit Regression Interpreting Logit Regression ResultsResults

a)a) Logit coefficients cannot be interpreted Logit coefficients cannot be interpreted in the same way as in OLS. Since the in the same way as in OLS. Since the relationship described is not a linear relationship described is not a linear one, it cannot be said that a unit change one, it cannot be said that a unit change in the independent variable leads to in the independent variable leads to <blank> change in dependent variable.<blank> change in dependent variable.

b)b) Logit coefficients can tell you the Logit coefficients can tell you the direction of the relationship between direction of the relationship between the dependent and independent the dependent and independent variable, whether it is statistically variable, whether it is statistically significant and give you a general sense significant and give you a general sense of the magnitude.of the magnitude.

Page 15: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

Interpreting Logit Regression Interpreting Logit Regression ResultsResults

c)c) Statistical significance in logit is based on Statistical significance in logit is based on whether the effect of the independent variable whether the effect of the independent variable on the dependent variable is statistically on the dependent variable is statistically different from zero. (Similar to OLS.)different from zero. (Similar to OLS.)

d)d) Since logit coefficients lack the direct Since logit coefficients lack the direct interpretation of OLS coefficients, many interpretation of OLS coefficients, many people prefer to use odds ratios instead.people prefer to use odds ratios instead.• Odds ratios show the effect of the independent Odds ratios show the effect of the independent

variable on the odds of the dependent variable variable on the odds of the dependent variable occurring.occurring.

• Values greater than 1 mean that the predictor Values greater than 1 mean that the predictor makes the dependent variable more likely to occur.makes the dependent variable more likely to occur.

• Values less than 1 mean that the predictor is less Values less than 1 mean that the predictor is less likely to occur.likely to occur.

• Example: Kentucky odds of winning pre & post Example: Kentucky odds of winning pre & post KansasKansas

Page 16: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

Interpreting Logit Regression Interpreting Logit Regression ResultsResults

e)e) Coefficients in logit are the effect of Coefficients in logit are the effect of the predictor on the log of the odds the predictor on the log of the odds (for the dependent variable). (for the dependent variable).

f)f) Odds ratios remove the log Odds ratios remove the log component of the coefficient and component of the coefficient and compute the effect of the predictor compute the effect of the predictor on the odds of the dependent on the odds of the dependent variable occurringvariable occurring

Page 17: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

Goodness of Fit-RGoodness of Fit-R22 Like Like MeasuresMeasures

a)a) Unlike linear regression, there is no intuitive Unlike linear regression, there is no intuitive equivalent of the Requivalent of the R22 statistic for logit statistic for logit models. models.

b)b) The desire to create comparable measure The desire to create comparable measure has led to the creation of a variety of so has led to the creation of a variety of so called pseudo Rcalled pseudo R22 measures. measures.

c)c) In general, the findings are that these In general, the findings are that these pseudo Rpseudo R22 measures perform poorly, so R measures perform poorly, so R doesndoesn’’t even report them.t even report them.

d)d) If you do calculate pseudo RIf you do calculate pseudo R22 values, you values, you can report it as a general sense of the fit can report it as a general sense of the fit but it does not have the same direct but it does not have the same direct interpretation as in OLS.interpretation as in OLS.

Page 18: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

Example: Interpreting Logit Example: Interpreting Logit ResultsResults

Coef.Coef. Std. ErrorStd. Error zz p>|z|p>|z|Odds RatioOdds RatioBushBush 1.491.49 0.170.17 8.828.82 0.000.00 4.444.44PartyParty -0.18-0.18 0.160.16 -1.09-1.09 0.270.27 0.840.84IdeologyIdeology 0.120.12 0.120.12 1.001.00 0.320.32 1.131.13ConstantConstant -3.80-3.80 0.760.76 -5.03-5.03 0.000.00 ----

a)a) Data from the 2003 Carolina poll (N=423)Data from the 2003 Carolina poll (N=423)

b)b) The dependent variable is a measure of The dependent variable is a measure of whether the person thinks the country was whether the person thinks the country was heading on the right (0) or wrong track (1)heading on the right (0) or wrong track (1)

c)c) The predictors are:The predictors are:• Evaluation of Bush: (1)Excellent- (4)PoorEvaluation of Bush: (1)Excellent- (4)Poor• Party: (1)Democrat (2)Independent (3)RepublicanParty: (1)Democrat (2)Independent (3)Republican• Ideology: (1) Very Liberal- (5) Very ConservativeIdeology: (1) Very Liberal- (5) Very Conservative

Page 19: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

Example: Interpreting Logit Example: Interpreting Logit ResultsResults

a)a) What do the results tell us about the relationship What do the results tell us about the relationship between evaluations of Bush and whether or not between evaluations of Bush and whether or not a person thinks the country is on the wrong a person thinks the country is on the wrong track? How sure are we of this result?track? How sure are we of this result?

b)b) What do the results tell us about the relationship What do the results tell us about the relationship between partisanship and whether or not a between partisanship and whether or not a person thinks the country is on the wrong track? person thinks the country is on the wrong track? How sure are we of this result?How sure are we of this result?

c)c) What do the results tell us about the relationship What do the results tell us about the relationship between ideology and whether or not a person between ideology and whether or not a person thinks the country is on the wrong track? How thinks the country is on the wrong track? How sure are we of this result?sure are we of this result?

Page 20: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

Example in RExample in Rlibrary(foreign)

ps.are<-read.spss('http://j.mp/classdata',

use.value.labels=FALSE,to.data.frame=TRUE)

ps.are$voted<-as.numeric(ps.are$po_4==1)

ps.are$strength<-abs(ps.are$po_party-4)

logit.model<-glm(voted~dm_income+strength,

data=ps.are, family=binomial(link="logit"))

odds.ratios<-exp(logit.model$coefficients)

summary(logit.model)

odds.ratios

(odds.ratios-1)*100

Page 21: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

Example: Interpreting Logit Example: Interpreting Logit ResultsResults

a)a) What do the results tell us about the What do the results tell us about the relationship between partisan strength and relationship between partisan strength and whether or not a person voted in 2004? How whether or not a person voted in 2004? How sure are we of this result?sure are we of this result?

b)b) What do the results tell us about the What do the results tell us about the relationship between income and whether or relationship between income and whether or not a person voted in 2004? How sure are we not a person voted in 2004? How sure are we of this result?of this result?

Page 22: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

What all this means for your What all this means for your papers…papers…

• If your dependent variable is If your dependent variable is continuous or has a range longer than continuous or has a range longer than 4, use OLS regression. It is the 4, use OLS regression. It is the simplest and the findings are most simplest and the findings are most intuitive. Even when some of the intuitive. Even when some of the assumptions are violated OLS tends to assumptions are violated OLS tends to be a very robust method.be a very robust method.

• If your dependent variable is If your dependent variable is dichotomous, use logit. It is simplest dichotomous, use logit. It is simplest and most common of the maximum and most common of the maximum likelihood estimation methods.likelihood estimation methods.

Page 23: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

What all this means for your What all this means for your papers…papers…

• If your dependent variable is short ordered (i. e. If your dependent variable is short ordered (i. e. less than 4 but more than 2 categories), try to less than 4 but more than 2 categories), try to reduce the number of categories to 2 or increase reduce the number of categories to 2 or increase them to 4 or more them to 4 or more • Drop DK/NA responsesDrop DK/NA responses• Drop middle categories (unless they are the categories Drop middle categories (unless they are the categories

of interest)of interest)• Combine multiple categoriesCombine multiple categories• Split the data into two parts and run separate analysesSplit the data into two parts and run separate analyses• Create a scale to increase the range of the dependent Create a scale to increase the range of the dependent

variablevariable

• Other circumstances: ordered logit Other circumstances: ordered logit (beyond the scope of this course).(beyond the scope of this course).

Page 24: Logistic Regression- Dichotomous Dependent Variables March 21 & 23, 2011

For March 25For March 25a)a) Turn-in your preliminary data Turn-in your preliminary data

analysis (one copy per group).analysis (one copy per group).

b)b) Read WKB chapter 15.Read WKB chapter 15.

c)c) Based on your reading of chapter 15, Based on your reading of chapter 15, what insight did you find most what insight did you find most relevant for your final paper? (Turn-in relevant for your final paper? (Turn-in individually.)individually.)