# binary logistic regression - · pdf filebinary logistic regression background and examples,...

Post on 07-Feb-2018

219 views

Embed Size (px)

TRANSCRIPT

Binary Logistic Regression

Background and Examples, Binary Logistic Regression in R, and Communicating Results

Prepared by Allison Horst for ESM 244

Bren School of Environmental Science & Management, UCSB Introduction, Conceptual Background and Examples Binary logistic regression is used to understand relationships between one or more independent variables (measured or categorical) and a dichotomous dependent variable (i.e., a variable with only two possible outcomes: yes/no, pass/fail, live/die, etc.). Logistic regression is particularly useful in determining the probability of a categorical outcome occurring based on the value or category of input variable(s). Examples that may benefit from analysis by logistic regression:

Example 1. You measure patient blood pressure and heart rate, and want to understand how they influence the probability of a patient having a heart attack within the next 5 years.

Independent measurement variables: blood pressure and heart rate. Dichotomous dependent variable: heart attack/no heart attack within 5 years.

Example 2. How do gender and age influence a persons likelihood to vote in an election?

Independent categorical variables: gender and age. Dichotomous dependent variable: votes/does not vote.

Logistic regression explores the odds of an event occurring. The odds ratio is the probability of an event occurring divided by the probability of the event not occurring:

!""# = !

1 ! The log odds of an event (also called the logit) are given by:

ln!

1 ! = !"#$% ! = !! + !!!! ++ !!!!

Where the coefficients () are determined from the logistic regression model (see approach in R below), and the variables (x1, x2, xn) are the values of the independent variables. If the independent variables are measured, the values are input directly. If the independent variables are categorical, the use of dummy variables is required. Dummy variables (typically) assign a value of 0 or 1 to categorical input variables and dichotomous dependent variables. For example, using Example 2 above, we might code the dichotomous dependent variable as [Does Not Vote = 0, Votes = 1], and the categorical input variable (gender) as [Female = 0, Male = 1]. Thus, categorical information is coded as quantitative information using proxies (also called Boolean operators).

Since we have an expression for the log odds (above), we can convert (using the exponential) that equation into an expression for the odds as follows:

!

1 != !(!!!!!!!! )

And, if we want to express the probability of the outcome, we can rewrite the equation as:

! =!(!!!!!!!! )

1 + !(!!!!!!!! )

So if we know the coefficients (0, 1, n), and we know the values of the input variables x1xn (either as measured values or coded dummy variables), then to find the probability of an outcome we can use the equation above. Before moving on to how to determine the coefficients in R, lets take a look at a simple (fictional) example to see how the equations above are useful in interpreting logistic regression results.

Example: One categorical input variable, one dichotomous dependent variable.

A researcher is investigating the influence of gender (Female = 0, Male = 1) on the likelihood of a child having an attention disorder (dichotomous: yes/no for disorder/no disorder). Logistic regression yields the following:

!"#$% ! = 1.45 + 1.61 !"#$"%

What are the odds of a female child having an attention disorder?

The log odds are found by:

!"#$% ! = 1.45 + 1.61 0 = 1.45 So the odds of a girl having an attention disorder is calculated by:

!1 ! = !

(!!!!!!!! ) = !!!.!" = 0.234 What are the odds of a male child having an attention disorder? The log odds are found by:

!"#$% ! = 1.45 + 1.61 1 = 0.16

So the odds of a boy having an attention disorder is calculated by:

!1 ! = !

(!!!!!!!! ) = !!.!" = 1.17 How do the odds for boys and girls compare? You find the comparison of odds two ways: by dividing the boys odds by the girls odds (1.17/0.23 = 5.00), or simply by taking the exponential of the coefficient in front of the

gender term (here, 1), to similarly find an odds ratio (male:female) by e1.61 = 5.00. Either way, you see that the relative odds of a boy having an attention disorder compared to a girl is ~5. What if you have a measured predictor variable? Below, consider the example (in which you are trying to predict the odds of a rat contracting a tumor when exposed to varying dosages of a chemical. Example (based on example described in lecture by Dr. Bruce Kendall in ESM 244): One continuous (measured) input variable, one dichotomous dependent variable.

You are investigating the influence of a chemical dosage (in ppm) on tumor formation in rats, where the dependent variable is dichotomous (no tumor/tumor). Performing logistic regression yields the following expression:

!"#$% ! = 4.34+ 0.013 !"#$

What types of information can we get from this expression? A lot! In fact, you can explore the complete logistic model to determine likelihood of tumor formation at any concentration! But lets start by considering some simple ways to think about it:

What are the odds of a rat contracting a tumor at a dosage of 250 ppm?

!"#$% ! = 4.34 + 0.013 250 = 1.09

!

1 != !!!.!" = 0.34

And the probability of a rat contracting a tumor at a dosage of 250 ppm?

Rearrange the above equation and solve for P to find: P250 = 0.25. Or, based on this model there is a 25% chance that a rate exposed to 250 ppm of the chemical will develop a tumor.

Another interesting question might be to ask: at what dosage is there a 50% likelihood that a rat will develop a tumor? We can again use the logit expression, substituting 0.5 in for P:

ln!

1 ! = !! + !!!! ++ !!!! = 4.34 + 0.013 !"#$

ln0.5

1 0.5 = 4.34 + 0.013 !"#$

0 = 4.34 + 0.013 !"#$

Dose = 334 ppm

So the dose at which the likelihood of a rat developing a tumor is 50% is 334 ppm (these are slightly different that reported in the lecture, due to rounding).

Example: Multiple (>1) predictor variables and one dichotomous dependent variable.

Often, you will have multiple predictor variables (categorical, measured, or a mix!) that influence the outcome of a dichotomous dependent variable. In those cases, it can also be useful to perform logistic regression.

Lets consider another example. A researcher is studying the influence of systolic blood pressure (SBP; mm Hg) and heart rate (HR; bpm) on likelihood of male over the age of 85 having a heart attack (no heart attack = 0, heart attack = 1). Logistic regression yields the following:

!"#$% ! = 5.09+ 0.015 !"# + 0.043 !"

As above, we can easily answer questions about the odds of a heart attack for varying values of BP and HR.

What are the odds of a heart attack occurring for a patient with a SBP of 138 and HR of 76?

!

1 ! = !(!!!!!!!! ) = !(!!.!"!!.!"#!"#!!.!"#!") = 1.28

Since the odds are 1.28, the probability of a heart attack is solved by rearranging the odds equation to yield P = 0.56.

It can also be interesting to think about the influence of the individual predictor variables on the outcome likelihood. What can we learn from the coefficients for each independent variable?

Well, we know that the coefficient for SBP is 0.015, which means that the odds ratio for that component is e0.015 = 1.015. We can interpret the value as follows: for every one unit increase in SBP, males over the age of 85 are 1.015 times more likely to suffer a heart attack. Similarly, if we consider the influence of HR, the odds ratio is e0.043 = 1.043, which we can interpret as follows: for every one unit increase in HR, the likelihood of having a heart attack is increased 1.043 times. Or, an alternate way to say this is: for every one unit increase in heart rate, the likelihood of a man over 85 years old having a heart attack is increased by 4.3%.

Binary Logistic Regression in R

**You can download an Excel file with the sheep data used in the example below to follow along in the example below, the data frame is loaded from a .csv file called Sheep. So now you (hopefully!) have an idea of when, how and why logistic regression can be useful. There are two major things left to learn: how to do it (how do I find those coefficients?) and how to communicate the results graphically and in text.

Luckily, once you understanding the basics of logistic regression, finding the coefficients in R is pretty straightforward using the glm() function (where glm stands for generalized linear model). Here, an example will be used to demonstrate binary logistic regression in R.

Lets consider a (fictional) dataset for which male bighorn sheep weight (in lbs) is used as a predictor variable for the success of male bighorn sheep in finding a mate (no mate = 0, mate = 1). The first 5 lines of the dataset are shown below (67 sheep total):

. . . .

Of course, we want to look at our data (from a dataset called Sheep):

And then we want to do some kind of analysis on it. Since we have a dichotomous dependent variable (success/failure) and a continuous predictor variable (weight), we can use logistic regression to explore the likelihood of mating success based on bighorn sheep weight.

So we can use the glm() function as follows for logistic regression:

So now we have our logistic regression information stored as the variable SheepLR. What does that tell us? We can see a brief version just by c