introduction to logistic regression

53
Introduction to Logistic Regression • Up to this point, we have been dealing with categorical exposure variables. • If the exposure variable is continuous, for example, income, the results covered in previous lectures can not be applied unless you categorize it. Keep in mind that categorizing a continuous exposure variable is problematic in some cases.

Upload: aure

Post on 25-Feb-2016

67 views

Category:

Documents


1 download

DESCRIPTION

Introduction to Logistic Regression. Up to this point, we have been dealing with categorical exposure variables. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Logistic Regression

Introduction to Logistic Regression

• Up to this point, we have been dealing with categorical exposure variables.

• If the exposure variable is continuous, for example, income, the results covered in previous lectures can not be applied unless you categorize it. Keep in mind that categorizing a continuous exposure variable is problematic in some cases.

Page 2: Introduction to Logistic Regression

Introduction to Logistic Regression

• It is necessary to have methodologies to deal with an continuous exposure variable without categorization.

• Think about the research question whether or not income is associated with buying a new car.

Page 3: Introduction to Logistic Regression

Introduction to Logistic Regression

• Q: Can we study the association by modeling the probability of buying a car as a linear function of income, i.e.,

Where y=1 if one buys a new car and 0 otherwise; x=income.

bxaxYP )|1(

Page 4: Introduction to Logistic Regression

Introduction to Logistic Regression

• A: No, because the right-hand side of the equation can be negative and can exceed 1, which is not allowable as probability is always between 0 and 1.

• Therefore, we are looking for a function of x that lies in (0,1).

• One such a function is called logistic function, defined as:

)exp(1)exp()|1(xxxYP

Page 5: Introduction to Logistic Regression

Introduction to Logistic Regression

• Note that this logistic function is S-shaped, which means that changing the exposure level does not affect the probability much if the exposure level is low or high.

• Therefore, the logistic function is suitable for describing the association of income with the probability of buying a new car because changing income does not affect the probability much if income is low or high.

Page 6: Introduction to Logistic Regression

Logistic Regression Model

• A Logistic Regression Model is to model the conditional probability of Y=1 given explanatory variables X1,…,Xr as a logit function of a linear conbination of X1,…,Xr , i.e.,

)exp(1)exp(),,|1(

11

111

rr

rrr XX

XXXXYP

Page 7: Introduction to Logistic Regression

Logistic Regression Model

• Equivalently, a Logistic Regression Model is to model the logarithm of the conditional odds of Y=1 given explanatory variables X1,…,Xr as a linear function of X1,…,Xr , i.e.,

rrr XXXXYoddsLog 111 ),,|1(

Page 8: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Logistic Regression Model allows one to obtain marginal (crude) odds ratio to study association of an explanatory variable, X, with a binary response variable, Y, where X can be either a qualitative explanatory variable or a quantitative explanatory variable.

• Our remainder discussion will be split into two parts, one for X= a qualitative explanatory variable, and one for X= a quantitative explanatory variable

Page 9: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Q: how to set up a regression model to obtain marginal (crude) odds ratio to study the association of a qualitative explanatory variable with a binary response variable?

• Examples of qualitative explanatory variable: gender (female, male) and race (white, black, others).

• A: Qualitative explanatory variable can not be included directly into a Logistic Regression Model. Dummy variable (s) need to be created and then include them into a Logistic Regression Model. Dummy variables can be viewed as a way to quantitatively identifying the classes of a qualitative explanatory variable.

Page 10: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• example 1: qualitative explanatory variable=gender– create a dummy variable , gdum, for gender:

– use the following logistic regression model to link the conditional odds of Y=1(having lung cancer) given gdum with gdum:

gdumgdumYoddsLog )|1(

otherwise 0,

female issubject a if,1

gdum

Page 11: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Interpretation of : The ratio of the odds of Y=1 for gdum=1 to the odds of Y=1 for gdum=0.

• Why? Because is the logarithm of the ratio of the odds of Y=1 for gdum=1 to the odds of Y=1 for gdum=0

)0()1()]0|1(log[)]1|1(log[

)0|1()1|1(log

gdumYoddsgdumYoddsgdumYoddsgdumYodds

)exp(

Page 12: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Example 2: qualitative explanatory variable=race (white, black and hispanic) – Choose two classes of race, for example, white and black,

and then create a dummy variable for each of these two classes:

– use the following logistic regression model to link the conditional odds of Y=1(having lung cancer) given rdum1 and rdum2with rdum1 and rdum2 :

otherwise 0,

black issubject a if,1

otherwise 0, whiteissubject a if,1

21 rdumrdum

221121 ),|1( rdumrdumrdumrdumYoddsLog

Page 13: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Interpretation of : The ratio of the odds of Y=1 for rdum1=1 to the odds of Y=1 for rdum1=0 with rdum2 fixed at 0 .

• Why? Because is the logarithm of the ratio of the odds of Y=1 for rdum1=1 to the odds of Y=1 for rdum1=0 with rdum2 fixed at 0 .

1

2121

2121

21

21

)00()01()]0,0|1(log[)]0,1|1(log[

)0,0|1()0,1|1(log

rdumrdumYoddsrdumrdumYoddsrdumrdumYoddsrdumrdumYodds

)exp( 1

1

Page 14: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Correspondence between classes of race and the values of rdum1 and rdum2

race rdum1 rdum2

white 1 0

black 0 1

hispanic 0 0

Page 15: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Since rdum1=1 and rdum2=0 corresponds to white group and rdum1=0 and rdum2=0 corresponds to hispanic group, the interpretation of becomes the ratio of the odds of Y=1 for white to the odds of Y=1 for hispanic

)exp( 1

Page 16: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Interpretation of : The ratio of the odds of Y=1 for rdum1=0 and rdum2=1 to the odds of Y=1 for rdum1=0 and rdum2=0 .

• Why? Because is the logarithm of the ratio of the odds of Y=1 for rdum1=0 and rdum2=1 to the odds of Y=1 for rdum1=0 and rdum2=0 .

2

2121

2121

21

21

)00()10()]0,0|1(log[)]1,0|1(log[

)0,0|1()1,0|1(log

rdumrdumYoddsrdumrdumYoddsrdumrdumYoddsrdumrdumYodds

)exp( 2

2

Page 17: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Since rdum1=1 and rdum2=0 corresponds to white group and rdum1=0 and rdum2=0 corresponds to others group, the interpretation of becomes the ratio of the odds of Y=1 for black to the odds of Y=1 for hispanic

)exp( 2

Page 18: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Q: Can we also create a dummy variable rdum3 for others:

and include it along with the other two dummy variables rdum1

and rdum2 into a Logistic Regression Model? i.e.,

• A: Definitely No. Two reasons are provided on the slides that follow.

otherwise 0,

hispanic issubject a if,13

rdum

332211321 ),,|1( rdumrdumrdumrdumrdumrdumYoddsLog

Page 19: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Reason 1: the interpretation of are not odds ratios anymore.

• Correspondence between classes of race and values of rdum1 , rdum2 and rdum3

race rdum1 rdum2 rdum3

white 1 0 0black 0 1 0hispanic 0 0 1

3,2,1),exp( jj

Page 20: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• the ratio of the odds of Y=1 for white to the odds of Y=1 for hispanic becomes

31

321321

321321

321

321

)100()001()]1,0,0|1(log[)]0,0,1|1(log[

)1,0,0|1()0,0,1|1(log

rdumrdumrdumYoddsrdumrdumrdumYoddsrdumrdumrdumYoddsrdumrdumrdumYodds

)exp( 31

Page 21: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• the ratio of the odds of Y=1 for black to the odds of Y=1 for hispanic becomes )exp( 32

31

321321

321321

321

321

)100()010()]1,0,0|1(log[)]0,1,0|1(log[

)1,0,0|1()0,1,0|1(log

rdumrdumrdumYoddsrdumrdumrdumYoddsrdumrdumrdumYoddsrdumrdumrdumYodds

Page 22: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Reason 2: More importantly, cause over parameterization problem, i.e., the number of parameters in the model is more than the number of quantities that can be estimated.

• The three estimable quantities are the three log odds,

with corresponding to white, corresponding to black, and corresponding to others i.e.,

3,2,1, jj

)1,0,0|1()0,1,0|1()0,0,1|1(

3213

3212

3211

rdumrdumrdumYoddsrdumrdumrdumYoddsrdumrdumrdumYodds

1 2

3

Page 23: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• However, the model uses four parameters, to describe the three estimable parameters,

As a result, are not estimable.

33

22

11

321 ,,, 3,2,1, jj

321 ,,,

Page 24: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• In general, for a qualitative explanatory variable that has s classes, s-1 dummy variables need to be created.

• Choose one class to be the reference class (for example, others is used as the reference class in example 2). Create one dummy variable for each of the non-reference classes (for example, white and black are the non-reference classes in example 2), and then include those s-1 dummy variables into a Logistic Regression Model.

Page 25: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Q: how to set up a regression model to study the association of a quantitative explanatory variable with a binary response variable?

• Examples of quantitative explanatory variable: age, income.

• A: Unlike a qualitative explanatory variable, a quantitative explanatory variable can be included into model directly.

Page 26: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Example 3: quantitative explanatory variable=age– use the following logistic regression model to

link the conditional odds of Y=1(having lung cancer) given age

)|1(log ageageYodds

Page 27: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Interpretation of : The ratio of the odds of Y=1 for age=age0+1 to the odds of Y=1 for age=age0 .

• Why? Because is the logarithm of the ratio of the odds of Y=1 for age=age0+1 to the odds of Y=1 for age=age0 .

00

00

0

0

)1()]|1(log[)]1|1(log[

)|1()1|1(log

ageageageageYoddsageageYodds

ageageYoddsageageYodds

)exp(

Page 28: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Q: how to set up a regression model to study the association of a ordinal explanatory variable with a binary response variable?

• Examples of ordinal explanatory variable: income level(=0 if income<30,000, =1 if 30,000<=income<50,000, =2 if income>=50,000 .

• A: a ordinal explanatory variable can be treated as either a qualitative explanatory variable or a quantitative explanatory variable. However, Different treatments lead to different interpretations.

Page 29: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Example 4: ordinal explanatory variable=income_level (0,1,2) treated as a qualitative explanatory variable– Choose two classes of income_level, for example, 1 and 2,

and then create a dummy variable for each of these two classes:

– use the following logistic regression model to link the conditional odds of Y=1(having lung cancer) given incdum1 and incdum2with incdum1 and incdum2 :

otherwise 0,

2elincome_lev ssubject' a if,1

otherwise 0,1elincome_lev ssubject' a if,1

21 incdumincdum

221121 ),|1( incdumincdumincdumincdumYoddsLog

Page 30: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• the interpretation of : the ratio of the odds of Y=1 for income_level=1 to the odds of Y=1 for income_level=0

• the interpretation of : the ratio of the odds of Y=1 for income_level=2 to the odds of Y=1 for income_level=0

)exp( 2

)exp( 1

Page 31: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Example 5: ordinal explanatory variable =income level treated as a quantitative explanatory variable treated as a quantitative explanatory variable– use the following logistic regression model to

link the conditional odds of Y=1(having lung cancer) given income level

_)_|1(log levelincomelevelincomeYodds

Page 32: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Interpretation of : The ratio of the odds of Y=1 for income_level=1 to the odds of Y=1 for income_level=0.

• Why? Because is the logarithm of the ratio of the odds of Y=1 for income_level=1 to the odds of Y=1 for income_level=0 .

01)]0_|1(log[)]1_|1(log[

)0_|1()1_|1(log

levelincomeYoddslevelincomeYoddslevelincomeYoddslevelincomeYodds

)exp(

Page 33: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Interpretation of : The ratio of the odds of Y=1 for income_level=2 to the odds of Y=1 for income_level=1.

• Why? Because is the logarithm of the ratio of the odds of Y=1 for income_level=2 to the odds of Y=1 for income_level=1 .

12)]1_|1(log[)]2_|1(log[

)1_|1()2_|1(log

levelincomeYoddslevelincomeYoddslevelincomeYoddslevelincomeYodds

)exp(

Page 34: Introduction to Logistic Regression

Logistic Regression Model: Marginal (crude) Odds Ratio

• Q: what is the interpretation of ?• A: The ratio of the odds of Y=1 for income_level=2

to the odds of Y=1 for income_level=0.• Why? Because is the logarithm of the ratio of the

odds of Y=1 for income_level=2 to the odds of Y=1 for income_level=0 .

2

02)]0_|1(log[)]2_|1(log[

)0_|1()2_|1(log

levelincomeYoddslevelincomeYoddslevelincomeYoddslevelincomeYodds

)2exp(

2

Page 35: Introduction to Logistic Regression

Logistic Regression Model: Common conditional (adjusted)

Odds Ratio• We have understood the issue of using marginal

(crude) odds ratio to assess the association of an exposure variable X with a binary variable Y in the presence of a confounding variable Z and the need to use conditional (adjusted) Odds Ratios.

• Q: Let’s assume for now that the conditional odds ratios are the same and ask the question of how to use a Logistic Regression Model to estimate the Common conditional (adjusted) Odds Ratio?

Page 36: Introduction to Logistic Regression

Logistic Regression Model: Common conditional (adjusted)

Odds Ratio• A: All you need to do is to add the confounding Z

variable (or its dummy variables in the case where Z is a qualitative explanatory variable) into the Logistic Regression Model that contains the exposure X variable (or its dummy variables in the case where X is a qualitative explanatory variable).

Page 37: Introduction to Logistic Regression

Logistic Regression Model: Common conditional (adjusted)

Odds Ratio• Example 6: X=gender, Z=tea drinking:

– create one dummy variable for X and one dummy variable for Z :

– and use the following Logistic Regression Model :

),(log 21 tdumgdumtdumgdumodds

otherwise 0,

teadrinkssubject a if,1

otherwise 0,female issubject a if,1

tdumgdum

Page 38: Introduction to Logistic Regression

Logistic Regression Model: Common conditional (adjusted)

Odds Ratio• Interpretation of : The common conditional

(adjusted) odds ratio of Y and X with condition (adjustment) on Z.

• Why? – is equal to the ratio of the odds of Y=1 for gdum=1

to the odds of Y=1 for gdum=0 with tdum fixed at 1 .

1

2121 )10()11()]1,0|1(log[)]1,1|1(log[

)1,0|1()1,1|1(log

tdumgdumYoddstdumgdumYoddstdumgdumYoddstdumgdumYodds

)exp( 1

)exp( 1

Page 39: Introduction to Logistic Regression

Logistic Regression Model: Common conditional (adjusted)

Odds Ratio• Why?(cont.)

– is also equal to the ratio of the odds of Y=1 for gdum=1 to the odds of Y=1 for gdum=0 with tdum fixed at 0 .

1

2121 )00()01()]0,0|1(log[)]0,1|1(log[

)0,0|1()0,1|1(log

tdumgdumYoddstdumgdumYoddstdumgdumYoddstdumgdumYodds

)exp( 1

Page 40: Introduction to Logistic Regression

Logistic Regression Model: Test For effect modification

• Q: Now, let’s remove the assumption that the conditional odds ratios are the same and ask the question of how to use a Logistic Regression Model to test whether or not Z is an effect modifier with respect to the association of X with Y.

Page 41: Introduction to Logistic Regression

Logistic Regression Model: Test For effect modification

• A: All you need to do is to add the interaction term(s) into the logistic regression model assuming no effect modification and test for the null hypothesis that all the beta coefficients of interaction terms are equal to 0. How to create interaction term(s) depends on which case of the following four cases it is.– Case 1: both X and Z are qualitative– Case 2: X is qualitative and Z is quantitative– Case 3: X is quantitative and Z is qualitative– Case 4: both X and Z are quantitative

Page 42: Introduction to Logistic Regression

Logistic Regression Model: Test For effect modification

• Case1: interaction terms need to be created:

where are r-1 dummy variables of X , and are s-1 dummy variables of Z,

11 ,, rXdumXdum

1,,1,1,,1

,

sjri

ZdumXdumXdumZdum jiij

)1()1( sr

11 ,, rZdumZdum

Page 43: Introduction to Logistic Regression

Logistic Regression Model: Test For effect modification

Example 1 of case 1: X=gender, Z=tea drinking: – create one dummy variable for X and one dummy variable

for Z :

– Create one interaction term:

– and use the following Logistic Regression Model :

otherwise 0,

teadrinkssubject a if,1

otherwise 0,female issubject a if,1

tdumgdum

gdumtdum ),(log 1321 tdumgdumtdumgdumodds

111 tdumgdumgdumtdum

Page 44: Introduction to Logistic Regression

Logistic Regression Model: Test For effect modification

• Example 1 of case 1(cont.) – Fact: the null hypothesis that Z is not a effect modifier is

equivalent to – Why?

• is equal to the ratio of the odds of Y=1 for gdum=1 and tdum=1 to the odds of Y=1 for gdum=0 and tdum=1 .

31

321321 )1010()1111()]1,0|1(log[)]1,1|1(log[

)1,0|1()1,1|1(log

tdumgdumYoddstdumgdumYoddstdumgdumYoddstdumgdumYodds

03

)exp( 31

Page 45: Introduction to Logistic Regression

Logistic Regression Model: Test For effect modification

• Example 1 of case 1(cont.)• is equal to the ratio of the odds of Y=1 for gdum=1 and

tdum=0 to the odds of Y=1 for gdum=0 and tdum=0 .

1

321321 )0000()0101()]0,0|1(log[)]0,1|1(log[

)0,0|1()0,1|1(log

tdumgdumYoddstdumgdumYoddstdumgdumYoddstdumgdumYodds

)exp( 1

Page 46: Introduction to Logistic Regression

Logistic Regression Model: Test For effect modification

Example 2 of case 1: X=gender, Z=race: – create one dummy variable for X and two dummy variable

for Z :

– Create two interaction term:

– and use the following Logistic Regression Model :

otherwise 0,

black if,1

otherwise 0, whiteif,1

otherwise 0,

female issubject a if,121 rdumrdumgdum

gdumrdum gdumrdum rdum),(log 251423121 rdumgdumtdumgdumodds

111 rdumgdumgdumrdum

212 rdumgdumgdumrdum

Page 47: Introduction to Logistic Regression

Logistic Regression Model: Test For effect modification

• Case2: X is qualitative and Z is quantitative r-1 • interaction terms need to be created:

where are r-1 dummy variables

11 ,, rXdumXdum

1,,1, riZXdumXdumZ ii

Page 48: Introduction to Logistic Regression

Logistic Regression Model: Test For effect modification

Example of case 2: X=race, Z=age: – create two dummy variable for X

– Create two interaction term:

– and use the following Logistic Regression Model :

otherwise 0,

black if,1

otherwise 0, whiteif,1

21 rdumrdum

rdumage rdumage ),(log 251432211 agerdumrdumtdumgdumodds

agerdumrdumage 11

agerdumrdumage 22

Page 49: Introduction to Logistic Regression

Logistic Regression Model: Test For effect modification

• Case3: X is quantitative and Z is qualitative• s-1 interaction terms need to be created:

where are s-1 dummy variables of

Z

11 ,, sZdumZdum

1,,1,1 sjZdumXXZdum ji

Page 50: Introduction to Logistic Regression

Logistic Regression Model: Test For effect modification

Example of case 3: X=age, Z=race: – create two dummy variable for Z

– Create two interaction term:

– and use the following Logistic Regression Model :

otherwise 0,

black if,1

otherwise 0, whiteif,1

21 rdumrdum

agerdum agerdumrdum ),(log 251423121 rdumagetdumgdumodds

11 rdumageagerdum

22 rdumageagerdum

Page 51: Introduction to Logistic Regression

Logistic Regression Model: Test For effect modification

• Case4: only one interaction term need to be created:

• Example of case4: X=age, Z=income– Create one interaction term:

– and use the following Logistic Regression Model :

ZXXZ

ageincomeincomeageincomeageodds 321 ),(log

incomeageageincome

Page 52: Introduction to Logistic Regression

Logistic Regression Model: Test for Non-linearity

• If X is a quantitative variable, it is often desirable to see whether the relationship between the log odds and X is linear. This can be done by adding some high-order terms of X into the model:

2

21 )(log ageageageodds

Page 53: Introduction to Logistic Regression

Home Work Assignments

• Refer to example 2 in this lecture note. what do you think is going to change in

terms of the interpretations of the exponential of beta coefficients if one chooses white as the reference group instead of others as in example 2? Provide your detailed arguments.