basic statistics: logistic...

64
Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words Basic statistics: logistic regression Paul Blanche 1 Department of Biostatistics, University of Copenhagen 13/11/2019 1 material based on previous (wonderful) work from Thomas A Gerds. 1/55

Upload: others

Post on 22-Mar-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

Basic statistics: logistic regression

Paul Blanche1

Department of Biostatistics, University of Copenhagen

13/11/2019

1material based on previous (wonderful) work from Thomas A Gerds.

1/55

Page 2: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

TOPICS OF TODAY

Logistic regression, especially:I modelI parameter interpretationI examplesI categorical and continuous covariatesI 95% CI and testI interactionI Fitting the model with R and a few computational “tricks”

1/55

Page 3: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

CARPENTER ET AL. (THE LANCET, 2004)

2/55

Page 4: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

CARPENTER ET AL.

3/55

Page 5: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

REGRESSION

The type of the outcome variable determines which kind of model is relevant:

Quantitative (continuous) outcome

I Linear regresssionI Association parameters: differences between mean values

0-1 (binary) outcome

I Logistic regressionI Association parameters: odds ratio, differences between log(odds)

4/55

Page 6: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

CATEGORICAL EXPLANATORY VARIABLE

Group 1, . . . , K (especially binary: K=2)

Linear regression, continuous outcome Y

mean(Y|group k) - mean(Y|reference group)

E.g., the average systolic blood pressure was higher in males compared tofemales

Logistic regression, binary outcome

Odds ratio =odds(group k)

odds(reference group)

E.g., the risk (or the odds 2) of coronary heart disease was higher in malescompared to females

2remember: odds(p)= p/(1 − p) and “higher odds” is equivalent to “higher risk”.5/55

Page 7: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

QUANTITATIVE (CONTINUOUS) EXPLANATORY VARIABLES

Linear regression, continuous outcome YDifferences in mean values per unit of X:

mean(Y|x+1)-mean(Y|x)

E.g., the average systolic blood pressure increased with age

Logistic regression, binary outcomeDifferences in log(odds) per unit of X

Odds ratio =odds(x+1)

odds(x)

E.g., the risk (odds) of coronary heart disease increased with age

6/55

Page 8: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

LINEARITY IN REGRESSION MODELS

For a continuous explanatory variable X, linearity means that the effect of aunit change of X on the outcome does not depend on the value of X .

Linear regression, continuous outcome Y

mean(Y |45 + 1)−mean(Y |45) = mean(Y |46 + 1)−mean(Y |46)

= · · · = mean(Y |61 + 1)−mean(Y |61)

Logistic regression, binary outcome

odds(45+1)odds(45)

=odds(46+1)

odds(46)= · · · = odds(61+1)

odds(61)

Linearity is a model assumption which should be checked!3

3Categorization of the continuous covariate can be useful when linearity does not hold.7/55

Page 9: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

BINARY OUTCOME REGRESSION

If the outcome variable is binary:

Yi =

{1 if i is diseased0 if i is not diseased

then linear regressionYi = β0 + β1Xi

is not good for many reasons.

One reason is that theregression line will go below 0and above 1.

● ●● ●● ●●●

● ●●●

●●● ●

●● ●● ●●● ● ●● ●● ●● ●● ● ●● ●● ●● ● ●

●●●●●● ●● ●

●● ●● ● ●● ●●● ●● ●

●● ●●●

●● ●●●● ● ●● ●● ● ●●

●●

●● ●● ●● ●●● ●●●

● ●● ● ●

●●●

●●

●●●● ●●

● ●

●● ● ●●●●

●●● ●●● ●● ●● ● ●● ●●● ●●●

● ● ●●● ●●

● ●●

●●● ●● ●●● ●● ●●

●●

● ●

● ●● ● ●●●

● ●

●●

● ●●● ●

●●●

●●●● ●● ● ●● ●●

● ●●

●●

●●

●● ● ● ●● ● ●

● ●

● ●● ●● ●●

● ●●

●●●● ●●

● ● ●●●● ●●

●●● ●●● ●

● ●●

● ●● ●

●●●●

● ●

●●●

● ●● ●●●

●●● ●● ●

●●

●●

● ●●●●●

●● ●

● ●●●● ●

●●●

●●●

● ●

● ●● ●●

●●● ●● ●

●●●● ●●

●●● ●●

● ●●●●

●●●

●●

● ● ● ●● ● ●●● ●●●●

● ● ●● ●● ●● ● ●● ●● ●● ● ●●● ●●●●●● ●

● ●●● ●●

●● ●● ● ●● ●●●●● ●● ●●●●● ● ● ●●

●●

●● ●

●● ●●

●●●● ● ●

● ●●

● ●

● ●●

● ●●

●● ●

●● ●●● ●

●● ●● ●●

● ●●

●●

● ●

● ● ● ●●

●● ●●● ●●●● ● ●●● ●● ●

● ●●●●

●●

● ●

●● ●

●● ●●● ●● ●● ●● ● ●●● ● ●●●

●●

● ●● ●

●● ● ●●●

●●

●●● ● ●● ●●

● ●● ● ●●● ●● ●● ●●●

●● ● ●● ●●● ●

● ●●

● ●● ●●

●● ●

●● ●●

●●

● ●●

● ●● ● ●●● ●

●●● ●●●● ● ●

●●

● ●

●●

●● ●

●● ● ●● ●● ●● ●● ●● ● ●●● ●● ●●● ●

● ●● ●● ●●●●

●●●● ●

● ●

● ●

●●●

●● ●●

● ● ●●● ●

● ●● ●●

●●●

●●

●● ●●

●●

●● ● ●●●● ● ●●●● ●● ●●● ●

● ●

● ● ●

● ●

●● ●

●●● ●●●

● ●

●●

● ●● ● ● ●● ●● ●●● ●●● ● ●●●● ● ●

● ●

● ●● ● ●

● ●●

●●●

●●●● ● ●●●●

● ●●●

●● ● ●● ●● ● ●●

● ●

●●●

●● ● ●● ●●●●

●●●

● ●● ●●

●●●

●● ●●● ●

●●

● ● ●●● ●● ●

●● ●● ● ●● ●● ●●● ●

● ●●●● ●●● ●

● ●

● ●

●● ● ●●●● ● ●

●● ●●

●● ●●

●● ●● ●●● ●●● ●●

●● ●●

● ●

●●● ● ●

●● ●● ●● ●●●●● ● ●●●

●● ●● ● ●●● ●●●●

● ●●

●●●

●●●●

●●

●● ●● ●● ●●

●●

●● ●●● ●● ●

● ● ●●● ●●

● ●●

●●

● ●● ●● ●

● ●● ●●● ●

●●● ●●●● ●

● ●●●

●●●

● ●●

●● ● ●

●● ●● ● ● ●●●

● ●● ● ●● ● ● ●● ●●● ●● ●●● ●● ●●●

●●●

● ●

●● ●● ●●● ●

● ●● ●●●● ●● ●●

●●

● ●

● ● ●●

● ●●● ●●● ●● ● ●● ●●● ●

● ● ● ●●● ●●

●●●

●●●●

● ●●● ●● ● ●●●●

●●

●●

●●

●●

● ●● ●

● ●●

●●

●●● ●● ●● ● ●●

● ●

● ●● ●

●●

●●

●●● ●● ● ●● ●● ●●●●● ●●●

●● ● ●●

● ●

●● ●

−25 %

0 %

25 %

50 %

75 %

100 %

125 %

20 40 60 80

(Fitted linear model lm(Y∼AGE,data=framingham))

8/55

Page 10: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

BINARY OUTCOME REGRESSION

If the outcome variable is binary:

Yi =

{1 if i is diseased0 if i is not diseased

then linear regressionYi = β0 + β1Xi

is not good for many reasons.

One reason is that theregression line will go below 0and above 1.

● ●● ●● ●●●

● ●●●

●●● ●

●● ●● ●●● ● ●● ●● ●● ●● ● ●● ●● ●● ● ●

●●●●●● ●● ●

●● ●● ● ●● ●●● ●● ●

●● ●●●

●● ●●●● ● ●● ●● ● ●●

●●

●● ●● ●● ●●● ●●●

● ●● ● ●

●●●

●●

●●●● ●●

● ●

●● ● ●●●●

●●● ●●● ●● ●● ● ●● ●●● ●●●

● ● ●●● ●●

● ●●

●●● ●● ●●● ●● ●●

●●

● ●

● ●● ● ●●●

● ●

●●

● ●●● ●

●●●

●●●● ●● ● ●● ●●

● ●●

●●

●●

●● ● ● ●● ● ●

● ●

● ●● ●● ●●

● ●●

●●●● ●●

● ● ●●●● ●●

●●● ●●● ●

● ●●

● ●● ●

●●●●

● ●

●●●

● ●● ●●●

●●● ●● ●

●●

●●

● ●●●●●

●● ●

● ●●●● ●

●●●

●●●

● ●

● ●● ●●

●●● ●● ●

●●●● ●●

●●● ●●

● ●●●●

●●●

●●

● ● ● ●● ● ●●● ●●●●

● ● ●● ●● ●● ● ●● ●● ●● ● ●●● ●●●●●● ●

● ●●● ●●

●● ●● ● ●● ●●●●● ●● ●●●●● ● ● ●●

●●

●● ●

●● ●●

●●●● ● ●

● ●●

● ●

● ●●

● ●●

●● ●

●● ●●● ●

●● ●● ●●

● ●●

●●

● ●

● ● ● ●●

●● ●●● ●●●● ● ●●● ●● ●

● ●●●●

●●

● ●

●● ●

●● ●●● ●● ●● ●● ● ●●● ● ●●●

●●

● ●● ●

●● ● ●●●

●●

●●● ● ●● ●●

● ●● ● ●●● ●● ●● ●●●

●● ● ●● ●●● ●

● ●●

● ●● ●●

●● ●

●● ●●

●●

● ●●

● ●● ● ●●● ●

●●● ●●●● ● ●

●●

● ●

●●

●● ●

●● ● ●● ●● ●● ●● ●● ● ●●● ●● ●●● ●

● ●● ●● ●●●●

●●●● ●

● ●

● ●

●●●

●● ●●

● ● ●●● ●

● ●● ●●

●●●

●●

●● ●●

●●

●● ● ●●●● ● ●●●● ●● ●●● ●

● ●

● ● ●

● ●

●● ●

●●● ●●●

● ●

●●

● ●● ● ● ●● ●● ●●● ●●● ● ●●●● ● ●

● ●

● ●● ● ●

● ●●

●●●

●●●● ● ●●●●

● ●●●

●● ● ●● ●● ● ●●

● ●

●●●

●● ● ●● ●●●●

●●●

● ●● ●●

●●●

●● ●●● ●

●●

● ● ●●● ●● ●

●● ●● ● ●● ●● ●●● ●

● ●●●● ●●● ●

● ●

● ●

●● ● ●●●● ● ●

●● ●●

●● ●●

●● ●● ●●● ●●● ●●

●● ●●

● ●

●●● ● ●

●● ●● ●● ●●●●● ● ●●●

●● ●● ● ●●● ●●●●

● ●●

●●●

●●●●

●●

●● ●● ●● ●●

●●

●● ●●● ●● ●

● ● ●●● ●●

● ●●

●●

● ●● ●● ●

● ●● ●●● ●

●●● ●●●● ●

● ●●●

●●●

● ●●

●● ● ●

●● ●● ● ● ●●●

● ●● ● ●● ● ● ●● ●●● ●● ●●● ●● ●●●

●●●

● ●

●● ●● ●●● ●

● ●● ●●●● ●● ●●

●●

● ●

● ● ●●

● ●●● ●●● ●● ● ●● ●●● ●

● ● ● ●●● ●●

●●●

●●●●

● ●●● ●● ● ●●●●

●●

●●

●●

●●

● ●● ●

● ●●

●●

●●● ●● ●● ● ●●

● ●

● ●● ●

●●

●●

●●● ●● ● ●● ●● ●●●●● ●●●

●● ● ●●

● ●

●● ●

−25 %

0 %

25 %

50 %

75 %

100 %

125 %

20 40 60 80

(Fitted linear model lm(Y∼AGE,data=framingham))8/55

Page 11: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

(MULTIPLE) LOGISTIC REGRESSION

We denote the probability of the event Yi = 1 for a subject with explanatoryvariables Xi , Zi , . . . as

P(Yi = 1|Xi ,Zi , . . . ) = pi .

The idea is to use the logit function. Instead of pi which is bounded between0 and 1 we apply linear regression to log(odds):

logit(pi) = log

(pi

1− pi

)= a + b1Zi + b2Xi + . . .

log(

pi1−pi

)can take both negative and positive values.

We will see that exp(a), exp(b1), exp(b2) can be interpreted as odds ratios.

9/55

Page 12: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

Equivalently, the logistic model is:

pi =exp( a + b1Zi + b2Xi + . . . )

1 + exp( a + b1Zi + b2Xi + . . .︸ ︷︷ ︸linear predictor = logit(pi )

)

linear predictor = logit(risk)

risk

−10 −5 0 5 10

0 %

25 %

50 %

75 %

100 %

10/55

Page 13: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

WARM-UP EXERCISES

Complete the following table

pi oddsi logit(pi)0.001%

-7.0-4.5

2.1%8%

50%3.8

99%11.5

Hint: the following functions take a vector as argument

logit <- function(p){log(p/(1 - p))}expit <- function(x){exp(x)/(1 + exp(x))}

11/55

Page 14: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

EXAMPLE: FRAMINGHAM STUDY

I SEX: 1 for males, 2 for femalesI AGE: age (years) at baseline (45-62)I FRW: "Framingham relative weight" (pct.) at baseline (52-222; 11

persons have missing values)I SBP: systolic blood pressure at baseline (mmHg) (90-300)I DBP: diastolic blood pressure at baseline (mmHg) 50-160)I CHOL: cholesterol at baseline (mg/100ml) (96-430)I CIG: cigarettes per day at baseline (0-60; 1 person has missing value)I CHD: 0 if no "coronary heart disease" during follow-up, 1 if "coronary

heart disease" at baseline (prevalent cases), x=2-10 if "coronary heartdisease" was diagnosed at follow-up no.x

12/55

Page 15: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

FRAMINGHAM STUDY: DATA PREPARATION

library(data.table)framingham <- fread("data/Framingham.csv")## remove prevalent casesframingham <- framingham[CHD!=1,]## define factor levels/labelsframingham[,Smoke:=factor(CIG>0,levels=c(FALSE,TRUE),

labels=c("No","Yes"))]framingham[,Sex:=factor(SEX,levels=c(1,2),

labels=c("Male","Female"))]## define binary outcome variableframingham[,Y:=factor(CHD>1,levels=c(FALSE,TRUE),

labels=c("no CHD","CHD"))]framingham

ID SEX AGE FRW SBP DBP CHOL CIG CHD Smoke Sex Y1: 1070 2 45 93 100 62 220 0 0 No Female no CHD2: 1081 1 48 93 108 70 340 0 0 No Male no CHD3: 1123 2 45 91 160 100 171 0 0 No Female no CHD4: 1215 1 50 110 110 70 224 0 0 No Male no CHD5: 1267 1 48 85 110 70 229 25 0 Yes Male no CHD

---1359: 6432 1 47 113 155 105 175 5 5 Yes Male CHD1360: 6434 1 59 98 124 84 227 20 2 Yes Male CHD1361: 6437 2 55 111 108 74 231 0 0 No Female no CHD1362: 6440 1 49 114 110 80 218 20 0 Yes Male no CHD1363: 6442 2 51 95 152 90 199 1 0 Yes Female no CHD 13/55

Page 16: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

FRAMINGHAM OUTCOME

Yi =

{1 subject i develops coronary heart diseased (CHD)0 subject i does not develop CHD

Probability of CHD for any subject similar to subject i , with respect tovariables X , V , W , ...

pi = P(Yi = 1|Xi ,Vi ,Zi , ...)

Corresponding odds

pi

(1− pi)= odds of CHD of subject i

For example:I Xi : age of subject iI Zi : gender of subject iI Vi : smoking status of subject i

14/55

Page 17: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

A BINARY EXPLANATORY VARIABLE

Zi =

{1 if i is a man0 if i is a woman

Simple logistic regression:

log

(pi

1− pi

)= a + bZi =

{a females

a + b males

That means,

b = (a + b)− a = log(odds for males)− log(odds for females)

= log

(Odds for males

Odds for females

)and

−b = a− (a + b) = log

(Odds for femalesOdds for males

)Note: remember that exp(−b) = 1/ exp(b).

15/55

Page 18: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

A BINARY EXPLANATORY VARIABLE

Zi =

{1 if i is a man0 if i is a woman

Simple logistic regression:

log

(pi

1− pi

)= a + bZi =

{a females

a + b males

That means,

b = (a + b)− a = log(odds for males)− log(odds for females)

= log

(Odds for males

Odds for females

)and

−b = a− (a + b) = log

(Odds for femalesOdds for males

)Note: remember that exp(−b) = 1/ exp(b).

15/55

Page 19: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

EXERCISE: 2 BY 2 CONTINGENCY TABLE

framingham[,table(Sex,Y)]

no CHD CHDMale 479 164Female 616 104

I use the tools for 2x2 tables4

I compute the odds ratio with 95% confidence limits and correspondingp-value

I report and interpret the result in a sentence

4Note: pay attention to the order of rows and columns before using any formula or software.16/55

Page 20: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

LOGISTIC REGRESSION IN R

fit1 <- glm(Y~Sex, data=framingham, family=binomial)

I Y ∼ Sex tells R that Y is the outcome and Sex the explanatory variableI data=framingham tells R where to find Y and Sex

I glm means generalized linear modelI family=binomial tells R that the outcome is binary and the logit link

should be used

17/55

Page 21: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

LOGISTIC REGRESSION IN Rfit1 <- glm(Y~Sex,data=framingham,family=binomial)summary(fit1)

Call:glm(formula = Y ~ Sex, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.7674 -0.7674 -0.5586 -0.5586 1.9672

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.07183 0.09047 -11.847 < 2e-16 ***SexFemale -0.70702 0.13937 -5.073 0.000000392 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1324.9 on 1361 degrees of freedomAIC: 1328.9

Number of Fisher Scoring iterations: 4

Note: pay attention to the default reference group.18/55

Page 22: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

CONFIDENCE INTERVALS FOR THE ODDS RATIO

library(Publish)fit1 <- glm(Y~Sex,data=framingham,family=binomial)publish(fit1)

Variable Units OddsRatio CI.95 p-valueSex Male 1.00 [1.00;1.00] 1

Female 0.49 [0.38;0.65] <0.0001

Note : 0.49 = exp(−0.71)

Women have a significantly lower risk to develop coronary heart disease than men(odds ratio: 0.49, 95%-CI: [0.38; 0.65], p-value <0.0001).

19/55

Page 23: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

CONFIDENCE INTERVALS FOR THE ODDS RATIO

library(Publish)fit1 <- glm(Y~Sex,data=framingham,family=binomial)publish(fit1)

Variable Units OddsRatio CI.95 p-valueSex Male 1.00 [1.00;1.00] 1

Female 0.49 [0.38;0.65] <0.0001

Note : 0.49 = exp(−0.71)

Women have a significantly lower risk to develop coronary heart disease than men(odds ratio: 0.49, 95%-CI: [0.38; 0.65], p-value <0.0001).

19/55

Page 24: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

CHANGING THE REFERENCE LEVEL

framingham[,sex:=relevel(Sex,"Female")]fit1a <- glm(Y~sex,data=framingham,family=binomial)publish(fit1a)

Variable Units OddsRatio CI.95 p-valuesex Female 1.00 [1.00;1.00] 1

Male 2.03 [1.54;2.66] <0.0001

Note : 2.03 = exp(0.71)

Men have a significantly higher risk to develop coronary heart disease than women(odds ratio: 2.03, 95%-CI: [1.5; 2.7], p-value <0.0001).

20/55

Page 25: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

CHANGING THE REFERENCE LEVEL

framingham[,sex:=relevel(Sex,"Female")]fit1a <- glm(Y~sex,data=framingham,family=binomial)publish(fit1a)

Variable Units OddsRatio CI.95 p-valuesex Female 1.00 [1.00;1.00] 1

Male 2.03 [1.54;2.66] <0.0001

Note : 2.03 = exp(0.71)

Men have a significantly higher risk to develop coronary heart disease than women(odds ratio: 2.03, 95%-CI: [1.5; 2.7], p-value <0.0001).

20/55

Page 26: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

TWO EXPLANATORY VARIABLES:

Zi =

{1 if i male0 female and Vi =

{1 if i smokes0 otherwise

Data can be summarized as two 2 by 2 tables in two ways5

Males (Z=1) Females (Z=0)Y = 1 Y = 0 Y = 1 Y = 0

V = 1 107 288 V = 1 27 192V = 0 57 191 V = 0 77 423

Smokers (V = 1) Non-smokers (V = 0)Y = 1 Y = 0 Y = 1 Y = 0

Males 107 288 Males 57 191Females 27 192 Females 77 423

5Usually, one option is more interesting than the other. Here, the first option is more suitable tostudy the association between the risk CHD and smoking.

21/55

Page 27: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

COCHRAN-MANTEL-HAENSZEL TEST

In this way, we can study the effect of smoking adjusted for sex:

ORMantel-Haenszel = 1.03; p > 0.05

and also study the effect of Sex (male vs female) adjusted for smoking:

ORMantel-Haenszel = 2.03; p < 0.05

ConclusionsI there is no significant effect of smoking adjusted for sexI there is a significant effect of sex adjusted for smoking

22/55

Page 28: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

LOGISTIC REGRESSION MODEL: TWO BINARY VARIABLES

log(

pi1−pi

)= a + b1Zi + b2Vi

=

a Female non-smokera + b1 Male non-smokera + b2 Female smokera + b1 + b2 Male smoker

Note: b1 = (a + b1)− a (non-smoker)

= (a + b1 + b2)− (a + b2) (smoker)

= logOR (males vs. females for given smoking status)

and b2 = (a + b2)− a (female)

= (a + b1 + b2)− (a + b1) (male)

= logOR (smokers vs. non-smokers for given gender)

23/55

Page 29: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

LOGISTIC REGRESSION RESULTSfit2=glm(Y~Sex+Smoke,data=framingham,family=binomial)summary(fit2)

Call:glm(formula = Y ~ Sex + Smoke, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.7716 -0.7607 -0.5564 -0.5564 1.9708

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.09215 0.12717 -8.588 < 2e-16 ***SexFemale -0.69521 0.14635 -4.750 0.00000203 ***SmokeYes 0.03296 0.14457 0.228 0.82---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1350.8 on 1361 degrees of freedomResidual deviance: 1324.5 on 1359 degrees of freedom

(1 observation deleted due to missingness)AIC: 1330.5

Number of Fisher Scoring iterations: 4 24/55

Page 30: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

EXTRACTING ODDS RATIOS WITH CONFIDENCE INTERVALS

publish(fit2)

Variable Units Missing OddsRatio CI.95 p-valueSex Male 0 Ref

Female 0.50 [0.37;0.66] <0.0001Smoke No 1 Ref

Yes 1.03 [0.78;1.37] 0.8196

Logistic regression adjusted for smoking status showed a decrease in odds ofCHD of 50% (CI-95%: [37%;66%]) in women compared to men (p<0.0001).

Exercise: Based on this fitted model, compute the risk of CHD for anon-smoking woman, a non-smoking man, a smoking woman and a smokingman.

25/55

Page 31: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

EXTRACTING ODDS RATIOS WITH CONFIDENCE INTERVALS

publish(fit2)

Variable Units Missing OddsRatio CI.95 p-valueSex Male 0 Ref

Female 0.50 [0.37;0.66] <0.0001Smoke No 1 Ref

Yes 1.03 [0.78;1.37] 0.8196

Logistic regression adjusted for smoking status showed a decrease in odds ofCHD of 50% (CI-95%: [37%;66%]) in women compared to men (p<0.0001).

Exercise: Based on this fitted model, compute the risk of CHD for anon-smoking woman, a non-smoking man, a smoking woman and a smokingman.

25/55

Page 32: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

EXTRACTING ODDS RATIOS WITH CONFIDENCE INTERVALS

publish(fit2)

Variable Units Missing OddsRatio CI.95 p-valueSex Male 0 Ref

Female 0.50 [0.37;0.66] <0.0001Smoke No 1 Ref

Yes 1.03 [0.78;1.37] 0.8196

Logistic regression adjusted for smoking status showed a decrease in odds ofCHD of 50% (CI-95%: [37%;66%]) in women compared to men (p<0.0001).

Exercise: Based on this fitted model, compute the risk of CHD for anon-smoking woman, a non-smoking man, a smoking woman and a smokingman.

25/55

Page 33: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

SIMPLE LOGISTIC REGRESSION: CATEGORICAL EXPLANATORYVARIABLE:

Categorize age into 4 intervals:

45-48, 49-52, 53-56, 57-62

Summarize in 2 by 4 table

X = 0 X = 1 X = 2 X = 345-48 49-52 53-56 57-62

Y = 0 308 298 254 235 1095Y = 1 51 61 64 92 268

359 359 318 327 1363

(Note: both males and females)

26/55

Page 34: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

ANOVA: χ2 TEST

We may test whether the risk of CHD differs between the 4 age groups usinga chi-square test statistic - in this case with 3 degrees of freedom:

Null hypothesis:

odds(age 45− 48) = odds(age 49− 52)

= odds(age 53− 56) = odds(age 57−62)

Pearson χ2 test:∑ (OBS − EXP)2

EXP= 23.29 ∼ χ2

3, p < 0.001

Conclusion: the CHD-risk in all age groups are (significantly) not all equal.

27/55

Page 35: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

LOGISTIC REGRESSION: CATEGORICAL VARIABLE WITH 4 LEVELS:

log

(pi

1− pi

)=

a age 45− 48

a + b1 age 49− 52a + b2 age 53− 56a + b3 age 57− 62

Reference category 45-48

a = log (Odds(45− 48))

b1 = log

(Odds(49− 52)Odds(45− 48)

)b2 = log

(Odds(53− 56)Odds(45− 48)

)b3 = log

(Odds(57− 62)Odds(45− 48)

)

28/55

Page 36: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

RESULTS

framingham[,AgeCut:=cut(AGE,c(40,48,52,56,99),labels=c("45-48","49-52","53-56","57-62"))]

fit3=glm(Y~AgeCut,data=framingham,family=binomial)publish(fit3)

Variable Units OddsRatio CI.95 p-valueAgeCut 45-48 Ref

49-52 1.24 [0.82;1.85] 0.3042553-56 1.52 [1.02;2.28] 0.0415157-62 2.36 [1.61;3.46] < 0.0001

Notes:

1. The interpretation depends on the cut-off values.2. Not all (six) comparisons are in the table, for example the odds ratio for group

57-62 vs 53-56 is not. But, you should now be able to "easily" compute otherodds ratio estimates from the above results.

3. Running a similar code after changing the reference group is a convenient"trick" to easily obtain any odds ratio estimate, with corresponding 95%confidence interval and p-value.

29/55

Page 37: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

RESULTS

framingham[,AgeCut:=cut(AGE,c(40,48,52,56,99),labels=c("45-48","49-52","53-56","57-62"))]

fit3=glm(Y~AgeCut,data=framingham,family=binomial)publish(fit3)

Variable Units OddsRatio CI.95 p-valueAgeCut 45-48 Ref

49-52 1.24 [0.82;1.85] 0.3042553-56 1.52 [1.02;2.28] 0.0415157-62 2.36 [1.61;3.46] < 0.0001

Notes:

1. The interpretation depends on the cut-off values.2. Not all (six) comparisons are in the table, for example the odds ratio for group

57-62 vs 53-56 is not. But, you should now be able to "easily" compute otherodds ratio estimates from the above results.

3. Running a similar code after changing the reference group is a convenient"trick" to easily obtain any odds ratio estimate, with corresponding 95%confidence interval and p-value.

29/55

Page 38: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

EQUIVALENT RESULTS

Let us change the reference group to 53-56 and re-fit the model.

framingham[,AgeCutb:=relevel(AgeCut,"53-56")]fit3b=glm(Y~AgeCutb,data=framingham,family=binomial)publish(fit3b)

Variable Units OddsRatio CI.95 p-valueAgeCutb 53-56 Ref

45-48 0.66 [0.44;0.98] 0.0415149-52 0.81 [0.55;1.20] 0.2946857-62 1.55 [1.08;2.24] 0.01798

As expected, we could have computed these estimates from the previous(equivalent) model. E.g.:I 0.66=1/1.52, that is OR(45-48 vs 53-56)=1/OR(53-56 vs 45-48)

I 1.55=2.36/1.52, that is OR(57-62 vs 53-56)= OR(57-62 vs 45-48)/OR(53-56 vs

45-48)30/55

Page 39: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

QUANTITATIVE EXPLANATORY FACTOR

It is sometimes more natural / better to include the variable AGE (in years) asa quantitative explanatory factor in the model (i.e., No grouping)6

log

(pi

1− pi

)= a + b · agei

a = log(odds(age=0))

b = log(odds(age=a))− log(odds(age=a+1))

Interpretation: For each year

exp(b) = odds ratio

is the factor by which odds for CHD increases with each one unit increase ofage (here 1 year).

6sometimes better but not always, due to the linearity assumption.31/55

Page 40: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

RESULTSfit5=glm(Y~AGE,data=framingham,family=binomial)summary(fit5)

Call:glm(formula = Y ~ AGE, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.8600 -0.7052 -0.6082 -0.5224 2.0294

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.88431 0.77372 -6.313 0.000000000274 ***AGE 0.06581 0.01446 4.550 0.000005374208 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1330.2 on 1361 degrees of freedomAIC: 1334.2

Number of Fisher Scoring iterations: 4

32/55

Page 41: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

COMPARING MODELS (MODEL CHECK)

Age

Ris

k of

CH

D (

%)

45 48 52 56 62

10

20

30

40

●●

●●

Estimated risk:

no categorization (linear)categorization

33/55

Page 42: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

REPORTING RESULTS

One year change in age (not very good)

fit5=glm(Y~AGE,data=framingham,family=binomial)publish(fit5)

Variable Units OddsRatio CI.95 p-valueAGE 1.07 [1.04;1.10] < 0.0001

10-year change in age (probably better)

framingham[,age10:=AGE/10]fit5=glm(Y~age10,data=framingham,family=binomial)publish(fit5)

Variable Units OddsRatio CI.95 p-valueage10 1.93 [1.45;2.56] < 0.0001

These results are completely equivalent: 1.93 = 1.0710. The fitted models are thesame, but the "default" way of presenting the results is different.

34/55

Page 43: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

REPORTING RESULTS

One year change in age (not very good)

fit5=glm(Y~AGE,data=framingham,family=binomial)publish(fit5)

Variable Units OddsRatio CI.95 p-valueAGE 1.07 [1.04;1.10] < 0.0001

10-year change in age (probably better)

framingham[,age10:=AGE/10]fit5=glm(Y~age10,data=framingham,family=binomial)publish(fit5)

Variable Units OddsRatio CI.95 p-valueage10 1.93 [1.45;2.56] < 0.0001

These results are completely equivalent: 1.93 = 1.0710. The fitted models are thesame, but the "default" way of presenting the results is different.

34/55

Page 44: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

EXERCISESIf we substract from each person’s age the value 50 and then divide by 10:

framingham[,Age50:=(AGE-50)/10]fit5a=glm(Y~Age50,data=framingham,family=binomial)summary(fit5a)

Call:glm(formula = Y ~ Age50, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.8600 -0.7052 -0.6082 -0.5224 2.0294

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.59382 0.08347 -19.09 < 2e-16 ***Age50 0.65810 0.14465 4.55 5.37e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1330.2 on 1361 degrees of freedomAIC: 1334.2

Number of Fisher Scoring iterations: 4

1. Report the coronary heart disease risk of a person aged 50.2. Report the association between age and risk of coronary heart disease in a sentence with

confidence interval and p-value. 35/55

Page 45: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

MULTIPLE LOGISTIC REGRESSION

Additive effects of several explanatory variables:

log

(pi

1− pi

)= a + b1Zi + b2Xi + . . .

Multiple logistic regression is a way to control confounding:

The effect on the outcome (odds ratio) of each explanatory variable ismutually adjusted for the other explanatory variables.

I The model assumes that the effect (odds ratio) of Z on Y is the same forall values of X.

I In other words: the effect of Z on Y is not modified by the values of X (nostatistical interaction).

36/55

Page 46: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

ILLUSTRATION OF WHAT "MUTUALLY ADJUSTED" MEANS

Additive model (no statistical interactions)

log

(pi

1− pi

)= a + b1Zi + b2Xi

Effect of sex Zi (0 = female, 1 = male) adjusted for age (Xi)

odds(age=50, male)odds(age=50, female)

=exp(a + b1 + b250)

exp(a + b250)= exp(a + b1 + b250− a− b250)

= exp(b1).

The result is the same for age 46 and age 61 and all other ages.

37/55

Page 47: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

ILLUSTRATION OF WHAT "MUTUALLY ADJUSTED" MEANS (CONTINUED)

Effect of age (Xi) for males:

odds(age=51, male)odds(age=50, male)

=exp(a + b1 + b251)exp(a− b1 + b250)

= exp(a + b1 + b251− a− b1 − b250)

= exp(b2).

The result is the same for females:

odds(age=51, female)odds(age=50, female)

=exp(a + b251)exp(a− b250)

= exp(a + b251− a− b250)

= exp(b2).

Linearity means that the result is the same for a comparison of age 63 andage 62 and all other one year differences.

38/55

Page 48: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

RESULTS (RAW)

fit.add=glm(Y ~ AGE + Sex, family = binomial,data = framingham)

summary(fit.add)

Call:glm(formula = Y ~ AGE + Sex, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.9910 -0.6927 -0.5958 -0.4500 2.1913

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.59208 0.78019 -5.886 0.00000000396 ***AGE 0.06672 0.01458 4.575 0.00000475151 ***SexFemale -0.71613 0.14052 -5.096 0.00000034612 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1303.5 on 1360 degrees of freedomAIC: 1309.5

Number of Fisher Scoring iterations: 4 39/55

Page 49: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

RESULTS (FORMATTED FOR PUBLICATION)

fit.add=glm(Y ~ AGE + Sex, family = binomial, data =framingham)

publish(fit.add)

Variable Units OddsRatio CI.95 p-valueAGE 1.07 [1.04; 1.10] < 0.0001Sex Male Ref

Female 0.49 [0.37; 0.64] < 0.0001

Logistic regression was used to investigate gender differences in odds (risks)of CHD adjusted for age.

The age adjusted odds ratio was 0.49 (95%-CI: [0.37;0.64]) showing that therisks of CHD were significantly lower for women compared to men(p<0.0001).

40/55

Page 50: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

PREDICTED RISKS BASED ON LOGISTIC REGRESSION MODEL

A logistic regression model can be used to predict personalized risks:

log

(pi

1− pi

)= a + b1Zi + b2Xi + . . .

is equivalent to

pi =exp(a + b1Zi + b2Xi + . . . )

1 + exp(a + b1Zi + b2Xi + . . . )

The risks (and risk ratios) depend on all explanatory variables simultaneously.

We can predict a risk for any value of the covariates Z , X ,... once we haveestimated the model parameters.7

7However, upmost caution is needed when using covariate values beyond the range of thoseobserved (e.g. age=110). Usually we do not want to extrapolate beyond the observed data.

41/55

Page 51: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

PREDICTED RISKS BASED ON LOGISTIC REGRESSION MODEL

mydata=expand.grid(AGE=c(50,55),Sex=factor(c("Female","Male")))setDT(mydata)mydata

AGE Sex1: 50 Female2: 55 Female3: 50 Male4: 55 Male

mydata[,risk:=predict(fit.add,newdata=mydata,type="response")]mydata

AGE Sex risk1: 50 Female 0.12213812: 55 Female 0.16263533: 50 Male 0.22162844: 55 Male 0.2844255

42/55

Page 52: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

VISUALIZATION OF PREDICTED RISKS

mydata2 <- setDT(expand.grid(AGE=seq(45,62,1),Sex=factor(c("Female","Male"))))

mydata2[,risk:=predict(fit.add,newdata=mydata2,type="response")]

library(ggplot2)ggplot(mydata2,aes(x=AGE,y=risk,group=Sex

,colour=Sex))+geom_line()+ylim(c(0,1))+xlab("Age (years)")+ylab("Risk ofCHD")

0.00

0.25

0.50

0.75

1.00

45 50 55 60

Age (years)

Ris

k of

CH

D

Sex

Female

Male

43/55

Page 53: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

EXAMPLE WITH MORE VARIABLES

framingham[,Chol10:=CHOL/10]fit.multi <- glm(Y ~ AGE + Sex + Chol10 + SBP + Smoke,

family = binomial,data = framingham)publish(fit.multi)

Variable Units Missing OddsRatio CI.95 p-valueAGE 0 1.06 [1.02;1.09] 0.0004181Sex Male 0 Ref

Female 0.38 [0.28;0.52] < 0.0001Chol10 0 1.05 [1.02;1.08] 0.0026086

SBP 0 1.02 [1.01;1.02] < 0.0001Smoke No 1 Ref

Yes 1.19 [0.88;1.60] 0.2510447

44/55

Page 54: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

EXERCISE

1. Report the effect of cholesterol on coronary heart disease from themultiple logistic regression model

2. Predict the coronary heart disease risks of four smoking females allaged 50 and with 150 SBP but with different cholesterol values:I person 1: 235, person 2: 245, person 3: 351, person 4: 361

mydata=data.frame(AGE=50,Sex=factor("Female",levels(framingham$Sex)),Smoke=factor("Yes",levels(framingham$Smoke)),SBP=150,Chol10=c(23.5,24.5,35.1,36.1))

1. Compute the risk ratios for 10 unit cholesterol changes from 245 to 235and from 361 to 351. Are they similar? Did you expect that?

2. Repeat 2. and 3. for a male person.

45/55

Page 55: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

Statistical interaction = Effect modification

The effect of X on Y depends on Z

46/55

Page 56: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

EFFECT MODIFICATION

Setting: 3 variables

I two explanatory variables X,ZI one outcome Y

MeaningIn logistic regression, an interaction means that the odds ratio whichdescribes the effect of X on the odds of Y=1 depends on the value of Z

SymmetryIf the effect of variable X on Y is modified by Z then also the effect of Z on Y ismodified X.

ExampleThe age (Z) effect on the CHD-risk (Y) may depend on sex (X).

47/55

Page 57: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

A MODEL WITH AN INTERACTION

log

(pi

1− pi

)= a + b1Zi + b2Xi + b3Xi · Zi

The effect of sex Zi (0 = female, 1 = male) depends on age (Xi)

odds(age=50, male)odds(age=50, female)

=exp(a + b1 + b250 + b350)

exp(a + b250)= exp(b1 + b350).

The effect of age (Xi) depends on sex Zi (0 = female, 1 = male)

odds(age=50, male)odds(age=45, male)

= exp(b25 + b35).

odds(age=50, female)odds(age=45, female)

= exp(b25).

Note: exp(b2) describes the odds ratio for age in the reference group for sex(female) only, while it is exp(b2 + b3) in the other group (male).

48/55

Page 58: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

STATISTICAL INTERACTION IN R

First option (more transparent):

glm(Y ~ AGE + SEX + AGE:SEX, family = binomial, data =framingham)

Shorter syntax:

glm(Y ~ AGE * SEX, family = binomial, data = framingham)

49/55

Page 59: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

RESULTS

summary(glm(Y ~ AGE * SEX, family = binomial, data = framingham))

Call:glm(formula = Y ~ AGE * Sex, family = binomial, data = framingham)

Deviance Residuals:Min 1Q Median 3Q Max

-0.9171 -0.7284 -0.6074 -0.4010 2.3029

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.45290 1.00008 -3.453 0.000555 ***AGE 0.04523 0.01883 2.402 0.016288 *SexFemale -3.54459 1.60431 -2.209 0.027146 *AGE:SexFemale 0.05297 0.02987 1.773 0.076194 .---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1300.4 on 1359 degrees of freedomAIC: 1308.4

Number of Fisher Scoring iterations: 4 50/55

Page 60: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

STATISTICAL INTERACTION

publish(glm(Y ~ AGE * Sex, family = binomial, data =framingham))

Variable Units OddsRatio CI.95 p-valueAGE: Sex(Male) 1.05 [1.01;1.09] 0.01629

AGE: Sex(Female) 1.10 [1.05;1.15] < 0.0001

Notes:

I The main effects for AGE and Sex have no interpretation (and aretherefore not shown).

I One year more in age increases the odds by 5% in males and by 10% infemales.

51/55

Page 61: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

PREDICTED RISK OF THE MODEL WITH AN ADDITIVE EFFECT OF AGEAND SEX (NO EFFECT MODIFICATION)

Age (years)

Est

imat

ed C

HD

ris

k (%

)

45 50 55 60

0

25

50

75

100 MaleFemale

Age (years)

Logi

t of e

stim

ated

CH

D r

isk

45 50 55 60

−3

−2

−1

0

52/55

Page 62: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

PREDICTED RISK OF THE MODEL WITH AN INTERACTION BETWEEN AGEAND SEX (WITH EFFECT MODIFICATION)

Age (years)

Est

imat

ed C

HD

ris

k (%

)

45 50 55 60

0

25

50

75

100 MaleFemale

Age (years)

Logi

t of e

stim

ated

CH

D r

isk

45 50 55 60

−3

−2

−1

0

53/55

Page 63: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

TESTING FOR STATISTICAL INTERACTION

fit.add=glm(Y ~ AGE + SEX, family = binomial, data =framingham)

fit.int=glm(Y ~ AGE * SEX, family = binomial, data =framingham)

anova(fit.add,fit.int,test="Chisq")

Analysis of Deviance Table

Model 1: Y ~ AGE + SEXModel 2: Y ~ AGE * SEX

Resid. Df Resid. Dev Df Deviance Pr(>Chi)1 1360 1303.52 1359 1300.4 1 3.1676 0.07511 .---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

There is no statistically significant modification of the age effect by gender.

Note: this method and code also work when none of the two variables are binary. 54/55

Page 64: Basic statistics: logistic regressionpublicifsv.sund.ku.dk/~tag/Teaching/BasicStats/lecturenotes/lecture... · Basic statistics: logistic regression Paul Blanche1 Department of Biostatistics,

Motivation Overview One binary covariate Two binary covariates One continuous covariate Mutually adjusted Interaction Final words

TAKE HOME MESSAGES

I (Multiple) logistic regression describes associations between one orseveral explanatory variables and the risk of an event (binary outcome).

I Analysis of an exposure of interest can be adjusted for potentialconfounders.

I In an additive model (no interactions), odds ratios do not depend on theother explanatory variables.

I Risks and risk ratios predicted by the model depend on the otherexplanatory variables.

I Linearity and absence of interaction are assumptions which should beinvestigated.

I Models with interactions are flexible and useful but need moreconcentration to be interpreted correctly.

I Several models can be fitted from the same data, but some are morerelevant than others for a given research question (e.g. in terms ofadjustments and interactions).

55/55