# introduction to logistic regression. content simple and multiple linear regression simple logistic...

Post on 13-Jan-2016

240 views

Embed Size (px)

TRANSCRIPT

Introduction to Logistic Regression

ContentSimple and multiple linear regressionSimple logistic regressionThe logistic functionEstimation of parametersInterpretation of coefficientsMultiple logistic regressionInterpretation of coefficientsCoding of variables

Simple linear regressionTable 1 Age and systolic blood pressure (SBP) among 33 adult women

Age

SBP

Age

SBP

Age

SBP

22

131

41

139

52

128

23

128

41

171

54

105

24

116

46

137

56

145

27

106

47

111

57

141

28

114

48

115

58

153

29

123

49

133

59

157

30

117

49

128

63

155

32

122

50

183

67

176

33

99

51

130

71

172

35

121

51

133

77

178

40

147

51

144

81

217

SBP (mm Hg)Age (years)adapted from Colton T. Statistics in Medicine. Boston: Little Brown, 1974

Simple linear regressionRelation between 2 continuous variables (SBP and age)

Regression coefficient b1Measures association between y and xAmount by which y changes on average when x changes by one unitLeast squares method

yxSlope

Multiple linear regressionRelation between a continuous variable and a set of i continuous variables

Partial regression coefficients biAmount by which y changes on average when xi changes by one unit and all the other xis remain constantMeasures association between xi and y adjusted for all other xi

ExampleSBP versus age, weight, height, etc

Multiple linear regressionDependent Independent variablesPredicted Predictor variablesResponse variable Explanatory variablesOutcome variable Covariables

Multivariate analysis Model Outcome

Linear regression continousPoisson regression countsCox model survivalLogistic regression binomial......

Choice of the tool according to study, objectives, and the variablesControl of confoundingModel building, prediction

Logistic regressionModels the relationship between a set of variables xidichotomous (eat : yes/no)categorical (social class, ... )continuous (age, ...) and dichotomous variable Y Dichotomous (binary) outcome most common situation in biology and epidemiology

Logistic regression (1)Table 2 Age and signs of coronary heart disease (CD)

Age

CD

Age

CD

Age

CD

22

0

40

0

54

0

23

0

41

1

55

1

24

0

46

0

58

1

27

0

47

0

60

1

28

0

48

0

60

0

30

0

49

1

62

1

30

0

49

0

65

1

32

0

50

1

67

1

33

0

51

0

71

1

35

1

51

1

77

1

38

0

52

0

81

1

How can we analyse these data?Comparison of the mean age of diseased and non-diseased women

Non-diseased: 38.6 yearsDiseased: 58.7 years (p

Dot-plot: Data from Table 2

Logistic regression (2)Table 3 Prevalence (%) of signs of CD according to age group

Diseased

Age group

# in group

#

%

20 -29

5

0

0

30 - 39

6

1

17

40 - 49

7

2

29

50 - 59

7

4

57

60 - 69

5

4

80

70 - 79

2

2

100

80 - 89

1

1

100

Dot-plot: Data from Table 3DiseasedAge (years)

P1-P

Dot-plot: Data from Table 3Diseased %Age (years)

The logistic function (2)

The logistic function (3)Advantages of the logitSimple transformation of P(y|x)Linear relationship with xCan be continuous (Logit between - to + )Known binomial distribution (P between 0 and 1)Directly related to the notion of odds of disease

Interpretation of b (1)

Exposure (x)

Disease (y)

Yes

No

Yes

No

_872073004.unknown

_872073024.unknown

_872073042.unknown

_872060690.unknown

Interpretation of b (2) b = increase in log-odds for a one unit increase in xTest of the hypothesis that b=0 (Wald test)

Interval testing

- ExampleAge (
1374

18312459531628742442549929982501

Odds Odds

Results of fitting Logistic Regression Model

Coefficient

SE

Coeff/SE

Age

2.094

0.529

3.96

Constant

-0.841

0.255

-3.30

Log-odds = 2.094

OR = e2.094 = 8.1

Interpretation of b (1)

Exposure (x)

Disease (y)

Yes

No

Yes

No

_872073004.unknown

_872073024.unknown

_872073042.unknown

_872060690.unknown

Multiple logistic regressionMore than one independent variableDichotomous, ordinal, nominal, continuous

Interpretation of bi Increase in log-odds for a one unit increase in xi with all the other xis constantMeasures association between xi and log-odds adjusted for all other xi

Multiple logistic regressionEffect modificationCan be modelled by including interaction terms

Statistical testingQuestionDoes model including a given independent variable provide more information about dependent variable than model without this variable?Three testsLikelihood ratio statistic (LRS)Wald test

Fitting equation to the dataLinear regression: Least squaresLogistic regression: Maximum likelihoodLikelihood functionEstimates parameters a and b with property that likelihood (probability) of observed data is higher than for any other valuesPractically easier to work with log-likelihood

Likelihood ratio statisticCompares two nested models Log(odds) = + 1x1 + 2x2 + 3x3 + 4x4 (model 1) Log(odds) = + 1x1 + 2x2 (model 2)

Likelihood ratio statisticCompares two nested models Log(odds) = + 1x1 + 2x2 + 3x3 + 4x4 (model 1) Log(odds) = + 1x1 + 2x2 (model 2)

LR statistic-2 log (likelihood model 2 / likelihood model 1) =-2 log (likelihood model 2) minus -2log (likelihood model 1)

LR statistic is a 2 with DF = number of extra parameters in model

ExampleP Probability for cardiac arrestExc 1= lack of exercise, 0 = exerciseSmk 1= smokers, 0= non-smokersadapted from Kerr, Handbook of Public Health Methods, McGraw-Hill, 1998

Interactive effect between smoking and exercise?

Product term b3 = -0.4604 (SE 0.5332)

Wald test = 0.75 (1df)

-2log(L) = 342.092 with interaction term = 342.836 without interaction term

LR statistic = 0.74 (1df), p = 0.39 No evidence of any interaction

Coding of variables (1)Dichotomous variables: yes = 1, no = 0Continuous variablesIncrease in OR for a one unit change in exposure variableLogistic model is multiplicative OR increases exponentially with xIf OR = 2 for a one unit change in exposure and x increases from 2 to 5: OR = 2 x 2 x 2 = 23 = 8

Verify that OR increases exponentially with x. When in doubt, treat as qualitative variable

Continuous variable?Relationship between SBP>160 mmHg and body weight

Introduce BW as continuous variable?Code weight as single variable, eg. 3 equal classes: 40-60 kg = 0, 60-80 kg = 1, 80-100 kg = 2

Compatible with assumption of multiplicative modelIf not compatible, use indicator variables

BW

Cases

Controls

OR

0

20

40

1.0

1

22

30

1.5

1.52 ( 2.2

2

12

11

2.2

Coding of variables (2)Nominal variables or ordinal with unequal classes:Tobacco smoked: no=0, grey=1, brown=2, blond=3Model assumes that OR for blond tobacco = OR for grey tobacco3Use indicator variables (dummy variables)

Indicator variables: Type of tobacco Neutralises artificial hierarchy between classes in the variable "type of tobacco"No assumptions made3 variables (3 df) in model using same reference OR for each type of tobacco adjusted for the others in reference to non-smoking

Tobacco

consumption

Dummy variables

Dark

Light

Both

Dark

1

0

0

Light

0

1

0

Both

0

0

1

None

0

0

0

(LBW) . . . LBW .

: 2500 1380 1381 . " " .

LBW 295 5987 . LBW

LBW88008008806322880336188153618801244

LBW . . (3/82) (96 ) (52 ) .

Poisson RegressionPoisson Regression Analysis is used when the outcome variable comprises counts, usually of rather rare events e.g. number of cases of cancer over a defined period in a cohort of subjects.log (rate) = 0 + 1X1 + 2X2 + +nXnlog (event/person-time) = 0 + 1X1 + log (event) log (person-time) = 0 + 1X1 + log (event) = log (person-time) + 0 + 1X1 + log (event) = 0* + 1X1 +

ReferenceHosmer DW, Lemeshow S. Applied logistic regression.Wiley & Sons, New York, 1989

Example 1: Low Birth Weight Study198 observationsLow Birth Weigth [LBW] 1= Birth weight < 2500g0= Birth weight >= 2500gAge of mother in yearsWeight of mother in pounds [LWT]Race (1,2,3)Number of doctors visit in last trimester [FTV]

Example 2: Risk of death from bacterial meningitis according to treatment161 observationsDeath (0,1)Treatment1=Chloramphenicol, 2=Ampicillin) Delay before treatment (onset, in days)Convulsions (1,0)Level of consciousness (1-3)Severity of dehydration (1-3)Age in yearsPathogen1 Others, 2 HiB, 3 Streptococcus pneumoniae

The logistic function (1)Probability of diseasex

The logistic function (2)

Fitting equation to the dataLinear regression: Least squaresLogistic regression: Maximum likelihoodLikelihood functionEstimates parameters a and b with property that likelihood (probability) of observed data is higher than for any other valuesPractically easier to work with log-likelihood

Maximum likelihoodIterative computingChoice of an arbitrary value for the coefficients (usually 0)Computing of log-likelihoodVariation of coefficients valuesReiteration until maximisation (plateau)

ResultsMaximum Likelihood Estimates (MLE) for and Estimates of P(y) for a given value of x

*********************************************