introduction to logistic regression. content simple and multiple linear regression simple logistic...

of 48/48
Introduction to Logistic Regression

Post on 13-Jan-2016

240 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Introduction to Logistic Regression

  • ContentSimple and multiple linear regressionSimple logistic regressionThe logistic functionEstimation of parametersInterpretation of coefficientsMultiple logistic regressionInterpretation of coefficientsCoding of variables

  • Simple linear regressionTable 1 Age and systolic blood pressure (SBP) among 33 adult women

    Age

    SBP

    Age

    SBP

    Age

    SBP

    22

    131

    41

    139

    52

    128

    23

    128

    41

    171

    54

    105

    24

    116

    46

    137

    56

    145

    27

    106

    47

    111

    57

    141

    28

    114

    48

    115

    58

    153

    29

    123

    49

    133

    59

    157

    30

    117

    49

    128

    63

    155

    32

    122

    50

    183

    67

    176

    33

    99

    51

    130

    71

    172

    35

    121

    51

    133

    77

    178

    40

    147

    51

    144

    81

    217

  • SBP (mm Hg)Age (years)adapted from Colton T. Statistics in Medicine. Boston: Little Brown, 1974

  • Simple linear regressionRelation between 2 continuous variables (SBP and age)

    Regression coefficient b1Measures association between y and xAmount by which y changes on average when x changes by one unitLeast squares method

    yxSlope

  • Multiple linear regressionRelation between a continuous variable and a set of i continuous variables

    Partial regression coefficients biAmount by which y changes on average when xi changes by one unit and all the other xis remain constantMeasures association between xi and y adjusted for all other xi

    ExampleSBP versus age, weight, height, etc

  • Multiple linear regressionDependent Independent variablesPredicted Predictor variablesResponse variable Explanatory variablesOutcome variable Covariables

  • Multivariate analysis Model Outcome

    Linear regression continousPoisson regression countsCox model survivalLogistic regression binomial......

    Choice of the tool according to study, objectives, and the variablesControl of confoundingModel building, prediction

  • Logistic regressionModels the relationship between a set of variables xidichotomous (eat : yes/no)categorical (social class, ... )continuous (age, ...) and dichotomous variable Y Dichotomous (binary) outcome most common situation in biology and epidemiology

  • Logistic regression (1)Table 2 Age and signs of coronary heart disease (CD)

    Age

    CD

    Age

    CD

    Age

    CD

    22

    0

    40

    0

    54

    0

    23

    0

    41

    1

    55

    1

    24

    0

    46

    0

    58

    1

    27

    0

    47

    0

    60

    1

    28

    0

    48

    0

    60

    0

    30

    0

    49

    1

    62

    1

    30

    0

    49

    0

    65

    1

    32

    0

    50

    1

    67

    1

    33

    0

    51

    0

    71

    1

    35

    1

    51

    1

    77

    1

    38

    0

    52

    0

    81

    1

  • How can we analyse these data?Comparison of the mean age of diseased and non-diseased women

    Non-diseased: 38.6 yearsDiseased: 58.7 years (p

  • Dot-plot: Data from Table 2

  • Logistic regression (2)Table 3 Prevalence (%) of signs of CD according to age group

    Diseased

    Age group

    # in group

    #

    %

    20 -29

    5

    0

    0

    30 - 39

    6

    1

    17

    40 - 49

    7

    2

    29

    50 - 59

    7

    4

    57

    60 - 69

    5

    4

    80

    70 - 79

    2

    2

    100

    80 - 89

    1

    1

    100

  • Dot-plot: Data from Table 3DiseasedAge (years)

    P1-P

  • Dot-plot: Data from Table 3Diseased %Age (years)

  • The logistic function (2)

  • The logistic function (3)Advantages of the logitSimple transformation of P(y|x)Linear relationship with xCan be continuous (Logit between - to + )Known binomial distribution (P between 0 and 1)Directly related to the notion of odds of disease

  • Interpretation of b (1)

    Exposure (x)

    Disease (y)

    Yes

    No

    Yes

    No

    _872073004.unknown

    _872073024.unknown

    _872073042.unknown

    _872060690.unknown

  • Interpretation of b (2) b = increase in log-odds for a one unit increase in xTest of the hypothesis that b=0 (Wald test)

    Interval testing

  • ExampleAge (
  • 1374

    18312459531628742442549929982501

  • Odds Odds

  • Results of fitting Logistic Regression Model

    Coefficient

    SE

    Coeff/SE

    Age

    2.094

    0.529

    3.96

    Constant

    -0.841

    0.255

    -3.30

    Log-odds = 2.094

    OR = e2.094 = 8.1

  • Interpretation of b (1)

    Exposure (x)

    Disease (y)

    Yes

    No

    Yes

    No

    _872073004.unknown

    _872073024.unknown

    _872073042.unknown

    _872060690.unknown

  • Multiple logistic regressionMore than one independent variableDichotomous, ordinal, nominal, continuous

    Interpretation of bi Increase in log-odds for a one unit increase in xi with all the other xis constantMeasures association between xi and log-odds adjusted for all other xi

  • Multiple logistic regressionEffect modificationCan be modelled by including interaction terms

  • Statistical testingQuestionDoes model including a given independent variable provide more information about dependent variable than model without this variable?Three testsLikelihood ratio statistic (LRS)Wald test

  • Fitting equation to the dataLinear regression: Least squaresLogistic regression: Maximum likelihoodLikelihood functionEstimates parameters a and b with property that likelihood (probability) of observed data is higher than for any other valuesPractically easier to work with log-likelihood

  • Likelihood ratio statisticCompares two nested models Log(odds) = + 1x1 + 2x2 + 3x3 + 4x4 (model 1) Log(odds) = + 1x1 + 2x2 (model 2)

  • Likelihood ratio statisticCompares two nested models Log(odds) = + 1x1 + 2x2 + 3x3 + 4x4 (model 1) Log(odds) = + 1x1 + 2x2 (model 2)

    LR statistic-2 log (likelihood model 2 / likelihood model 1) =-2 log (likelihood model 2) minus -2log (likelihood model 1)

    LR statistic is a 2 with DF = number of extra parameters in model

  • ExampleP Probability for cardiac arrestExc 1= lack of exercise, 0 = exerciseSmk 1= smokers, 0= non-smokersadapted from Kerr, Handbook of Public Health Methods, McGraw-Hill, 1998

  • Interactive effect between smoking and exercise?

    Product term b3 = -0.4604 (SE 0.5332)

    Wald test = 0.75 (1df)

    -2log(L) = 342.092 with interaction term = 342.836 without interaction term

    LR statistic = 0.74 (1df), p = 0.39 No evidence of any interaction

  • Coding of variables (1)Dichotomous variables: yes = 1, no = 0Continuous variablesIncrease in OR for a one unit change in exposure variableLogistic model is multiplicative OR increases exponentially with xIf OR = 2 for a one unit change in exposure and x increases from 2 to 5: OR = 2 x 2 x 2 = 23 = 8

    Verify that OR increases exponentially with x. When in doubt, treat as qualitative variable

  • Continuous variable?Relationship between SBP>160 mmHg and body weight

    Introduce BW as continuous variable?Code weight as single variable, eg. 3 equal classes: 40-60 kg = 0, 60-80 kg = 1, 80-100 kg = 2

    Compatible with assumption of multiplicative modelIf not compatible, use indicator variables

    BW

    Cases

    Controls

    OR

    0

    20

    40

    1.0

    1

    22

    30

    1.5

    1.52 ( 2.2

    2

    12

    11

    2.2

  • Coding of variables (2)Nominal variables or ordinal with unequal classes:Tobacco smoked: no=0, grey=1, brown=2, blond=3Model assumes that OR for blond tobacco = OR for grey tobacco3Use indicator variables (dummy variables)

  • Indicator variables: Type of tobacco Neutralises artificial hierarchy between classes in the variable "type of tobacco"No assumptions made3 variables (3 df) in model using same reference OR for each type of tobacco adjusted for the others in reference to non-smoking

    Tobacco

    consumption

    Dummy variables

    Dark

    Light

    Both

    Dark

    1

    0

    0

    Light

    0

    1

    0

    Both

    0

    0

    1

    None

    0

    0

    0

  • (LBW) . . . LBW .

  • : 2500 1380 1381 . " " .

  • LBW 295 5987 . LBW

    LBW88008008806322880336188153618801244

  • LBW . . (3/82) (96 ) (52 ) .

  • Poisson RegressionPoisson Regression Analysis is used when the outcome variable comprises counts, usually of rather rare events e.g. number of cases of cancer over a defined period in a cohort of subjects.log (rate) = 0 + 1X1 + 2X2 + +nXnlog (event/person-time) = 0 + 1X1 + log (event) log (person-time) = 0 + 1X1 + log (event) = log (person-time) + 0 + 1X1 + log (event) = 0* + 1X1 +

  • ReferenceHosmer DW, Lemeshow S. Applied logistic regression.Wiley & Sons, New York, 1989

  • Example 1: Low Birth Weight Study198 observationsLow Birth Weigth [LBW] 1= Birth weight < 2500g0= Birth weight >= 2500gAge of mother in yearsWeight of mother in pounds [LWT]Race (1,2,3)Number of doctors visit in last trimester [FTV]

  • Example 2: Risk of death from bacterial meningitis according to treatment161 observationsDeath (0,1)Treatment1=Chloramphenicol, 2=Ampicillin) Delay before treatment (onset, in days)Convulsions (1,0)Level of consciousness (1-3)Severity of dehydration (1-3)Age in yearsPathogen1 Others, 2 HiB, 3 Streptococcus pneumoniae

  • The logistic function (1)Probability of diseasex

  • The logistic function (2)

  • Fitting equation to the dataLinear regression: Least squaresLogistic regression: Maximum likelihoodLikelihood functionEstimates parameters a and b with property that likelihood (probability) of observed data is higher than for any other valuesPractically easier to work with log-likelihood

  • Maximum likelihoodIterative computingChoice of an arbitrary value for the coefficients (usually 0)Computing of log-likelihoodVariation of coefficients valuesReiteration until maximisation (plateau)

    ResultsMaximum Likelihood Estimates (MLE) for and Estimates of P(y) for a given value of x

    *********************************************