biost 536 lecture 11 1 lecture 11 – additional topics in logistic regression c-statistic...

BIOST 536 Lecture 11 1

Lecture 11 – Additional topics in Logistic Regression C-statistic (“concordance statistic”)

Same as Area under the curve (AUC) in LROC (logistic receiving operating characteristic)

Fit a model and generate logit (p) and p for each observation Form all possible pairings of the m cases and n controls

(total number of pairs is m x n) Compare logit (pcase) to logit (pcontrol) C-statistic is equal to

# pairs (logit logit ) 0.50*# pairs (logit = logit )

# pairs totalcase control case controlAUC

. logistic chd age sc1 sbp Logistic regression Number of obs = 910 LR chi2(3) = 43.01 Prob > chi2 = 0.0000 Log likelihood = -428.26102 Pseudo R2 = 0.0478 ------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.027559 .0142093 1.97 0.049 1.000083 1.055789 sc1 | 1.006021 .0019341 3.12 0.002 1.002237 1.009819 sbp | 1.016381 .0035904 4.60 0.000 1.009368 1.023443 ------------------------------------------------------------------------------


. predict xb , xb

. graph box xb, over(chd)

-3-2

-10

1Li

near

pre

dict

ion

0 1

Create all possible pairs of cases (m=178) x number of controls (n=732) = 130,296

Assess number of pairs where the case logit > control logit No ties – get the same result as lroc

Can also compute the c-statistic for the validation sample to test prediction in a new sample

. lroc Logistic model for chd number of observations = 910 area under ROC curve = 0.6473

. tabulate concord concord | Freq. Percent Cum. ------------+----------------------------------- Pcase<Pcont | 45,951 35.27 35.27 Pcase>Pcont | 84,345 64.73 100.00 ------------+----------------------------------- Total | 130,296 100.00


Small sample sizes Logistic regression LR tests, odds ratio estimates, confidence

intervals depend on asymptotic large-sample results May not work well for small samples May not even be able to get estimates in some cases if a

category has all cases or all controls Sir DR Cox proposed some small sample exact logistic

regression methods in his 1970 text Analysis of Binary Data Not computationally feasible until an algorithm developed by

Hirji, Mehta, and Patel (1987) reduced computations (programs marketed as StatXact and LogXact)

Exact logistic regression uses the sufficient statistics for all covariates in the model:

Condition on the sufficient statistics and consider all permutations of the data consistent with the sufficient statistics

Can derive estimates and confidence intervals

i ix y


Small sample sizes Computation can be extensive Can stratify by variables that we control for Methods now included in SAS and Stata (Version 10 on?) Small dose escalation example

Too small for ordinary logistic regression

Dose Deaths N

0 0 3

1 0 3

2 0 3

3 0 3

4 1 3

5 2 3


Small sample sizes Do this example in Stata using exact logistic regression

(exlogistic command)

Do an incorrect standard logistic regression first

Wald test and LR disagree

. list +----------------------+ | dose count death | |----------------------| 1. | 0 3 0 | 2. | 0 0 1 | 3. | 1 3 0 | 4. | 1 0 1 | 5. | 2 3 0 | 6. | 2 0 1 | 7. | 3 3 0 | 8. | 3 0 1 | 9. | 4 2 0 | 10. | 4 1 1 | 11. | 5 1 0 | 12. | 5 2 1 | +----------------------+

. logistic death dose [fw=count] Logistic regression Number of obs = 18 LR chi2(1) = 8.15 Prob > chi2 = 0.0043 Log likelihood = -4.0362174 Pseudo R2 = 0.5023 ------------------------------------------------------------------------------ death | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- dose | 8.007606 10.09189 1.65 0.099 .6772423 94.68068 ------------------------------------------------------------------------------


Small sample sizes

Exact logistic regression does show a significant relationship of deaths with dose and gives odds ratio and permutation-based confidence intervals

Note sufficient statistic is

. exlogistic death dose [fw=count] Enumerating sample-space combinations: observation 1: enumerations = 2 observation 2: enumerations = 4 observation 3: enumerations = 7 observation 4: enumerations = 10 observation 5: enumerations = 20 observation 6: enumerations = 30 observation 7: enumerations = 33 observation 8: enumerations = 16 Exact logistic regression Number of obs = 18 Model score = 5.472381 Pr >= score = 0.0245 --------------------------------------------------------------------------- death | Odds Ratio Suff. 2*Pr(Suff.) [95% Conf. Interval] -------------+------------------------------------------------------------- dose | 6.049377 14 0.0245 1.122698 353.0003 ---------------------------------------------------------------------------

i ii

dose * deaths 0*0 1*0 2*0 3*0 4*1 5*2 14


Small sample sizes – Example 2 Two binary covariates

Only 3 outcomes observed First consider Fisher’s exact test to relate A to outcome

Set up the data using frequency counts

A B Y N

0 0 1 1

0 1 0 2

1 0 1 8

1 1 1 21

. tabulate y a [fw=count] , exact chi2 | a y | 0 1 | Total -----------+----------------------+---------- 0 | 2 27 | 29 1 | 1 2 | 3 -----------+----------------------+---------- Total | 3 29 | 32 Pearson chi2(1) = 2.2365 Pr = 0.135 Fisher's exact = 0.263 1-sided Fisher's exact = 0.263


Small sample sizes – Example 2 Same answer with exact logistic regression

Now consider both covariates together

. exlogistic y a [fw=count] Enumerating sample-space combinations: observation 1: enumerations = 2 observation 2: enumerations = 4 observation 3: enumerations = 6 observation 4: enumerations = 9 observation 5: enumerations = 10 observation 6: enumerations = 4 Exact logistic regression Number of obs = 32 Model score = 2.166601 Pr >= score = 0.2633 --------------------------------------------------------------------------- y | Odds Ratio Suff. 2*Pr(Suff.) [95% Conf. Interval] -------------+------------------------------------------------------------- a | .1647192 2 0.5266 .0055396 13.0711 ---------------------------------------------------------------------------

. exlogistic y a b [fw=count] Enumerating sample-space combinations: observation 1: enumerations = 2 observation 2: enumerations = 4 observation 3: enumerations = 8 observation 4: enumerations = 17 observation 5: enumerations = 21 observation 6: enumerations = 12 Exact logistic regression Number of obs = 32 Model score = 4.360821 Pr >= score = 0.0798 --------------------------------------------------------------------------- y | Odds Ratio Suff. 2*Pr(Suff.) [95% Conf. Interval] -------------+------------------------------------------------------------- a | .1649572 2 0.5797 .0018228 14.92823 b | .1982354 1 0.4138 .0031804 4.02635 ---------------------------------------------------------------------------


Small sample sizes – Example 3 Crossover design

Same individuals get tested in all treatments Outcome is recorded after each treatment Treatment effect is assumed to wash out quickly after outcome is

measured Order of treatments may still matter

Example has 15 individuals undergoing three treatments but in different orders

+--------------------------+ | person time y drug | |--------------------------| 1. | 1 1 0 1 | 2. | 1 2 0 2 | 3. | 1 3 0 0 | 4. | 2 1 1 1 | 5. | 2 2 1 2 | 6. | 2 3 0 0 | 7. | 3 1 0 1 | 8. | 3 2 1 2 | 9. | 3 3 1 0 | 10. | 4 1 1 1 | 11. | 4 2 0 0 | 12. | 4 3 1 2 | 13. | 5 1 1 1 | 14. | 5 2 0 0 | 15. | 5 3 0 2 |

16. | 6 1 0 2 | 17. | 6 2 0 1 | 18. | 6 3 0 0 | 19. | 7 1 1 2 | 20. | 7 2 1 1 | 21. | 7 3 0 0 | 22. | 8 1 0 2 | 23. | 8 2 0 0 | 24. | 8 3 1 1 | 25. | 9 1 1 2 | 26. | 9 2 0 0 | 27. | 9 3 1 1 | 28. | 10 1 0 2 | 29. | 10 2 1 0 | 30. | 10 3 0 1 |

31. | 11 1 0 0 | 32. | 11 2 1 1 | 33. | 11 3 0 2 | 34. | 12 1 1 0 | 35. | 12 2 0 2 | 36. | 12 3 1 1 | 37. | 13 1 0 0 | 38. | 13 2 0 2 | 39. | 13 3 1 1 | 40. | 14 1 0 0 | 41. | 14 2 1 2 | 42. | 14 3 0 1 | 43. | 15 1 0 0 | 44. | 15 2 1 2 | 45. | 15 3 1 1 |


Small sample sizes – Example 3

Drug: 0 Placebo; 1 Drug A; 2 Drug B Treat time and drug as categorical variables Need to group observations within individual (all comparisons

are within individual)

. xi: exlogistic y i.drug i.time , group(person) i.drug _Idrug_0-2 (naturally coded; _Idrug_0 omitted) i.time _Itime_1-3 (naturally coded; _Itime_1 omitted) Enumerating sample-space combinations: observation 1: enumerations = 1 observation 2: enumerations = 1 observation 3: enumerations = 1 observation 4: enumerations = 2 observation 5: enumerations = 3 observation 6: enumerations = 3 observation 7: enumerations = 6 observation 8: enumerations = 8 etc. observation 43: enumerations = 10286 observation 44: enumerations = 11395 observation 45: enumerations = 6877


Small sample sizes – Example 3

Drug A is significantly different than Placebo Drug B has higher odds ratio than Placebo, but is not

statistically significant Time effects are not strong Have accounted for the correlation within individual by

grouping Conditioning methods used extensively later

Exact logistic regression Number of obs = 45 Group variable: person Number of groups = 15 Obs per group: min = 3 avg = 3.0 max = 3 Model score = 6.14764 Pr >= score = 0.1835 --------------------------------------------------------------------------- y | Odds Ratio Suff. 2*Pr(Suff.) [95% Conf. Interval] -------------+------------------------------------------------------------- _Idrug_1 | 5.276637 10 0.0450 1.029821 49.90737 _Idrug_2 | 2.301805 7 0.3516 .5056709 12.74542 _Itime_2 | 1.468032 7 0.9203 .2282417 11.0921 _Itime_3 | .9958269 7 1.0000 .1534726 5.69607 ---------------------------------------------------------------------------


More on confounding

Confounder is a covariate that is related to the outcome as well as the primary risk factor or scientific variable of interest

“Confounding is the distortion of a disease/exposure association brought about by the association of other factors with both disease and exposure, the latter association with disease being causal” Breslow & Day (1980)

“If any factor either increasing or decreasing the risk of disease besides the characteristic or exposure under study is unequally distributed in the groups that are being compared with regard to the disease, this itself will give rise to differences in disease frequency in the compared groups. Such distortion, termed confounding, leads to an invalid comparison” Lilienfeld & Stolley (1994)


More on confounding

Criteria for a confounding factor (Rothman & Greenland, 1998)

1. Confounding factor must be a risk factor for the disease

2. Confounding factor must be associated with the exposure under study in the source population (the population at risk from which the cases are derived)

3. Confounding factor must not be affected by exposure or the disease. In particular, it cannot be an intermediate step in the causal path between the exposure and the disease

Choosing confounders for statistical adjustment Choice should be based on a priori considerations

Study design/protocol specifies particular exposure x disease association under investigation

Confounders selected/measured based on their role as known risk factors for disease


More on confounding

Best not to select based on internal study results Selecting on the basis of statistical significance with

outcome can leave residual confounding Selecting on change in association of exposure and

outcome may not lead to correct inference Reporting results

Give unadjusted estimates Give estimates adjusted for known primary risk factors Give estimates adjusted for primary and secondary risk

factors

How do we adjust?


Paradigms for controlling for confounding

Experimental methods Hold all other relevant factors constant Randomly allocate subjects to treatments

Statistical approaches for controlling for confounding1. Estimate conditional treatment effects holding values of

confounders constant Assume constant effect measure within strata – confounders

are not effect modifiers Indirect standardization Mantel-Haenszel summary odds ratio in stratified analyses Get adjusted estimates using logistic regression Traditional method of adjusting for confounding involves

comparison of unadjusted versus adjusted estimates Sometimes called “summarized effect measure” (Newman,

2001)



2. Estimate marginal treatment effects under simulated randomization

Randomization assumes equal distribution of potential confounders among treatment (exposure) groups

Treatment effects measured by contrasting marginal distributions of response

Direct standardization of rates or proportions Simulates randomized experiment by fixing the

distribution of the confounder to be equal across treatment groups

Compares marginal measures of response between “fixed” treatment groups

Less dependent on modeling assumptions, but less stable statistically and less generalizable to other populations with different confounder distributions



“Causal” analysis of unobserved (“counterfactual”) responses that would have been observed if subjects had been assigned to another treatment

Treatment/exposure effects measured by contrasting marginal distributions of responses (both observed and counterfactual) between treatment groups

Confounding controlled by use of inverse probability weighting of observed responses to compensate for missing counterfactual responses

Example: Probability that treated response is observed is equal to the probability of treatment estimated by logistic regression of treatment category on confounders (“propensity score”)

Assumption of no unmeasured confounders equivalent to assuming that unobserved (counterfactual) responses are “missing at random” in sense of Little and Rubin (1987)


Direct standardization

Compare response rates under different scenarios where both treatments are applied to equivalent populations assumed to equal the distribution of

1. Treatment 1 group

2. Treatment 2 group

3. 50% Treatment 1 and 50% Treatment 2

4. Some other arbitrary population

Difference depends on how population is standardized Answers “What would be the difference if …”

Number of Responses Number Exposed Percent Response Stratum Trt 1 Trt 2 Trt 1 Trt 2 Trt 1 Trt 2 Easy 560 80 800 100 70%. 80% Hard 40 360 200 900 20% 40% Total 600 440 1000 1000 60% 44%

Standardized population Trt 1 Trt 2 Difference Like Trt 1 60% 72% 12% Like Trt 2 25% 44% 19% Like 50% of each 42.5% 58% 15.5%


Causal models for binary outcome data Assume that everyone has two possible binary outcomes, Y1

if they received treatment (or were exposed) and Y0 if not treated (or not exposed)

May be written as followsto emphasize that these are the outcomes that would occur if assigned to treatment or not

Counterfactual nature of Y occurs because only one of Y1 and Y0 can be observed

Define R to the the random variable that indicates whether a subject receives treatment (R=1) or not (R=0)

R=1 indicates a subject was selected for treatment R and the counterfactual Y may be statistically correlated

Observed binary outcome is

1 0 1 0( , ) ( , )A AY Y Y Y Y

1 0(1 )RY RY R Y


Causal models for binary outcome data Fundamental problem of causal inference is that

difference in expected outcome of treating everyone versus treating no one (LHS) does not equal difference in expected outcomes among those actually treated versus not treated due to bias in allocation of treatment to subjects

Confounding is present if there is not equality between the LHS and RHS of expression (1)

If we randomized, then R and the counterfactual Y are independent, and the LHS may be estimated directly and there is no confounding

Without randomization, we cannot be sure if confounding is present or not

1 0 1 0( ) ( | 1) ( | 0) (1)E Y Y E Y R E Y R


Causal models for binary outcome data May be willing to assume that R is independent of the

counterfactual Y within strata formed by Z, a composite of measured confounders, i.e. R and Y are conditionally independent given Z

This is an assumption and not verifiable from the data May be a good assumption if randomization was carried out within

strata If the assumption of conditional independence is true given Z

1 1

1

1

Pr ( 1) Pr ( 1| ) Pr ( )

Pr ( 1| 1, ) Pr ( ) ( )

Pr ( 1, 1, )

Pr ( 1 | )

k

k

k

Y Y Z k Z k

Y R Z k Z k assumption

Y R Z k

R Z k

00

Pr ( 1, 0, )Pr ( 1)

Pr ( 0 | )k

Y R Z kY

R Z k


Causal models for binary outcome data Numerator of probability is a standard logistic regression

model for outcome given observed treatment and covariates Denominator is the probability of receiving that treatment

given the covariates Suggest we can estimate these probabilities by using inverse

probability weighting (IPW), estimating selection for treatment within strata

IPW results in a synthetic or pseudo-population where treatment (exposure) and the confounders are uncorrelated

Marginal (or crude) measures of treatment effect within the pseudo-population constitutes the causal effect of interest

Direct standardization of effect measure is conceptually and algebraically identical to this marginal causal effect


Example (Robins, Hernan, & Brumback, 2000)ijkn denotes number of study subjects in stratum Z = k who received treatment

R = j where j = 0 or 1 and had outcome i where i = 0 or 1

ijkN denotes corresponding number in the constructed pseudo-population

Estimate Pr ( | )ijkp R j Z k by jk kn n

Weight with 1/ /jk jk k jkw p n n so that

ijk k

ijkjk

n nN

n

and note that

jk kjk k

jk

n nN n

n

independently of j (R and Z unassociated)

Stratum Z=1 Z=0 Outcome Trt 1 Trt 0 Trt 1 Trt 0 Y = 1 108 24 20 40 Y = 0 252 16 30 10 Total jkn 360 40 50 50

Stratum total kn 400 100


Example (Robins, Hernan, & Brumback, 2000)Estimate Pr ( | )jkp R j Z k by jk kn n

e.g. 11 Pr ( 1| 1)p R Z by 360 400 0.90

Weight 1/ /jk jk k jkw p n n so 11 1/ 1/ 0.90 1.11jkw p

ijk k

ijkjk

n nN

n

so 111

108 400120

360N

Observed data and synthetic population

Z R YR n p w N 1 1 1 108 0.90 1.11 120 1 1 0 252 0.90 1.11 280 1 0 1 24 0.10 10 240 1 0 0 16 0.10 10 160 0 1 1 20 0.50 2 40 0 1 0 30 0.50 2 60 0 0 1 40 0.50 2 80 0 0 0 10 0.50 2 20


Example (Robins, Hernan, & Brumback, 2000)

Note that R and Z are unassociated and that the pseudo-population is twice the size of the original sample

Marginal risk difference is which equals the risk differencefound by direct standardization to a population that has the same fractions in each stratum as the original sample pooled over treatment (or exposure)

Marginal (causal) odds ratio

Pseudo-population Stratum Z=1 Stratum Z=0 Combined Outcome Trt 1 Trt 0 Trt 1 Trt 0 Trt 1 Trt 0 Y = 1 120 240 40 80 160 320 Y = 0 280 160 60 20 340 180 Total 400 400 100 100 500 500

160 3200.32

500 500

160 1800.265

340 320


Example (Robins, Hernan, & Brumback, 2000)

Crude risk difference is

Crude odds ratio

Note standard errors not shown for the IPW weighted results Stata can weight inversely by “sampling probabilities” using

pweight Causal modeling is an area of research interest currently

128 640.40

410 90

128 260.184

282 64

Unadjusted crude data Combined Outcome Trt 1 Trt 0 Y = 1 128 64 Y = 0 282 26 Total 410 90


Propensity scores Causal inference requires no association between treatment

(exposure) and counterfactual outcome Y conditional on Z What if Z is multidimensional with continuous components,

can we find a function of Z such that conditioning on Z, will suffice to meet this independence condition?

Answer: The propensity score

can satisfy this independence (Rosenbaum & Rubin, 1983) Find a model for exposure given covariate vector Z Stratify or match on estimated propensity score Estimated association of exposure and disease is a summary

over propensity score strata and not necessarily equal to the causal effect

Fine matching may be difficult but crude matching or modeling using the propensity score may allow residual confounding

( ) Pr ( 1| )Z R Z

biost 536 lecture 11 1 lecture 11 – additional topics in logistic regression c-statistic...

Documents