biost 536 lecture 9 1 lecture 9 – prediction and association example low birth weight dataset...

BIOST 536 Lecture 9 1

Lecture 9 – Prediction and Association example Low birth weight dataset

Consider a prediction model for low birth weight (< 2500 grams) given the collection of variables available Do not particularly care which variables are included Want to maximize our prediction of the outcome Need to validate our prediction on data not used to generate the

model

Variable Name --------------------------------------------------------------- Identification Code ID Low Birth Weight (0 = Birth Weight >= 2500g, LOW 1 = Birth Weight < 2500g) Age of the Mother in Years AGE Weight in Pounds at the Last Menstrual Period LWT Race (1 = White, 2 = Black, 3 = Other) RACE Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE History of Premature Labor (0 = None 1 = One, etc.) PTL History of Hypertension (1 = Yes, 0 = No) HT Presence of Uterine Irritability (1 = Yes, 0 = No) UI Number of Physician Visits During the First Trimester FTV (0 = None, 1 = One, 2 = Two, etc.) Birth Weight in Grams BWT ---------------------------------------------------------------


. describe Contains data from H:\Biostat\Biost536\Fall2007\data\lbw.dta obs: 189 vars: 11 6 Oct 2007 11:21 size: 3,402 (99.9% of memory free) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- id int %8.0g ID low byte %8.0g LOW age byte %8.0g AGE lwt int %8.0g LWT race byte %8.0g RACE smoke byte %8.0g SMOKE ptl byte %8.0g PTL ht byte %8.0g HT ui byte %8.0g UI ftv byte %8.0g FTV bwt int %8.0g BWT ------------------------------------------------------------------------------- . summ Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- id | 189 121.0794 63.30363 4 226 low | 189 .3121693 .4646093 0 1 age | 189 23.2381 5.298678 14 45 lwt | 189 129.8148 30.57938 80 250 race | 189 1.846561 .9183422 1 3 smoke | 189 .3915344 .4893898 0 1 ptl | 189 .1957672 .4933419 0 3 ht | 189 .0634921 .2444936 0 1 ui | 189 .1481481 .3561903 0 1 ftv | 189 .7936508 1.059286 0 6 bwt | 189 2944.656 729.0224 709 4990 . histogram bwt, width(250) start(500) percent normal xline(2500, extend) xlabel(500(500)5000) (bin=18, start=500, width=250)


Outcome variable Look at the distribution of birthweights

Were low birthweight babies oversampled or was there bias in recording?

05

1015

Per

cent

500 1000 1500 2000 2500 3000 3500 4000 4500 5000BWT


Simple descriptives

Number of prior pre-term labors may be too thin to model – consider dichotomizing none versus 1 or more (everptl)

. tabulate low smoke, col | SMOKE LOW | 0 1 | Total -----------+----------------------+---------- 0 | 86 44 | 130 | 74.78 59.46 | 68.78 -----------+----------------------+---------- 1 | 29 30 | 59 | 25.22 40.54 | 31.22 -----------+----------------------+---------- Total | 115 74 | 189 . tabulate low race, col | RACE LOW | 1 2 3 | Total -----------+---------------------------------+---------- 0 | 73 15 42 | 130 | 76.04 57.69 62.69 | 68.78 -----------+---------------------------------+---------- 1 | 23 11 25 | 59 | 23.96 42.31 37.31 | 31.22 -----------+---------------------------------+---------- Total | 96 26 67 | 189 . tabulate low ptl, col | PTL LOW | 0 1 2 3 | Total -----------+--------------------------------------------+---------- 0 | 118 8 3 1 | 130 | 74.21 33.33 60.00 100.00 | 68.78 -----------+--------------------------------------------+---------- 1 | 41 16 2 0 | 59 | 25.79 66.67 40.00 0.00 | 31.22 -----------+--------------------------------------------+---------- Total | 159 24 5 1 | 189


Number of first trimester physician visits may also have to be grouped for analysis (ptvgrp (0, 1, 2+))

. tabulate low ht, col | HT LOW | 0 1 | Total -----------+----------------------+---------- 0 | 125 5 | 130 | 70.62 41.67 | 68.78 -----------+----------------------+---------- 1 | 52 7 | 59 | 29.38 58.33 | 31.22 -----------+----------------------+---------- Total | 177 12 | 189 . tabulate low ui, col | UI LOW | 0 1 | Total -----------+----------------------+---------- 0 | 116 14 | 130 | 72.05 50.00 | 68.78 -----------+----------------------+---------- 1 | 45 14 | 59 | 27.95 50.00 | 31.22 -----------+----------------------+---------- Total | 161 28 | 189 . tabulate low ftv, col | FTV LOW | 0 1 2 3 4 6 | Total -----------+------------------------------------------------------------------+---------- 0 | 64 36 23 3 3 1 | 130 | 64.00 76.60 76.67 42.86 75.00 100.00 | 68.78 -----------+------------------------------------------------------------------+---------- 1 | 36 11 7 4 1 0 | 59 | 36.00 23.40 23.33 57.14 25.00 0.00 | 31.22 -----------+------------------------------------------------------------------+---------- Total | 100 47 30 7 4 1 | 189


Relationship of LBW with continuous variables age of mother and weight of mother

Possible relationship of LBW with either age or weight univariately Need to consider setting aside an internal validation sample Not much data so use 75% for training and 25% for validation


. logistic low age Logistic regression Number of obs = 189 LR chi2(1) = 2.76 Prob > chi2 = 0.0966 Log likelihood = -115.95598 Pseudo R2 = 0.0118 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | .9501333 .0299423 -1.62 0.105 .8932232 1.010669 ------------------------------------------------------------------------------ . logistic low lwt Logistic regression Number of obs = 189 LR chi2(1) = 5.98 Prob > chi2 = 0.0145 Log likelihood = -114.34533 Pseudo R2 = 0.0255 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- lwt | .9860401 .0060834 -2.28 0.023 .9741886 .9980358 ------------------------------------------------------------------------------


Generate training and validation samples

Generate a random number uniform on the interval (0,1) Assign an observation to the training sample if U < 0.75 and validation sample if U ≥ 0.75

Will not guarantee that there are exactly 75% of the observations or cases in the training sample If you want exactly 75% of the observations then sort by the random number and assign the first .75*n to the training sample


. gen random=uniform()

. gen training=(random<0.75)

. tabulate training low , row col | LOW training | 0 1 | Total -----------+----------------------+---------- 0 | 28 13 | 41 | 68.29 31.71 | 100.00 | 21.54 22.03 | 21.69 -----------+----------------------+---------- 1 | 102 46 | 148 | 68.92 31.08 | 100.00 | 78.46 77.97 | 78.31 -----------+----------------------+---------- Total | 130 59 | 189 | 68.78 31.22 | 100.00


This has achieved greater balance in the observations, but still there are not 75% of the cases in the training set Just use the original training classification in the analysis below Still need to consider how to model the continuous variables age and weight without being too complex

Could just use linear age and weight terms Could categorize into age groups and weight groups Could use a simple polynomial (e.g. age and age squared) Could use a smoother or a spline Could use fractional polynomials


. sort random

. gen train=( _n < (.75*189))

. tabulate train low , col | LOW train | 0 1 | Total -----------+----------------------+---------- 0 | 31 17 | 48 | 23.85 28.81 | 25.40 -----------+----------------------+---------- 1 | 99 42 | 141 | 76.15 71.19 | 74.60 -----------+----------------------+---------- Total | 130 59 | 189


Fractional polynomials First consider a 2 degree polynomial model for age

Two degree model not significantly better than age as a linear term so just use age as a linear variable


. fracpoly logistic low age if training==1, compare -> gen double Iage__1 = X^-2-.1836025735 if e(sample) -> gen double Iage__2 = X^3-12.71106248 if e(sample) (where: X = age/10) Logistic regression Number of obs = 148 LR chi2(2) = 4.51 Prob > chi2 = 0.1050 Log likelihood = -89.468565 Pseudo R2 = 0.0246 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Iage__1 | 8.281937 26.74684 0.65 0.513 .0147619 4646.447 Iage__2 | .9803986 .0314943 -0.62 0.538 .9205741 1.044111 ------------------------------------------------------------------------------ Deviance: 178.94. Best powers of age among 44 models fit: -2 3. Fractional polynomial model comparisons: --------------------------------------------------------------- age df Deviance Dev. dif. P [*] Powers --------------------------------------------------------------- Not in model 0 183.445 4.508 0.342 Linear 1 179.560 0.623 0.891 1 m = 1 2 179.350 0.413 0.813 3 m = 2 4 178.937 -- -- -2 3 --------------------------------------------------------------- [*] P-value from deviance difference comparing reported model with m = 2 model . fracplot


Fractional polynomials Plot of 2 degree model for age


-6-4

-20

2P

redi

ctor

+re

sidu

al o

f low

10 20 30 40 50AGE

Fractional Polynomial (-2 3)


Fractional polynomials Now consider a 2 degree polynomial model for weight

Two degree model not significantly better than weight as a linear term so just use weight as a linear variable


. fracpoly logistic low lwt if training==1, compare -> gen double Ilwt__1 = X^-2-.5847657897 if e(sample) -> gen double Ilwt__2 = X^3-2.236284553 if e(sample) (where: X = lwt/100) Logistic regression Number of obs = 148 LR chi2(2) = 7.57 Prob > chi2 = 0.0227 Log likelihood = -87.938036 Pseudo R2 = 0.0413 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Ilwt__1 | 5.387833 6.286794 1.44 0.149 .5472547 53.04431 Ilwt__2 | .9533033 .1428405 -0.32 0.750 .7107048 1.278713 ------------------------------------------------------------------------------ Deviance: 175.88. Best powers of lwt among 44 models fit: -2 3. Fractional polynomial model comparisons: --------------------------------------------------------------- lwt df Deviance Dev. dif. P [*] Powers --------------------------------------------------------------- Not in model 0 183.445 7.569 0.109 Linear 1 177.000 1.124 0.771 1 m = 1 2 175.985 0.109 0.947 -2 m = 2 4 175.876 -- -- -2 3 --------------------------------------------------------------- [*] P-value from deviance difference comparing reported model with m = 2 model . fracplot


Fractional polynomials Plot of 2 degree model for weight


-6-4

-20

2P

redi

ctor

+re

sidu

al o

f low

50 100 150 200 250LWT

Fractional Polynomial (-2 3)


Model exploration (stepwise) Try to screen for possible predictors (forward stepwise)

Ptl is included as a linear term – may want to dichotomize Other race may be excluded due to sample size rather than magnitude Create indicator covariates


. xi: sw logistic low age lwt i.race smoke ptl ht ui ftv if training==1 , forw pe(.10) pr(.20) lr i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) LR test begin with empty model p = 0.0031 < 0.1000 adding ptl p = 0.0343 < 0.1000 adding lwt p = 0.0207 < 0.1000 adding ht p = 0.0628 < 0.1000 adding _Irace_2 Logistic regression Number of obs = 148 LR chi2(4) = 22.04 Prob > chi2 = 0.0002 Log likelihood = -80.70039 Pseudo R2 = 0.1202 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ptl | 2.719407 1.155793 2.35 0.019 1.182223 6.255311 lwt | .978786 .0079106 -2.65 0.008 .9634037 .994414 ht | 5.185071 3.886264 2.20 0.028 1.193357 22.52885 _Irace_2 | 2.635247 1.366314 1.87 0.062 .953879 7.280304 ------------------------------------------------------------------------------ . gen raceblk=(race==2) . gen raceoth=(race==3) . gen everptl=(ptl>0)


Model exploration (stepwise) Refit model

Now other race is significant, but Smoke also is added to the model Backwards stepwise gives the same result (not shown) Still may want to dichotomize ptl and put that in the model


. xi: sw logistic low age lwt ( raceblk raceoth ) smoke ptl ht ui ftv if training==1 , forw pe(.10) pr(.20) lr LR test begin with empty model p = 0.0031 < 0.1000 adding ptl p = 0.0343 < 0.1000 adding lwt p = 0.0207 < 0.1000 adding ht p = 0.0907 < 0.1000 adding raceblk raceoth p = 0.0180 < 0.1000 adding smoke Logistic regression Number of obs = 148 LR chi2(6) = 28.98 Prob > chi2 = 0.0001 Log likelihood = -77.233355 Pseudo R2 = 0.1580 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ptl | 2.3882 1.030745 2.02 0.044 1.024927 5.564788 lwt | .9806299 .0079809 -2.40 0.016 .9651118 .9963976 ht | 5.066391 3.798747 2.16 0.030 1.165396 22.02542 raceblk | 4.101805 2.386227 2.43 0.015 1.31156 12.82809 raceoth | 2.773365 1.403543 2.02 0.044 1.028564 7.477953 smoke | 2.903573 1.342785 2.30 0.021 1.172969 7.187519 ------------------------------------------------------------------------------


Model exploration Refit model replacing ptl with everptl

Same number of parameters – 2nd model is “better” and simpler Look at some goodness-of-fit tests

Test is not reliable since # obs is too close to the number of covariate combinations


. logistic low ht lwt raceblk raceoth smoke everptl if training==1 Logistic regression Number of obs = 148 LR chi2(6) = 29.99 Prob > chi2 = 0.0000 Log likelihood = -76.72783 Pseudo R2 = 0.1635 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ht | 4.979157 3.77316 2.12 0.034 1.127516 21.98815 lwt | .9804978 .0080525 -2.40 0.016 .9648416 .9964081 raceblk | 3.845191 2.251793 2.30 0.021 1.220235 12.11693 raceoth | 2.612055 1.333204 1.88 0.060 .9605622 7.102957 smoke | 2.74314 1.284086 2.16 0.031 1.095957 6.865979 everptl | 3.442278 1.890798 2.25 0.024 1.172984 10.10182 ------------------------------------------------------------------------------

. lfit Logistic model for low, goodness-of-fit test number of observations = 148 number of covariate patterns = 123 Pearson chi2(116) = 113.28 Prob > chi2 = 0.5540


Model diagnostics Use Hosmer-Lemeshow goodness-of-fit test and calculate

c-statistic (lroc)

Model predicts pretty well for the training sample – also need to consider the validation sample Compute estimated probabilities and logits for both samples and compute predictive power


. lfit , group(10) Logistic model for low, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) number of observations = 148 number of groups = 10 Hosmer-Lemeshow chi2(8) = 7.75 Prob > chi2 = 0.4579 . lroc Logistic model for low number of observations = 148 area under ROC curve = 0.7644

0.00

0.25

0.50

0.75

1.00

Sen

sitiv

ity

0.00 0.25 0.50 0.75 1.001 - Specificity

Area under ROC curve = 0.7644


Model diagnostics

Estimation in validation sample is still good, but certainly inferior to the training sample


. predict prob, pr

. predict xb, xb

. sum xb prob Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- xb | 189 -.9540967 1.111488 -3.59465 2.117284 prob | 189 .315882 .2046601 .0267359 .8925718 . roctab low prob if training==1 ROC -Asymptotic Normal-- Obs Area Std. Err. [95% Conf. Interval] -------------------------------------------------------- 148 0.7644 0.0422 0.68168 0.84709 . roctab low xb if training==1 ROC -Asymptotic Normal-- Obs Area Std. Err. [95% Conf. Interval] -------------------------------------------------------- 148 0.7644 0.0422 0.68168 0.84709 . roctab low prob if training==0, graph ROC -Asymptotic Normal-- Obs Area Std. Err. [95% Conf. Interval] -------------------------------------------------------- 41 0.6951 0.0937 0.51131 0.87880

0.00

0.25

0.50

0.75

1.00

Sen

sitiv

ity

0.00 0.25 0.50 0.75 1.001 - Specificity

Area under ROC curve = 0.6951


Model prediction Look at classification statistics in the training and validation samples

Are there any risk factors so high that LBW babies are probable?


. estat class if training==1 Logistic model for low -------- True -------- Classified | D ~D | Total -----------+--------------------------+----------- + | 19 8 | 27 - | 27 94 | 121 -----------+--------------------------+----------- Total | 46 102 | 148 Classified + if predicted Pr(D) >= .5 True D defined as low != 0 -------------------------------------------------- Sensitivity Pr( +| D) 41.30% Specificity Pr( -|~D) 92.16% Positive predictive value Pr( D| +) 70.37% Negative predictive value Pr(~D| -) 77.69% -------------------------------------------------- False + rate for true ~D Pr( +|~D) 7.84% False - rate for true D Pr( -| D) 58.70% False + rate for classified + Pr(~D| +) 29.63% False - rate for classified - Pr( D| -) 22.31% -------------------------------------------------- Correctly classified 76.35% --------------------------------------------------

. estat class if training==0 Logistic model for low -------- True -------- Classified | D ~D | Total -----------+--------------------------+----------- + | 6 4 | 10 - | 7 24 | 31 -----------+--------------------------+----------- Total | 13 28 | 41 Classified + if predicted Pr(D) >= .5 True D defined as low != 0 -------------------------------------------------- Sensitivity Pr( +| D) 46.15% Specificity Pr( -|~D) 85.71% Positive predictive value Pr( D| +) 60.00% Negative predictive value Pr(~D| -) 77.42% -------------------------------------------------- False + rate for true ~D Pr( +|~D) 14.29% False - rate for true D Pr( -| D) 53.85% False + rate for classified + Pr(~D| +) 40.00% False - rate for classified - Pr( D| -) 22.58% -------------------------------------------------- Correctly classified 73.17% --------------------------------------------------


Model prediction Consider smoking as a risk factor

35% of smokers are predicted as LBW (but they have many other diverse risk factors) Get estimated probability of LBW by weight & smoking, but no other elevated risk factors


. gen predlow=(prob>=.5) . tabulate predlow smoke , col | SMOKE predlow | 0 1 | Total -----------+----------------------+---------- 0 | 104 48 | 152 | 90.43 64.86 | 80.42 -----------+----------------------+---------- 1 | 11 26 | 37 | 9.57 35.14 | 19.58 -----------+----------------------+---------- Total | 115 74 | 189

. twoway (connected prob lwt if ht==0 & raceblk==0 & raceoth==0 & smoke==0 & everptl==0, sort) (connected prob lwt if ht==0 & raceblk==0 & raceoth==0 & smoke==1 & everptl==0, sort), legend(order(1 "Non-smoker" 2 "Smoker"))


Model prediction

Higher risk for smokers with low weight, but would still need to have another risk factor to have a low birthweight baby with probability greater than 50% Consider hypertension as well


0.1

.2.3

.4P

r(lo

w)

100 150 200 250LWT

Non-smoker Smoker


Model prediction

Would have to have several risk factors to be at high risk We did not consider interactions, but they may help only slightly in improving overall model prediction


. twoway (connected prob lwt if ht==0 & raceblk==0 & raceoth==0 & smoke==1 & everptl==0, sort) (connected prob lwt if ht==1 & raceblk==0 & raceoth==0 & smoke==1 & everptl==0, sort), legend(order(1 "Non-hypertensive smoker" 2 "Hypertensive smoker")) . twoway (connected xb lwt if ht==0 & raceblk==0 & raceoth==0 & smoke==1 & everptl==0, sort) (connected xb lwt if ht==1 & raceblk==0 & raceoth==0 & smoke==1 & everptl==0, sort), legend(order(1 "Non-hypertensive smoker" 2 "Hypertensive smoker"))

0.2

.4.6

Pr(

low

)

100 150 200 250LWT

Non-hypertensive smoker Hypertensive smoker

-3-2

-10

1Li

near

pre

dict

ion

100 150 200 250LWT

Non-hypertensive smoker Hypertensive smoker


Model prediction Do the same factors also estimate actual birth weight in grams in a linear regression model?


. logistic low ht lwt raceblk raceoth smoke everptl Logistic regression Number of obs = 189 LR chi2(6) = 34.19 Prob > chi2 = 0.0000 Log likelihood = -100.24113 Pseudo R2 = 0.1457 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ht | 5.855858 4.148348 2.49 0.013 1.4608 23.47417 lwt | .9834105 .0068346 -2.41 0.016 .9701058 .9968976 raceblk | 3.538576 1.87307 2.39 0.017 1.253901 9.986056 raceoth | 2.373051 1.0325 1.99 0.047 1.011473 5.567492 smoke | 2.401531 .9623179 2.19 0.029 1.094972 5.267121 everptl | 3.426148 1.528933 2.76 0.006 1.428743 8.215961 ------------------------------------------------------------------------------ . regress bwt ht lwt raceblk raceoth smoke everptl Source | SS df MS Number of obs = 189 -------------+------------------------------ F( 6, 182) = 7.42 Model | 19637424.4 6 3272904.07 Prob > F = 0.0000 Residual | 80279628.2 182 441096.858 R-squared = 0.1965 -------------+------------------------------ Adj R-squared = 0.1700 Total | 99917052.6 188 531473.684 Root MSE = 664.15 ------------------------------------------------------------------------------ bwt | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ht | -512.0872 204.6527 -2.50 0.013 -915.8842 -108.2902 lwt | 4.620609 1.719624 2.69 0.008 1.227646 8.013572 raceblk | -470.8198 149.8334 -3.14 0.002 -766.4537 -175.1859 raceoth | -353.7262 115.8004 -3.05 0.003 -582.2101 -125.2424 smoke | -339.4116 108.2637 -3.14 0.002 -553.0251 -125.7982 everptl | -286.6176 135.9999 -2.11 0.036 -554.9569 -18.27832 _cons | 2745.896 248.9675 11.03 0.000 2254.662 3237.13 ------------------------------------------------------------------------------


Association example Same dataset Now interested in whether smoking is related to low birthweight

Assume this has not been tested before, but some animal models suggest a causal relationship

May want to control for factors believed to be related to LBW and/or smoking

Other variables are not of particular interest, but smoking may be a modifiable risk factor

This specific hypothesis is proposed prior to data collection so use all the available data

Unadjusted odds ratio for smoking suggests a possible association


Potential confounders or predictors of the outcome May want to consider other variables that are potential confounders for the relationship of smoking with LBW May also want to consider other variables that predict the outcome even if they are not confounders

Precision for smoking variable may be improved What is the relationship of smoking to some of the other variables in the data?

Whites are much heavier smokers so race could be a potential confounder Consider some of the other covariates


. tabulate smoke race , col | RACE SMOKE | 1 2 3 | Total -----------+---------------------------------+---------- 0 | 44 16 55 | 115 | 45.83 61.54 82.09 | 60.85 -----------+---------------------------------+---------- 1 | 52 10 12 | 74 | 54.17 38.46 17.91 | 39.15 -----------+---------------------------------+---------- Total | 96 26 67 | 189


Potential confounders or predictors of the outcome


. tabulate smoke everptl , col | everptl SMOKE | 0 1 | Total -----------+----------------------+---------- 0 | 103 12 | 115 | 64.78 40.00 | 60.85 -----------+----------------------+---------- 1 | 56 18 | 74 | 35.22 60.00 | 39.15 -----------+----------------------+---------- Total | 159 30 | 189 . tabulate smoke ht , col | HT SMOKE | 0 1 | Total -----------+----------------------+---------- 0 | 108 7 | 115 | 61.02 58.33 | 60.85 -----------+----------------------+---------- 1 | 69 5 | 74 | 38.98 41.67 | 39.15 -----------+----------------------+---------- Total | 177 12 | 189 . tabulate smoke ui , col | UI SMOKE | 0 1 | Total -----------+----------------------+---------- 0 | 100 15 | 115 | 62.11 53.57 | 60.85 -----------+----------------------+---------- 1 | 61 13 | 74 | 37.89 46.43 | 39.15 -----------+----------------------+---------- Total | 161 28 | 189


Potential confounders or predictors of the outcome Explore continuous covariates as well


. ttest age , by(smoke) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 0 | 115 23.42609 .5098662 5.467706 22.41605 24.43613 1 | 74 22.94595 .5867511 5.047424 21.77655 24.11534 ---------+-------------------------------------------------------------------- combined | 189 23.2381 .3854221 5.298678 22.47779 23.9984 ---------+-------------------------------------------------------------------- diff | .480141 .7909778 -1.080245 2.040528 ------------------------------------------------------------------------------ diff = mean(0) - mean(1) t = 0.6070 Ho: diff = 0 degrees of freedom = 187 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.7277 Pr(|T| > |t|) = 0.5446 Pr(T > t) = 0.2723 . ttest lwt , by(smoke) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 0 | 115 130.8957 2.650831 28.427 125.6444 136.1469 1 | 74 128.1351 3.927628 33.78673 120.3074 135.9629 ---------+-------------------------------------------------------------------- combined | 189 129.8148 2.224323 30.57938 125.427 134.2027 ---------+-------------------------------------------------------------------- diff | 2.760517 4.564873 -6.244749 11.76578 ------------------------------------------------------------------------------ diff = mean(0) - mean(1) t = 0.6047 Ho: diff = 0 degrees of freedom = 187 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.7270 Pr(|T| > |t|) = 0.5461 Pr(T > t) = 0.2730


Control for age and race


. logistic low smoke Logistic regression Number of obs = 189 LR chi2(1) = 4.87 Prob > chi2 = 0.0274 Log likelihood = -114.9023 Pseudo R2 = 0.0207 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- smoke | 2.021944 .6462912 2.20 0.028 1.080668 3.783083 ------------------------------------------------------------------------------ . logistic low smoke age Logistic regression Number of obs = 189 LR chi2(2) = 7.40 Prob > chi2 = 0.0248 Log likelihood = -113.63815 Pseudo R2 = 0.0315 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- smoke | 1.997405 .642777 2.15 0.032 1.063027 3.753081 age | .9514394 .0304194 -1.56 0.119 .8936481 1.012968 ------------------------------------------------------------------------------ . xi: logistic low smoke age i.race i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) Logistic regression Number of obs = 189 LR chi2(4) = 15.81 Prob > chi2 = 0.0033 Log likelihood = -109.4311 Pseudo R2 = 0.0674 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- smoke | 3.00582 1.117995 2.96 0.003 1.449987 6.231058 age | .9657186 .0322571 -1.04 0.296 .9045209 1.031057 _Irace_2 | 2.749483 1.356656 2.05 0.040 1.045321 7.231905 _Irace_3 | 2.876948 1.167915 2.60 0.009 1.298319 6.375035 ------------------------------------------------------------------------------


Control for other factors Age is neither a predictor or a confounder, but leave in the model anyway Race is both a predictor and a confounder of the association of smoking and low birthweight Now consider some other potential confounders/predictors

Weight may also be a confounder and a predictor Add hypertension and uterine irritability to the model


. xi: logistic low smoke age i.race lwt i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) Logistic regression Number of obs = 189 LR chi2(5) = 20.09 Prob > chi2 = 0.0012 Log likelihood = -107.28862 Pseudo R2 = 0.0856 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- smoke | 2.870363 1.090738 2.77 0.006 1.362952 6.044959 age | .9777725 .033411 -0.66 0.511 .9144329 1.045499 _Irace_2 | 3.426952 1.772255 2.38 0.017 1.243677 9.442967 _Irace_3 | 2.568347 1.069029 2.27 0.023 1.135942 5.806992 lwt | .9875525 .0063063 -1.96 0.050 .9752693 .9999903 ------------------------------------------------------------------------------


Control for other factors

Predictors, but not confounders Only other potentially modifiable risk factor and/or confounder might be number of first trimester prenatal visits Conduct an unplanned exploratory analysis of this variable and outcome


. xi: logistic low smoke age i.race lwt ht ui i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) Logistic regression Number of obs = 189 LR chi2(7) = 30.72 Prob > chi2 = 0.0001 Log likelihood = -101.97403 Pseudo R2 = 0.1309 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- smoke | 2.794269 1.100761 2.61 0.009 1.291071 6.047646 age | .9819096 .0347149 -0.52 0.606 .9161736 1.052362 _Irace_2 | 3.598944 1.89556 2.43 0.015 1.281882 10.1042 _Irace_3 | 2.464232 1.070381 2.08 0.038 1.051835 5.773185 lwt | .9838469 .0067479 -2.37 0.018 .9707098 .9971617 ht | 6.408447 4.414475 2.70 0.007 1.661118 24.72323 ui | 2.448283 1.098045 2.00 0.046 1.016485 5.896877 ------------------------------------------------------------------------------

. logistic low ftv Logistic regression Number of obs = 189 LR chi2(1) = 0.77 Prob > chi2 = 0.3792 Log likelihood = -116.94943 Pseudo R2 = 0.0033 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ftv | .8736112 .1368936 -0.86 0.389 .6425932 1.187682 ------------------------------------------------------------------------------


Control for other factors Not a strong predictor as a grouped linear variable Consider as a categorical variable

Runs into numerical issues due to sparseness of the data


. tabulate low ftv , col | FTV LOW | 0 1 2 3 4 6 | Total -----------+------------------------------------------------------------------+---------- 0 | 64 36 23 3 3 1 | 130 | 64.00 76.60 76.67 42.86 75.00 100.00 | 68.78 -----------+------------------------------------------------------------------+---------- 1 | 36 11 7 4 1 0 | 59 | 36.00 23.40 23.33 57.14 25.00 0.00 | 31.22 -----------+------------------------------------------------------------------+---------- Total | 100 47 30 7 4 1 | 189 . xi: logistic low i.ftv i.ftv _Iftv_0-6 (naturally coded; _Iftv_0 omitted) note: _Iftv_6 != 0 predicts failure perfectly _Iftv_6 dropped and 1 obs not used Logistic regression Number of obs = 188 LR chi2(4) = 5.43 Prob > chi2 = 0.2455 Log likelihood = -114.24311 Pseudo R2 = 0.0232 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iftv_1 | .5432099 .2186982 -1.52 0.130 .2467578 1.195816 _Iftv_2 | .5410628 .2593369 -1.28 0.200 .2114746 1.384322 _Iftv_3 | 2.37037 1.876543 1.09 0.276 .5022828 11.18624 _Iftv_4 | .5925926 .695315 -0.45 0.656 .0594298 5.908924 ------------------------------------------------------------------------------


Control for other factors Collapse into three levels

Not a strong predictor – return to model for smoking, but add this as a potential confounder


. gen ftvgrp=(ftv>0)+(ftv>1)

. tabulate ftv ftvgrp | ftvgrp FTV | 0 1 2 | Total -----------+---------------------------------+---------- 0 | 100 0 0 | 100 1 | 0 47 0 | 47 2 | 0 0 30 | 30 3 | 0 0 7 | 7 4 | 0 0 4 | 4 6 | 0 0 1 | 1 -----------+---------------------------------+---------- Total | 100 47 42 | 189 . xi: logistic low i.ftvgrp i.ftvgrp _Iftvgrp_0-2 (naturally coded; _Iftvgrp_0 omitted) Logistic regression Number of obs = 189 LR chi2(2) = 2.59 Prob > chi2 = 0.2743 Log likelihood = -116.04255 Pseudo R2 = 0.0110 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iftvgrp_1 | .5432099 .2186982 -1.52 0.130 .2467578 1.195816 _Iftvgrp_2 | .7111111 .2845062 -0.85 0.394 .3246258 1.557729 ------------------------------------------------------------------------------


Control for other factors

Small change in the OR for smoking – leave in anyway

Model shows no obvious lack-of-fit


. xi: logistic low smoke age i.race lwt ht ui i.ftvgrp i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) i.ftvgrp _Iftvgrp_0-2 (naturally coded; _Iftvgrp_0 omitted) Logistic regression Number of obs = 189 LR chi2(9) = 31.09 Prob > chi2 = 0.0003 Log likelihood = -101.79203 Pseudo R2 = 0.1325 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- smoke | 2.713932 1.097076 2.47 0.014 1.228882 5.993598 age | .9826152 .0357058 -0.48 0.629 .915067 1.05515 _Irace_2 | 3.588344 1.899526 2.41 0.016 1.271458 10.12712 _Irace_3 | 2.407264 1.070489 1.98 0.048 1.006936 5.755001 lwt | .9834105 .0067902 -2.42 0.015 .9701917 .9968095 ht | 6.676087 4.675312 2.71 0.007 1.692074 26.34053 ui | 2.417519 1.087941 1.96 0.050 1.000712 5.840242 _Iftvgrp_1 | .8259409 .3736977 -0.42 0.673 .3402688 2.004822 _Iftvgrp_2 | 1.126281 .5047641 0.27 0.791 .4679128 2.710996 ------------------------------------------------------------------------------

. lfit, group(10) Logistic model for low, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) number of observations = 189 number of groups = 10 Hosmer-Lemeshow chi2(8) = 7.08 Prob > chi2 = 0.5278


Unadjusted and Adjusted Odds Ratios

Suggests an association of smoking and low birthweight that remains after adjustment for age, race, weight of the mother, hypertension, uterine irritability, and number of physician visits LR test for smoking in the final model

History of premature labor was accidentally omitted from this analysis but should have been included


OR for smoking

95% CI Wald p-value

Unadjusted 2.02 (1.08 , 3.78) .028

Adjusted for age and race 3.01 (1.45 , 6.23) .003

Adjusted for age and race & other factors

2.71 (1.23 , 5.99) .014

. lrtest A B Likelihood-ratio test LR chi2(1) = 6.32 (Assumption: B nested in A) Prob > chi2 = 0.0119


Effect Modification Significant effect modification would render the entire previous analysis null and void

Some analysts prefer to start with interactions to rule out effect modification before looking at confounding Add interactions between smoking and each covariate and test in a LR test If significant, then the interpretation of the association of smoking and low birthweight depends on that effect modifier



Effect Modification with Age?

No apparent effect modification by age


. est store A . xi: logistic low smoke age i.race lwt ht ui i.ftvgrp i.smoke*age i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) i.ftvgrp _Iftvgrp_0-2 (naturally coded; _Iftvgrp_0 omitted) i.smoke _Ismoke_0-1 (naturally coded; _Ismoke_0 omitted) i.smoke*age _IsmoXage_# (coded as above) note: _Ismoke_1 dropped because of collinearity note: age dropped because of collinearity Logistic regression Number of obs = 189 LR chi2(10) = 32.13 Prob > chi2 = 0.0004 Log likelihood = -101.27162 Pseudo R2 = 0.1369 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- smoke | .4793266 .839849 -0.42 0.675 .0154598 14.86137 age | .9479373 .0493236 -1.03 0.304 .8560309 1.049711 _Irace_2 | 3.213281 1.736413 2.16 0.031 1.11422 9.266725 _Irace_3 | 2.277838 1.02166 1.84 0.066 .9456721 5.486623 lwt | .9836575 .0067341 -2.41 0.016 .970547 .9969452 ht | 6.934175 4.905822 2.74 0.006 1.732937 27.74641 ui | 2.624468 1.203619 2.10 0.035 1.068238 6.447844 _Iftvgrp_1 | .7780536 .3555842 -0.55 0.583 .3176841 1.905564 _Iftvgrp_2 | 1.101442 .4955076 0.21 0.830 .4560684 2.66007 _IsmoXage_1 | 1.078295 .0802661 1.01 0.311 .9319145 1.247669 ------------------------------------------------------------------------------ . est store B . lrtest A B Likelihood-ratio test LR chi2(1) = 1.04 (Assumption: A nested in B) Prob > chi2 = 0.3076


Effect Modification with Race?

No apparent effect modification by race No significant effect modification by any variable


. xi: logistic low smoke age i.race lwt ht ui i.ftvgrp i.smoke*i.race i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) i.ftvgrp _Iftvgrp_0-2 (naturally coded; _Iftvgrp_0 omitted) i.smoke _Ismoke_0-1 (naturally coded; _Ismoke_0 omitted) i.smoke*i.race _IsmoXrac_#_# (coded as above) note: _Ismoke_1 dropped because of collinearity note: _Irace_2 dropped because of collinearity note: _Irace_3 dropped because of collinearity Logistic regression Number of obs = 189 LR chi2(11) = 33.19 Prob > chi2 = 0.0005 Log likelihood = -100.73884 Pseudo R2 = 0.1414 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- smoke | 4.035503 2.578857 2.18 0.029 1.153304 14.12055 age | .982677 .0369021 -0.47 0.642 .9129477 1.057732 _Irace_2 | 3.941254 3.208098 1.68 0.092 .7994186 19.43097 _Irace_3 | 3.773384 2.415678 2.07 0.038 1.075973 13.23307 lwt | .9840367 .0069244 -2.29 0.022 .9705582 .9977023 ht | 6.279158 4.455962 2.59 0.010 1.562615 25.23195 ui | 2.611334 1.189534 2.11 0.035 1.069342 6.376877 _Iftvgrp_1 | .8545712 .3896019 -0.34 0.730 .3496895 2.088401 _Iftvgrp_2 | 1.121016 .5077453 0.25 0.801 .4613965 2.723637 _IsmoXrac_~2 | 1.081958 1.199233 0.07 0.943 .1232384 9.498937 _IsmoXrac_~3 | .2958876 .2782807 -1.29 0.195 .0468356 1.869292 ------------------------------------------------------------------------------ . est store B . lrtest A B Likelihood-ratio test LR chi2(2) = 2.11 (Assumption: A nested in B) Prob > chi2 = 0.3488


Final model

OR for a smoker with hypertension compared to a non-smoker without hypertension all else being equal OR = 2.71*6.68 = 18.11

OR for a smoker aged 30 compared to a nonsmoker aged 20 all else equal


. xi: logistic low smoke age i.race lwt ht ui i.ftvgrp Logistic regression Number of obs = 189 LR chi2(9) = 31.09 Prob > chi2 = 0.0003 Log likelihood = -101.79203 Pseudo R2 = 0.1325 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- smoke | 2.713932 1.097076 2.47 0.014 1.228882 5.993598 age | .9826152 .0357058 -0.48 0.629 .915067 1.05515 _Irace_2 | 3.588344 1.899526 2.41 0.016 1.271458 10.12712 _Irace_3 | 2.407264 1.070489 1.98 0.048 1.006936 5.755001 lwt | .9834105 .0067902 -2.42 0.015 .9701917 .9968095 ht | 6.676087 4.675312 2.71 0.007 1.692074 26.34053 ui | 2.417519 1.087941 1.96 0.050 1.000712 5.840242 _Iftvgrp_1 | .8259409 .3736977 -0.42 0.673 .3402688 2.004822 _Iftvgrp_2 | 1.126281 .5047641 0.27 0.791 .4679128 2.710996 ------------------------------------------------------------------------------

. lincom smoke + ht , or ( 1) smoke + ht = 0 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 18.11844 14.76024 3.56 0.000 3.670178 89.44471 ------------------------------------------------------------------------------

. lincom (20*age + smoke) - (30*age) , or ( 1) smoke - 10 age = 0 ------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 3.23418 1.772894 2.14 0.032 1.104479 9.470457 ------------------------------------------------------------------------------

biost 536 lecture 9 1 lecture 9 – prediction and association example low birth weight dataset...

Documents