09cer module9 linearregressionmartin · 2020. 1. 3. · bivariate analysis continuous independent...
TRANSCRIPT
2/20/2012
1
Brook I. Martin PhD MPHDartmouth Collegeg
Linear regression
About me
• Health services PhD from University of hWashington.
• Health services faculty at Dartmouth College.
• Affiliated with Dartmouth‐Hitchcock Medical Center Department of Orthopaedics.
• Primary research interest is in the quality of care for musculoskeletal & spinal problems.
2
Overall goals
1) Framework for choosing an analysis
2) Introduction to descriptive analysis of continuous data.
3) li d i d i i d l3) Applied introduction regression models.
1) Modeling an outcome
2) Diagnostic tests of model assumptions
3) Post‐estimation
2/20/2012
2
Data types
Type Example
Continuous Age; annual salary; WBC
Dichotomous Infection (yes/no); Death (yes/no)
Count Number (rates) of surgical procedures.
Survival # days until death or other event.
Categorical (Ordinal)
Health rating:1 = Excellent; 2 = Very good; 3 = Good; 4 = Fair; 5 = Poor
Categorical (Nominal)
Insurance:1 = Medicare/aid; 2 = Private; 4 = HMO; 5 = Other
Analysis frameworkDependentVariable
Univariable analysis (e.g. descriptive)
Bivariable analysis Multivariableanalysis
Continuous MeansT‐test
Correlation;T‐test
Analysis of covariance
Dichotomous Proportion Chi‐square Logistic Regression
Count Incidence & Rate‐difference or ratio Poisson regressionprevalence
g
Survival Kaplan‐Meier survival Log‐rank; Wilcoxon(other) for survival data
Cox‐proportionalhazard regression
Categorical (Ordinal)
Proportion;Wilcoxon signed rank
Spearman’s test;Mann‐Whitney test;
Ordered logistic regression
Categorical (Nominal)
Proportion Chi‐square; Mantel‐Haenszel test
Multinomial logistics
Analysis frameworkDependentVariable
Univariable analysis (e.g. descriptive)
Bivariable analysis Multivariableanalysis
Continuous MeansT‐test
Correlation;T‐test
Analysis of covariance
Dichotomous Proportion Chi‐square Logistic Regression
Count Incidence & Rate‐difference or ratio Poisson regressionprevalence
g
Survival Kaplan‐Meier survival Log‐rank; Wilcoxon(other) for survival data
Cox‐proportionalhazard regression
Categorical (Ordinal)
Proportion;Wilcoxon signed rank
Spearman’s test;Mann‐Whitney test;
Ordered logistic regression
Categorical (Nominal)
Proportion Chi‐square; Mantel‐Haenszel test
Multinomial logistics
2/20/2012
3
ExampleData source: Seattle Lumbar Imaging Project
Design: Randomized control trial of X‐ray versus Rapid MR
Patients: Low Back Pain without significant comorbidity.
Primary outcome: Roland Score
Independent variables: Age, sex, race, education, employment, sciatica, comorbidity, previous surgery, duration of symptoms, depression, bmi, planning disability.
Citation: Jarvik JG, et al. Rapid magnetic resonance imaging vsradiographs for patients with low back pain: a randomized controlled trial. JAMA. 2003 Jun 4;289(21):2810‐8.
Primary outcome: Roland score
• 23 yes/no items asking about disability due to back problems
• Higher score means more severe disability.g y
• Clinically significant difference is 3 points.
• We model ABSOLUTE Roland score at 12 months, controlling for baseline Roland score.
Part I:
1) Univariate (descriptive) analysis of continuous data.
2) Bivariate analysis of continuous outcome.
3) i i & hi l i3) Descriptive & graphical summaries.
2/20/2012
4
Analysis frameworkDependentVariable
Univariable analysis (e.g. descriptive)
Bivariable analysis Multivariableanalysis
Continuous MeansT‐test
Correlation;T‐test
Analysis of covariance
Dichotomous Proportion Chi‐square Logistic Regression
Count Incidence & Rate‐difference or ratio Poisson regressionprevalence
g
Survival Kaplan‐Meier survival Log‐rank; Wilcoxon(other) for survival data
Cox‐proportionalhazard regression
Categorical (Ordinal)
Proportion;Wilcoxon signed rank
Spearman’s test;Mann‐Whitney test;
Ordered logistic regression
Categorical (Nominal)
Proportion Chi‐square; Mantel‐Haenszel test
Multinomial logistics
Univariate analysis
Continuous variables:summarize roland0 age bmi
tabstat roland0 age bmi, stat(count mean sd min max)
Univariate analysis
Graphhistogram roland0, freq
304
050
y 040
50cy
6080
cy
010
20
3F
req
uenc
y
0 5 10 15 20 25Roland Score (Baseline)
010
20
30F
requ
enc
20 40 60 80 100age
02
04
0F
requ
enc
10 20 30 40 50Body Mass Index
2/20/2012
5
Univariate analysis
Categorical or dichotomous variables:tabulate prevsurg, miss
Univariate analysis
Categorical or dichotomous variables:tabulate female, miss
tabulate prevsurg, miss
Univariable analysis
Graphhistogram education, discrete percent width(0.5) start(1) addl
33.25
304
0
6.596
18.73
21.919.53
01
020
Pe
rce
nt
Less
than
HS
Gradu
ated
HS
Some
Colleg
e
Colleg
e de
gree
Gradu
ate
degr
ee
Education
2/20/2012
6
Bivariate analysis
Continuous independent variable
pwcorr roland12 age, sig
Bivariate analysis
Continuous independent variable
scatter roland12 age
25
05
1015
20R
olan
d S
core
(1
2 m
on
ths)
20 40 60 80 100age
Bivariate analysis
Continuous independent variable
twoway (scatter roland12 age) (lfit roland12 age)
25
05
10
15
20
20 40 60 80 100age
Fitted values Roland Score (12 months)
2/20/2012
7
Bivariate analysis
Categorical independent variable
tabstat roland12, by(female) stat(mean sd)
Categorical independent variablegraph box roland12, over(female)
Bivariate analysis
Bivariate analysis
Categorical independent variable
ttest roland12, by(female)
2/20/2012
8
Bivariate analysis
Examine two categorical variables
tabulate random charlson, col
Bivariate analysis
Examine two categorical variables
tabulate random charlson, col chi2
“Table 1”VARIABLE X‐RAY
(n= 190)MRI(n = 190)
OVERALL(n = 380)
P‐VALUE
Age, mean (SD) 51.9 (14.4) 54.5 (14.9) 53.2 (14.7) 0.089
Female (%) 55% 56% 56% 0.757
White (%) 78% 80% 79% 0.562
BMI 28.5 (5.9) 28.9 (6.1) 28.7 (6.0) 0.561
Current or former smoker 47% 53% 51% 0.302
Employment Working 54% 56% 55% 0.902
Unemployed 15% 14% 15%
Other (retired/retire) 31% 31% 31%
Comorbidity (%) 72% 81% 77% 0.039
Roland (baseline) 12.8 (5.6) 13.6 (6.0) 13.2 (5.8) 0.176
2/20/2012
9
Self check
• Suppose the SLIP study was a prospective cohort study rather than an RCT. How might this influence the analysis of the Roland outcome that we choose?outcome that we choose?
• Using a dataset that is relevant to your work, create a “table 1” for the factors that you are interested in your analysis.
PART 2: Multivariable analysis
• Introduction to building a multivariable regression.
• Testing model assumptions
i i i l• Using post‐estimation results
Analysis frameworkDependentVariable
Univariable analysis (e.g. descriptive)
Bivariable analysis Multivariableanalysis
Continuous MeansT‐test
Correlation;T‐test
Analysis of covariance
Dichotomous Proportion Chi‐square Logistic Regression
Count Incidence & Rate‐difference or ratio Poisson regressionprevalence
g
Survival Kaplan‐Meier survival Log‐rank; Wilcoxon(other) for survival data
Cox‐proportionalhazard regression
Categorical (Ordinal)
Proportion;Wilcoxon signed rank
Spearman’s test;Mann‐Whitney test;
Ordered logistic regression
Categorical (Nominal)
Proportion Chi‐square; Mantel‐Haenszel test
Multinomial logistics
2/20/2012
10
Multivariable regression
Basic regression model:
ŷi = β0+ β1Xi + ei
OLS minimi es the difference in the s m of theOLS minimizes the difference in the sum of the observed y and predicted ŷ. The differences are called the residuals.
Multivariable regression
Basic regression model:
ŷi = β0+ β1Xi + ei
OLS minimizes the difference in the sum of the observed y and predicted ŷ. The differences are called the residuals.
ŷRoland12 = β0+ β1XRandom + ei
Primary outcome
Regression of outcome on primary variable of interest
regress roland12 random
2/20/2012
11
Baseline Roland score
Control for baseline Roland scoreregress roland12 random roland0
Additional covariates
regress roland12 random roland0 age female
Polynomial terms
Checking for polynomial on agegen age2 = age * ageregress roland12 random roland0 age age2 female
2/20/2012
12
Interaction terms
Checking for effect modificationgen inter = female * ageregress roland12 random roland0 age female inter
Adding categorical variables
• Each level of a categorical variable needs to be entered into the model as a dichotomous variable.
• One level is left out of the model (“referent”)• One level is left out of the model ( referent )
quietly tabulate education, generate(ed_)
Adding categorical variables
regress roland12 random roland0 age female ed_HSed_Some ed_Coll ed_grad
2/20/2012
13
Adding categorical variablesShortcut for categorical variablesregress roland12 random roland0 age female i.education
In older versions of Stata:xi: regress roland12 random roland0 age female i.education
estimates store model1
Likelihood ratio test
estimates store model2
regress roland12 random roland0 age female i.educationi.episode
lrtest model1 model2
Grouped‐linear variableContinuous Body Mass Index:regress roland12 random roland0 age female i.education i.episode
bmiestimates store bmi_cont
2/20/2012
14
Grouped‐linear variable
Make categorical Body Mass Index:xtile bmi4 = bmi, n(4)
Grouped‐linear variable
Categorical Body Mass Index:regress roland12 random roland0 age female i.education i.episode i.bmi4estimates store bmi_catlrtest bmi_cont bmi_cat
Grouped‐linear variable
Grouped‐linear (ordered categorical) Body Mass Index:regress roland12 random roland0 age female i.education i.episode bmi4
estimates store bmi_gl
lrtest bmi_cat bmi_gl
2/20/2012
15
Assumptions for OLS = “LINE”
• Linear association between dependent & independent variables.
• Independent
• Normally distribution of outcomey
• Equal variance of error terms
Model Diagnostics
Linear association between dependent & independent variables.
cprplot bmi4
20
-20
-10
010
Co
mpo
nent
plu
s re
sidu
al
1 2 3 44 quantiles of bmi
Model Diagnostics
Independent (no serial correlation)
‐ How was the data collected?
‐What was the study design?
2/20/2012
16
Model Diagnostics
Normally distribution of error termspredict resid, residqnorm resid 2
0-2
0-1
00
10R
esi
dua
ls
-20 -10 0 10 20Inverse Normal
Model Diagnostics
Normally distribution of error terms‐Residual versus fitted plot‐Is error term is zero and symmetic at every X?
rvfplot
20
-20
-10
010
Re
sid
uals
0 5 10 15 20Fitted values
Model DiagnosticsInfluential points
dfbeta
graph box _dfbeta_12 _dfbeta_3
.2-.
3-.
2-.
10
.1
Dfbeta bmi4 Dfbeta age
2/20/2012
17
Model diagnostics
Cooks residual for influence pointspredict cooksd, cooksdlocal max = 4/e(N)generate index = _ngraph twoway scatter cook index, yline(`max') msymbol(p) yscale(log)
.02
.03.04
.01
Co
ok's
D
0 100 200 300 400index
Model diagnostics
Test for equal variance of error terms‐Looking for heteroskedasticity
symplot
010
2030
40D
ista
nce
abo
ve m
edia
n
0 10 20 30 40Distance below median
age
Model diagnostics
Test for equal variance of error terms‐test for heteroskedasticity
estat hettest
2/20/2012
18
Post‐estimation: lincom
Lincom (linear combination)What is the predicted roland12 for a 40 year old female with a
graduate degree whose has a baseline Roland score was 10, a BMI of 25 (2nd quartile), and more that 5 episode of back pain.
lincom roland0*10 + age*40 + female*1 + 5.education + 3.episode + g p2.bmi4 + _cons
Post‐estimation: test
Hypothesis testingDoes Roland12 differ between those who have continuous and those with 5+ episodes?
test 3.episode = 4.episodetest 3.episode 4.episode
Post‐estimation: Margins
MarginsWhat are the estimated Roland12 scores between those who received MRI (random = 1) and those who received X‐ray (random = 0)?
margins random , atmeansg
margins , atmeans dydx(random)
2/20/2012
19
Post‐estimation
Hypothesis testingDoes the Roland12 differ between a 40 year old female and a 50 year old male?
test age*40 + female*1 = age*50 + female*0test age 40 + female 1 age 50 + female 0
Special situation
• Complex (survey) data– NHANES– Use Stata survey commands (“help svy”)
• Skewed data (such as costs)– Variable transformation (log costs)
li d l ( li d i d l )– Use non‐linear models (Generalized Linear Models)
• Correlated data– Repeated measures for same patient– Patients nested within hospitals that vary on outcomes– Use Stata panel data commands (“help xtset, sxreg”)
Self check1) What are the four main assumptions of OLS regression? Which one is
does not involve diagnostic plots?
2) Give an interpretation for each of the parameters in the model below: