09cer module9 linearregressionmartin · 2020. 1. 3. · bivariate analysis continuous independent...

20
2/20/2012 1 Brook I. Martin PhD MPH Dartmouth College Linear regression About me Health services PhD from University of h Washington. Health services faculty at Dartmouth College. Affiliated with DartmouthHitchcock Medical Center Department of Orthopaedics. Primary research interest is in the quality of care for musculoskeletal & spinal problems. 2 Overall goals 1) Framework for choosing an analysis 2) Introduction to descriptive analysis of continuous data. 3) li di d i i dl 3) Appliedintroduction regression models. 1) Modeling an outcome 2) Diagnostic tests of model assumptions 3) Postestimation

Upload: others

Post on 01-Mar-2021

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

1

Brook I. Martin PhD MPHDartmouth Collegeg

Linear regression

About me

• Health services PhD from University of hWashington.

• Health services faculty at Dartmouth College.

• Affiliated with Dartmouth‐Hitchcock Medical Center Department of Orthopaedics.

• Primary research interest is in the quality of care for musculoskeletal & spinal problems.

2

Overall goals

1) Framework for choosing an analysis

2) Introduction to descriptive analysis of continuous data.

3) li d i d i i d l3) Applied introduction regression models.

1) Modeling an outcome

2) Diagnostic tests of model assumptions 

3) Post‐estimation

Page 2: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

2

Data types

Type Example

Continuous  Age; annual salary; WBC

Dichotomous Infection (yes/no);  Death (yes/no)

Count Number (rates) of surgical procedures.

Survival # days until death or other event.

Categorical (Ordinal)

Health rating:1 = Excellent; 2 = Very good; 3 = Good; 4 = Fair; 5 = Poor

Categorical (Nominal)

Insurance:1 = Medicare/aid; 2 = Private; 4 = HMO; 5 = Other

Analysis frameworkDependentVariable

Univariable analysis (e.g. descriptive)

Bivariable analysis Multivariableanalysis

Continuous  MeansT‐test

Correlation;T‐test

Analysis of covariance

Dichotomous Proportion Chi‐square Logistic Regression

Count Incidence &  Rate‐difference or ratio Poisson regressionprevalence

g

Survival Kaplan‐Meier survival Log‐rank; Wilcoxon(other) for survival data

Cox‐proportionalhazard regression

Categorical (Ordinal)

Proportion;Wilcoxon signed rank

Spearman’s test;Mann‐Whitney test;

Ordered logistic regression

Categorical (Nominal)

Proportion Chi‐square; Mantel‐Haenszel test

Multinomial logistics

Analysis frameworkDependentVariable

Univariable analysis (e.g. descriptive)

Bivariable analysis Multivariableanalysis

Continuous  MeansT‐test

Correlation;T‐test

Analysis of covariance

Dichotomous Proportion Chi‐square Logistic Regression

Count Incidence &  Rate‐difference or ratio Poisson regressionprevalence

g

Survival Kaplan‐Meier survival Log‐rank; Wilcoxon(other) for survival data

Cox‐proportionalhazard regression

Categorical (Ordinal)

Proportion;Wilcoxon signed rank

Spearman’s test;Mann‐Whitney test;

Ordered logistic regression

Categorical (Nominal)

Proportion Chi‐square; Mantel‐Haenszel test

Multinomial logistics

Page 3: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

3

ExampleData source: Seattle Lumbar Imaging Project

Design: Randomized control trial of X‐ray versus Rapid MR 

Patients: Low Back Pain without significant comorbidity.

Primary outcome: Roland Score

Independent variables:  Age, sex, race, education, employment, sciatica, comorbidity, previous surgery, duration of symptoms, depression, bmi, planning disability.

Citation:  Jarvik JG, et al. Rapid magnetic resonance imaging vsradiographs for patients with low back pain: a randomized controlled trial.  JAMA. 2003 Jun 4;289(21):2810‐8.

Primary outcome: Roland score

• 23 yes/no items asking about disability due to back problems 

• Higher score means more severe disability.g y

• Clinically significant difference is 3 points.

• We model ABSOLUTE Roland score at 12 months, controlling for baseline Roland score.

Part I: 

1) Univariate (descriptive) analysis of continuous data.

2) Bivariate analysis of continuous outcome.

3) i i & hi l i3) Descriptive & graphical summaries.

Page 4: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

4

Analysis frameworkDependentVariable

Univariable analysis (e.g. descriptive)

Bivariable analysis Multivariableanalysis

Continuous  MeansT‐test

Correlation;T‐test

Analysis of covariance

Dichotomous Proportion Chi‐square Logistic Regression

Count Incidence &  Rate‐difference or ratio Poisson regressionprevalence

g

Survival Kaplan‐Meier survival Log‐rank; Wilcoxon(other) for survival data

Cox‐proportionalhazard regression

Categorical (Ordinal)

Proportion;Wilcoxon signed rank

Spearman’s test;Mann‐Whitney test;

Ordered logistic regression

Categorical (Nominal)

Proportion Chi‐square; Mantel‐Haenszel test

Multinomial logistics

Univariate analysis

Continuous variables:summarize roland0 age bmi

tabstat roland0 age bmi, stat(count mean sd min max)

Univariate analysis

Graphhistogram  roland0, freq

304

050

y 040

50cy

6080

cy

010

20

3F

req

uenc

y

0 5 10 15 20 25Roland Score (Baseline)

010

20

30F

requ

enc

20 40 60 80 100age

02

04

0F

requ

enc

10 20 30 40 50Body Mass Index

Page 5: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

5

Univariate analysis

Categorical or dichotomous variables:tabulate prevsurg, miss

Univariate analysis

Categorical or dichotomous variables:tabulate female, miss

tabulate prevsurg, miss

Univariable analysis

Graphhistogram  education, discrete percent width(0.5) start(1) addl

33.25

304

0

6.596

18.73

21.919.53

01

020

Pe

rce

nt

Less

than

HS

Gradu

ated

HS

Some

Colleg

e

Colleg

e de

gree

Gradu

ate

degr

ee

Education

Page 6: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

6

Bivariate analysis

Continuous independent variable

pwcorr roland12 age, sig

Bivariate analysis

Continuous independent variable

scatter roland12 age

25

05

1015

20R

olan

d S

core

(1

2 m

on

ths)

20 40 60 80 100age

Bivariate analysis

Continuous independent variable

twoway (scatter roland12 age) (lfit roland12 age)

25

05

10

15

20

20 40 60 80 100age

Fitted values Roland Score (12 months)

Page 7: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

7

Bivariate analysis

Categorical independent variable

tabstat roland12, by(female) stat(mean sd)

Categorical independent variablegraph box roland12, over(female)

Bivariate analysis

Bivariate analysis

Categorical independent variable

ttest roland12, by(female)

Page 8: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

8

Bivariate analysis

Examine two categorical variables

tabulate random charlson, col

Bivariate analysis

Examine two categorical variables

tabulate random charlson, col chi2

“Table 1”VARIABLE X‐RAY 

(n= 190)MRI(n = 190)

OVERALL(n = 380)

P‐VALUE

Age, mean (SD) 51.9 (14.4) 54.5 (14.9) 53.2 (14.7) 0.089

Female (%) 55% 56% 56% 0.757

White (%) 78% 80% 79% 0.562

BMI 28.5 (5.9) 28.9 (6.1) 28.7 (6.0) 0.561

Current or former smoker 47% 53% 51% 0.302

Employment Working 54% 56% 55%  0.902

Unemployed 15% 14% 15%

Other (retired/retire) 31% 31% 31%

Comorbidity (%) 72% 81% 77% 0.039

Roland (baseline) 12.8 (5.6) 13.6 (6.0) 13.2 (5.8) 0.176

Page 9: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

9

Self check

• Suppose the SLIP study was a prospective cohort study rather than an RCT.  How might this influence the analysis of the Roland outcome that we choose?outcome that we choose?

• Using a dataset that is relevant to your work, create a “table 1” for the factors that you are interested in your analysis.

PART 2: Multivariable analysis

• Introduction to building a multivariable regression.

• Testing model assumptions

i i i l• Using post‐estimation results

Analysis frameworkDependentVariable

Univariable analysis (e.g. descriptive)

Bivariable analysis Multivariableanalysis

Continuous  MeansT‐test

Correlation;T‐test

Analysis of covariance

Dichotomous Proportion Chi‐square Logistic Regression

Count Incidence &  Rate‐difference or ratio Poisson regressionprevalence

g

Survival Kaplan‐Meier survival Log‐rank; Wilcoxon(other) for survival data

Cox‐proportionalhazard regression

Categorical (Ordinal)

Proportion;Wilcoxon signed rank

Spearman’s test;Mann‐Whitney test;

Ordered logistic regression

Categorical (Nominal)

Proportion Chi‐square; Mantel‐Haenszel test

Multinomial logistics

Page 10: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

10

Multivariable regression

Basic regression model:

ŷi = β0+ β1Xi + ei

OLS minimi es the difference in the s m of theOLS minimizes the difference in the sum of the observed y and predicted ŷ.  The differences are called the residuals.

Multivariable regression

Basic regression model:

ŷi = β0+ β1Xi + ei

OLS minimizes the difference in the sum of the observed y and predicted ŷ.  The differences are called the residuals.

ŷRoland12 = β0+ β1XRandom + ei

Primary outcome

Regression of outcome on primary variable of interest

regress roland12 random

Page 11: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

11

Baseline Roland score

Control for baseline Roland scoreregress roland12 random roland0

Additional covariates

regress roland12 random roland0 age female

Polynomial terms

Checking for polynomial on agegen age2 = age * ageregress roland12 random roland0 age age2 female

Page 12: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

12

Interaction terms

Checking for effect modificationgen inter = female * ageregress roland12 random roland0 age female inter

Adding categorical variables

• Each level of a categorical variable needs to be entered into the model as a dichotomous variable.

• One level is left out of the model (“referent”)• One level is left out of the model ( referent )

quietly tabulate education, generate(ed_)

Adding categorical variables

regress roland12 random roland0 age female ed_HSed_Some ed_Coll ed_grad

Page 13: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

13

Adding categorical variablesShortcut for categorical variablesregress roland12 random roland0 age female i.education

In older versions of Stata:xi: regress roland12 random roland0 age female i.education

estimates store model1

Likelihood ratio test

estimates store model2

regress roland12 random roland0 age female i.educationi.episode

lrtest model1 model2

Grouped‐linear variableContinuous Body Mass Index:regress roland12 random roland0 age female i.education i.episode

bmiestimates store bmi_cont

Page 14: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

14

Grouped‐linear variable

Make categorical Body Mass Index:xtile bmi4 = bmi, n(4)

Grouped‐linear variable

Categorical Body Mass Index:regress roland12 random roland0 age female i.education i.episode i.bmi4estimates store bmi_catlrtest bmi_cont bmi_cat

Grouped‐linear variable

Grouped‐linear (ordered categorical) Body Mass Index:regress roland12 random roland0 age female i.education i.episode bmi4

estimates store bmi_gl

lrtest bmi_cat bmi_gl

Page 15: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

15

Assumptions for OLS = “LINE”

• Linear association between dependent & independent variables.

• Independent

• Normally distribution of outcomey

• Equal variance of error terms

Model Diagnostics

Linear association between dependent & independent variables.

cprplot bmi4

20

-20

-10

010

Co

mpo

nent

plu

s re

sidu

al

1 2 3 44 quantiles of bmi

Model Diagnostics

Independent (no serial correlation)

‐ How was the data collected?

‐What was the study design?

Page 16: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

16

Model Diagnostics

Normally distribution of error termspredict resid, residqnorm resid 2

0-2

0-1

00

10R

esi

dua

ls

-20 -10 0 10 20Inverse Normal

Model Diagnostics

Normally distribution of error terms‐Residual versus fitted plot‐Is error term is zero and symmetic at every X?

rvfplot

20

-20

-10

010

Re

sid

uals

0 5 10 15 20Fitted values

Model DiagnosticsInfluential points

dfbeta

graph box _dfbeta_12 _dfbeta_3

.2-.

3-.

2-.

10

.1

Dfbeta bmi4 Dfbeta age

Page 17: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

17

Model diagnostics

Cooks residual for influence pointspredict cooksd, cooksdlocal max = 4/e(N)generate index = _ngraph twoway scatter cook index, yline(`max') msymbol(p) yscale(log)

.02

.03.04

.01

Co

ok's

D

0 100 200 300 400index

Model diagnostics

Test for equal variance of error terms‐Looking for heteroskedasticity

symplot

010

2030

40D

ista

nce

abo

ve m

edia

n

0 10 20 30 40Distance below median

age

Model diagnostics

Test for equal variance of error terms‐test for heteroskedasticity

estat hettest

Page 18: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

18

Post‐estimation: lincom

Lincom (linear combination)What is the predicted roland12 for a 40 year old female with a 

graduate degree whose has a baseline Roland score was 10, a BMI of 25 (2nd quartile), and more that 5 episode of back pain.

lincom roland0*10 + age*40 + female*1 + 5.education + 3.episode + g p2.bmi4 + _cons

Post‐estimation: test

Hypothesis testingDoes Roland12 differ between those who have continuous and those with 5+ episodes?

test 3.episode = 4.episodetest 3.episode   4.episode

Post‐estimation: Margins

MarginsWhat are the estimated Roland12 scores between those who received MRI (random = 1) and those who received X‐ray (random = 0)?

margins random , atmeansg

margins , atmeans dydx(random)

Page 19: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

19

Post‐estimation

Hypothesis testingDoes the Roland12 differ between a 40 year old female and a 50 year old male?

test age*40 + female*1 = age*50 + female*0test age 40 + female 1   age 50 + female 0

Special situation

• Complex (survey) data– NHANES– Use Stata survey commands (“help svy”)

• Skewed data (such as costs)– Variable transformation (log costs)

li d l ( li d i d l )– Use non‐linear models (Generalized Linear Models)

• Correlated data– Repeated measures for same patient– Patients nested within hospitals that vary on outcomes– Use Stata panel data commands (“help xtset, sxreg”)

Self check1) What are the four main assumptions of OLS regression?  Which one is 

does not involve diagnostic plots?

2) Give an interpretation for each of the parameters in the model below:

Page 20: 09CER Module9 LinearRegressionMartin · 2020. 1. 3. · Bivariate analysis Continuous independent variable pwcorr roland12 age, sig Bivariate analysis Continuous independent variable

2/20/2012

20

Thanks! 

[email protected]