lecture 4 – linear regression analysis

Step 1: Collect and clean data (spreadsheet from heaven)

Step 2: Calculate descriptive statistics Step 3: Explore graphics Step 4: Choose outcome(s) and potential

predictive variables (covariates) Step 5: Pick an appropriate statistical

procedure & execute Step 6: Evaluate fitted model, make

adjustments as needed

Four Considerations1) Purpose of the investigation

Descriptive orientation2) The mathematical characteristics of the

variablesLevel of measurement (nominal, ordinal,

continuous) and Distribution3) The statistical assumptions made about these

variablesDistribution, Independence, etc.

4) How the data are collectedRandom sample, cohort, case control, etc.

Purpose of analysis: To relate two variables, where we designate one as the outcome of interest (Dependent Variable or DV) and one more as the predictor variables (Independent Variables or IVs) In general, we will consider k to represent the number

of IVs and here k=1.

Given a sample of n individuals, we observe pairs of

values for 2 variables (Xi,Yi) for each individual i. Type of variables: Continuous (interval or ratio)

Characterize relationship by determining extent, direction, and strength of association between IVs and DV.

Predict DV as a function of IVs Describe relationship between IVs and DV

controlling for other variables (confounders) Determine which IVs are important for

predicting a DV and which ones are not. Determine the best mathematical model for

describing the relationship between IVs and a DV

Assess the interactive effects (effect modification) of 2 or more IVs with regard to a DV

Obtain a valid and precise estimate of 1 or more regression coefficients from a larger set of regression coefficients in a given model.

NOTE: When we find statistically significant associations between IVs and a DV this does not imply that the particular IVs caused the DV to occur.

Strength of association - does the association appear strong for a number of different studies?

Dose-response effect - The DV changes in a meaningful manner with changes in the IV

Lack of temporal ambiguity - The cause precedes the effect

Consistency of findings - Most studies show similar results

Biological and theoretical plausibility - The causal relationship is consistent with current biological and theoretical knowledge

Coherence of evidence - The findings do not seriously conflict with accepted facts about the DV being studied.

Specificity of association - The study factor is associated with only one effect

Simple Linear Regression Model

where: Yi is the value of the response(outcome, dependent)

variable for the ith unit (e.g., SBP) 0 and 1 are parameters which represent the intercept

and slope, respectively Xi is the value of the predictor (independent) variable

(e.g., age) for the ith unit. X is considered fixed - not random.

i is a random error term that has mean 0 and variance 2, i and j are uncorrelated for all i,j ij, i=1,...,n

0 1i i iY x

Model is "simple" because there is only one independent variable.

Model is "linear in the parameters" because the parameters β0 and β1 do not appear as an exponent and they are not multiplied or divided by another parameter.

Model is also "linear in the independent variable" because this variable (Xi) appears only in the first power.

Simple Linear Regression Model

0 1i i iY x

The observed value of Y for the ith unit is the sum of 2 components (1) the constant term β0 + β1Xi and (2) the random error term i. Hence, Yi is a random variable.

Since i has mean 0, Y must have mean β0 + β1Xi:

E(Yi|Xi) = E(β0 + β1Xi + i)

= β0 + β1Xi + E(i)

= β0 + β1Xi

where E = "Expected value”=mean

X

Y

The fitted (or estimated) regression line is the expected value of Y at the given value of X, i.e. E(Y|X)

1ˆ ˆ( ) oE Y X

ˆDefine the ( )i i iresiduals Y Y

X

Y

ε

ε

Residuals

X

Y

o1.0

1 Expected change in Y per unit change in X

Expected value of Y when X=0

Interpreting the Coefficients

Linear relationship between Y and X (i.e., only allow linear β’s)

Independent observations

Normally distributed residuals, in particular εi~N(0, σ2)

Equal variances across values of X (homogeneity of variance)

Normality Assumption

X

Y

X1 = 10

•0 + 1x1

ii xββyE 10

),(~ 210

i.i.d.

ii xNY

Homoscedasticity - The variance of Y is the same for any X

20

25

30

35

40

45

5 10 15 20 25 30 35

X

Y

•

•

•

•

Departures from Normality Assumption If the normality assumption is not “badly”

violated, the model is generally robust to violations from normality

If normality assumption is badly violated, try a transformation of Y (e.g., the natural log)

If you transform the data, you must consider if Y is normally distributed as well as whether the variance homogeneity assumption holds – often go together

The “correct” model is fittedAll IVs included are truly related to

the DVNo (conceivable) IVs related to the

DV have been left out

Violation of either of these assumptions can lead to “model misspecification bias”

Null Hypothesis: The simple linear regression model does not

fit the data better than the baseline model. 1 = 0

Alternative Hypothesis: The simple linear regression model fits the

data better than the baseline model. 1 0

Fitting data to a linear model

1i o i iY X Linear Regression – determine the values of β0 and β1 that minimize:

2 2i i ii i

Y Y ˆ( )

The LEAST-SQUARES Solution

For each pair of observations (Xi,Yi), the method of least squares considers the deviation of Yi from its expected value:

the least-squares method will find and that minimize the sum of squares above.

The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line the smallest.

0 1

22

0 11

ˆ ˆˆn n

i i i iii=1

Y -Y = Y X

2

ˆ2

n n nn

i i iii ii=1 i=1 i=1i=1

n1n n

2iiii=1

i=1 i=1

-X X Y Y X Y X Y= =

X XX X

n

- n

The Least-Squares Method

1ˆ

0= Y X

The method of least squares is purely mathematical

However, statistically the least squares estimators are very appealing because the are the Best Linear Unbiased Estimators (BLUE)

This means that among all of the equations we could have picked to estimate 0 and 1,the least squares equations will give us estimates:1. That have expectation 0 and 1 (unbiased)

2. Have minimum variance among all of the possible linear estimators for 0 and 1 (most efficient)

2

1

ˆn

i ii

SSE Y Y

SSE is the sum of squares due to error (i.e., sum of the squared

residuals), the quantity we wish to “minimize”.

0 1ˆ ˆ

i iY X

Predictor (X)

Response (Y)

Unexplained Variability

Explained Variability

TotalVariability

Y_

iY Y

ˆi iY YiY Y

If SSE=0, then model is perfect fit SSE is affected by

1. Large 2 (a lot of variability)2. Nonlinearity

Need to look at both (1) and (2). For now assume linearity, and estimate σ 2 as:

We use n – 2 because we estimate 2 parameters, 0 and 1

SSE/(n-2) is also known as “mean squared error” or MSE

2

1

ˆn

2Y|X i i

i

1 1= Y Y = SSES

n - 2 n - 2

How do I build my model? Using the tools of statistics…1. First I use estimation

in particular, least squares to estimate:

2. Then I use my distributional assumptions to make Inference about the estimates Hypothesis testing, e.g., is the slope 0?

3. Interpretation – interpret in light of assumptions

Simple Linear Regression in a

20 1

ˆ ˆ ˆˆ, , , Y

Hypothesis testing: To test the hypothesis H0: β1=β1

(0), where β1(0) is some hypothesized value

for β1, the test statistic used is

This test statistic has a t distribution with n - 2 degrees of freedom

The CI is given by

ˆˆ

ˆ ˆ ˆ where

(0)

Y|X1 1

1x x

1

- ST = SS n -1 n -1S S

11 2,1 / 2

ˆnt S

Hypothesis Testing for Regression Parameters

Timeout: The T-distribution

The t distribution (or Student’s t distribution) arises when we use an estimated variance to construct the test statistic:

As n→∞, T→Z~N(0,1) Have to pay a penalty for estimating σ2

YT n

S

2( )where is the sample standard deviation

1

ii

Y YS

n

Can think of the t distribution as a thick-tailed normal

Inference concerning the Intercept

To test the hypothesis H0: β0=β0(0) we use the

following statistic

which also has the t distribution with n-2 degrees of freedom when Ho:β0= β0

(0)

The CI is given by

0

2

ˆ 2

ˆ ˆ (0)

0 0

Y|XX

-T =

1 XS S +n n -1 S

0ˆ0 2,1 / 2

ˆnt S

Null Hypothesis: The simple linear regression model does not

fit the data better than the baseline model. 1 = 0

Alternative Hypothesis: The simple linear regression model does fit

the data better than the baseline model. 1 0

Interpretations of Tests for Slope

Failure to reject H0:β1=0 could mean:

is essentially as good as for predicting YY 1 -Y X X

•

•• •••

•• •

••••

•

x

y A

Y


Failure to reject H0:β1=0 could mean: The true relationship between Y and X is

not linear (i.e. could be quadratic or some other higher power)

x

y •

• ••

••

•

••••••

•

••

••

••

••

•

• ••

•

Dude, that’s why you always plot Y vs. X!


Failure to reject H0:β1=0 could mean: We do not have enough power to detect a

significant slope

Not rejecting H0:β1=0 implies that a straight line model in X is not the best model to use, and does not provide much help for predicting X (ignoring power)

The Intercept

We often leave the intercept, β0, in the model

regardless of whether the hypothesis, H0:β0=0, is rejected or not. This is because if

we say the intercept is zero then we must force the regression line through the origin (0,0) and rarely is this true.

Regression of SBP on age:

Analysis of Variance

Sum of MeanSource DF Squares Square

Model 1 4008.12372 4008.12372Error 28 2319.37628 82.83487Corrected Total 29 6327.50000

Root MSE 9.10137

Parameter Estimates

Parameter StandardVariable DF Estimate Error t Value Pr > t|

Intercept 1 54.21462 13.08530 4.14 .0003age 1 1.70995 0.24582 6.96 .0001

2|ˆ = Y XS

1 0 1SE( ) 0 1H : 0

SSE

Response Variable: Y Explanatory Variables: X1,..., Xk Model (Extension of Simple Regression): E(Y) = 0 + 1X1 + + kXk V(Y) = 2

Partial Regression Coefficients (i): Effect of increasing Xi by 1 unit, holding all other predictors constant

Computer packages fit models, hand calculations very tedious

Model Parameters: 01,…, k, Estimators: Least squares prediction equation: Residuals: Error Sum of Squares: Estimated conditional standard deviation:

kk XXY ˆˆˆˆ11

22 )ˆ( iii YYSSE

1

kn

SSE

0 1 ˆ, ,..., ,

k

( )

i i iY Y

When there are 2 independent variables (X1 and X2) we can view the regression as fitting the best plane to the 3 dimensional set of points (as compared to the best line in simple linear regression)

When there are more than 2 IVs plotting becomes much more difficult

Analysis of Variance: Regression sum of Squares: Error Sum of Squares: Total Sum of Squares:

Coefficient of (Multiple) Determination: R2=SSR/TSS (the % of variation explained by the

model) Least Squares Estimates

Regression Coefficients Estimated Standard Errors t-statistics P-values (Significance levels for 2-sided tests)

kdfYYSSR R 2)ˆ(

12 kndfYYSSE E)ˆ(

1)( 2 ndfYYTSS T

706N

Participant ID Gender

Reader Name

Max Diameter, Dilation

Phase (mm)

Time to Max Diameter

(sec)

Pre-cuff Baseline

(mm)

Post-cuff Baseline

(mm) Age (yrs)

3000028 M Crotts 6.835 84 6.559 6.573 84.2

3000052 F Manli 2.905 89 2.809 2.829 75.3

3000079 M Manli 3.677 52 3.583 3.576 80.1

3000087 M Manli 4.974 57 4.957 4.909 78.3

3000257 F Crotts 4.748 62 4.492 4.291 78

3000346 M Drum 5.973 114 5.929 5.917 78.5

3000419 F Drum 3.429 94 3.288 3.312 76.6

3000524 M Drum 4.971 34 4.897 4.887 75.4

3000559 F Crotts 4.162 46 3.825 3.751 76.5

3000591 M Crotts 4.677 115 4.477 4.493 80.7

Max Diameter Dilation Phase (mm) Histogram # Boxplot Normal Probability Plot

7.75+* 1 0 7.75+ *

.* 4 0 | *

.*** 12 | | *****+

.******* 27 | | ******+

.***************** 65 | | ******+

.************************ 96 +-----+ | ******

.************************************* 146 *--+--* | ******

.**************************************** 157 | | | *******

.***************************** 114 +-----+ | *******

.***************** 65 | | *********

.**** 14 | |******++

2.25+* 1 | 2.25+*+

----+----+----+----+----+----+----+----+ +----+----+----+----+----+----+----+----+----+----+

Pre-cuff, Baseline (mm) Histogram # Boxplot Normal Probability Plot

7.75+* 1 0 7.75+ *

.* 3 0 | *

.** 6 0 | ****

.******* 25 | | ******+

.************** 54 | | ******+

.*********************** 91 +-----+ | *****+

.********************************* 131 | | | +******

.*************************************** 156 *--+--* | ******

.********************************* 132 +-----+ | *******

.******************** 80 | | *********

.******* 25 | |********+

2.25+* 2 | 2.25+*+++

----+----+----+----+----+----+----+---- +----+----+----+----+----+----+----+----+----+----+

* may represent up to 4 counts -2 -1 0 +1 +2

Time to Max Diameter(sec) Stem Leaf # Boxplot Normal Probability Plot

11 555555555555555999 18 | 117.5+ ++******

11 0000011112222333344444444444444 31 | | *****

10 555556666666677778889999 24 | | ***+

10 000001111122223333334444 24 | | ***+

9 5555555555555667777777777888889999 34 | | ***+

9 0000000000111122222333344444444 31 | | ***++

8 555555556667778888999999999 27 +-----+ | **++

8 000000001112222223333334444 27 | | | **+

7 555555666677778888889999 24 | | | **+

7 00000000001112222222333333344444444 35 | | | **

6 555555556666666666777777778888999999999 39 | | | ***

6 000000001111111222222333333344444 33 | + | | **

5 5555555556666666667777777777788888888999999999 46 *-----* | +**

5 0000000000000011111122222222223333334444444444 46 | | | +**

4 55555556666666677777777778888888899999 38 | | | ***

4 0000000111112222222233333444444444 34 | | | +**

3 5555555555666666677777778888888999999999 40 +-----+ | +***

3 0000000000111111122222222222333333334444444444 46 | | ***

2 55555566666666667777788888889999999 35 | | ****

2 0002222222223333334444 22 | | ***

1 555566677888899999999 21 | | ***

1 00111122222233333444 20 | | *****++

0 7779 4 | |*** ++

0 0444 4 | 2.5+* +

----+----+----+----+----+----+----+----+----+- +----+----+----+----+----+----+----+----+----+----+

Age (years) Histogram # Boxplot Normal Probability Plot

93+* 2 0 93+ *

.*** 12 0 | *****

.**** 14 | | ***++++

87+******** 29 | 87+ ****+++

.*********** 43 | | ****++

.******************* 73 | | *****

81+**************** 63 +-----+ 81+ +****

.************************* 100 | + | | ++****

.***************************************** 161 *-----* | +*******

75+************************************* 148 +-----+ 75+ **********

.*********** 43 | | ******+++

.**** 16 | |******++++

69+* 1 | 69+*+++++

. |+

. |

63+* 1 0 63+*

----+----+----+----+----+----+----+----+- +----+----+----+----+----+----+----+----+----+----+

Max

. D

iam

eter

, D

ilatio

n P

hase

(m

m)

2

3

4

5

6

7

8

Pre-cuff Baseline (mm)

2 3 4 5 6 7 8

Regression of Max Diameter (mm) on Pre-cuff Baseline (mm)


Source DFSum of

SquaresMean

Square F Value Pr > FModel 1 549.03550 549.03550 70386.5 <.0001

Error 700 5.46021 0.00780Corrected Total 701 554.49571

Root MSE 0.08832 R-Square 0.9902Dependent Mean 4.56780 Adj R-Sq 0.9901Coeff Var 1.93352

Parameter Estimates

Variable Label DFParameter

EstimateStandard

Error t Value Pr > |t|Intercept Intercept 1 0.17015 0.01691 10.06 <.0001

PREBL Pre-cuff Baseline (mm) 1 0.99302 0.00374 265.30 <.0001

R2

Max

. D

iam

eter

, D

ilatio

n P

hase

(m

m)

2

3

4

5

6

7

8

Pre-cuff Baseline (mm)

2 3 4 5 6 7 8

Regression Line with 95% CI

Max

. D

iam

eter

, D

ilatio

n P

hase

(m

m)

2

3

4

5

6

7

8

Age (yrs)

60 70 80 90 100

Regression of Max Diameter (mm) on Age (mm)


Source DFSum of

SquaresMean

Square F Value Pr > FModel 1 0.04862 0.04862 0.06 0.8044

Error 700 554.44709 0.79207

Corrected Total 701 554.49571

Root MSE 0.88998 R-Square 0.0001

Dependent Mean 4.56780 Adj R-Sq -0.0013

Coeff Var 19.48382

Parameter Estimates


EstimateStandard


age 1 0.00186 0.00751 0.25 0.8044

Diameter Dilation (mm) vs. Age (yrs)

Max

. D

iam

eter

, D

ilatio

n P

hase

(m

m)

2

3

4

5

6

7

8

Age (yrs)

60 70 80 90 100

Regression of Max Diameter (mm) on Pre-cuff Baseline (mm) and Age (yrs)


Source DFSum of

SquaresMean

Square F Value Pr > FModel 2 549.09606 274.54803 35541.0 <.0001

Error 699 5.39965 0.00772


Root MSE 0.08789 R-Square 0.9903

Dependent Mean 4.56780 Adj R-Sq 0.9902

Coeff Var 1.92414

Parameter Estimates


EstimateStandard

Error t Value Pr > |t|Intercept Intercept 1 0.33282 0.06049 5.50 <.0001PREBL Pre-cuff Baseline (mm) 1 0.99323 0.00373 266.60 <.0001age Age (yrs) 1 -0.00208 0.00074187 -2.80 0.0053

Many research studies have large numbers of predictor variables

Problems arise when the various predictors are highly related among themselves (collinear) Estimated regression coefficients can change

dramatically, depending on whether or not other predictor(s) are included in model.

Standard errors of regression coefficients can increase, causing non-significant t-tests and wide confidence intervals

Variables are explaining the same variation in Y

Multicollinearity

Multicollinearity - examplePearson Correlation CoefficientsProb > |r| under H0: Rho=0

Number of Observations

MAXD PREBL POSTBL T2MAXD ageMAXDMax. Diameter, Dilation Phase (mm)

1.00000

702

0.99506<.0001

702

0.99475<.0001

702

0.028270.4546

702

0.009360.8044

702

PREBLPre-cuff Baseline (mm)

0.99506<.0001

702

1.00000

706

0.99716<.0001

703

0.025970.4918

703

0.021940.5605

706

POSTBLPost-cuff Baseline (mm)

0.99475<.0001

702

0.99716<.0001

703

1.00000

703

0.016670.6590

703

0.020750.5828

703

T2MAXDTime to Max. Diameter (sec)

0.028270.4546

702

0.025970.4918

703

0.016670.6590

703

1.00000

703

-0.041690.2697

703

ageAge (yrs)

0.009360.8044

702

0.021940.5605

706

0.020750.5828

703

-0.041690.2697

703

1.00000

706

Parameter Estimates


EstimateStandard



age Age (yrs) 1 -0.00208 0.00074187 -2.80 0.0053

Parameter Estimates


EstimateStandard



POSTBL Post-cuff Baseline (mm) 1 0.44290 0.04735 9.35 <.0001

age Age (yrs) 1 -0.00213 0.00069985 -3.04 0.0025

Multicollinearity - example

We assume that the outcome (Y’s) are normally distributed. What assumptions have we made about the distribution of the IVs (the X’s)?

None, except that they are RVs with some underlying distribution

Recall, model assumption centers around the conditional distribution of Y’s (conditional on values of X’s)

ANOVA is simply linear regression with series of dichotomous indicators for the “levels” of X

Are there any differences among the population means?

Response

Continuous

Predictor

Categorical

One-Way ANOVA

45

40

35

30

25

20

15

10

5

0

H0: All means are equal H1: At least one mean different

45

40

35

30

25

20

15

10

5

0

independent observations normally distributed data for each group, or

the pooled error terms are normally distributed equal variances for each group

C o m p a r in g P o p u la t io n s

A B C

2

1

3

D

4

TotalTotalVariabilityVariability

Variabilitybetween Groups

Variabilitywithin Groups

i groups j individuals

Within Sum of Squares (SSW)

Between Sum of Squares (SSB)

Total Sum of Squares (SST)

2

1 1

( )ink

ij ii j

Y Y

2

1 1

( )ink

ii j

Y Y

2

1 1

( )ink

iji j

Y Y

SST = SSB + SSW

1 1

1 1 and

in k

i ij i ij ii

Y Y Y n Yn n

ANOVA Example – Max Diameter by Reader

Source DFSum of

Squares Mean Square F Value Pr > FModel 2 3.6421805 1.8210902 2.31 0.0999Error 699 550.853526

80.7880594


R-Square Coeff Var Root MSE MAXD Mean0.006568 19.43447 0.887727 4.567798

READERMAXD

LSMEANStandard

ErrorPr > |

t|LSMEAN Number

Crotts 4.64841096 0.05998704

<.0001

1

Drum 4.48160364 0.05353196

<.0001

2

Manli 4.59687981 0.06155280

<.0001

3

SST

SSB

SSW

One of the things that makes the General Linear Model (or GLM) so flexible

ANCOVA analyses should always assess possible interactions between continuous IVs and categorical IVs

If interactions are present, model must be interpreted carefully

ANCOVA Example – Max Diameter vs. Reader Adjusting for Pre-cuff

Source DFSum of

SquaresMean

Square F Value Pr > FModel 5 549.2229

489109.844589

814499.

4<.000

1Error 69

65.272758

30.0075758

Corrected Total

701

554.4957073

R-Square

Coeff Var Root MSE MAXD Mean

0.990491

1.905493

0.087039

4.567798

Source DF Type III SSMean

Square F Value Pr > FREADER 2 0.0596745 0.0298373 3.94 0.019

9PREBL 1 541.77647

44541.776474

471514.

1<.000

1PREBL*READER 2 0.0508130 0.0254065 3.35 0.035

5

Continuous covariate: Pre-cuff diameter

Interaction between Pre-cuff Diameter and Reader

Yikes, it’s significant!

READERMAXD

LSMEANStandard

ErrorCrotts 3.50587755 0.04807411Drum 3.51649412 0.03650059

Manli 3.50760000 0.04759094

READERMAXD

LSMEANStandard


Manli 4.28982456 0.02437390

READERMAXD

LSMEANStandard


Manli 4.77966667 0.02667919

READERMAXD

LSMEANStandard


Manli 5.78918868 0.06958252

Reader Effects in 1st Quartile of Pre-cuff Diameter

Reader Effects in 2nd Quartile of Pre-cuff Diameter

Reader Effects in 3rd Quartile of Pre-cuff Diameter

Reader Effects in 4th Quartile of Pre-cuff Diameter

ANCOVA Example – Stratified Analysis

Interaction is driven by “swapping” of effects for Reader at various levels of pre-cuff diameter.

lecture 4 – linear regression analysis

Documents