lecture 4 – linear regression analysis
DESCRIPTION
Lecture 4 – Linear Regression Analysis. Steps in Data Analysis. Step 1: Collect and clean data (spreadsheet from heaven) Step 2: Calculate descriptive statistics Step 3: Explore graphics Step 4: Choose outcome(s) and potential predictive variables (covariates) - PowerPoint PPT PresentationTRANSCRIPT
Step 1: Collect and clean data (spreadsheet from heaven)
Step 2: Calculate descriptive statistics Step 3: Explore graphics Step 4: Choose outcome(s) and potential
predictive variables (covariates) Step 5: Pick an appropriate statistical
procedure & execute Step 6: Evaluate fitted model, make
adjustments as needed
Four Considerations1) Purpose of the investigation
Descriptive orientation2) The mathematical characteristics of the
variablesLevel of measurement (nominal, ordinal,
continuous) and Distribution3) The statistical assumptions made about these
variablesDistribution, Independence, etc.
4) How the data are collectedRandom sample, cohort, case control, etc.
Purpose of analysis: To relate two variables, where we designate one as the outcome of interest (Dependent Variable or DV) and one more as the predictor variables (Independent Variables or IVs) In general, we will consider k to represent the number
of IVs and here k=1.
Given a sample of n individuals, we observe pairs of
values for 2 variables (Xi,Yi) for each individual i. Type of variables: Continuous (interval or ratio)
Characterize relationship by determining extent, direction, and strength of association between IVs and DV.
Predict DV as a function of IVs Describe relationship between IVs and DV
controlling for other variables (confounders) Determine which IVs are important for
predicting a DV and which ones are not. Determine the best mathematical model for
describing the relationship between IVs and a DV
Assess the interactive effects (effect modification) of 2 or more IVs with regard to a DV
Obtain a valid and precise estimate of 1 or more regression coefficients from a larger set of regression coefficients in a given model.
NOTE: When we find statistically significant associations between IVs and a DV this does not imply that the particular IVs caused the DV to occur.
Strength of association - does the association appear strong for a number of different studies?
Dose-response effect - The DV changes in a meaningful manner with changes in the IV
Lack of temporal ambiguity - The cause precedes the effect
Consistency of findings - Most studies show similar results
Biological and theoretical plausibility - The causal relationship is consistent with current biological and theoretical knowledge
Coherence of evidence - The findings do not seriously conflict with accepted facts about the DV being studied.
Specificity of association - The study factor is associated with only one effect
Simple Linear Regression Model
where: Yi is the value of the response(outcome, dependent)
variable for the ith unit (e.g., SBP) 0 and 1 are parameters which represent the intercept
and slope, respectively Xi is the value of the predictor (independent) variable
(e.g., age) for the ith unit. X is considered fixed - not random.
i is a random error term that has mean 0 and variance 2, i and j are uncorrelated for all i,j ij, i=1,...,n
0 1i i iY x
Model is "simple" because there is only one independent variable.
Model is "linear in the parameters" because the parameters β0 and β1 do not appear as an exponent and they are not multiplied or divided by another parameter.
Model is also "linear in the independent variable" because this variable (Xi) appears only in the first power.
Simple Linear Regression Model
0 1i i iY x
The observed value of Y for the ith unit is the sum of 2 components (1) the constant term β0 + β1Xi and (2) the random error term i. Hence, Yi is a random variable.
Since i has mean 0, Y must have mean β0 + β1Xi:
E(Yi|Xi) = E(β0 + β1Xi + i)
= β0 + β1Xi + E(i)
= β0 + β1Xi
where E = "Expected value”=mean
X
Y
The fitted (or estimated) regression line is the expected value of Y at the given value of X, i.e. E(Y|X)
1ˆ ˆ( ) oE Y X
ˆDefine the ( )i i iresiduals Y Y
X
Y
ε
ε
Residuals
X
Y
o1.0
1 Expected change in Y per unit change in X
Expected value of Y when X=0
Interpreting the Coefficients
Linear relationship between Y and X (i.e., only allow linear β’s)
Independent observations
Normally distributed residuals, in particular εi~N(0, σ2)
Equal variances across values of X (homogeneity of variance)
Normality Assumption
X
Y
X1 = 10
•0 + 1x1
ii xββyE 10
),(~ 210
i.i.d.
ii xNY
Homoscedasticity - The variance of Y is the same for any X
20
25
30
35
40
45
5 10 15 20 25 30 35
X
Y
•
•
•
•
Departures from Normality Assumption If the normality assumption is not “badly”
violated, the model is generally robust to violations from normality
If normality assumption is badly violated, try a transformation of Y (e.g., the natural log)
If you transform the data, you must consider if Y is normally distributed as well as whether the variance homogeneity assumption holds – often go together
The “correct” model is fittedAll IVs included are truly related to
the DVNo (conceivable) IVs related to the
DV have been left out
Violation of either of these assumptions can lead to “model misspecification bias”
Null Hypothesis: The simple linear regression model does not
fit the data better than the baseline model. 1 = 0
Alternative Hypothesis: The simple linear regression model fits the
data better than the baseline model. 1 0
Fitting data to a linear model
1i o i iY X Linear Regression – determine the values of β0 and β1 that minimize:
2 2i i ii i
Y Y ˆ( )
The LEAST-SQUARES Solution
For each pair of observations (Xi,Yi), the method of least squares considers the deviation of Yi from its expected value:
the least-squares method will find and that minimize the sum of squares above.
The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line the smallest.
0 1
22
0 11
ˆ ˆˆn n
i i i iii=1
Y -Y = Y X
2
ˆ2
n n nn
i i iii ii=1 i=1 i=1i=1
n1n n
2iiii=1
i=1 i=1
-X X Y Y X Y X Y= =
X XX X
n
- n
The Least-Squares Method
1ˆ
0= Y X
The method of least squares is purely mathematical
However, statistically the least squares estimators are very appealing because the are the Best Linear Unbiased Estimators (BLUE)
This means that among all of the equations we could have picked to estimate 0 and 1,the least squares equations will give us estimates:1. That have expectation 0 and 1 (unbiased)
2. Have minimum variance among all of the possible linear estimators for 0 and 1 (most efficient)
2
1
ˆn
i ii
SSE Y Y
SSE is the sum of squares due to error (i.e., sum of the squared
residuals), the quantity we wish to “minimize”.
0 1ˆ ˆ
i iY X
Predictor (X)
Response (Y)
Unexplained Variability
Explained Variability
TotalVariability
Y_
iY Y
ˆi iY YiY Y
If SSE=0, then model is perfect fit SSE is affected by
1. Large 2 (a lot of variability)2. Nonlinearity
Need to look at both (1) and (2). For now assume linearity, and estimate σ 2 as:
We use n – 2 because we estimate 2 parameters, 0 and 1
SSE/(n-2) is also known as “mean squared error” or MSE
2
1
ˆn
2Y|X i i
i
1 1= Y Y = SSES
n - 2 n - 2
How do I build my model? Using the tools of statistics…1. First I use estimation
in particular, least squares to estimate:
2. Then I use my distributional assumptions to make Inference about the estimates Hypothesis testing, e.g., is the slope 0?
3. Interpretation – interpret in light of assumptions
Simple Linear Regression in a
20 1
ˆ ˆ ˆˆ, , , Y
Hypothesis testing: To test the hypothesis H0: β1=β1
(0), where β1(0) is some hypothesized value
for β1, the test statistic used is
This test statistic has a t distribution with n - 2 degrees of freedom
The CI is given by
ˆˆ
ˆ ˆ ˆ where
(0)
Y|X1 1
1x x
1
- ST = SS n -1 n -1S S
11 2,1 / 2
ˆnt S
Hypothesis Testing for Regression Parameters
Timeout: The T-distribution
The t distribution (or Student’s t distribution) arises when we use an estimated variance to construct the test statistic:
As n→∞, T→Z~N(0,1) Have to pay a penalty for estimating σ2
YT n
S
2( )where is the sample standard deviation
1
ii
Y YS
n
Can think of the t distribution as a thick-tailed normal
Inference concerning the Intercept
To test the hypothesis H0: β0=β0(0) we use the
following statistic
which also has the t distribution with n-2 degrees of freedom when Ho:β0= β0
(0)
The CI is given by
0
2
ˆ 2
ˆ ˆ (0)
0 0
Y|XX
-T =
1 XS S +n n -1 S
0ˆ0 2,1 / 2
ˆnt S
Null Hypothesis: The simple linear regression model does not
fit the data better than the baseline model. 1 = 0
Alternative Hypothesis: The simple linear regression model does fit
the data better than the baseline model. 1 0
Interpretations of Tests for Slope
Failure to reject H0:β1=0 could mean:
is essentially as good as for predicting YY 1 -Y X X
•
•• •••
•• •
••••
•
x
y A
Y
Interpretations of Tests for Slope
Failure to reject H0:β1=0 could mean: The true relationship between Y and X is
not linear (i.e. could be quadratic or some other higher power)
x
y •
• ••
••
•
••••••
•
••
••
••
••
•
• ••
•
Dude, that’s why you always plot Y vs. X!
Interpretations of Tests for Slope
Failure to reject H0:β1=0 could mean: We do not have enough power to detect a
significant slope
Not rejecting H0:β1=0 implies that a straight line model in X is not the best model to use, and does not provide much help for predicting X (ignoring power)
The Intercept
We often leave the intercept, β0, in the model
regardless of whether the hypothesis, H0:β0=0, is rejected or not. This is because if
we say the intercept is zero then we must force the regression line through the origin (0,0) and rarely is this true.
Regression of SBP on age:
Analysis of Variance
Sum of MeanSource DF Squares Square
Model 1 4008.12372 4008.12372Error 28 2319.37628 82.83487Corrected Total 29 6327.50000
Root MSE 9.10137
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > t|
Intercept 1 54.21462 13.08530 4.14 .0003age 1 1.70995 0.24582 6.96 .0001
2|ˆ = Y XS
1 0 1SE( ) 0 1H : 0
SSE
Response Variable: Y Explanatory Variables: X1,..., Xk Model (Extension of Simple Regression): E(Y) = 0 + 1X1 + + kXk V(Y) = 2
Partial Regression Coefficients (i): Effect of increasing Xi by 1 unit, holding all other predictors constant
Computer packages fit models, hand calculations very tedious
Model Parameters: 01,…, k, Estimators: Least squares prediction equation: Residuals: Error Sum of Squares: Estimated conditional standard deviation:
kk XXY ˆˆˆˆ11
22 )ˆ( iii YYSSE
1
kn
SSE
0 1 ˆ, ,..., ,
k
( )
i i iY Y
When there are 2 independent variables (X1 and X2) we can view the regression as fitting the best plane to the 3 dimensional set of points (as compared to the best line in simple linear regression)
When there are more than 2 IVs plotting becomes much more difficult
Analysis of Variance: Regression sum of Squares: Error Sum of Squares: Total Sum of Squares:
Coefficient of (Multiple) Determination: R2=SSR/TSS (the % of variation explained by the
model) Least Squares Estimates
Regression Coefficients Estimated Standard Errors t-statistics P-values (Significance levels for 2-sided tests)
kdfYYSSR R 2)ˆ(
12 kndfYYSSE E)ˆ(
1)( 2 ndfYYTSS T
706N
Participant ID Gender
Reader Name
Max Diameter, Dilation
Phase (mm)
Time to Max Diameter
(sec)
Pre-cuff Baseline
(mm)
Post-cuff Baseline
(mm) Age (yrs)
3000028 M Crotts 6.835 84 6.559 6.573 84.2
3000052 F Manli 2.905 89 2.809 2.829 75.3
3000079 M Manli 3.677 52 3.583 3.576 80.1
3000087 M Manli 4.974 57 4.957 4.909 78.3
3000257 F Crotts 4.748 62 4.492 4.291 78
3000346 M Drum 5.973 114 5.929 5.917 78.5
3000419 F Drum 3.429 94 3.288 3.312 76.6
3000524 M Drum 4.971 34 4.897 4.887 75.4
3000559 F Crotts 4.162 46 3.825 3.751 76.5
3000591 M Crotts 4.677 115 4.477 4.493 80.7
Max Diameter Dilation Phase (mm) Histogram # Boxplot Normal Probability Plot
7.75+* 1 0 7.75+ *
.* 4 0 | *
.*** 12 | | *****+
.******* 27 | | ******+
.***************** 65 | | ******+
.************************ 96 +-----+ | ******
.************************************* 146 *--+--* | ******
.**************************************** 157 | | | *******
.***************************** 114 +-----+ | *******
.***************** 65 | | *********
.**** 14 | |******++
2.25+* 1 | 2.25+*+
----+----+----+----+----+----+----+----+ +----+----+----+----+----+----+----+----+----+----+
Pre-cuff, Baseline (mm) Histogram # Boxplot Normal Probability Plot
7.75+* 1 0 7.75+ *
.* 3 0 | *
.** 6 0 | ****
.******* 25 | | ******+
.************** 54 | | ******+
.*********************** 91 +-----+ | *****+
.********************************* 131 | | | +******
.*************************************** 156 *--+--* | ******
.********************************* 132 +-----+ | *******
.******************** 80 | | *********
.******* 25 | |********+
2.25+* 2 | 2.25+*+++
----+----+----+----+----+----+----+---- +----+----+----+----+----+----+----+----+----+----+
* may represent up to 4 counts -2 -1 0 +1 +2
Time to Max Diameter(sec) Stem Leaf # Boxplot Normal Probability Plot
11 555555555555555999 18 | 117.5+ ++******
11 0000011112222333344444444444444 31 | | *****
10 555556666666677778889999 24 | | ***+
10 000001111122223333334444 24 | | ***+
9 5555555555555667777777777888889999 34 | | ***+
9 0000000000111122222333344444444 31 | | ***++
8 555555556667778888999999999 27 +-----+ | **++
8 000000001112222223333334444 27 | | | **+
7 555555666677778888889999 24 | | | **+
7 00000000001112222222333333344444444 35 | | | **
6 555555556666666666777777778888999999999 39 | | | ***
6 000000001111111222222333333344444 33 | + | | **
5 5555555556666666667777777777788888888999999999 46 *-----* | +**
5 0000000000000011111122222222223333334444444444 46 | | | +**
4 55555556666666677777777778888888899999 38 | | | ***
4 0000000111112222222233333444444444 34 | | | +**
3 5555555555666666677777778888888999999999 40 +-----+ | +***
3 0000000000111111122222222222333333334444444444 46 | | ***
2 55555566666666667777788888889999999 35 | | ****
2 0002222222223333334444 22 | | ***
1 555566677888899999999 21 | | ***
1 00111122222233333444 20 | | *****++
0 7779 4 | |*** ++
0 0444 4 | 2.5+* +
----+----+----+----+----+----+----+----+----+- +----+----+----+----+----+----+----+----+----+----+
Age (years) Histogram # Boxplot Normal Probability Plot
93+* 2 0 93+ *
.*** 12 0 | *****
.**** 14 | | ***++++
87+******** 29 | 87+ ****+++
.*********** 43 | | ****++
.******************* 73 | | *****
81+**************** 63 +-----+ 81+ +****
.************************* 100 | + | | ++****
.***************************************** 161 *-----* | +*******
75+************************************* 148 +-----+ 75+ **********
.*********** 43 | | ******+++
.**** 16 | |******++++
69+* 1 | 69+*+++++
. |+
. |
63+* 1 0 63+*
----+----+----+----+----+----+----+----+- +----+----+----+----+----+----+----+----+----+----+
Max
. D
iam
eter
, D
ilatio
n P
hase
(m
m)
2
3
4
5
6
7
8
Pre-cuff Baseline (mm)
2 3 4 5 6 7 8
Regression of Max Diameter (mm) on Pre-cuff Baseline (mm)
Analysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 1 549.03550 549.03550 70386.5 <.0001
Error 700 5.46021 0.00780Corrected Total 701 554.49571
Root MSE 0.08832 R-Square 0.9902Dependent Mean 4.56780 Adj R-Sq 0.9901Coeff Var 1.93352
Parameter Estimates
Variable Label DFParameter
EstimateStandard
Error t Value Pr > |t|Intercept Intercept 1 0.17015 0.01691 10.06 <.0001
PREBL Pre-cuff Baseline (mm) 1 0.99302 0.00374 265.30 <.0001
R2
Max
. D
iam
eter
, D
ilatio
n P
hase
(m
m)
2
3
4
5
6
7
8
Pre-cuff Baseline (mm)
2 3 4 5 6 7 8
Regression Line with 95% CI
Max
. D
iam
eter
, D
ilatio
n P
hase
(m
m)
2
3
4
5
6
7
8
Age (yrs)
60 70 80 90 100
Regression of Max Diameter (mm) on Age (mm)
Analysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 1 0.04862 0.04862 0.06 0.8044
Error 700 554.44709 0.79207
Corrected Total 701 554.49571
Root MSE 0.88998 R-Square 0.0001
Dependent Mean 4.56780 Adj R-Sq -0.0013
Coeff Var 19.48382
Parameter Estimates
Variable Label DFParameter
EstimateStandard
Error t Value Pr > |t|Intercept Intercept 1 4.42124 0.59247 7.46 <.0001
age 1 0.00186 0.00751 0.25 0.8044
Diameter Dilation (mm) vs. Age (yrs)
Max
. D
iam
eter
, D
ilatio
n P
hase
(m
m)
2
3
4
5
6
7
8
Age (yrs)
60 70 80 90 100
Regression of Max Diameter (mm) on Pre-cuff Baseline (mm) and Age (yrs)
Analysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > FModel 2 549.09606 274.54803 35541.0 <.0001
Error 699 5.39965 0.00772
Corrected Total 701 554.49571
Root MSE 0.08789 R-Square 0.9903
Dependent Mean 4.56780 Adj R-Sq 0.9902
Coeff Var 1.92414
Parameter Estimates
Variable Label DFParameter
EstimateStandard
Error t Value Pr > |t|Intercept Intercept 1 0.33282 0.06049 5.50 <.0001PREBL Pre-cuff Baseline (mm) 1 0.99323 0.00373 266.60 <.0001age Age (yrs) 1 -0.00208 0.00074187 -2.80 0.0053
Many research studies have large numbers of predictor variables
Problems arise when the various predictors are highly related among themselves (collinear) Estimated regression coefficients can change
dramatically, depending on whether or not other predictor(s) are included in model.
Standard errors of regression coefficients can increase, causing non-significant t-tests and wide confidence intervals
Variables are explaining the same variation in Y
Multicollinearity
Multicollinearity - examplePearson Correlation CoefficientsProb > |r| under H0: Rho=0
Number of Observations
MAXD PREBL POSTBL T2MAXD ageMAXDMax. Diameter, Dilation Phase (mm)
1.00000
702
0.99506<.0001
702
0.99475<.0001
702
0.028270.4546
702
0.009360.8044
702
PREBLPre-cuff Baseline (mm)
0.99506<.0001
702
1.00000
706
0.99716<.0001
703
0.025970.4918
703
0.021940.5605
706
POSTBLPost-cuff Baseline (mm)
0.99475<.0001
702
0.99716<.0001
703
1.00000
703
0.016670.6590
703
0.020750.5828
703
T2MAXDTime to Max. Diameter (sec)
0.028270.4546
702
0.025970.4918
703
0.016670.6590
703
1.00000
703
-0.041690.2697
703
ageAge (yrs)
0.009360.8044
702
0.021940.5605
706
0.020750.5828
703
-0.041690.2697
703
1.00000
706
Parameter Estimates
Variable Label DFParameter
EstimateStandard
Error t Value Pr > |t|Intercept Intercept 1 0.33282 0.06049 5.50 <.0001
PREBL Pre-cuff Baseline (mm) 1 0.99323 0.00373 266.60 <.0001
age Age (yrs) 1 -0.00208 0.00074187 -2.80 0.0053
Parameter Estimates
Variable Label DFParameter
EstimateStandard
Error t Value Pr > |t|Intercept Intercept 1 0.32369 0.05707 5.67 <.0001
PREBL Pre-cuff Baseline (mm) 1 0.55326 0.04716 11.73 <.0001
POSTBL Post-cuff Baseline (mm) 1 0.44290 0.04735 9.35 <.0001
age Age (yrs) 1 -0.00213 0.00069985 -3.04 0.0025
Multicollinearity - example
We assume that the outcome (Y’s) are normally distributed. What assumptions have we made about the distribution of the IVs (the X’s)?
None, except that they are RVs with some underlying distribution
Recall, model assumption centers around the conditional distribution of Y’s (conditional on values of X’s)
ANOVA is simply linear regression with series of dichotomous indicators for the “levels” of X
Are there any differences among the population means?
Response
Continuous
Predictor
Categorical
One-Way ANOVA
45
40
35
30
25
20
15
10
5
0
H0: All means are equal H1: At least one mean different
45
40
35
30
25
20
15
10
5
0
independent observations normally distributed data for each group, or
the pooled error terms are normally distributed equal variances for each group
C o m p a r in g P o p u la t io n s
A B C
2
1
3
D
4
TotalTotalVariabilityVariability
Variabilitybetween Groups
Variabilitywithin Groups
i groups j individuals
Within Sum of Squares (SSW)
Between Sum of Squares (SSB)
Total Sum of Squares (SST)
2
1 1
( )ink
ij ii j
Y Y
2
1 1
( )ink
ii j
Y Y
2
1 1
( )ink
iji j
Y Y
SST = SSB + SSW
1 1
1 1 and
in k
i ij i ij ii
Y Y Y n Yn n
ANOVA Example – Max Diameter by Reader
Source DFSum of
Squares Mean Square F Value Pr > FModel 2 3.6421805 1.8210902 2.31 0.0999Error 699 550.853526
80.7880594
Corrected Total 701 554.4957073
R-Square Coeff Var Root MSE MAXD Mean0.006568 19.43447 0.887727 4.567798
READERMAXD
LSMEANStandard
ErrorPr > |
t|LSMEAN Number
Crotts 4.64841096 0.05998704
<.0001
1
Drum 4.48160364 0.05353196
<.0001
2
Manli 4.59687981 0.06155280
<.0001
3
SST
SSB
SSW
One of the things that makes the General Linear Model (or GLM) so flexible
ANCOVA analyses should always assess possible interactions between continuous IVs and categorical IVs
If interactions are present, model must be interpreted carefully
ANCOVA Example – Max Diameter vs. Reader Adjusting for Pre-cuff
Source DFSum of
SquaresMean
Square F Value Pr > FModel 5 549.2229
489109.844589
814499.
4<.000
1Error 69
65.272758
30.0075758
Corrected Total
701
554.4957073
R-Square
Coeff Var Root MSE MAXD Mean
0.990491
1.905493
0.087039
4.567798
Source DF Type III SSMean
Square F Value Pr > FREADER 2 0.0596745 0.0298373 3.94 0.019
9PREBL 1 541.77647
44541.776474
471514.
1<.000
1PREBL*READER 2 0.0508130 0.0254065 3.35 0.035
5
Continuous covariate: Pre-cuff diameter
Interaction between Pre-cuff Diameter and Reader
Yikes, it’s significant!
READERMAXD
LSMEANStandard
ErrorCrotts 3.50587755 0.04807411Drum 3.51649412 0.03650059
Manli 3.50760000 0.04759094
READERMAXD
LSMEANStandard
ErrorCrotts 4.24069492 0.02395722Drum 4.25861538 0.02282474
Manli 4.28982456 0.02437390
READERMAXD
LSMEANStandard
ErrorCrotts 4.84836735 0.02640555Drum 4.84447541 0.02366619
Manli 4.77966667 0.02667919
READERMAXD
LSMEANStandard
ErrorCrotts 5.78133871 0.06433425Drum 5.64400000 0.06332105
Manli 5.78918868 0.06958252
Reader Effects in 1st Quartile of Pre-cuff Diameter
Reader Effects in 2nd Quartile of Pre-cuff Diameter
Reader Effects in 3rd Quartile of Pre-cuff Diameter
Reader Effects in 4th Quartile of Pre-cuff Diameter
ANCOVA Example – Stratified Analysis
Interaction is driven by “swapping” of effects for Reader at various levels of pre-cuff diameter.