tuanv.nguyen%% · 2015-06-15 · three types of analysis • analysis&of&difference&...
TRANSCRIPT
Tuan V. Nguyen Garvan Ins)tute of Medical Research
Sydney, Australia
Garvan Ins)tute Biosta)s)cal Workshop 16/6/2015 © Tuan V. Nguyen
Introduction to linear regression analysis
• Purposes
• Ideas of regression
• Es)ma)on of parameters
• R codes
Three types of analysis
• Analysis of difference
• Associa)on analysis
• Correla)on analysis and predic)on
Analysis of correlation
• Correla)on analysis
• Linear regression analysis
• Logis)c regression
• Cox's regression
• etc
A note of history
• Developed by Sir Francis Galton (1822-‐1911) in his ar)cle “Regression towards mediocrity in hereditary structure”
Linear regression analysis
• Assess / quan)fy a LINEAR rela)onship between variables
• Es)mate the magnitude of effect of risk factors on an outcome variable
• Build model of predic)on
Purposes of linear regression analysis
• Find an equa)on to describe the rela)onship betwee X and Y
– X is a predictor, risk factor, independent variable – Y is the outcome variable, dependent variable
• Adjustment for confounding factors
• Predic)on
Linear regression model
How to find an equa3on to link the 2 poits?
(x1, y1)
(x2, y2)
x-axis
y-axis
0
Given two points on a plane (x1, y1) and (x2, y2)
• Find a slope
• Find the intercept (value of y when x=0)
(x1, y1)
(x2, y2)
x-axis
y-axis
2 1
2 1
y y yslopex x x
Δ −= =Δ −
0
The linear regression model
• Simple linear regression model
• Y -‐ response variable, dependent variable • Y is a con)nuous variable
• X -‐ predictor variable, independent variable – X can be a con)nuous variable or a categorical variable
The linear regression model
• The statement: Y = α + βX + ε
α : intercept β : slope / gradient
ε : random error – the varia)o in Y for each X value
• The rela)onship between X and Y is linear
• X does not have random error
• Values of Y are independent (eg, Y1 is not related to Y2) ;
• Random error ε: Normal distribu)on with mean 0, constant variance,
ε ~ N(0, σ2)
Assumptions
Aim of estimation
• The model (popula)on):
Y = α + βX + ε
• We don't know α and β
• But we can use observed data to to es)mate the two parameters
• Es)mates of α and β are a and b
• Can be es)mated by eyeballing • But it can be biased and inconsistent • We want a method to give unbiased and consistent
es)mates
0 5 10 15 20 25
020
4060
80100
x
y
Criteria for estimation
Y
X
ii bxay +=ˆiii yyd ˆ−=
yi
Find a formula (es)mator) to es)mate a and b so that sum of d2 is minimum à Least square method
Carl Friedrich Gauss (1777 – 1855)
• Born Brunswick, Germany
• Prodigy, "the greatest mathema)cian since an)quity"
• Inventor of the least square method
Estimates of parameters
b = sXYsX2 =
(Xi − X)i=1
n
∑ (Yi −Y )
(Xi − X)i=1
n
∑ (Xi − X)
a =Y − bX
• For a series of pairs of Xi, Yi
• Then, the equa)on is:
Yi = a+ bXi
Francis Galton's data
• Galton F (1869). Hereditary Genius: An Inquiry into its Laws and Consequences. London: Macmillan
• Data from 928 adult children born to 205 fathers and mothers
• Data – mid-‐parent's height
– child's height
Galton's data
galton = read.csv("~/Google Drive/Garvan Lectures 2014/Datasets and Teaching Materials/Galton data.csv", header=T)
attach(galton) head(galton) id parent child 1 1 70.5 61.7 2 2 68.5 61.7 3 3 65.5 61.7 4 4 64.5 61.7 5 5 64.0 61.7 6 6 67.5 62.2
Linear regression analysis
• Research statement: Child's height is related to parent's height
• Sta)s)cal statement:
Child = α + β.Parent + ε
• R codes: m = lm(child ~ parent, data=galton)
> m = lm(child ~ parent) > summary(m) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 23.94153 2.81088 8.517 <2e-16 *** parent 0.64629 0.04114 15.711 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.239 on 926 degrees of freedom Multiple R-squared: 0.2105, Adjusted R-squared: 0.2096 F-statistic: 246.8 on 1 and 926 DF, p-value: < 2.2e-16
Interpretation of results
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 23.94153 2.81088 8.517 <2e-16 *** parent 0.64629 0.04114 15.711 <2e-16 ***
• Remember that our model is:
Child = a + b*Parent
• Our equa)on is now: Child = 23.94 + 0.646*Parent
• Interpreta)on: Each in increase in parental height is associated with 0.64 in increase in child's height.
Meaning of the regression line
When parental height = 64 in Child = 23.94 + 0.646*64 = 65.3 When parental height = 70 Child = 23.94 + 0.646*70 = 69.2
Expected value Child = 23.94 + 0.646*Parent
64 66 68 70 72
6264
6668
7072
74
parent
child
Questions concerning linear regression
• Is the model good enough?
• What criteria to judge the "goodness of fit"?
• Good = difference between observed and predicted values
Residual (e)
• Residual = the part not explained by the model
• Predicted child's height = 23.94 + 0.646*Parent • e = Observed height – Predicted height
calculated for each child
Residual and predicted values using R
m = lm(child ~ parent, data=galton) res = resid(m) pred = predict(m) > cbind(parent, child, pred, res) parent child pred res 1 70.5 61.7 69.50502 -7.80501621 2 68.5 61.7 68.21244 -6.51243505 3 65.5 61.7 66.27356 -4.57356330 4 64.5 61.7 65.62727 -3.92727272 5 64.0 61.7 65.30413 -3.60412743 6 67.5 62.2 67.56614 -5.36614446 7 67.5 62.2 67.56614 -5.36614446 8 67.5 62.2 67.56614 -5.36614446 9 66.5 62.2 66.91985 -4.71985388 10 66.5 62.2 66.91985 -4.71985388 11 66.5 62.2 66.91985 -4.71985388
Analysis of variance
• Child = a + b*Parent + e
• Observed varia)on = model + random
“Varia)on” = sum of squares
• SStotal = total sum of squares
SSreg = sum of squares due to the regression model
SSerror = sum of squares due to random component
Sources of variation
• Total SS = 1237 + 4640 = 5877 – Due to "parent": 1236 – Residuals (unexplained part): 4640
> m = lm(child ~ parent, data=galton) > anova(m)
Analysis of Variance Table Response: child Df Sum Sq Mean Sq F value Pr(>F) parent 1 1236.9 1236.93 246.84 < 2.2e-16 *** Residuals 926 4640.3 5.01 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The coefficient of determination (R2)
• Total SS = 1237 + 4640 = 5877
• R2 = 1237 / 5877 = 0.21
> m = lm(child ~ parent, data=galton) > anova(m)
Analysis of Variance Table Response: child Df Sum Sq Mean Sq F value Pr(>F) parent 1 1236.9 1236.93 246.84 < 2.2e-16 *** Residuals 926 4640.3 5.01 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residuals: Min 1Q Median 3Q Max -7.8050 -1.3661 0.0487 1.6339 5.9264 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 23.94153 2.81088 8.517 <2e-16 *** parent 0.64629 0.04114 15.711 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.239 on 926 degrees of freedom Multiple R-squared: 0.2105, Adjusted R-squared: 0.2096 F-statistic: 246.8 on 1 and 926 DF, p-value: < 2.2e-16
Meaning of R2
Residual standard error: 2.239 on 926 degrees of freedom Multiple R-squared: 0.2105, Adjusted R-squared: 0.2096 F-statistic: 246.8 on 1 and 926 DF, p-value: < 2.2e-16
• Coefficient of determina)on R2 = 0.21
• Interpreta)on: Approximately 21% of childen's height variance could be accounted for by parental height
Adjusted R2
• Defini)on:
R2adj = 1 -‐ (MSerror / MStotal)
MSerror : mean square due to error
MStotal : mean square (total)
Adjusted R2
• MStotal = (1237 + 4640) / 927 = 6.34
• MSerror = 5
• R2adj = 1 – (5 / 6.34) = 0.21
> m = lm(child ~ parent, data=galton) > anova(m)
Analysis of Variance Table Response: child Df Sum Sq Mean Sq F value Pr(>F) parent 1 1236.9 1236.93 246.84 < 2.2e-16 *** Residuals 926 4640.3 5.01 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residuals: Min 1Q Median 3Q Max -7.8050 -1.3661 0.0487 1.6339 5.9264 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 23.94153 2.81088 8.517 <2e-16 *** parent 0.64629 0.04114 15.711 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.239 on 926 degrees of freedom Multiple R-squared: 0.2105, Adjusted R-squared: 0.2096 F-statistic: 246.8 on 1 and 926 DF, p-value: < 2.2e-16