statistical data analysis 2010/2011 m. de gunst lecture 9
DESCRIPTION
Statistical Data Analysis 3 Multiple linear regression (Reader: Chapter 8) Relationship between one response variable and one or more explanatory variable Statistical model: multiple linear regression model Parameter estimation Selection explanatory variables: numerical measures determination coefficient partial correlation coefficient tests F-tests t-test Model quality: global methods/diagnostics several plotsTRANSCRIPT
Statistical Data Analysis
2010/2011
M. de Gunst
Lecture 9
Statistical Data Analysis
2
Statistical Data Analysis: Introduction
TopicsSummarizing dataInvestigating distributions Bootstrap Robust methodsNonparametric tests Analysis of categorical dataMultiple linear regression
Statistical Data Analysis
3
Multiple linear regression (Reader: Chapter 8)
Relationship between one response variableand one or more explanatory variable
Statistical model: multiple linear regression model
Parameter estimation
Selection explanatory variables: numerical measures
determination coefficient partial correlation coefficient tests
F-tests t-test
Model quality: global methods/diagnostics several plots
Statistical Data Analysis
4
Pairwise scatter plots; data of body fat, triceps skin-fold thickness, thigh circumference and mid-arm circumference for twenty healthy females aged 20 to 34
Looks promising!
Example
Body fat: difficult and expensive to obtainCan it be predicted by one or more other, more easily measurable variables?
Possible explanatory variables:triceps skin-fold thicknessthigh circumferencemid-arm circumference
What kind of relationship? Try simplest: linear
First make plot(s) of available data Which one(s)?
Statistical Data Analysis
5
Statistical model
Multiple linear regression model
i-th response value of j-th explanatory variable corresponding to i-th response stochastic “measurement error” for i-th response unknown constants
or matrix notation:
Assumption: independent and normally distributed
design matrixintercept
Statistical Data Analysis
6
Statistical model
Multiple linear regression model
independent and normally distributed
Note: response and explanatory variables continuous
Other type of models?
Statistical Data Analysis
7
Statistical model
Multiple linear regression model
independent and normally distributed
Issues: 1) estimate2) select explanatory variables3) assess model quality
Statistical Data Analysis
8
1) Parameter estimation -
Multiple linear regression model
independent and normally distributed
Estimate with least-squares:
minimize w.r.t.
Solution:
→ unbiased estimator
Statistical Data Analysis
9
1) Parameter estimation -
Multiple linear regression model
independent and normally distributed
i-th residual
Residual sum of squares
Estimate by
Under normality of the ei , chisquare distr, df n-p-1 → unbiased estimator
What do residuals tell us? If large, model “not so good”
Statistical Data Analysis
10
2) Selection of explanatory variables
Multiple linear regression model
independent and normally distributed
Do more variables explain variability in responses better?Do we want a large model?
Want: smallest possible model that explains variability in responses as much as possible → contradictory requirements
Need: selection criterion/measure for how much variability is explained
Statistical Data Analysis
11
2) Selection of variables – determination coefficient (1)
Multiple linear regression model
independent and normally distributed
Sum of squares for Y what is this?
Sum of squares for regression what is this?
Determination coefficient
amount of variability in Y explained by design matrix X
When is larger, with more or with less variables in model?What is better, large or small ? What is large?
Statistical Data Analysis
12
2) Selection of variables – determination coefficient (2)
Multiple linear regression model
independent and normally distributed
Determination coefficient
amount of variability in Y explained by design matrix X
For simple linear regression: cor(Y,X1)2
For multivariate regression: 2
is multiple correlation coefficient = largest cor between Y and any linear combination of the Xis
Statistical Data Analysis
13
2) Selection of variables – overall F-test
Multiple linear regression model
independent and normally distributed
Another scaling of SSreg yields test statistic for
Test statistic: ~
If large, makes sense to include all p variables that are considered
Overall F-test
An F-distribution
Statistical Data Analysis
14
2) Selection of variables – partial F-test
Multiple linear regression model
independent and normally distributed
Next to in model?
Which sums of squares give indication?
Test statistic ~
Partial F-test
Statistical Data Analysis
15
2) Selection of variables – t-test
Multiple linear regression model
independent and normally distributed
For testing whether or not 1 variable Xk should be included
Test statistic ~
Relationship t and partial F:
Very often used
Statistical Data Analysis
16
2) Selection of variables – partial correlation coefficient
Multiple linear regression model
independent and normally distributed
Linear relationship of Y and Xk corrected for other p-1 variables in model
partial correlation coefficient = cor( , )
vector of residuals from regression of Y on Xj except Xk vector of residuals from regression of Xk on all other Xj
If large: indication that Xk should be included next to p-1 other variables
Equivalent to t-test
Statistical Data Analysis
17
2) Selection of variables – practice
Multiple linear regression model
independent and normally distributed
How to select systematically in practice?
Two ways: build up step by step: determination coefficient then t-test for last step break down step by step: t-tests then determination coefficient for last step
Statistical Data Analysis
18
Example - bodyfat
Build up a model:
Determination coefficients univariate regression: Triceps Thigh MidarmFat on 0.71 0.77 0.02
First regression of Fat on Thigh
Statistical Data Analysis
19
Example - bodyfat
R: (data in matrix bf)> zglob3 = globalregression(bf[,3],bf[,1])
> zglob3$RSS[1] 113.4237$detcoef [1] 0.7710414$beta #estimate Intercept X -23.6344891 0.8565466 $covbeta #estimate of cov matrix of beta-hat x 32.0063293 -0.61933288x -0.6193329 0.01210344$sigmakw #estimate[1] 6.301316
...
...
$t #value t-statisticsIntercept X -4.177614 7.785681 $pt_#onesided p-values[1] 5.656662e-04 3.599996e-07 # Thigh significant at 0.05 # (two-sided test)$F[1] 60.61684$pF[1] 3.599996e-07
Statistical Data Analysis
20
Example - bodyfat
R: (data in matrix bf)> zfit3= lsfit(bf[,3],bf[,1])> zfit3$coefficients Intercept X -23.6344891 0.8565466 $residuals [1] -1.3826690 3.7784688 -2.1202790 -2.7759908 0.3882229 -0.8333722 [7] 0.6265135 4.4084117 2.1928142 -2.8907536 0.5539520 2.2682973 … and some more
Statistical Data Analysis
21
Example - bodyfat
Adding one of the other variables: > zglob32 = globalregression(bf[,c(3,2)],bf[,1])> zglob34 = globalregression(bf[,c(3,4)],bf[,1])
yields almost same value for det.coef: 0.78moreover, coefficient additional variable not significantly different from 0
So we stop with adding variables:
Building up leads to univariate model with explanatory variable Thigh
Statistical Data Analysis
22
Example - bodyfat
Breaking down a model
Shows problems:starting with all variables yields determination coefficient = 0.80 But: none of betas significantly different from 0!
Breaking down based on highest p-value first takes out Thigh (!) (det.coef=0.78)
We leave remaining variables in, both their coefficients now significantly different from 0, and taking them out lowers the det.coef to 0.71 or 0.02
Breaking down leads to bivariate model with explanatory variables Triceps and Midarm
Statistical Data Analysis
23
Example - bodyfat
Building-up and breaking down leads to different modelsWhich is final model of our choice?
Breaking down leads to model with one variable more that has only slightly larger det.coef than model obtained with building up procedure
So, smaller, univariate model with only Thigh as explanatory variable forresponse variable Body fat seems best;
Estimates of its coefficients are: -23.63 (intercept), 0.86 (Tigh); Estimate of its error variance is 6.30Thigh explains 77% of variation in Body Fat
Statistical Data Analysis
24
3) Assessment of model quality
Multiple linear regression model
independent and normally distributed
Is linear regression model adequate for these data sets?
But: these data sets have same , and if (simple) linear regression modelis fitted
Statistical Data Analysis
25
3) Assessment of model quality - diagnostics
Multiple linear regression model
independent and normally distributed
Until now: model, incl. assumptions, correctNow: assessment of model quality, incl. appropriateness assumptions Globally: with global quantities like and tests not sufficient
Diagnostics: investigation with quantities that have different valuefor each observation point ( = combined with )
First: make suitable plots and investigate deviating points further
Statistical Data Analysis
26
3) Assessment of model quality - plots
Types of plots:
i) Scatter plot of Y against each explanatory variable Gives overall picture + deviating values
ii) Added variable plot: scatter plot of residuals from regression of Y on Xj
except Xk against residuals from regression of Xk on all other Xj Gives picture of relation Y and Xk corrected for other Xj + deviating
values (cf. partial correlation coeff)
iii) Plots based on residuals
Statistical Data Analysis
27
3) Assessment of model quality - plots
iii) Plots based on residualsScatter plot residuals against each explanatory variableIf pattern: linear model perhaps not correctCurvature: include higher order of variableSystematic spread : linear model not correct or non-equal variance
Scatter plot residuals against new explanatory variableIf linear relationship: include this variable
Scatter plot residuals against predicted responsesIf spread increases/decrease: non-equal variance
Normal QQ-plot of residuals: Checks assumption of normality measurement errors
Plus: all these plots show deviating individual values
Statistical Data Analysis
28
Example - bodyfat
Model of choice: Bodyfat = -23.63 + 0.86 Thigh + measurement error
Some diagnostic checks for this model:- scatter plot of pairs (above) showed no outliers - scatter plot of residuals against explanatory variable (below, left)- scatter plot of residuals against predicted responses (below, middle)- normality check with normal QQ-plot of residuals (below, right)
None shows particular pattern or outliers; QQ-plot OK
Conclusion: we stay with this model
Statistical Data Analysis
29
3) Assessment of model quality – further diagnostics
Next week: further investigation - deviating observation points with numerical measures and tests- explanatory variables that are themselves linearly related