statistical data analysis 2010/2011 m. de gunst lecture 9

29
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Upload: rudolf-weaver

Post on 17-Jan-2018

224 views

Category:

Documents


0 download

DESCRIPTION

Statistical Data Analysis 3 Multiple linear regression (Reader: Chapter 8) Relationship between one response variable and one or more explanatory variable Statistical model: multiple linear regression model Parameter estimation Selection explanatory variables: numerical measures determination coefficient partial correlation coefficient tests F-tests t-test Model quality: global methods/diagnostics several plots

TRANSCRIPT

Page 1: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

2010/2011

M. de Gunst

Lecture 9

Page 2: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

2

Statistical Data Analysis: Introduction

TopicsSummarizing dataInvestigating distributions Bootstrap Robust methodsNonparametric tests Analysis of categorical dataMultiple linear regression

Page 3: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

3

Multiple linear regression (Reader: Chapter 8)

Relationship between one response variableand one or more explanatory variable

Statistical model: multiple linear regression model

Parameter estimation

Selection explanatory variables: numerical measures

determination coefficient partial correlation coefficient tests

F-tests t-test

Model quality: global methods/diagnostics several plots

Page 4: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

4

Pairwise scatter plots; data of body fat, triceps skin-fold thickness, thigh circumference and mid-arm circumference for twenty healthy females aged 20 to 34

Looks promising!

Example

Body fat: difficult and expensive to obtainCan it be predicted by one or more other, more easily measurable variables?

Possible explanatory variables:triceps skin-fold thicknessthigh circumferencemid-arm circumference

What kind of relationship? Try simplest: linear

First make plot(s) of available data Which one(s)?

Page 5: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

5

Statistical model

Multiple linear regression model

i-th response value of j-th explanatory variable corresponding to i-th response stochastic “measurement error” for i-th response unknown constants

or matrix notation:

Assumption: independent and normally distributed

design matrixintercept

Page 6: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

6

Statistical model

Multiple linear regression model

independent and normally distributed

Note: response and explanatory variables continuous

Other type of models?

Page 7: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

7

Statistical model

Multiple linear regression model

independent and normally distributed

Issues: 1) estimate2) select explanatory variables3) assess model quality

Page 8: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

8

1) Parameter estimation -

Multiple linear regression model

independent and normally distributed

Estimate with least-squares:

minimize w.r.t.

Solution:

→ unbiased estimator

Page 9: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

9

1) Parameter estimation -

Multiple linear regression model

independent and normally distributed

i-th residual

Residual sum of squares

Estimate by

Under normality of the ei , chisquare distr, df n-p-1 → unbiased estimator

What do residuals tell us? If large, model “not so good”

Page 10: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

10

2) Selection of explanatory variables

Multiple linear regression model

independent and normally distributed

Do more variables explain variability in responses better?Do we want a large model?

Want: smallest possible model that explains variability in responses as much as possible → contradictory requirements

Need: selection criterion/measure for how much variability is explained

Page 11: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

11

2) Selection of variables – determination coefficient (1)

Multiple linear regression model

independent and normally distributed

Sum of squares for Y what is this?

Sum of squares for regression what is this?

Determination coefficient

amount of variability in Y explained by design matrix X

When is larger, with more or with less variables in model?What is better, large or small ? What is large?

Page 12: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

12

2) Selection of variables – determination coefficient (2)

Multiple linear regression model

independent and normally distributed

Determination coefficient

amount of variability in Y explained by design matrix X

For simple linear regression: cor(Y,X1)2

For multivariate regression: 2

is multiple correlation coefficient = largest cor between Y and any linear combination of the Xis

Page 13: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

13

2) Selection of variables – overall F-test

Multiple linear regression model

independent and normally distributed

Another scaling of SSreg yields test statistic for

Test statistic: ~

If large, makes sense to include all p variables that are considered

Overall F-test

An F-distribution

Page 14: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

14

2) Selection of variables – partial F-test

Multiple linear regression model

independent and normally distributed

Next to in model?

Which sums of squares give indication?

Test statistic ~

Partial F-test

Page 15: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

15

2) Selection of variables – t-test

Multiple linear regression model

independent and normally distributed

For testing whether or not 1 variable Xk should be included

Test statistic ~

Relationship t and partial F:

Very often used

Page 16: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

16

2) Selection of variables – partial correlation coefficient

Multiple linear regression model

independent and normally distributed

Linear relationship of Y and Xk corrected for other p-1 variables in model

partial correlation coefficient = cor( , )

vector of residuals from regression of Y on Xj except Xk vector of residuals from regression of Xk on all other Xj

If large: indication that Xk should be included next to p-1 other variables

Equivalent to t-test

Page 17: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

17

2) Selection of variables – practice

Multiple linear regression model

independent and normally distributed

How to select systematically in practice?

Two ways: build up step by step: determination coefficient then t-test for last step break down step by step: t-tests then determination coefficient for last step

Page 18: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

18

Example - bodyfat

Build up a model:

Determination coefficients univariate regression: Triceps Thigh MidarmFat on 0.71 0.77 0.02

First regression of Fat on Thigh

Page 19: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

19

Example - bodyfat

R: (data in matrix bf)> zglob3 = globalregression(bf[,3],bf[,1])

> zglob3$RSS[1] 113.4237$detcoef [1] 0.7710414$beta #estimate Intercept X -23.6344891 0.8565466 $covbeta #estimate of cov matrix of beta-hat x 32.0063293 -0.61933288x -0.6193329 0.01210344$sigmakw #estimate[1] 6.301316

...

...

$t #value t-statisticsIntercept X -4.177614 7.785681 $pt_#onesided p-values[1] 5.656662e-04 3.599996e-07 # Thigh significant at 0.05 # (two-sided test)$F[1] 60.61684$pF[1] 3.599996e-07

Page 20: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

20

Example - bodyfat

R: (data in matrix bf)> zfit3= lsfit(bf[,3],bf[,1])> zfit3$coefficients Intercept X -23.6344891 0.8565466 $residuals [1] -1.3826690 3.7784688 -2.1202790 -2.7759908 0.3882229 -0.8333722 [7] 0.6265135 4.4084117 2.1928142 -2.8907536 0.5539520 2.2682973 … and some more

Page 21: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

21

Example - bodyfat

Adding one of the other variables: > zglob32 = globalregression(bf[,c(3,2)],bf[,1])> zglob34 = globalregression(bf[,c(3,4)],bf[,1])

yields almost same value for det.coef: 0.78moreover, coefficient additional variable not significantly different from 0

So we stop with adding variables:

Building up leads to univariate model with explanatory variable Thigh

Page 22: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

22

Example - bodyfat

Breaking down a model

Shows problems:starting with all variables yields determination coefficient = 0.80 But: none of betas significantly different from 0!

Breaking down based on highest p-value first takes out Thigh (!) (det.coef=0.78)

We leave remaining variables in, both their coefficients now significantly different from 0, and taking them out lowers the det.coef to 0.71 or 0.02

Breaking down leads to bivariate model with explanatory variables Triceps and Midarm

Page 23: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

23

Example - bodyfat

Building-up and breaking down leads to different modelsWhich is final model of our choice?

Breaking down leads to model with one variable more that has only slightly larger det.coef than model obtained with building up procedure

So, smaller, univariate model with only Thigh as explanatory variable forresponse variable Body fat seems best;

Estimates of its coefficients are: -23.63 (intercept), 0.86 (Tigh); Estimate of its error variance is 6.30Thigh explains 77% of variation in Body Fat

Page 24: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

24

3) Assessment of model quality

Multiple linear regression model

independent and normally distributed

Is linear regression model adequate for these data sets?

But: these data sets have same , and if (simple) linear regression modelis fitted

Page 25: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

25

3) Assessment of model quality - diagnostics

Multiple linear regression model

independent and normally distributed

Until now: model, incl. assumptions, correctNow: assessment of model quality, incl. appropriateness assumptions Globally: with global quantities like and tests not sufficient

Diagnostics: investigation with quantities that have different valuefor each observation point ( = combined with )

First: make suitable plots and investigate deviating points further

Page 26: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

26

3) Assessment of model quality - plots

Types of plots:

i) Scatter plot of Y against each explanatory variable Gives overall picture + deviating values

ii) Added variable plot: scatter plot of residuals from regression of Y on Xj

except Xk against residuals from regression of Xk on all other Xj Gives picture of relation Y and Xk corrected for other Xj + deviating

values (cf. partial correlation coeff)

iii) Plots based on residuals

Page 27: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

27

3) Assessment of model quality - plots

iii) Plots based on residualsScatter plot residuals against each explanatory variableIf pattern: linear model perhaps not correctCurvature: include higher order of variableSystematic spread : linear model not correct or non-equal variance

Scatter plot residuals against new explanatory variableIf linear relationship: include this variable

Scatter plot residuals against predicted responsesIf spread increases/decrease: non-equal variance

Normal QQ-plot of residuals: Checks assumption of normality measurement errors

Plus: all these plots show deviating individual values

Page 28: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

28

Example - bodyfat

Model of choice: Bodyfat = -23.63 + 0.86 Thigh + measurement error

Some diagnostic checks for this model:- scatter plot of pairs (above) showed no outliers - scatter plot of residuals against explanatory variable (below, left)- scatter plot of residuals against predicted responses (below, middle)- normality check with normal QQ-plot of residuals (below, right)

None shows particular pattern or outliers; QQ-plot OK

Conclusion: we stay with this model

Page 29: Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9

Statistical Data Analysis

29

3) Assessment of model quality – further diagnostics

Next week: further investigation - deviating observation points with numerical measures and tests- explanatory variables that are themselves linearly related