sadc course in statistics assumptions underlying regression analysis (session 05)

20
SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

Upload: mary-schneider

Post on 28-Mar-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

SADC Course in Statistics

Assumptions underlying

regression analysis

(Session 05)

Page 2: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

2To put your footer here go to View > Header and Footer

Learning Objectives

At the end of this session, you will be able to

• describe assumptions underlying a regression analysis

• conduct analyses that will allow a check on model assumptions

• discuss the consequences of failure of assumptions

• consider remedial action when assumption fail

Page 3: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

3To put your footer here go to View > Header and Footer

Checking assumptions

Describing the relationship carries no assumptions.

However, inferences concerning the slope of the line, e.g. by use of t-tests or F-tests, are subject to certain assumptions.

Checking assumptions is important to avoid pitfalls associated with making invalid conclusions.

Page 4: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

4To put your footer here go to View > Header and Footer

Assumptions

The simple linear regression model is:

yi = 0 + 1 xi + i

In addition to assuming a linear form for themodel, the i are assumed to be

• independent, with

• zero mean and constant variance 2,

• and be normally distributed.

Note: Model predictions, often called fitted values, are

Page 5: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

5To put your footer here go to View > Header and Footer

How to check assumptions?The usual approach is to conduct a residual analysis. Residuals are deviations of observed values from model fitted values.

Paddy data relating yield to fertiliser

Residual

Page 6: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

6To put your footer here go to View > Header and Footer

Residual PlotsPlotting residuals in various ways allowsfailure of assumptions to be detected.

e.g. To check the normality assumption,

• plot a histogram of the residuals (provided there are enough observations);

• or do a normal probability plot of residuals. A straight line plot indicates that the normality assumption is reasonable.

Page 7: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

7To put your footer here go to View > Header and Footer

Residual Plots - continuedMost useful is a plot of residuals against fitted values( ) .

It helps to detect failure of the variance homogeneity assumption. Also helps to identify potential outliers.

e.g. If standardised residuals are used, i.e. residuals/standard error,then 95% of observations would be expected to lie between –2 and +2.

A random scatter with no obvious pattern is good! Some examples follow….

ˆiy

Page 8: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

8To put your footer here go to View > Header and Footer

Some Residual Plots

xx

x

x

xx

xx

xx

x

x

x

xx

xx

x

x

x xx

x

x x

A random scatter as above is good. It shows no obvious departures of the variance homogeneity assumption.

0

Page 9: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

9To put your footer here go to View > Header and Footer

Some Residual Plotsx

x

xx

x

x

x

xxx

xxx

x

xx

x

xx

xx

x

x

x

x

Variance increases with increasing x.

Could try a loge(y) transformation.

x

0

Page 10: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

10To put your footer here go to View > Header and Footer

0

xx

xx

x

x

x

x x

x

xx

x

x xx

x

xx

x

x

x

x

x

x

Indication that the response is a binomial proportion. Use a logistic regression model.

Some Residual Plots

x

Page 11: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

11To put your footer here go to View > Header and Footer

x

xx

x

x

x

xx

xx

xx

xx

x

xxx

x

x

x

x

x

x

x

Some Residual Plots

Lack of linearity. Pattern indicates an incorrect model - probably due to a missing squared term.

0

Page 12: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

12To put your footer here go to View > Header and Footer

Some Residual Plots

xx

x

x

xx

xx

xx

x

x

xx

x

xx

x

x

x

x

x x

x x

Presence of an outlier. Investigate if there is a reason for this odd-point.

0

Page 13: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

13To put your footer here go to View > Header and Footer

Consequences of assumption failureStudies on consequences of assumptionfailure have demonstrated that:

• tests and confidence intervals for means are relatively robust to small departures from non-normality;

• the effects of non-homogeneous variance can be large, but not so serious if sample sizes in different sub-groupings are equal;

• dependence of observations can badly affect F-tests.

Page 14: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

14To put your footer here go to View > Header and Footer

Dealing with assumption failureOne approach is to find a transformationthat will stabilize the variance.

Some typical transformations are:

• taking logs (useful when there is skewness);

• square root transformation;

• reciprocal transformation.

Sometimes theoretical grounds will determinethe transformation to use, e.g. when data are Poisson or Binomial. However, in such cases, exact methods of analysis will be preferable.

Page 15: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

15To put your footer here go to View > Header and Footer

Dealing with non-independenceThe assumption of independence is quite critical. Some attention to this is needed at the data collection stage.

If observations are collected in time or space, plotting residuals in time (or space) order may reveal that subsequent observations are correlated.

Techniques similar to those used in time series analysis or analysis of repeated measurements data may be more appropriate.

Page 16: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

16To put your footer here go to View > Header and Footer

An illustration – using paddy data

Histogram of standardised residuals after fitting a linear regression of yield on fertiliser. This is a check on the normality assumption.

Page 17: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

17To put your footer here go to View > Header and Footer

A normal probability plot…

Another check on the normality assumption

Do you think the points follow a straight line?

Page 18: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

18To put your footer here go to View > Header and Footer

Std. residuals versus fitted values

Checking assumption of variance homogeneity, and identification of outliers:

Do you judge this to be a random scatter?

Are there any outliers?

Page 19: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

19To put your footer here go to View > Header and Footer

Conclusion:

The residual plots showed no evidence of departures from the model assumptions.

We may conclude that fertiliser does contribute significantly to explaining the variability in paddy yields.

Note: Always conduct a residual analysis after fitting a regression model. The same concepts carry over to more complex models that may be fitted.

Page 20: SADC Course in Statistics Assumptions underlying regression analysis (Session 05)

20To put your footer here go to View > Header and Footer

Practical work follows to ensure learning objectives are

achieved…