regression and models with multiple factors · least-squares • the method of least squares...

27
Regression and Models with Multiple Factors Ch. 17, 18

Upload: others

Post on 31-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Regression and Models with Multiple Factors

Ch. 17, 18

Page 2: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Scatter Plot

70 75 80

15

20

25

Snout-Vent Length

Mass

70 75 80

15

20

25

Page 3: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Linear Regression

70 75 80

15

20

25

Snout-Vent Length

Mass

70 75 80

15

20

25

Page 4: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Least-squares

• The method of least squares chooses the line that minimizes the squared deviations along the Y-axis

• Y = bX + a

• The slope of the line is the average change in variable Y given a change in variable X

• The slope is usually of greater interest than the intercept

• The slope is the covariance between X and Y divided by the variance in X – a positive slope implies a positive covariance and a negative slope implies a negative covariance

Page 5: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Linear Regression Uses

• Depict the relationship between two variables in an eye-catching fashion

• Test the null hypothesis of no association between two variables – The test is whether or not the slope is zero

• Predict the average value of variable Y for a group of

individuals with a given value of variable X – Note that variation around the line can make it very

difficult to predict a value for a given individual with much confidence

Page 6: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

What are Residuals?

Page 7: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

What are Residuals? Chimps have larger testes than predicted for their body weight Gorillas have smaller testes than predicted for their body weight Chimps have a positive residual Gorillas have a negative residual

Page 8: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

What are Residuals

• In general, the residual is the individual’s departure from the value predicted by the model

• In this case the model is simple – the linear regression – but residuals also exist for more complex models

• For a model that fits better, the residuals will be smaller on average

• Residuals can be of interest in their own right, because they represent values that have been corrected for relationships that might be obscuring a pattern (e.g., the body weight-testes mass relationship)

Page 9: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Strong Inference for Observational Studies

• Noticing a pattern in the data and reporting it represents a post hoc analysis – This is not hypothesis testing – The results, while potentially important, must be interpreted

cautiously

• What can be done? – Based on a post-hoc observational study, construct a new

hypothesis for a novel group or system that has not yet been studied

– For example, given the primate data, a reasonable prediction is that residual testes mass in deer will be associated with mating system

– Collect a new observational data set from the new group to test the hypothesis

Page 10: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Phylogenetic Pseudoreplication

http://www.cell.com/trends/ecology-evolution/fulltext/S0169-5347(10)00146-1

Page 11: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Performing Linear Regression in R

• Run the regression: > reg_newt <- lm(mass ~ svl, data=males) > summary(reg_newt) > confint(reg_newt) #get the CI for slope

• Add the line to a plot:

> plot(mass ~ svl, data=males) > abline(reg_newt) See http://whitlockschluter.zoology.ubc.ca/r-code/rcode17 for code to

draw the line’s 95% CI on the graph

Page 12: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Interpreting the Output Call: lm(formula = mass ~ svl, data = males) Residuals: Min 1Q Median 3Q Max -5.5081 -1.7670 -0.5128 1.6322 6.4942 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -31.48430 7.43453 -4.235 0.000108 *** svl 0.68766 0.09806 7.013 8.71e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.514 on 46 degrees of freedom Multiple R-squared: 0.5167, Adjusted R-squared: 0.5062 F-statistic: 49.18 on 1 and 46 DF, p-value: 8.712e-09

Slope

p-value for slope

Page 13: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Least Squares Regression with 95% CI

70 75 80

15

20

25

Snout-Vent Length

Mass

70 75 80

15

20

25

Page 14: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Code for That males <- read.csv("MaleNewtSampleData.csv") plot(males$mass ~ males$svl, pch=19, xlab=list("Snout-Vent Length", cex=1.5), ylab=list("Mass",cex=1.5)) box(lwd=3) axis(1,lwd=3) axis(2,lwd=3) reg_newt <- lm(mass ~ svl, data=males) abline(reg_newt, lwd=3, col="maroon") summary(reg_newt) #Add confidence limits for the regression line xpt <- seq(min(males$svl),max(males$svl), length.out=100) ypt <- data.frame(predict(reg_newt, newdata=data.frame(svl=xpt), interval="confidence")) lines(ypt$lwr ~ xpt, lwd=2, lty=2) lines(ypt$upr ~ xpt, lwd=2, lty=2)

Page 15: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Assumptions of Linear Regression

• The true relationship must be linear

• At each value of X, the distribution of Y is normal (i.e., the residuals are normal)

• The variance in Y is independent of the value of X

• Note that there are no assumptions about the distribution of X

Page 16: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Common Problems

• Outliers – Regression is extremely sensitive to outliers – The line will be drawn to outliers, especially along the x-axis – Consider performing the regression with and without outliers

• Non-linearity

– Best way to notice is by visually inspecting the plot and the line fit – Try a transformation to get linearity [often a log transformation]

• Non-normality of residuals

– Can be detected from a residual plot – Possibly solved with a transformation

• Unequal variance

– Usually visible from a scatterplot or from a residual plot

Page 17: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Residual Plot

70 75 80

-4-2

02

46

svl

resid

ua

ls(r

eg

_n

ew

t)

plot(residuals(reg_newt) ~ svl, data=males, pch=19)

Page 18: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Examples of Problems

Page 19: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Outlier Example

Page 20: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Multiple Explanatory Variables

• The reason ANOVA is so widely used is that it provides a framework to simultaneously test the effects of multiple factors

• ANOVA also makes it possible to detect interactions among the factors

• ANOVA is a special case of a general linear model

Page 21: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

General Linear Models

• GLMs handle categorical factors and continuous factors in a single modeling framework

• ANOVA is a special case with just categorical explanatory variables

• Linear regression is a special case with just continuous explanatory variables

Page 22: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Two-factor ANOVA

Testing: • The mean of “serum starved” versus “normal culture” • The mean of “wild-type” versus “WTC911” • The interaction (i.e., wild-type responds differently than WTC911 to the change in culturing conditions [serum starved versus normal culture])

Page 23: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

ANOVA with Blocking

• Analysis is similar to two-factor ANOVA (at least it was in 1995)

• Except: – Typically do not test for an interaction term

– Don’t care if the block is significant or not

– Remember: you don’t care about the block; you simply want to reduce the variance introduced by it

Page 24: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Analysis of Covariance

• Used to test for a difference in means, while correcting for a variable that is correlated with the response variable – The slopes must not differ in the two groups

– In other words, the mean comparison is only valid if the interaction term is not significant

• Also used to compare the slope of two regression lines – If the interaction term is significant, then the

conclusion is that the slopes are different

Page 25: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

ANCOVA Example

Page 26: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

ANCOVA Example

Page 27: Regression and Models with Multiple Factors · Least-squares • The method of least squares chooses the line that minimizes the squared deviations along the Y-axis • Y = bX + a

Summary

• Statistical models can be quite complex, with potentially many factors and interaction terms

• The model is specified by something that looks like an equation: Y = μ + A + B + A*B

• General linear models allow you to combine categorical and continuous factors into a single model

• Your sample size will limit the complexity of the model

• For complex models, you will need to think about procedures to choose the best model (i.e., the model that best explains your observations)