multiple regression models experimental design and data analysis for biologists (quinn & keough,...

Multiple regression models

Experimental design and data analysis for biologists (Quinn & Keough, 2002)

Environmental sampling and analysis

Multiple regression models

• One response (dependent) variable:– Y

• More than one predictor (independent variable) variable:– X1, X2, X3 …, Xj

– number of predictors = p (j = 1 to p)

• Number of observations = n (i = 1 to n)

Forest fragmentation

Forest fragmentation• 56 forest patches in SE Victoria (Loyn 1987)• Response variable:

– bird abundance

• Predictor variables:– patch area (ha)– years isolated (years)– distance to nearest patch (km)– distance to nearest larger patch (km)– stock grazing intensity (1 to 5 scale)– altitude (m)

Biomoinitoring with Vallisneria

• Indicators of sublethal effects of organochlorine contamination– leaf-to-shoot surface area ratio of

Vallisneria americana– response variable

• Predictors:– sediment contamination, plant

density, PAR, rivermile, water depth

• 225 sites in Great Lakes• Potter & Lovett-Doust (2001)

Regression models

Linear model:

Sample equation:

...y b b x b xi 0 1 i1 2 i2

yi = 0 + 1xi1 + 2xi2 + .... + i

Example

• Regression model:

(bird abundance)i = 0 + 1(patch area)i + 2(years isolated)i + 3(nearest patch distance)i + 4(nearest large patch distance)i + 5(stock grazing)i + 6(altitude)i + i

Multiple regression planebi

altitude log10area

Partial regression coefficients

• H0: 1 = 0

• Partial population regression coefficient (slope) for Y on X1, holding all other X’s constant, equals zero

• Example:– slope of regression of bird abundance against patch

area, holding years isolated, distance to nearest patch, distance to nearest larger patch, stock grazing intensity and altitude constant, equals 0.

Testing H0: i = 0

• Use partial t-tests:

• t = bi / SE(bi)

• Compare with t-distribution with n-2 df

• Separate t-test for each partial regression coefficient in model

• Usual logic of t-tests:– reject H0 if P < 0.05

Model comparison

• Test H0: 1 = 0

• Fit full model:– y = 0+1x1+2x2+3x3+…

• Fit reduced model:– y = 0+2x2+3x3+…

• Calculate SSextra:

– SSRegression(full) - SSRegression(reduced)

• F = MSextra / MSResidual(full)

Overall regression model

• H0: 1 = 2 = ... = 0 (all population slopes equal zero)

• Test of whether overall regression equation is significant

• Use ANOVA F-test:– variation explained by regression– unexplained (residual) variation

Explained variance

proportion of variation in Y explained by linear relationship with X1, X2 etc.

SS Regression SS Total

Intercept 20.789 8.285 0 0.015Log10 area 7.470 1.465 0.565 <0.001Log10 distance -0.907 2.676 -0.035 0.736Log10 ldistance -0.648 2.123 -0.035 0.761Grazing -1.668 0.930 -0.229 0.079Altitude 0.020 0.024 0.079 0.419Years -0.074 0.045 -0.176 0.109

r2 = 0.685, F6,49 = 17.754, P <0 .001

Parameter Coefficient SE Stand coeff P

Biomoinitoring with Vallisneria

Parameter Coefficient SE P

Intercept 1.054 0.565 0.063Sediment contamination 1.352 0.482 0.006Plant density 0.028 0.007 <0.001PAR -0.087 0.017 <0.001Rivermile 1.00 x 10-4 9.17 x 10-5 0.277Water depth 0.246 0.486 0.613

Assumptions

• Normality and homogeneity of variance for response variable

• Independence of observations

• Linearity

• No collinearity

Scatterplots

• Scatterplot matrix (SPLOM)– pairwise plots for all variables

• Partial regression (added variable) plots– relationship between Y and Xj, holding other

Xs constant

– residuals from Y against all Xs except Xj vs residuals from Xj against all other Xs

– graphs partial regression slope for Xj

Partial regression plot (log10 area)

-2 -1 0 1 2

Log10 Area

Regression diagnostics

• Residual:– observed yi - predicted yi

• Residual plots:– residual against predicted yi

– residual against each X

• Influence:– Cook’s D statistics

Collinearity

• Collinearity:– predictors correlated

• Assumption of no collinearity:– predictor variables uncorrelated with (ie.

independent of) each other

• Effect of collinearity:– estimates of js and significance tests

unreliable

Response (Y) and 2 predictors (X1 and X2)

1. X1 and X2 uncorrelated (r = -0.24)

coeff se tol t Pintercept -0.17 1.03 -0.16 0.873X1 1.13 0.14 0.95 7.86 <0.001X2 0.12 0.14 0.95 0.86 0.404

r2 = 0.787, F = 31.38, P < 0.001

Collinearity

intercept 0.49 0.72 0.69 0.503X1 1.55 1.21 0.01 1.28 0.219X2 -0.45 1.21 0.01 -0.37 0.714

2. Rearrange X2 so X1 and X2 highly correlated (r = 0.99)

coeff se tol t P

r2 = 0.780, F = 30.05, P < 0.001

Checks for collinearity

• Correlation matrix and/or SPLOM between predictors

• Tolerance for each predictor:– 1-r2 for regression of that predictor on all

others– if tolerance is low (near 0.1) then

collinearity is a problem– VIF (variance inflation factor)

Forest fragmentationL 1

L 10 A

L10DIST

L10LDIST L10AREA GRAZE ALT YRS

Tolerances:0.396 – 0.681

Solutions to collinearity

• Drop redundant (correlated) predictors• Principal components regression

– potentially useful– replace predictors by independent

components from PCA on predictor variables

• Ridge regression– controversial and complex

Predictor importance

• Tests on partial regression slopes

• Standardised partial regression slopes

Predictor importance

• Change in explained variation– compare fit of full model to reduced model

omitting Xj

• Hierarchical partitioning– splits total r2 for each predictor into

• independent contribution of each predictor• joint contribution of each predictor with other

predictors

Residual

Extra2

SS Reduced

Predictor Independent Joint Total Stand coeffr2 r2 r2

Log10 area 0.315 0.232 0.548 0.565Log10 distance 0.007 0.009 0.016 -0.035Log10 ldistance 0.014 <0.001 0.014 -0.035Altitude 0.057 0.092 0.149 0.079Grazing 0.190 0.275 0.466 -0.229Years 0.101 0.152 0.253 -0.176

Interactions

• Interactive effect of X1 and X2 on Y

• Dependence of partial regression slope of Y against X1 on the value of X2

• Dependence of partial regression slope of Y against X2 on the value of X1

• yi = 0 + 1xi1 + 2xi2 + 3xi1xi2 + i

• Does effect of grazing on bird abundance depend on area?– log10 area x grazing interaction

• Does effect of grazing depend on years since isolation?– grazing x years interaction

• Etc.

Interpreting interactions

• Interactions highly correlated with individual predictors:– collinearity problem– centring variables (subtracting mean) removes

collinearity

• Simple regression slopes:– slope of Y on X1 for different values of X2

– slope of Y on X2 for different values of X1

– use if interaction is significant

Polynomial regression

• Modeling some curvilinear relationships• Include quadratic (X2) or cubic (X3) etc.• Quadratic model:

yi = 0 + 1xi1 + 2xi12 + i

• Compare fit with:

yi = 0 + 1xi1 + i

• Does quadratic fit better than linear?

Local and regional species richness

• Relationship between local and regional species richness in North America– Caley & Schluter (1997)

• Two models compared:

local spp = 0 + 1(regional spp) + 2(regional spp)2 +

local spp = 0 + 1(regional spp) +

0 50 100 150 200 250

Regional species richness

Linear

Quadratic

Model comparison

Full model:SSResidual = 376.620, df = 5

Reduced model:SSResidual = 1299.257, df = 6

Difference due to (regional spp)2:SSExtra = 922.7, df = 1, MSExtra = 922.7F = 12.249, P < 0.018

See Quinn & Keough Box 6.6

Categorical predictors

• Convert categorical predictors into multiple continuous predictors– dummy (indicator) variables

• Each dummy variable coded as 0 or 1

• Usually no. of dummy variables = no. groups minus 1

Grazing Grazing1 Grazing2 Grazing3 Grazing4

intensity

Zero (1) 0 0 0 0Low (2) 1 0 0 0Medium (3) 0 1 0 0High (4) 0 0 1 0Intense (5) 0 0 0 1

Each dummy variable measures effect of low – intense categories compared to “reference” category – zero grazing

Coefficient Est SE t PIntercept 21.603 3.092 6.987 <0.001Grazing -2.854 0.713 -4.005 <0.001Log10 area 6.890 1.290 5.341 <0.001

Intercept 15.716 2.767 5.679 <0.001Grazing1 0.383 2.912 0.131 0.896Grazing2 -0.189 2.549 -0.074 0.941Grazing3 -1.592 2.976 -0.535 0.595Grazing4 -11.894 2.931 -4.058 <0.001Log10 area 7.247 1.255 5.774 <0.001

Categorical predictors

• All linear models fit categorical predictors using dummy variables

• ANOVA models combine dummy variables into single factor effect– partition SS into factor and residual– dummy variable effects often provided by software

• Models with both categorical (factor) and continuous (covariate) predictors– adjust factor effects based on covariate– reduce residual based on strength of relationship

between Y and covariate – more powerful test of factor

multiple regression models experimental design and data analysis for biologists (quinn & keough,...

slope of regression

variablespartial regression

regression modelslinear

nearest patch distancei

overall regression equation

nearest patch kmdistance

patch areai

nearest larger patch

Documents

changing responsibilities and roles for professional...

experimental design data analysis...keough, 2002,...

《最珍貴的教訓》donald keough

casement schedule - keough-naughton institute for irish...

notable biologists

a historical perspective of the racine yacht club through...

biology 14.2 how biologists classify organisms how...

biologists’ tools & technology technology continually...

bioinformatics for biologists - assets -...

bioinformatics for biologists

bioinformatics for biologists - massachusetts institute of...

potential future of reference: the next five years keough...

simple data analysis for biologists -...

eng 120 keough slavery & abolition

leeth keough props apr13 - adelaide crew | a gateway...

programming for biologists

famous biologists

sequence analysis: part i. pairwise alignment and database...

simple data analysis for biologists - worldfish | harnessing

computational infrastructure for systems genetics analysis...