multiple regression models experimental design and data analysis for biologists (quinn & keough,...
TRANSCRIPT
Multiple regression models
Experimental design and data analysis for biologists (Quinn & Keough, 2002)
Environmental sampling and analysis
Multiple regression models
• One response (dependent) variable:– Y
• More than one predictor (independent variable) variable:– X1, X2, X3 …, Xj
– number of predictors = p (j = 1 to p)
• Number of observations = n (i = 1 to n)
Forest fragmentation
Forest fragmentation• 56 forest patches in SE Victoria (Loyn 1987)• Response variable:
– bird abundance
• Predictor variables:– patch area (ha)– years isolated (years)– distance to nearest patch (km)– distance to nearest larger patch (km)– stock grazing intensity (1 to 5 scale)– altitude (m)
Biomoinitoring with Vallisneria
• Indicators of sublethal effects of organochlorine contamination– leaf-to-shoot surface area ratio of
Vallisneria americana– response variable
• Predictors:– sediment contamination, plant
density, PAR, rivermile, water depth
• 225 sites in Great Lakes• Potter & Lovett-Doust (2001)
Regression models
Linear model:
Sample equation:
...y b b x b xi 0 1 i1 2 i2
yi = 0 + 1xi1 + 2xi2 + .... + i
Example
• Regression model:
(bird abundance)i = 0 + 1(patch area)i + 2(years isolated)i + 3(nearest patch distance)i + 4(nearest large patch distance)i + 5(stock grazing)i + 6(altitude)i + i
Multiple regression planebi
rd a
bund
ance
altitude log10area
Partial regression coefficients
• H0: 1 = 0
• Partial population regression coefficient (slope) for Y on X1, holding all other X’s constant, equals zero
• Example:– slope of regression of bird abundance against patch
area, holding years isolated, distance to nearest patch, distance to nearest larger patch, stock grazing intensity and altitude constant, equals 0.
Testing H0: i = 0
• Use partial t-tests:
• t = bi / SE(bi)
• Compare with t-distribution with n-2 df
• Separate t-test for each partial regression coefficient in model
• Usual logic of t-tests:– reject H0 if P < 0.05
Model comparison
• Test H0: 1 = 0
• Fit full model:– y = 0+1x1+2x2+3x3+…
• Fit reduced model:– y = 0+2x2+3x3+…
• Calculate SSextra:
– SSRegression(full) - SSRegression(reduced)
• F = MSextra / MSResidual(full)
Overall regression model
• H0: 1 = 2 = ... = 0 (all population slopes equal zero)
• Test of whether overall regression equation is significant
• Use ANOVA F-test:– variation explained by regression– unexplained (residual) variation
Explained variance
r2
proportion of variation in Y explained by linear relationship with X1, X2 etc.
SS Regression SS Total
Forest fragmentation
Intercept 20.789 8.285 0 0.015Log10 area 7.470 1.465 0.565 <0.001Log10 distance -0.907 2.676 -0.035 0.736Log10 ldistance -0.648 2.123 -0.035 0.761Grazing -1.668 0.930 -0.229 0.079Altitude 0.020 0.024 0.079 0.419Years -0.074 0.045 -0.176 0.109
r2 = 0.685, F6,49 = 17.754, P <0 .001
Parameter Coefficient SE Stand coeff P
Biomoinitoring with Vallisneria
Parameter Coefficient SE P
Intercept 1.054 0.565 0.063Sediment contamination 1.352 0.482 0.006Plant density 0.028 0.007 <0.001PAR -0.087 0.017 <0.001Rivermile 1.00 x 10-4 9.17 x 10-5 0.277Water depth 0.246 0.486 0.613
Assumptions
• Normality and homogeneity of variance for response variable
• Independence of observations
• Linearity
• No collinearity
Scatterplots
• Scatterplot matrix (SPLOM)– pairwise plots for all variables
• Partial regression (added variable) plots– relationship between Y and Xj, holding other
Xs constant
– residuals from Y against all Xs except Xj vs residuals from Xj against all other Xs
– graphs partial regression slope for Xj
Partial regression plot (log10 area)
-2 -1 0 1 2
Log10 Area
-20
-10
0
10
20
Bird
abu
ndan
ce
Regression diagnostics
• Residual:– observed yi - predicted yi
• Residual plots:– residual against predicted yi
– residual against each X
• Influence:– Cook’s D statistics
Collinearity
• Collinearity:– predictors correlated
• Assumption of no collinearity:– predictor variables uncorrelated with (ie.
independent of) each other
• Effect of collinearity:– estimates of js and significance tests
unreliable
Response (Y) and 2 predictors (X1 and X2)
1. X1 and X2 uncorrelated (r = -0.24)
coeff se tol t Pintercept -0.17 1.03 -0.16 0.873X1 1.13 0.14 0.95 7.86 <0.001X2 0.12 0.14 0.95 0.86 0.404
r2 = 0.787, F = 31.38, P < 0.001
Collinearity
Collinearity
intercept 0.49 0.72 0.69 0.503X1 1.55 1.21 0.01 1.28 0.219X2 -0.45 1.21 0.01 -0.37 0.714
2. Rearrange X2 so X1 and X2 highly correlated (r = 0.99)
coeff se tol t P
r2 = 0.780, F = 30.05, P < 0.001
Checks for collinearity
• Correlation matrix and/or SPLOM between predictors
• Tolerance for each predictor:– 1-r2 for regression of that predictor on all
others– if tolerance is low (near 0.1) then
collinearity is a problem– VIF (variance inflation factor)
Forest fragmentationL 1
0 DIS
TL 1
0 LD
I ST
L 10 A
RE
AG
RA
ZE
ALT
L10DIST
YR
S
L10LDIST L10AREA GRAZE ALT YRS
Tolerances:0.396 – 0.681
Solutions to collinearity
• Drop redundant (correlated) predictors• Principal components regression
– potentially useful– replace predictors by independent
components from PCA on predictor variables
• Ridge regression– controversial and complex
Predictor importance
• Tests on partial regression slopes
• Standardised partial regression slopes
j
j
Y
X
jj s
sbb *
Predictor importance
• Change in explained variation– compare fit of full model to reduced model
omitting Xj
• Hierarchical partitioning– splits total r2 for each predictor into
• independent contribution of each predictor• joint contribution of each predictor with other
predictors
Residual
Extra2
SS Reduced
SS
jXr
Forest fragmentation
Predictor Independent Joint Total Stand coeffr2 r2 r2
Log10 area 0.315 0.232 0.548 0.565Log10 distance 0.007 0.009 0.016 -0.035Log10 ldistance 0.014 <0.001 0.014 -0.035Altitude 0.057 0.092 0.149 0.079Grazing 0.190 0.275 0.466 -0.229Years 0.101 0.152 0.253 -0.176
Interactions
• Interactive effect of X1 and X2 on Y
• Dependence of partial regression slope of Y against X1 on the value of X2
• Dependence of partial regression slope of Y against X2 on the value of X1
• yi = 0 + 1xi1 + 2xi2 + 3xi1xi2 + i
Forest fragmentation
• Does effect of grazing on bird abundance depend on area?– log10 area x grazing interaction
• Does effect of grazing depend on years since isolation?– grazing x years interaction
• Etc.
Interpreting interactions
• Interactions highly correlated with individual predictors:– collinearity problem– centring variables (subtracting mean) removes
collinearity
• Simple regression slopes:– slope of Y on X1 for different values of X2
– slope of Y on X2 for different values of X1
– use if interaction is significant
Polynomial regression
• Modeling some curvilinear relationships• Include quadratic (X2) or cubic (X3) etc.• Quadratic model:
yi = 0 + 1xi1 + 2xi12 + i
• Compare fit with:
yi = 0 + 1xi1 + i
• Does quadratic fit better than linear?
Local and regional species richness
• Relationship between local and regional species richness in North America– Caley & Schluter (1997)
• Two models compared:
local spp = 0 + 1(regional spp) + 2(regional spp)2 +
local spp = 0 + 1(regional spp) +
0 50 100 150 200 250
Regional species richness
0
50
100
150
200
Loca
l spe
cies
ric
hnes
s
Linear
Quadratic
Model comparison
Full model:SSResidual = 376.620, df = 5
Reduced model:SSResidual = 1299.257, df = 6
Difference due to (regional spp)2:SSExtra = 922.7, df = 1, MSExtra = 922.7F = 12.249, P < 0.018
See Quinn & Keough Box 6.6
Categorical predictors
• Convert categorical predictors into multiple continuous predictors– dummy (indicator) variables
• Each dummy variable coded as 0 or 1
• Usually no. of dummy variables = no. groups minus 1
Forest fragmentation
Grazing Grazing1 Grazing2 Grazing3 Grazing4
intensity
Zero (1) 0 0 0 0Low (2) 1 0 0 0Medium (3) 0 1 0 0High (4) 0 0 1 0Intense (5) 0 0 0 1
Each dummy variable measures effect of low – intense categories compared to “reference” category – zero grazing
Forest fragmentation
Coefficient Est SE t PIntercept 21.603 3.092 6.987 <0.001Grazing -2.854 0.713 -4.005 <0.001Log10 area 6.890 1.290 5.341 <0.001
Intercept 15.716 2.767 5.679 <0.001Grazing1 0.383 2.912 0.131 0.896Grazing2 -0.189 2.549 -0.074 0.941Grazing3 -1.592 2.976 -0.535 0.595Grazing4 -11.894 2.931 -4.058 <0.001Log10 area 7.247 1.255 5.774 <0.001
Categorical predictors
• All linear models fit categorical predictors using dummy variables
• ANOVA models combine dummy variables into single factor effect– partition SS into factor and residual– dummy variable effects often provided by software
• Models with both categorical (factor) and continuous (covariate) predictors– adjust factor effects based on covariate– reduce residual based on strength of relationship
between Y and covariate – more powerful test of factor