entering multidimensional space: multiple regression peter t. donnan professor of epidemiology and...
TRANSCRIPT
Entering Multidimensional Space: Multiple
Regression Peter T. Donnan
Professor of Epidemiology and Biostatistics
Statistics for Health ResearchStatistics for Health Research
Objectives of sessionObjectives of session
• Recognise the need for multiple Recognise the need for multiple regressionregression
• Understand methods of selecting Understand methods of selecting variables variables
• Understand strengths and weakness of Understand strengths and weakness of selection methodsselection methods
• Carry out Multiple Carry out Multiple Regression in SPSS Regression in SPSS and interpret the outputand interpret the output
• Recognise the need for multiple Recognise the need for multiple regressionregression
• Understand methods of selecting Understand methods of selecting variables variables
• Understand strengths and weakness of Understand strengths and weakness of selection methodsselection methods
• Carry out Multiple Carry out Multiple Regression in SPSS Regression in SPSS and interpret the outputand interpret the output
Why do we need Why do we need multiple regression?multiple regression?
Research is not as simple as Research is not as simple as effect of one variable on one effect of one variable on one outcome,outcome,
Especially with observational Especially with observational datadata
Need to assess many factors Need to assess many factors simultaneously; more realistic simultaneously; more realistic modelsmodels
Consider Fitted line of Consider Fitted line of y = a + by = a + b11xx11 + b + b22xx22
Explanatory Explanatory (x(x11))
Dep
en
den
t D
ep
en
den
t (y
)(y
)
Explanatory (x
Explanatory (x 22))
3-dimensional scatterplot from 3-dimensional scatterplot from SPSS of Min LDL in relation to SPSS of Min LDL in relation to
baseline LDL and age baseline LDL and age
When to use multiple When to use multiple regression modelling regression modelling
(1)(1)Assess relationship between Assess relationship between two variables while two variables while adjustingadjusting or or allowing forallowing for another another variablevariable
Sometimes the second variable Sometimes the second variable is considered a ‘nuisance’ is considered a ‘nuisance’ factorfactor
Example: Physical Activity Example: Physical Activity allowing for age and allowing for age and medicationsmedications
When to use multiple When to use multiple regression modelling (2)regression modelling (2)
In RCT whenever there is In RCT whenever there is imbalance between arms of the imbalance between arms of the trial at baseline in trial at baseline in characteristics of subjectscharacteristics of subjects
e.g. survival in colorectal cancer e.g. survival in colorectal cancer on two different randomised on two different randomised therapies therapies adjustedadjusted for age, for age, gender, stage, and co-morbidity gender, stage, and co-morbidity
When to use multiple When to use multiple regression modelling (2)regression modelling (2)
A special case of this is when A special case of this is when adjusting for baseline level of adjusting for baseline level of the primary outcome in an RCTthe primary outcome in an RCT
Baseline level added as a factor Baseline level added as a factor in regression model in regression model
This will be covered in Trials This will be covered in Trials part of the course part of the course
When to use multiple When to use multiple regression modelling (3)regression modelling (3)
With observational data in order With observational data in order to produce a to produce a prognostic prognostic equationequation for future prediction of for future prediction of risk of mortalityrisk of mortality
e.g. Predicting future risk of e.g. Predicting future risk of CHD used 10-year data from the CHD used 10-year data from the Framingham cohort Framingham cohort
When to use multiple When to use multiple regression modelling (4)regression modelling (4)
With observational data in order With observational data in order to adjust for possible to adjust for possible confoundersconfounders
e.g. survival in colorectal cancer e.g. survival in colorectal cancer in those with hypertension in those with hypertension adjustedadjusted for age, gender, social for age, gender, social deprivation and co-morbidity deprivation and co-morbidity
Definition of ConfoundingDefinition of Confounding
A confounder is a factor A confounder is a factor which is related to which is related to bothboth the the variable of interest variable of interest (explanatory) and the (explanatory) and the outcome, outcome, butbut is not an is not an intermediary in a causal intermediary in a causal pathwaypathway
Example of ConfoundingExample of Confounding
Deprivation
Lung Cancer
Smoking
But, also worth adjusting for But, also worth adjusting for factors only related to factors only related to
outcomeoutcome
Deprivation
Lung Cancer
Exercise
Not worth adjusting for Not worth adjusting for intermediate factor in a causal intermediate factor in a causal
pathwaypathway
Exercise
Stroke
Blood viscosit
y
In a causal pathway each factor In a causal pathway each factor is merely a marker of the other is merely a marker of the other factors i.e correlated - factors i.e correlated - collinearitycollinearity
SPSS: Add both baseline LDL SPSS: Add both baseline LDL and age in the independent and age in the independent
box in linear regression box in linear regression
Output from SPSS linear Output from SPSS linear regression on Age at regression on Age at
baselinebaseline
Coefficientsa
2.024 .105 19.340 .000 1.819 2.229
-.008 .002 -.121 -4.546 .000 -.011 -.004 1.000 1.000
(Constant)
Age at baseline
Model1
B Std. Error
UnstandardizedCoeff icients
Beta
StandardizedCoeff icients
t Sig. Lower Bound Upper Bound
95% Conf idence Interv al for B
Tolerance VIF
Collinearity Statistics
Dependent Variable: Min LDL achieveda.
Output from SPSS linear Output from SPSS linear regression on Baseline LDLregression on Baseline LDL
Coefficientsa
.668 .066 10.091 .000 .538 .798
.257 .018 .351 13.950 .000 .221 .293
(Constant)
Baseline LDL
Model1
B Std. Error
UnstandardizedCoeff icients
Beta
StandardizedCoeff icients
t Sig. Lower Bound Upper Bound
95% Conf idence Interv al for B
Dependent Variable: Min LDL achieveda.
Model Summary
.360a .130 .129 .6753538Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), Age at baseline, Baseline LDLa.
Coefficientsa
1.003 .124 8.086 .000 .760 1.246
.250 .019 .342 13.516 .000 .214 .286
-.005 .002 -.081 -3.187 .001 -.008 -.002
(Constant)
Baseline LDL
Age at baseline
Model1
B Std. Error
UnstandardizedCoef f icients
Beta
StandardizedCoef f icients
t Sig. Lower Bound Upper Bound
95% Conf idence Interval for B
Dependent Variable: Min LDL achieveda.
Output: Multiple regressionOutput: Multiple regression
RR2 2 now now improveimproved to d to 13%13%
Both variables still significant Both variables still significant INDEPENDENTLY of each otherINDEPENDENTLY of each other
How do you select which How do you select which variables to enter the variables to enter the
model?model?•Usually consider what hypotheses are you Usually consider what hypotheses are you testing?testing?
•If main ‘exposure’ variable, enter first and If main ‘exposure’ variable, enter first and assess confounders one at a timeassess confounders one at a time
•For derivation of CPR you want powerful For derivation of CPR you want powerful predictorspredictors
•Also clinically important factors e.g. cholesterol Also clinically important factors e.g. cholesterol in CHD predictionin CHD prediction
•Significance is important Significance is important butbut
•It is acceptable to have an ‘important’ variable It is acceptable to have an ‘important’ variable withoutwithout statistical significance statistical significance
How do you decide what How do you decide what variables to enter in model?variables to enter in model?
Correlations? With great Correlations? With great difficulty!difficulty!
3-dimensional scatterplot from 3-dimensional scatterplot from SPSS of Time from Surgery in SPSS of Time from Surgery in relation to Duke’s staging and relation to Duke’s staging and
ageage
Approaches to model Approaches to model buildingbuilding
1. Let Scientific or Clinical factors 1. Let Scientific or Clinical factors guide selectionguide selection
1. Let Scientific or Clinical factors 1. Let Scientific or Clinical factors guide selectionguide selection
2. Use automatic selection 2. Use automatic selection algorithmsalgorithms
3. A mixture of above3. A mixture of above
1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection
Baseline LDL cholesterol is an Baseline LDL cholesterol is an important factor determining important factor determining LDL outcome so enter firstLDL outcome so enter first
Next allow for age and genderNext allow for age and gender
Add adherence as important?Add adherence as important?
Add BMI and smoking?Add BMI and smoking?
1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection
Results in model of: Results in model of:
1.1.Baseline LDLBaseline LDL
2.2.age and genderage and gender
3.3.Adherence Adherence
4.4.BMI and smokingBMI and smoking
Is this a ‘good’ model?Is this a ‘good’ model?
1) Let Science or Clinical 1) Let Science or Clinical factors guide selection: factors guide selection:
Final Model Final Model Note three variables entered but not statistically Note three variables entered but not statistically significantsignificant
1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection
Is this the ‘best’ model?Is this the ‘best’ model?Should I leave out the non-significant factors Should I leave out the non-significant factors (Model 2)?(Model 2)?
Model Adj R2 F from ANOVA
No. of Parameters p
1 0.137 37.48 7
2 0.134 72.021 4
Adj RAdj R22 lower, F has increased and number of lower, F has increased and number of parameters is less in 2parameters is less in 2ndnd model. Is this better? model. Is this better?
Kullback-Leibler Kullback-Leibler InformationInformation
Kullback and Leibler Kullback and Leibler (1951) quantified the (1951) quantified the meaning of ‘information’ meaning of ‘information’ – related to Fisher’s – related to Fisher’s ‘sufficient statistics’‘sufficient statistics’Basically we have reality fBasically we have reality fAnd a model g to And a model g to approximate fapproximate fSo K-L information is So K-L information is I(f,g)I(f,g)
ff
gg
Kullback-Leibler Kullback-Leibler InformationInformation
We want to minimise I We want to minimise I (f,g) to obtain the (f,g) to obtain the best best model model over other over other modelsmodels
I (f,g) is the information I (f,g) is the information lost or ‘distance’ lost or ‘distance’ between reality and a between reality and a model so need to model so need to minimise:minimise: dx
xg
xfxfgfI )
)(
)(log()(),(
Akaike’s Information Akaike’s Information CriterionCriterion
It turns out that the It turns out that the function I(f,g) is function I(f,g) is related to a very related to a very simple measure of simple measure of goodness-of-fit:goodness-of-fit:
Akaike’s Information Akaike’s Information Criterion or AICCriterion or AIC
Selection CriteriaSelection Criteria
•With a large number of factors type 1 With a large number of factors type 1 error large, likely to have model with many error large, likely to have model with many variablesvariables
•Two standard criteria:Two standard criteria:
1) Akaike’s Information Criterion (AIC)1) Akaike’s Information Criterion (AIC)
2) Schwartz’s Bayesian Information 2) Schwartz’s Bayesian Information Criterion (BIC)Criterion (BIC)
•Both Both penalisepenalise models with large number models with large number of variables if sample size is large of variables if sample size is large
Akaike’s Information Akaike’s Information CriterionCriterion
•Where p = number of parameters Where p = number of parameters and -2*log likelihood is in the outputand -2*log likelihood is in the output
•Hence AIC penalises models with Hence AIC penalises models with large number of variables large number of variables
•Select model that Select model that minimisesminimises (- (-2LL+2p)2LL+2p)
p*2oodloglikelih*2AIC
Generalized linear Generalized linear modelsmodels
•Unfortunately the standard Unfortunately the standard REGRESSION in SPSS does not give REGRESSION in SPSS does not give these statisticsthese statistics
•Need to use Need to use
AnalyzeAnalyze
Generalized Linear Models…..Generalized Linear Models…..
Generalized linear Generalized linear models. Default is linearmodels. Default is linear
•Add Min LDL Add Min LDL achieved as achieved as dependent as in dependent as in REGRESSION in REGRESSION in SPSS SPSS
•Next go to Next go to predictors…..predictors…..
Generalized linear Generalized linear models: Predictorsmodels: Predictors
•WARNING!WARNING!
•Make sure Make sure you add the you add the predictors in predictors in the correct the correct boxbox
•Categorical Categorical in FACTORS in FACTORS boxbox
•Continuous Continuous in in COVARIATES COVARIATES boxbox
Generalized linear Generalized linear models: Modelmodels: Model
•Add all Add all factors and factors and covariates covariates in the in the model as model as main main effectseffects
Generalized Linear Models Generalized Linear Models Parameter EstimatesParameter Estimates
Note identical to REGRESSION outputNote identical to REGRESSION output
Generalized Linear Models Generalized Linear Models Goodness-of-fitGoodness-of-fit
Note output Note output gives log gives log likelihood and likelihood and AIC = 2835AIC = 2835(AIC = -2x-1409.6 (AIC = -2x-1409.6 +2x7= 2835)+2x7= 2835)
Footnote Footnote explains explains smaller AIC is smaller AIC is ‘better’‘better’
Let Science or Clinical Let Science or Clinical factors guide selection: factors guide selection:
‘Optimal’ model‘Optimal’ model•The log likelihood is a measure of The log likelihood is a measure of GOODNESS-OF-FITGOODNESS-OF-FIT•Seek ‘optimal’ model that Seek ‘optimal’ model that maximisesmaximises the log likelihood or the log likelihood or minimisesminimises the AIC the AIC
Model 2LL p AIC
1 Full Model -1409.6 7 2835.6
2 Non-significant variables removed
-1413.6 4 2837.2
ChangChange is e is 1.61.6
1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection
Key points: Key points:
1.1.Results demonstrate a significant Results demonstrate a significant association with baseline LDL, Age and association with baseline LDL, Age and Adherence Adherence
2.2.Difficult choices with Gender, smoking Difficult choices with Gender, smoking and BMIand BMI
3.3.AIC only changes by 1.6 when removedAIC only changes by 1.6 when removed
4.4.Generally changes of 4 or more in AIC Generally changes of 4 or more in AIC are considered importantare considered important
1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection
Key points: Key points:
1.1.Conclude little to chose between Conclude little to chose between modelsmodels
2.2.AIC actually lower with larger model AIC actually lower with larger model and consider Gender, and BMI important and consider Gender, and BMI important factors so keep larger model but have to factors so keep larger model but have to justifyjustify
3.3.Model building manual, logical, Model building manual, logical, transparent and under your controltransparent and under your control
2) Use automatic 2) Use automatic selection proceduresselection procedures
These are based on automatic These are based on automatic mechanical algorithms usually mechanical algorithms usually related to statistical related to statistical significancesignificance
Common ones are stepwise, Common ones are stepwise, forward or backward forward or backward eliminationelimination
Can be selected in SPSS using Can be selected in SPSS using ‘Method’ in dialogue box‘Method’ in dialogue box
2) Use automatic selection 2) Use automatic selection procedures (e.g Stepwise)procedures (e.g Stepwise)
Select Select Method Method = = StepwisStepwisee
2) Use automatic selection 2) Use automatic selection procedures (e.g Stepwise)procedures (e.g Stepwise)
Final Final ModeModell
11stst step step
2nd 2nd stepstep
2) Change in AIC with 2) Change in AIC with Stepwise selectionStepwise selection
Note: Only available from Generalized Linear Note: Only available from Generalized Linear ModelsModels
Step Model Log Likelihood
AIC Change in AIC
No. of Parameters p
1 Baseline LDL -1423.1 2852.2 - 2
2 +Adherence -1418.0 2844.1 8.1 3
3 +Age -1413.6 2837.2 6.9 4
2) Advantages and 2) Advantages and disadvantages of disadvantages of
stepwisestepwiseAdvantagesAdvantages
Simple to implementSimple to implement
Gives a parsimonious modelGives a parsimonious model
Selection is certainly objectiveSelection is certainly objective
DisadvantagesDisadvantagesNon stable selection – stepwise considers Non stable selection – stepwise considers
many many models that are very similarmodels that are very similar
P-value on entry may be smaller once P-value on entry may be smaller once procedure is procedure is finished so exaggeration finished so exaggeration of p-valueof p-value
Predictions in external dataset usually Predictions in external dataset usually worse for worse for stepwise proceduresstepwise procedures
2) Automatic procedures: 2) Automatic procedures: Backward eliminationBackward elimination
BackwardBackward starts by eliminating the least starts by eliminating the least significant factor form the full model and significant factor form the full model and has a few advantages over forward:has a few advantages over forward:
•Modeller has to consider the ‘full’ model Modeller has to consider the ‘full’ model and sees results for all factors and sees results for all factors simultaneouslysimultaneously
•Correlated factors can remain in the Correlated factors can remain in the model (in forward methods they may not model (in forward methods they may not even enter)even enter)
•Criteria for removal tend to be more lax Criteria for removal tend to be more lax in backward so end up with more in backward so end up with more parametersparameters
2) Use automatic selection 2) Use automatic selection procedures (e.g Backward)procedures (e.g Backward)
Select Select Method Method = = BackwarBackwardd
2) Backward elimination in 2) Backward elimination in SPSSSPSS
Final Final ModeModell
11stst step stepGender Gender removeremovedd
2nd 2nd step step BMI BMI removeremovedd
Summary of automatic Summary of automatic selectionselection
• Automatic selection may not give Automatic selection may not give ‘optimal’ model (may leave out ‘optimal’ model (may leave out important factors)important factors)
• Different methods may give different Different methods may give different results (forward vs. backward results (forward vs. backward elimination)elimination)
• Backward elimination preferred as less Backward elimination preferred as less stringentstringent
• Too easily fitted in SPSS!Too easily fitted in SPSS!
• Model assessment still requires some Model assessment still requires some thoughtthought
3) A mixture of automatic 3) A mixture of automatic procedures and self procedures and self
selectionselection
•Use automatic procedures as Use automatic procedures as a a guide guide
•Think about what factors are Think about what factors are importantimportant
•Add ‘important’ factorsAdd ‘important’ factors•Do not blindly follow Do not blindly follow
statistical significancestatistical significance•Consider AICConsider AIC
•Use automatic procedures as Use automatic procedures as a a guide guide
•Think about what factors are Think about what factors are importantimportant
•Add ‘important’ factorsAdd ‘important’ factors•Do not blindly follow Do not blindly follow
statistical significancestatistical significance•Consider AICConsider AIC
Summary of Model Summary of Model selectionselection
• Selection of factors for Multiple Selection of factors for Multiple Linear regression models requires Linear regression models requires some judgementsome judgement
• Automatic procedures are Automatic procedures are available but treat results with available but treat results with cautioncaution
• They are easily fitted in SPSSThey are easily fitted in SPSS
• Check AIC or log likelihood for fitCheck AIC or log likelihood for fit
SummarySummary• Multiple regression models are Multiple regression models are
the most used analytical tool in the most used analytical tool in quantitative researchquantitative research
• They are easily fitted in SPSSThey are easily fitted in SPSS
• Model assessment requires Model assessment requires some thoughtsome thought
• Parsimony is better – Occam’s Parsimony is better – Occam’s RazorRazor
Remember Occam’s Remember Occam’s RazorRazor‘‘Entia non sunt Entia non sunt multiplicanda multiplicanda praeter praeter necessitatem’necessitatem’
‘‘Entities must not be Entities must not be multiplied beyond multiplied beyond necessity’necessity’
William of Ockham 14th century Friar and logician1288-1347
SummarySummary
After fitting any model check assumptionsAfter fitting any model check assumptions• Functional form – linearity or not Functional form – linearity or not • Check Residuals for normalityCheck Residuals for normality• Check Residuals for outliers Check Residuals for outliers • All accomplished within SPSSAll accomplished within SPSS• See publications for further infoSee publications for further info
• Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in diabetes: A Go-DARTS study. diabetes: A Go-DARTS study. Pharmacogenetics and GenomicsPharmacogenetics and Genomics, 2008; 18: 279-87. , 2008; 18: 279-87.
After fitting any model check assumptionsAfter fitting any model check assumptions• Functional form – linearity or not Functional form – linearity or not • Check Residuals for normalityCheck Residuals for normality• Check Residuals for outliers Check Residuals for outliers • All accomplished within SPSSAll accomplished within SPSS• See publications for further infoSee publications for further info
• Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in diabetes: A Go-DARTS study. diabetes: A Go-DARTS study. Pharmacogenetics and GenomicsPharmacogenetics and Genomics, 2008; 18: 279-87. , 2008; 18: 279-87.
Practical on Multiple Practical on Multiple RegressionRegression
Read in ‘LDL Data.sav’Read in ‘LDL Data.sav’
1)1)Try fitting multiple regression model Try fitting multiple regression model on Min LDL obtained using forward and on Min LDL obtained using forward and backward elimination. Are the results backward elimination. Are the results the same? Add other factors than those the same? Add other factors than those considered in the presentation such as considered in the presentation such as BMI, smoking. Remember the goal is to BMI, smoking. Remember the goal is to assess the association of APOE with LDL assess the association of APOE with LDL response.response.
2)2)Try fitting multiple regression models Try fitting multiple regression models for Min Chol achieved. Is the model for Min Chol achieved. Is the model similar to that found for Min Chol?similar to that found for Min Chol?
Read in ‘LDL Data.sav’Read in ‘LDL Data.sav’
1)1)Try fitting multiple regression model Try fitting multiple regression model on Min LDL obtained using forward and on Min LDL obtained using forward and backward elimination. Are the results backward elimination. Are the results the same? Add other factors than those the same? Add other factors than those considered in the presentation such as considered in the presentation such as BMI, smoking. Remember the goal is to BMI, smoking. Remember the goal is to assess the association of APOE with LDL assess the association of APOE with LDL response.response.
2)2)Try fitting multiple regression models Try fitting multiple regression models for Min Chol achieved. Is the model for Min Chol achieved. Is the model similar to that found for Min Chol?similar to that found for Min Chol?