entering multidimensional space: multiple regression peter t. donnan professor of epidemiology and...

55
Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Statistics for Health Research Research

Upload: anabel-johnson

Post on 30-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Entering Multidimensional Space: Multiple

Regression Peter T. Donnan

Professor of Epidemiology and Biostatistics

Statistics for Health ResearchStatistics for Health Research

Page 2: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Objectives of sessionObjectives of session

• Recognise the need for multiple Recognise the need for multiple regressionregression

• Understand methods of selecting Understand methods of selecting variables variables

• Understand strengths and weakness of Understand strengths and weakness of selection methodsselection methods

• Carry out Multiple Carry out Multiple Regression in SPSS Regression in SPSS and interpret the outputand interpret the output

• Recognise the need for multiple Recognise the need for multiple regressionregression

• Understand methods of selecting Understand methods of selecting variables variables

• Understand strengths and weakness of Understand strengths and weakness of selection methodsselection methods

• Carry out Multiple Carry out Multiple Regression in SPSS Regression in SPSS and interpret the outputand interpret the output

Page 3: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Why do we need Why do we need multiple regression?multiple regression?

Research is not as simple as Research is not as simple as effect of one variable on one effect of one variable on one outcome,outcome,

Especially with observational Especially with observational datadata

Need to assess many factors Need to assess many factors simultaneously; more realistic simultaneously; more realistic modelsmodels

Page 4: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Consider Fitted line of Consider Fitted line of y = a + by = a + b11xx11 + b + b22xx22

Explanatory Explanatory (x(x11))

Dep

en

den

t D

ep

en

den

t (y

)(y

)

Explanatory (x

Explanatory (x 22))

Page 5: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

3-dimensional scatterplot from 3-dimensional scatterplot from SPSS of Min LDL in relation to SPSS of Min LDL in relation to

baseline LDL and age baseline LDL and age

Page 6: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

When to use multiple When to use multiple regression modelling regression modelling

(1)(1)Assess relationship between Assess relationship between two variables while two variables while adjustingadjusting or or allowing forallowing for another another variablevariable

Sometimes the second variable Sometimes the second variable is considered a ‘nuisance’ is considered a ‘nuisance’ factorfactor

Example: Physical Activity Example: Physical Activity allowing for age and allowing for age and medicationsmedications

Page 7: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

When to use multiple When to use multiple regression modelling (2)regression modelling (2)

In RCT whenever there is In RCT whenever there is imbalance between arms of the imbalance between arms of the trial at baseline in trial at baseline in characteristics of subjectscharacteristics of subjects

e.g. survival in colorectal cancer e.g. survival in colorectal cancer on two different randomised on two different randomised therapies therapies adjustedadjusted for age, for age, gender, stage, and co-morbidity gender, stage, and co-morbidity

Page 8: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

When to use multiple When to use multiple regression modelling (2)regression modelling (2)

A special case of this is when A special case of this is when adjusting for baseline level of adjusting for baseline level of the primary outcome in an RCTthe primary outcome in an RCT

Baseline level added as a factor Baseline level added as a factor in regression model in regression model

This will be covered in Trials This will be covered in Trials part of the course part of the course

Page 9: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

When to use multiple When to use multiple regression modelling (3)regression modelling (3)

With observational data in order With observational data in order to produce a to produce a prognostic prognostic equationequation for future prediction of for future prediction of risk of mortalityrisk of mortality

e.g. Predicting future risk of e.g. Predicting future risk of CHD used 10-year data from the CHD used 10-year data from the Framingham cohort Framingham cohort

Page 10: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

When to use multiple When to use multiple regression modelling (4)regression modelling (4)

With observational data in order With observational data in order to adjust for possible to adjust for possible confoundersconfounders

e.g. survival in colorectal cancer e.g. survival in colorectal cancer in those with hypertension in those with hypertension adjustedadjusted for age, gender, social for age, gender, social deprivation and co-morbidity deprivation and co-morbidity

Page 11: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Definition of ConfoundingDefinition of Confounding

A confounder is a factor A confounder is a factor which is related to which is related to bothboth the the variable of interest variable of interest (explanatory) and the (explanatory) and the outcome, outcome, butbut is not an is not an intermediary in a causal intermediary in a causal pathwaypathway

Page 12: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Example of ConfoundingExample of Confounding

Deprivation

Lung Cancer

Smoking

Page 13: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

But, also worth adjusting for But, also worth adjusting for factors only related to factors only related to

outcomeoutcome

Deprivation

Lung Cancer

Exercise

Page 14: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Not worth adjusting for Not worth adjusting for intermediate factor in a causal intermediate factor in a causal

pathwaypathway

Exercise

Stroke

Blood viscosit

y

In a causal pathway each factor In a causal pathway each factor is merely a marker of the other is merely a marker of the other factors i.e correlated - factors i.e correlated - collinearitycollinearity

Page 15: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

SPSS: Add both baseline LDL SPSS: Add both baseline LDL and age in the independent and age in the independent

box in linear regression box in linear regression

Page 16: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Output from SPSS linear Output from SPSS linear regression on Age at regression on Age at

baselinebaseline

Coefficientsa

2.024 .105 19.340 .000 1.819 2.229

-.008 .002 -.121 -4.546 .000 -.011 -.004 1.000 1.000

(Constant)

Age at baseline

Model1

B Std. Error

UnstandardizedCoeff icients

Beta

StandardizedCoeff icients

t Sig. Lower Bound Upper Bound

95% Conf idence Interv al for B

Tolerance VIF

Collinearity Statistics

Dependent Variable: Min LDL achieveda.

Page 17: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Output from SPSS linear Output from SPSS linear regression on Baseline LDLregression on Baseline LDL

Coefficientsa

.668 .066 10.091 .000 .538 .798

.257 .018 .351 13.950 .000 .221 .293

(Constant)

Baseline LDL

Model1

B Std. Error

UnstandardizedCoeff icients

Beta

StandardizedCoeff icients

t Sig. Lower Bound Upper Bound

95% Conf idence Interv al for B

Dependent Variable: Min LDL achieveda.

Page 18: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Model Summary

.360a .130 .129 .6753538Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), Age at baseline, Baseline LDLa.

Coefficientsa

1.003 .124 8.086 .000 .760 1.246

.250 .019 .342 13.516 .000 .214 .286

-.005 .002 -.081 -3.187 .001 -.008 -.002

(Constant)

Baseline LDL

Age at baseline

Model1

B Std. Error

UnstandardizedCoef f icients

Beta

StandardizedCoef f icients

t Sig. Lower Bound Upper Bound

95% Conf idence Interval for B

Dependent Variable: Min LDL achieveda.

Output: Multiple regressionOutput: Multiple regression

RR2 2 now now improveimproved to d to 13%13%

Both variables still significant Both variables still significant INDEPENDENTLY of each otherINDEPENDENTLY of each other

Page 19: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

How do you select which How do you select which variables to enter the variables to enter the

model?model?•Usually consider what hypotheses are you Usually consider what hypotheses are you testing?testing?

•If main ‘exposure’ variable, enter first and If main ‘exposure’ variable, enter first and assess confounders one at a timeassess confounders one at a time

•For derivation of CPR you want powerful For derivation of CPR you want powerful predictorspredictors

•Also clinically important factors e.g. cholesterol Also clinically important factors e.g. cholesterol in CHD predictionin CHD prediction

•Significance is important Significance is important butbut

•It is acceptable to have an ‘important’ variable It is acceptable to have an ‘important’ variable withoutwithout statistical significance statistical significance

Page 20: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

How do you decide what How do you decide what variables to enter in model?variables to enter in model?

Correlations? With great Correlations? With great difficulty!difficulty!

Page 21: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

3-dimensional scatterplot from 3-dimensional scatterplot from SPSS of Time from Surgery in SPSS of Time from Surgery in relation to Duke’s staging and relation to Duke’s staging and

ageage

Page 22: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Approaches to model Approaches to model buildingbuilding

1. Let Scientific or Clinical factors 1. Let Scientific or Clinical factors guide selectionguide selection

1. Let Scientific or Clinical factors 1. Let Scientific or Clinical factors guide selectionguide selection

2. Use automatic selection 2. Use automatic selection algorithmsalgorithms

3. A mixture of above3. A mixture of above

Page 23: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection

Baseline LDL cholesterol is an Baseline LDL cholesterol is an important factor determining important factor determining LDL outcome so enter firstLDL outcome so enter first

Next allow for age and genderNext allow for age and gender

Add adherence as important?Add adherence as important?

Add BMI and smoking?Add BMI and smoking?

Page 24: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection

Results in model of: Results in model of:

1.1.Baseline LDLBaseline LDL

2.2.age and genderage and gender

3.3.Adherence Adherence

4.4.BMI and smokingBMI and smoking

Is this a ‘good’ model?Is this a ‘good’ model?

Page 25: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

1) Let Science or Clinical 1) Let Science or Clinical factors guide selection: factors guide selection:

Final Model Final Model Note three variables entered but not statistically Note three variables entered but not statistically significantsignificant

Page 26: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection

Is this the ‘best’ model?Is this the ‘best’ model?Should I leave out the non-significant factors Should I leave out the non-significant factors (Model 2)?(Model 2)?

Model Adj R2 F from ANOVA

No. of Parameters p

1 0.137 37.48 7

2 0.134 72.021 4

Adj RAdj R22 lower, F has increased and number of lower, F has increased and number of parameters is less in 2parameters is less in 2ndnd model. Is this better? model. Is this better?

Page 27: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Kullback-Leibler Kullback-Leibler InformationInformation

Kullback and Leibler Kullback and Leibler (1951) quantified the (1951) quantified the meaning of ‘information’ meaning of ‘information’ – related to Fisher’s – related to Fisher’s ‘sufficient statistics’‘sufficient statistics’Basically we have reality fBasically we have reality fAnd a model g to And a model g to approximate fapproximate fSo K-L information is So K-L information is I(f,g)I(f,g)

ff

gg

Page 28: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Kullback-Leibler Kullback-Leibler InformationInformation

We want to minimise I We want to minimise I (f,g) to obtain the (f,g) to obtain the best best model model over other over other modelsmodels

I (f,g) is the information I (f,g) is the information lost or ‘distance’ lost or ‘distance’ between reality and a between reality and a model so need to model so need to minimise:minimise: dx

xg

xfxfgfI )

)(

)(log()(),(

Page 29: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Akaike’s Information Akaike’s Information CriterionCriterion

It turns out that the It turns out that the function I(f,g) is function I(f,g) is related to a very related to a very simple measure of simple measure of goodness-of-fit:goodness-of-fit:

Akaike’s Information Akaike’s Information Criterion or AICCriterion or AIC

Page 30: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Selection CriteriaSelection Criteria

•With a large number of factors type 1 With a large number of factors type 1 error large, likely to have model with many error large, likely to have model with many variablesvariables

•Two standard criteria:Two standard criteria:

1) Akaike’s Information Criterion (AIC)1) Akaike’s Information Criterion (AIC)

2) Schwartz’s Bayesian Information 2) Schwartz’s Bayesian Information Criterion (BIC)Criterion (BIC)

•Both Both penalisepenalise models with large number models with large number of variables if sample size is large of variables if sample size is large

Page 31: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Akaike’s Information Akaike’s Information CriterionCriterion

•Where p = number of parameters Where p = number of parameters and -2*log likelihood is in the outputand -2*log likelihood is in the output

•Hence AIC penalises models with Hence AIC penalises models with large number of variables large number of variables

•Select model that Select model that minimisesminimises (- (-2LL+2p)2LL+2p)

p*2oodloglikelih*2AIC

Page 32: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Generalized linear Generalized linear modelsmodels

•Unfortunately the standard Unfortunately the standard REGRESSION in SPSS does not give REGRESSION in SPSS does not give these statisticsthese statistics

•Need to use Need to use

AnalyzeAnalyze

Generalized Linear Models…..Generalized Linear Models…..

Page 33: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Generalized linear Generalized linear models. Default is linearmodels. Default is linear

•Add Min LDL Add Min LDL achieved as achieved as dependent as in dependent as in REGRESSION in REGRESSION in SPSS SPSS

•Next go to Next go to predictors…..predictors…..

Page 34: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Generalized linear Generalized linear models: Predictorsmodels: Predictors

•WARNING!WARNING!

•Make sure Make sure you add the you add the predictors in predictors in the correct the correct boxbox

•Categorical Categorical in FACTORS in FACTORS boxbox

•Continuous Continuous in in COVARIATES COVARIATES boxbox

Page 35: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Generalized linear Generalized linear models: Modelmodels: Model

•Add all Add all factors and factors and covariates covariates in the in the model as model as main main effectseffects

Page 36: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Generalized Linear Models Generalized Linear Models Parameter EstimatesParameter Estimates

Note identical to REGRESSION outputNote identical to REGRESSION output

Page 37: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Generalized Linear Models Generalized Linear Models Goodness-of-fitGoodness-of-fit

Note output Note output gives log gives log likelihood and likelihood and AIC = 2835AIC = 2835(AIC = -2x-1409.6 (AIC = -2x-1409.6 +2x7= 2835)+2x7= 2835)

Footnote Footnote explains explains smaller AIC is smaller AIC is ‘better’‘better’

Page 38: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Let Science or Clinical Let Science or Clinical factors guide selection: factors guide selection:

‘Optimal’ model‘Optimal’ model•The log likelihood is a measure of The log likelihood is a measure of GOODNESS-OF-FITGOODNESS-OF-FIT•Seek ‘optimal’ model that Seek ‘optimal’ model that maximisesmaximises the log likelihood or the log likelihood or minimisesminimises the AIC the AIC

Model 2LL p AIC

1 Full Model -1409.6 7 2835.6

2 Non-significant variables removed

-1413.6 4 2837.2

ChangChange is e is 1.61.6

Page 39: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection

Key points: Key points:

1.1.Results demonstrate a significant Results demonstrate a significant association with baseline LDL, Age and association with baseline LDL, Age and Adherence Adherence

2.2.Difficult choices with Gender, smoking Difficult choices with Gender, smoking and BMIand BMI

3.3.AIC only changes by 1.6 when removedAIC only changes by 1.6 when removed

4.4.Generally changes of 4 or more in AIC Generally changes of 4 or more in AIC are considered importantare considered important

Page 40: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection

Key points: Key points:

1.1.Conclude little to chose between Conclude little to chose between modelsmodels

2.2.AIC actually lower with larger model AIC actually lower with larger model and consider Gender, and BMI important and consider Gender, and BMI important factors so keep larger model but have to factors so keep larger model but have to justifyjustify

3.3.Model building manual, logical, Model building manual, logical, transparent and under your controltransparent and under your control

Page 41: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

2) Use automatic 2) Use automatic selection proceduresselection procedures

These are based on automatic These are based on automatic mechanical algorithms usually mechanical algorithms usually related to statistical related to statistical significancesignificance

Common ones are stepwise, Common ones are stepwise, forward or backward forward or backward eliminationelimination

Can be selected in SPSS using Can be selected in SPSS using ‘Method’ in dialogue box‘Method’ in dialogue box

Page 42: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

2) Use automatic selection 2) Use automatic selection procedures (e.g Stepwise)procedures (e.g Stepwise)

Select Select Method Method = = StepwisStepwisee

Page 43: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

2) Use automatic selection 2) Use automatic selection procedures (e.g Stepwise)procedures (e.g Stepwise)

Final Final ModeModell

11stst step step

2nd 2nd stepstep

Page 44: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

2) Change in AIC with 2) Change in AIC with Stepwise selectionStepwise selection

Note: Only available from Generalized Linear Note: Only available from Generalized Linear ModelsModels

Step Model Log Likelihood

AIC Change in AIC

No. of Parameters p

1 Baseline LDL -1423.1 2852.2 - 2

2 +Adherence -1418.0 2844.1 8.1 3

3 +Age -1413.6 2837.2 6.9 4

Page 45: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

2) Advantages and 2) Advantages and disadvantages of disadvantages of

stepwisestepwiseAdvantagesAdvantages

Simple to implementSimple to implement

Gives a parsimonious modelGives a parsimonious model

Selection is certainly objectiveSelection is certainly objective

DisadvantagesDisadvantagesNon stable selection – stepwise considers Non stable selection – stepwise considers

many many models that are very similarmodels that are very similar

P-value on entry may be smaller once P-value on entry may be smaller once procedure is procedure is finished so exaggeration finished so exaggeration of p-valueof p-value

Predictions in external dataset usually Predictions in external dataset usually worse for worse for stepwise proceduresstepwise procedures

Page 46: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

2) Automatic procedures: 2) Automatic procedures: Backward eliminationBackward elimination

BackwardBackward starts by eliminating the least starts by eliminating the least significant factor form the full model and significant factor form the full model and has a few advantages over forward:has a few advantages over forward:

•Modeller has to consider the ‘full’ model Modeller has to consider the ‘full’ model and sees results for all factors and sees results for all factors simultaneouslysimultaneously

•Correlated factors can remain in the Correlated factors can remain in the model (in forward methods they may not model (in forward methods they may not even enter)even enter)

•Criteria for removal tend to be more lax Criteria for removal tend to be more lax in backward so end up with more in backward so end up with more parametersparameters

Page 47: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

2) Use automatic selection 2) Use automatic selection procedures (e.g Backward)procedures (e.g Backward)

Select Select Method Method = = BackwarBackwardd

Page 48: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

2) Backward elimination in 2) Backward elimination in SPSSSPSS

Final Final ModeModell

11stst step stepGender Gender removeremovedd

2nd 2nd step step BMI BMI removeremovedd

Page 49: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Summary of automatic Summary of automatic selectionselection

• Automatic selection may not give Automatic selection may not give ‘optimal’ model (may leave out ‘optimal’ model (may leave out important factors)important factors)

• Different methods may give different Different methods may give different results (forward vs. backward results (forward vs. backward elimination)elimination)

• Backward elimination preferred as less Backward elimination preferred as less stringentstringent

• Too easily fitted in SPSS!Too easily fitted in SPSS!

• Model assessment still requires some Model assessment still requires some thoughtthought

Page 50: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

3) A mixture of automatic 3) A mixture of automatic procedures and self procedures and self

selectionselection

•Use automatic procedures as Use automatic procedures as a a guide guide

•Think about what factors are Think about what factors are importantimportant

•Add ‘important’ factorsAdd ‘important’ factors•Do not blindly follow Do not blindly follow

statistical significancestatistical significance•Consider AICConsider AIC

•Use automatic procedures as Use automatic procedures as a a guide guide

•Think about what factors are Think about what factors are importantimportant

•Add ‘important’ factorsAdd ‘important’ factors•Do not blindly follow Do not blindly follow

statistical significancestatistical significance•Consider AICConsider AIC

Page 51: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Summary of Model Summary of Model selectionselection

• Selection of factors for Multiple Selection of factors for Multiple Linear regression models requires Linear regression models requires some judgementsome judgement

• Automatic procedures are Automatic procedures are available but treat results with available but treat results with cautioncaution

• They are easily fitted in SPSSThey are easily fitted in SPSS

• Check AIC or log likelihood for fitCheck AIC or log likelihood for fit

Page 52: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

SummarySummary• Multiple regression models are Multiple regression models are

the most used analytical tool in the most used analytical tool in quantitative researchquantitative research

• They are easily fitted in SPSSThey are easily fitted in SPSS

• Model assessment requires Model assessment requires some thoughtsome thought

• Parsimony is better – Occam’s Parsimony is better – Occam’s RazorRazor

Page 53: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Remember Occam’s Remember Occam’s RazorRazor‘‘Entia non sunt Entia non sunt multiplicanda multiplicanda praeter praeter necessitatem’necessitatem’

‘‘Entities must not be Entities must not be multiplied beyond multiplied beyond necessity’necessity’

William of Ockham 14th century Friar and logician1288-1347

Page 54: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

SummarySummary

After fitting any model check assumptionsAfter fitting any model check assumptions• Functional form – linearity or not Functional form – linearity or not • Check Residuals for normalityCheck Residuals for normality• Check Residuals for outliers Check Residuals for outliers • All accomplished within SPSSAll accomplished within SPSS• See publications for further infoSee publications for further info

• Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in diabetes: A Go-DARTS study. diabetes: A Go-DARTS study. Pharmacogenetics and GenomicsPharmacogenetics and Genomics, 2008; 18: 279-87. , 2008; 18: 279-87.

After fitting any model check assumptionsAfter fitting any model check assumptions• Functional form – linearity or not Functional form – linearity or not • Check Residuals for normalityCheck Residuals for normality• Check Residuals for outliers Check Residuals for outliers • All accomplished within SPSSAll accomplished within SPSS• See publications for further infoSee publications for further info

• Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in diabetes: A Go-DARTS study. diabetes: A Go-DARTS study. Pharmacogenetics and GenomicsPharmacogenetics and Genomics, 2008; 18: 279-87. , 2008; 18: 279-87.

Page 55: Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

Practical on Multiple Practical on Multiple RegressionRegression

Read in ‘LDL Data.sav’Read in ‘LDL Data.sav’

1)1)Try fitting multiple regression model Try fitting multiple regression model on Min LDL obtained using forward and on Min LDL obtained using forward and backward elimination. Are the results backward elimination. Are the results the same? Add other factors than those the same? Add other factors than those considered in the presentation such as considered in the presentation such as BMI, smoking. Remember the goal is to BMI, smoking. Remember the goal is to assess the association of APOE with LDL assess the association of APOE with LDL response.response.

2)2)Try fitting multiple regression models Try fitting multiple regression models for Min Chol achieved. Is the model for Min Chol achieved. Is the model similar to that found for Min Chol?similar to that found for Min Chol?

Read in ‘LDL Data.sav’Read in ‘LDL Data.sav’

1)1)Try fitting multiple regression model Try fitting multiple regression model on Min LDL obtained using forward and on Min LDL obtained using forward and backward elimination. Are the results backward elimination. Are the results the same? Add other factors than those the same? Add other factors than those considered in the presentation such as considered in the presentation such as BMI, smoking. Remember the goal is to BMI, smoking. Remember the goal is to assess the association of APOE with LDL assess the association of APOE with LDL response.response.

2)2)Try fitting multiple regression models Try fitting multiple regression models for Min Chol achieved. Is the model for Min Chol achieved. Is the model similar to that found for Min Chol?similar to that found for Min Chol?