multivariate analysis : introduction

29
Third training Module, EpiSouth: Multivariate analysis, 15 th to 19 th June 2009 1/29 Multivariate analysis: Introduction Third training Module EpiSouth Madrid, 15 th to 19 th June, 2009 Dr D. Hannoun National Institute of Public Health Algeria

Upload: genero

Post on 20-Jan-2016

75 views

Category:

Documents


5 download

DESCRIPTION

Multivariate analysis : Introduction. Third training Module EpiSouth Madrid, 15 th to 19 th June, 2009 Dr D. Hannoun National Institute of Public Health Algeria. Introduction: Generality. Stratification allows us: Control confounding Reveal effect modification - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 1/29

Multivariate analysis: Introduction

Third training ModuleEpiSouth

Madrid, 15th to 19th June, 2009

Dr D. Hannoun

National Institute of Public HealthAlgeria

Page 2: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 2/29

Introduction: GeneralityGenerality

Stratification allows us:

• Control confounding

• Reveal effect modification

Limits of stratification:

• Only a few number of confounders could be controlled simultaneously

• The joint effect of confounders cannot be analysed correctly +++

• Choice of classes with quantitative variables

Other tools: MULTIVARIATE ANALYSIS

Assess the reality of the effect of exposure on the disease

Page 3: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 3/29

Introduction: Joint effectJoint effect

Example: Hepatitis B SEP

Potential confounders: Age (children/adults), immunity(good/deficient)

Joint effect: the effect of two/more factors combined together Marginal effect: the effect of one confounder alone without taking in

consideration the other potential confounders

Control on Strate 1F+

Strate 2F-

Strate 3 Strate 4 Crude effect

Adjusted Measure

2.0

F1+/F2+ F1-/F2- F1+/F2- F1-/F2+

Age (F1) 2,0 2,0 2,0

Immunity (F2) 2,0 2,0 2,0

Factors 1+2 1,0 1,0 1 1 1,0

Page 4: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 4/29

Multivariate analysis: DefinitionDefinition

Definition:

• Simultaneously, adjust for several variables

• Simultaneously, control for several potential confounders

Several models:

• Multiple linear regression

• Logistic regression

• Cox regression ….

Vocabulary

• Disease Y= dependant variable

• Risk factors= independant variables or predictors

Procedures, at the analysis phase, that

Page 5: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 5/29

Multivariate analysis: DefinitionDefinition

How:

• Representation of the disease Y as a function of other variables

•Risk factors

•Potential confounders

By modelling the relationship studied

Set of variables

Stati

stica

l pr

oced

ures

: M

ultiv

aria

te

anal

ysis

:The best Subset of variables describes

the relationship between RF and

disease

Measure of the relationship: parameters

To describe the disease via an

equation

The best model fitting the data

Page 6: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 6/29

Multivariate analysis: DefinitionDefinition

Writing Model:

• E(Y/E, X1, X2…, Xp) = f(E, X1, X2…, Xp)• Y: a given Disease• E: Exposure

• X1,X2…: other variables

Example:

• F= linear function

E(Y/E, X1, X2…, Xp) = α + βE + β1X1 + β2X2 + … + βpXp

• β, β1, β2… measure the relation between the exposure E, the others risk factors X1, X2… and the disease Y controlled on the other variables

• If β =0 No relationship between exposure and the disease

Page 7: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 7/29

Multivariate analysis: DefinitionDefinition

The adjusted measures of association we obtain from multivariable analysis are:

For each variable in the model, we obtain the effect measure of the relationship between this variable and the disease controlled on the other variables

Direct effects and not total effects

Page 8: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 8/29

Multivariate analysis: AdvantagesAdvantages

Advantages/techniques:

• Estimation of effects and controlling for more than one confounder simultaneously

• Study of the joint effect of several risk factors and quantify the intensity of interaction

• Possibility to have continuous risk factor

• Study the dose-response relationship: interest for causality and the specific risk at intermediary levels

• Study the trend effect according to the level of the risk factor

• Prediction of the disease

Page 9: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 9/29

Multivariate analysis: StepStep

Several steps:

• Choosing the appropriate model to summarize data

• Define the strategy variable selection

• Estimate the model coefficients

• Method of least squares (LS) estimation

• Method of maximum likelihood (ML) estimation

• Writing and interpreting the model

• Study the adequation of the model

Page 10: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 10/29

Multivariate analysis: Choice of the modelChoice of the model

Depends on the form of the function f:

1. Nature of the outcome variable

• Continuous outcome Multiple linear Regression

• Categorical outcome Logistic regression (LR)

• Outcome time to an event Cox regression

2. Nature of joint effect

• Additif Multiple linear regression

• Multiplicatif Logistic regression

Cox regression

3. Form of the variable-distribution

• Normally distributed…

4. Assumption

Page 11: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 11/29

Multivariate analysis: variables selectionvariables selection

The final model depends on the variables will be selected:

• At the study design:

• Decide which variables to adjust or to control for

• How the variable will be coded

• Which interaction should be considered

• At the analytical phase:

• Which variables must be entered in the model

• Variables must be forced

• P value

• E.g.: 7 variables coded 0/1 with all interaction terms 27 = 128 coefficients to estimate in the final model!

Neccesity of STRATEGY

Page 12: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 12/29

Multivariate analysis: Parameters estimationParameters estimation

Purpose of multivariate analysis:

• To obtain some measure of the effect that describes the exposure-outcome relationship adjusted for relevant extraneous factors

Parameters estimation depends on the model used:

• In MLR regression coefficients β

• In LR odds ratio

• In Cox hazard ratio

Page 13: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 13/29

Multivariate analysis: ModelModel adequationadequation

Verify the adequation of the model:

• Capacity of the model to represent correctly the value of the disease given the value of subset of risk factors

Steps:• Adequation of the model:

• Graphical methods +++• Statistical tests

• Interpreting the test: be careful to the outlier

• The best model is necessary not the best statistical model: choose the model with the best understanding of the disease

The fitting model could be used for prediction

Page 14: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 14/29

MLR: IntroductionIntroduction

= multivariate model used in case of continuous data

Principle:

• Describe one variable as a linear function of one or more other variables

• Form: E(Y)=f(E,X1,X2…) F= linear function

• E(Y/X) = α + βX Simple linear regression model

• E(Y/X1, , Xp)= α + β 1X1 + … + βpXp Multi. linear regression model

E(Y) = α + βX

••

•• ••

••• •

• • ••

•Dis

ease

Page 15: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 15/29

MLR: IntroductionIntroduction

Inci

denc

e ra

te o

f ARI

Atmopsheric pollution: density of PM10

• •••

•• ••

• ••Y = α + βX + ε

β = slope of the straight line• Estimate the change in Y for one unit of X• E.g. when pollution atmospheric increases 1%, the incidence rate of ARI

increases by 2 cas/100.000 person

α = intercept which correspond to the value of disease when the exposure equal 0, or more generally describes the baseline

ε = error term in the model

Statistical model

In simple linear regression:

Y = α + βX ^̂^̂^̂

Page 16: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 16/29

MLR: IntroductionIntroduction

In Multiple linear regression:

• Statistical model: Y = α + β1X1 + β2X2 + … + βpXp + ε

• E.g.:

•Variation of incidence rate of ARI with atmospheric pollution

•Potential confounders: age and smoking

•X1 = density of PM10

•X2 = age of person

•X3 = smoking

ARI Inc. Rate = α + β1density of PM10 + β2 Age + β3 smoking + ε

Page 17: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 17/29

MLR: IntroductionIntroduction

In Multiple linear regression:

• β1 = slope along the X1 dimension: variation of ARI with the change of 1 unit of PM10 density controlled on the other variables

• β2 = slope along the X2 dimension: variation of ARI with the change of one unit of AGE controlled on the other variables

• β3 = slope along the X3 dimension: variation of ARI with the change of one unit of smoking (person/year) controlled on the other variables

• α = intercept, value of the disease when there is no risk factor…

• ε = error term in the model

ARI Inc. Rate = α + β1density of PM10 + β2 Age + β3 smoking + ε

Page 18: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 18/29

MLR: Parameters estimation Parameters estimation

Method used: least squares estimation

Principle:

• Identify the best straight line that minimizes the sum of squared residuals

• ••

• • •Yi

Ŷi (Xi,Ŷi,)

(Xi,Yi,)

Xi

Least squared line fit

SSR = Σ(Yi - Ŷi)2 = Σ(Yi - α – βX)2

Page 19: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 19/29

MLR: Variables selection Variables selection

Decide which variables to control for:

1. Prediction of the risk of the disease

• We haven’t to take in consideration all confounders but the best group of predictors

• Importance in term of public Health +++

• E.g.: incidence rate of ARI – Exposure: atmospheric pollution – Predictors: age and smoking

2. Estimation of the relation between exposure and disease

• We have to take in consideration ALL confounders to control confounding

• Importance in term of causal association

• E.g.: incidence rate of ARI – Exposure: atmospheric pollution – Predictors: age, smoking, breastfeeding, ROR…

Page 20: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 20/29

MLR: Variables selection Variables selection

Which variables must be entered in the initial model:2 situations

• Some are obligatory in the model because there are recognized as risk factor: exposure

• Other variables significant relationship between the variable and the disease in the bivariate analysis

All candidate variables to modelling

Page 21: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 21/29

MLR: Variables selection Variables selection

Which interaction should be considered:

• Problem of interaction must be approached in a manner wich facilitates understanding of the nature of the causal effect

• Statistical consideration should serve rather than determine our objectives

• Adjonction of an interaction term

• Addition of an other regression coefficient in the equation

• More difficulties to interpret the model

• For a given interaction, you must ensure that the variables which are in the term interaction are contained in the model

Page 22: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 22/29

MLR: Variables selection Variables selection

Example: Incidence rate of ARI

1. Model WITH an interaction term:

• Interaction BETWEEN smoking and age: β2,3X2X3

ARI Inc. Rate = α + β1density of PM10 + β2 Age + β3 smoking + β2,3 Age smoking + β4 breastfeeding + β5 ROR + ε

ARI Inc. Rate = α + β1density of PM10 + β2 Age + β2,3 Age smoking + ε

Page 23: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 23/29

MLR: Variables selection Variables selection

Which variables must be entered in the initial model:2 situations

• …

• How the variables must be entered in the initial model: Strategy must be defined

• Start with ALL variables Backward elimination

• Start with NO variable Forward selection

• Mixed the two previous methods Stepwise selection

Page 24: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 24/29

MLR: Variables selection Variables selection

sexe age

Pollution

ROR

smoking

breastfeeding region

Profession

Age*smoking

At The stud design

Bivariate analysis and

stratification

First part of analytical phase

Significant variables

•Pollution

•Age

•Smoking

•Breastfeeding

•ROR

V. must be forced

•Pollution

Candidate variables to modeling

The largest possible model

Define how the V. could be entered in the model

Backward

Forward

Stepwise

Multivariate analysis

Rules

Second part of analytical phase

Final model: PollutionAgeSmoking

Page 25: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 25/29

MLR: Backwards strategyBackwards strategy

Principle :

• Begins with ALL candidate variables in the model largest POSSIBLE model

• At each step, Drop one variable, the choice of this variable is based on statistical rules remains variable which is not significant

• Continue until no more variables can be dropped, meaning all remaining variables are relevant

Advantages: Evaluate the joint confounding effects of all variables

Limits: With many risk factors, strata could provide no information

Page 26: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 26/29

MLR: Forward strategyForward strategy

Principle :

• Begins with NO variable in the model smallest POSSIBLE model

• At each step, Keep one variable in the model, the choice of this variable is based on statistical rules • Start with the variable that has the biggest change-in-estimate impact

when evaluated individually• Keep the var. which changes meangfully the adjusted estimate

• Continue until no other variables can be added

Advantages: Avoids the initial sparse cell problem of backwards approach

Limits: Does not evaluate joint confounding effects of many variables

Page 27: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 27/29

MLR: ConclusionConclusion

Goal of modeling: To obtain • The smallest subset of relevant risk factors to describes the disease

• With the best understanding of the disease

Like for stratification, you must identify:

• First, significant interaction term: don’t forget to verifiy that the v. which are in the term interaction are contained in the model

statistical significance + biological consideration

• Secondly, test the confounding effect No statistical test

• Retain significant risk factors, confounder risk factors and interaction term that help us to understand and to explain the occurrence of disease

Page 28: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 28/29

Conclusion

Multivariate analysis allows to control and adjust the effect of exposure with several extraneaous factors simultaneously

The adjusted measures of association are direct effects and not total effects

Multivariate analysis is a useful tool but it could be very dangerous if we haven’t preliminary defined the strategy

• Purpose of the study• Method of variable selection• Assumption• Adequation of the model…

Page 29: Multivariate analysis : Introduction

Third training Module, EpiSouth: Multivariate analysis, 15th to 19th June 2009 29/29

Conclusion

As with stratification method, statistical considerations should serve rather than determine our objectives

Multivariate analysis requires computer to run the statistical programme

The choice of the model depends upon of a lot of factors: outcome variable, form of the relationship between exposure and disease…