biostatistics case studies 2014

Biostatistics Case Studies 2014

Youngju Pak, PhD.

Biostatistician

[email protected]

Session 4:

Regression Models and Multivariate Analyses

What and Why?

Multivariate analysis (MVA) techniques allow more than two variables to be analysed at once. Compared with univariate or bivariate

Data richness with computational technologies advanced Data reductions or classifications eg., Factor analysis, Principal Component Analysis(PCA)

Several variables are potentially correlated with some degree potential confounding bias the result eg., Analysis of Covariance (ANCOVA), Multiple Linear or

Generalized Linear Regression Models

What and Why ?

Many variables are all interrelated with multiple dependent and independent variables

eg., Multivariate Analysis of Variance (MANOVA), Path Models, Structural Equation Models(SEM), Partially Least Square(PLS) Models.

This Session will focus on multiple regression models.

Why regression models?

To reduce “Random Noise” in Data => better variance estimations by adding source of variability of your dependent variables eg. ANCOVA

To determine a optimal set of predictors => predictive models eg. Variable selection procedures for multiple

regression models

To adjust for potential confounding effects eg, regression models with covariates

Actual mathematical Models

ANOVA

Yij=μ+τi+ϵij,,where Yij represents the jth observation (j=1,2,…,n)

on the ith treatment (i=1,2,…,l levels).

The errors ϵij are assumed to be normally and independently (NID) distributed, with mean zero and variance σ2.

ANCOVA with k number of covariates

Yij=μ+τi+X1ij + X2ij + …+ Xkij + ϵij, MANOVA (with p number of outcome variables)

Y(nxp) = X(nx[q+1]) B([q+1] x p) + E (n x p)


Simple Linear Regression Models (SLR)

Yi = β0 + β1 Xi + εi

µY (true mean value of Y)

ε =“error” (random noise due to random sampling error), assumed ε follow a normal distribution with mean=0, variance=σ2

β0 & β1 = intercept & slope often called Regression (or beta) Coefficients

Y=Dependent Variable(DV) X=Independent Variable (IV)eg., Y= Insulin Sensitivity X= FattyAcid in percentage

Multiple Linear Regression Models (MLR) Simple Logistic Models(SL) Multiple Logistic Models(ML)

SLR: Example SPSS output

• Two-sided p-value=0.002. Thus, there is significant statistical evidence (alpha=0.05) to conclude that the true slope is not zero Fatty Acid(%) is significantly related to insulin sensitivity .

• • Mean Insulin sensitivity increase by 37.208 unit as

Fatty Acid(%) increase by one percent.

SLR w/CI

Checking the assumptions using a residual Plot

A plot has to be looked as “RANDOM” no special pattern is supposed to be shown if the assumptions are met.


Multiple Linear Regression Models (SLR)

Y = β0+ β1X1 + β2 X2 + … + βk Xk + ε

µY (true mean value of Y)

Assumptions are the same as SLR with one more addition : All Xs are not highly correlated. If they are, this is called “Multicollinearity”, which will make model very unstable.

Diagnosis for multicollinearity Variance Inflation Factor (VIF) = 1 OK VIF < 5 Tolerable VIF > 5 Problematic Remove the variable

which has a high VIF or do PCA

Multiple Linear Regression Models (MLR) Simple Logistic Models(SL) Multiple Logistic Models(ML)

MRL: Example

mY = -56.935 + 1.634X1 + 0.249X2

Coefficientsa

-56.935 55.217 -1.031 .327

1.634 .714 .490 2.290 .045

.249 .116 .458 2.137 .058

(Constant)

R_Flexibility

O_Strength

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Distancea.

11

1.634*FlexibilityFor every 1 degree increase in flexibility, MEAN punt distance increases by 1.634 feet, adjusting for leg strength.

0.249*StrengthFor every 1 lb increase in strength, MEAN punt distance increases by 0.249 feet, adjusting for flexibility.

What do mean by “adjusted for”?

If categorical covariates? eg.,

Mean % gain w/o adjustment for Gender Exercise & Diet: (20%x10+10%x40) / 50 = 12 % Exercise only: (15%x40 + 5%x10) / 50 = 13 %

Mean % gain with adjustment for Gender Exercise & Diet: Male avg. x 0.5 + Female avg. x 0.5

= 20% x 0.5 + 10% x 0.5=15 % Exercise only: Male avg. x 0.5 + Female avg. x 0.5

= 15% x 0.5 + 5% x 0.5=10%

Mean muscle gain % (n)

Exercise & Diet Exercise only

Male 20% (10) 15% (40)

Female 10% (40) 5 % (10)

Why different?

% gain for males are 10% higher than female in both diet potential confounding

However, two groups are unbalanced in terms of gender, i.e, 80% male for the exercise group while 20% female for the diet & exercise group dilute the “treatment effect”

If continuous covariates such as baseline age, similar adjustment will be performed based on the correlation between % gain and the baseline age.

Graphical illustration : Adjusting for a continuous covariate

10

5

0

-5

HbA1c: Post-Pre

Ad

ipo

ne

ctin

: P

ost-

Pre

Adjustment by Analysis of Covariance

-0.56

-0.02

+1.791.81 = Diff

2.07

0.55Diff = 1.52

Unadjusted 0 Adjusted

* Changes in Adiponectin (a glucose regulating protein) b/w two groups

Multiple Logistic Regression Models

• The model:

Logit(π)= β0 + β1X1 + β2X2 + ••• +βkXk

where

π=Prob (event =1), Logit(π)= ln[π /(1- π)]• or

π = e LP / (1+ e LP ),

where Lp= β0 + β1X1 + β2X2 + ••• +βkXk

Interpretation of the coefficients in logistic regression models

For a continuous predictor, a coefficient

(e β) represents the multiplicative increase in the mean odds of Y=1 for one unit change in X odds ratio for X+1 to X.

Similarly, for a nominal predictor, the coefficient represent the odds ratio for one group (X=1) to another (X=0).

Remember, MLR has other covariates. Hence, the interpretation of one coefficient is applied when other covariates are adjusted for.

16

Estimated Prob. Vs. Age

17

Other Models Ordinal Logistic Regression for ordinal responses

such as cancer stage I, II, III, IV : assumes the constant rate of change in OR between any two groups.

Poisson regressions when responses are count data such as # of pregnancy : over dispersion is common and some times a negative binomial distribution is used instead.

Mixed Model ; commonly used for a repeated measures ANOVA or ANCOVA. Time is used as within-subject factor and random factor. Mixed models are also used for nested design.

Cox proportional Hazard models: multivariate models for survival data.

General Linear Modelvs. Generalized Linear Model(GLM)

A Linear Model General Linear Model– eg., ANOVA, ANCOVA, MANOVA,

MANCOVA, Linear regression, mixed model

A Non Linear Model Generalized Linear Model – Eg., Logistic, Ordinary Logistic, Possion

All these used a link function for a response variable (Y) such as a logit link or possion link.

GEE(Generalized Estimating Equation)

models are an extension of GLM.

Variable Selection Procedures Forward

By adding a new predictor that as the lowest p-value and keep repeating this step until no more predictors to be added at 0.05 alpha level

Backward Start a full model with all predictors and eliminate the

predictor with the highest p-value and keep repeating this procedure until no more predictors left to be eliminated at 0.05 alpha level

Stepwise Combination of Forward and Backward

Level of stay : 0.01, Level of entry: 0.05 usually used

Simulation studies show Backward is most recommendable based on many simulation studies.

Bariatric Surgery

• Roux-en-Y gastric bypass,

• Sleeve gastrectomy,

• Gastric banding,

• Biliopancreatic diversion.

Table 1

Figure 1

Appendix ?

Factors Associated with Achieving The Primary End Points at 3 Years

biostatistics case studies 2014

Documents