biostatistics case studies 2014
DESCRIPTION
Biostatistics Case Studies 2014. Session 4 : Regression Models and Multivariate Analyses. Youngju Pak, PhD. Biostatistician [email protected]. What and Why?. Multivariate analysis (MVA) techniques allow more than two variables to be analysed at once. - PowerPoint PPT PresentationTRANSCRIPT
Biostatistics Case Studies 2014
Youngju Pak, PhD.
Biostatistician
Session 4:
Regression Models and Multivariate Analyses
What and Why?
Multivariate analysis (MVA) techniques allow more than two variables to be analysed at once. Compared with univariate or bivariate
Data richness with computational technologies advanced Data reductions or classifications eg., Factor analysis, Principal Component Analysis(PCA)
Several variables are potentially correlated with some degree potential confounding bias the result eg., Analysis of Covariance (ANCOVA), Multiple Linear or
Generalized Linear Regression Models
What and Why ?
Many variables are all interrelated with multiple dependent and independent variables
eg., Multivariate Analysis of Variance (MANOVA), Path Models, Structural Equation Models(SEM), Partially Least Square(PLS) Models.
This Session will focus on multiple regression models.
Why regression models?
To reduce “Random Noise” in Data => better variance estimations by adding source of variability of your dependent variables eg. ANCOVA
To determine a optimal set of predictors => predictive models eg. Variable selection procedures for multiple
regression models
To adjust for potential confounding effects eg, regression models with covariates
Actual mathematical Models
ANOVA
Yij=μ+τi+ϵij,,where Yij represents the jth observation (j=1,2,…,n)
on the ith treatment (i=1,2,…,l levels).
The errors ϵij are assumed to be normally and independently (NID) distributed, with mean zero and variance σ2.
ANCOVA with k number of covariates
Yij=μ+τi+X1ij + X2ij + …+ Xkij + ϵij, MANOVA (with p number of outcome variables)
Y(nxp) = X(nx[q+1]) B([q+1] x p) + E (n x p)
Actual mathematical Models
Simple Linear Regression Models (SLR)
Yi = β0 + β1 Xi + εi
µY (true mean value of Y)
ε =“error” (random noise due to random sampling error), assumed ε follow a normal distribution with mean=0, variance=σ2
β0 & β1 = intercept & slope often called Regression (or beta) Coefficients
Y=Dependent Variable(DV) X=Independent Variable (IV)eg., Y= Insulin Sensitivity X= FattyAcid in percentage
Multiple Linear Regression Models (MLR) Simple Logistic Models(SL) Multiple Logistic Models(ML)
SLR: Example SPSS output
• Two-sided p-value=0.002. Thus, there is significant statistical evidence (alpha=0.05) to conclude that the true slope is not zero Fatty Acid(%) is significantly related to insulin sensitivity .
• • Mean Insulin sensitivity increase by 37.208 unit as
Fatty Acid(%) increase by one percent.
SLR w/CI
Checking the assumptions using a residual Plot
A plot has to be looked as “RANDOM” no special pattern is supposed to be shown if the assumptions are met.
Actual mathematical Models
Multiple Linear Regression Models (SLR)
Y = β0+ β1X1 + β2 X2 + … + βk Xk + ε
µY (true mean value of Y)
Assumptions are the same as SLR with one more addition : All Xs are not highly correlated. If they are, this is called “Multicollinearity”, which will make model very unstable.
Diagnosis for multicollinearity Variance Inflation Factor (VIF) = 1 OK VIF < 5 Tolerable VIF > 5 Problematic Remove the variable
which has a high VIF or do PCA
Multiple Linear Regression Models (MLR) Simple Logistic Models(SL) Multiple Logistic Models(ML)
MRL: Example
mY = -56.935 + 1.634X1 + 0.249X2
Coefficientsa
-56.935 55.217 -1.031 .327
1.634 .714 .490 2.290 .045
.249 .116 .458 2.137 .058
(Constant)
R_Flexibility
O_Strength
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: Distancea.
11
1.634*FlexibilityFor every 1 degree increase in flexibility, MEAN punt distance increases by 1.634 feet, adjusting for leg strength.
0.249*StrengthFor every 1 lb increase in strength, MEAN punt distance increases by 0.249 feet, adjusting for flexibility.
What do mean by “adjusted for”?
If categorical covariates? eg.,
Mean % gain w/o adjustment for Gender Exercise & Diet: (20%x10+10%x40) / 50 = 12 % Exercise only: (15%x40 + 5%x10) / 50 = 13 %
Mean % gain with adjustment for Gender Exercise & Diet: Male avg. x 0.5 + Female avg. x 0.5
= 20% x 0.5 + 10% x 0.5=15 % Exercise only: Male avg. x 0.5 + Female avg. x 0.5
= 15% x 0.5 + 5% x 0.5=10%
Mean muscle gain % (n)
Exercise & Diet Exercise only
Male 20% (10) 15% (40)
Female 10% (40) 5 % (10)
Why different?
% gain for males are 10% higher than female in both diet potential confounding
However, two groups are unbalanced in terms of gender, i.e, 80% male for the exercise group while 20% female for the diet & exercise group dilute the “treatment effect”
If continuous covariates such as baseline age, similar adjustment will be performed based on the correlation between % gain and the baseline age.
Graphical illustration : Adjusting for a continuous covariate
10
5
0
-5
HbA1c: Post-Pre
Ad
ipo
ne
ctin
: P
ost-
Pre
Adjustment by Analysis of Covariance
-0.56
-0.02
+1.791.81 = Diff
2.07
0.55Diff = 1.52
Unadjusted 0 Adjusted
* Changes in Adiponectin (a glucose regulating protein) b/w two groups
Multiple Logistic Regression Models
• The model:
Logit(π)= β0 + β1X1 + β2X2 + ••• +βkXk
where
π=Prob (event =1), Logit(π)= ln[π /(1- π)]• or
π = e LP / (1+ e LP ),
where Lp= β0 + β1X1 + β2X2 + ••• +βkXk
Interpretation of the coefficients in logistic regression models
For a continuous predictor, a coefficient
(e β) represents the multiplicative increase in the mean odds of Y=1 for one unit change in X odds ratio for X+1 to X.
Similarly, for a nominal predictor, the coefficient represent the odds ratio for one group (X=1) to another (X=0).
Remember, MLR has other covariates. Hence, the interpretation of one coefficient is applied when other covariates are adjusted for.
16
Estimated Prob. Vs. Age
17
Other Models Ordinal Logistic Regression for ordinal responses
such as cancer stage I, II, III, IV : assumes the constant rate of change in OR between any two groups.
Poisson regressions when responses are count data such as # of pregnancy : over dispersion is common and some times a negative binomial distribution is used instead.
Mixed Model ; commonly used for a repeated measures ANOVA or ANCOVA. Time is used as within-subject factor and random factor. Mixed models are also used for nested design.
Cox proportional Hazard models: multivariate models for survival data.
General Linear Modelvs. Generalized Linear Model(GLM)
A Linear Model General Linear Model– eg., ANOVA, ANCOVA, MANOVA,
MANCOVA, Linear regression, mixed model
A Non Linear Model Generalized Linear Model – Eg., Logistic, Ordinary Logistic, Possion
All these used a link function for a response variable (Y) such as a logit link or possion link.
GEE(Generalized Estimating Equation)
models are an extension of GLM.
Variable Selection Procedures Forward
By adding a new predictor that as the lowest p-value and keep repeating this step until no more predictors to be added at 0.05 alpha level
Backward Start a full model with all predictors and eliminate the
predictor with the highest p-value and keep repeating this procedure until no more predictors left to be eliminated at 0.05 alpha level
Stepwise Combination of Forward and Backward
Level of stay : 0.01, Level of entry: 0.05 usually used
Simulation studies show Backward is most recommendable based on many simulation studies.
Bariatric Surgery
• Roux-en-Y gastric bypass,
• Sleeve gastrectomy,
• Gastric banding,
• Biliopancreatic diversion.
Table 1
Figure 1
Appendix ?
Factors Associated with Achieving The Primary End Points at 3 Years