bio503: lecture 4 statistical models in r --- recap --- stefan bentink [email protected]
TRANSCRIPT
![Page 2: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/2.jpg)
![Page 3: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/3.jpg)
Linear Regression Models
residual error
regression coefficient
dependent variable
intercept independent variable
Using the methods of least squares, we can derive the following estimators:
Our goal is to test the hypothesis: 0^
We can do this with a T test:
)(
0^
^
SEt
under the null hypothesis, this follows a T distribution with (n-1) df.
![Page 4: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/4.jpg)
my.model <- lm(y ~ x)
![Page 5: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/5.jpg)
Some important functions
my.model <- lm(x~y)
summary(my.model)anova(my.model)predict(my.model,new.data)
![Page 6: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/6.jpg)
Specifying ModelsIn R we use model formula to specify the model we want to fit to
our data. y ~ x Simple Linear Regressiony ~ x – 1 Simple Linear Regression
without the intercept (line goes through origin)
y ~ x1 + x2 + x3 Multiple Regressiony ~ x + I(x^2) Quadratic Regressionlog(y) ~ x1 + x2 Multiple Regression of
Transformed VariableFor factors A, B:y ~ A 1-way ANOVA y ~ A + B 2-way ANOVAy ~ A*B 2-way ANOVA + interaction term
![Page 7: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/7.jpg)
ANOVA ExampleLet's use a different dataset:
> library(MASS)> data(ChickWeight)> attach(ChickWeight)The factor Diet has 4 levels.> levels(Diet)> anova(lm(weight ~ Diet, data=ChickWeight))Analysis of Variance Table
Response: weight Df Sum Sq Mean Sq F value Pr(>F) Diet 3 155863 51954 10.81 6.433e-07Residuals 574 2758693 4806
![Page 8: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/8.jpg)
Two-way ANOVAWe can fit a two-way ANOVA:
> anova(lm(weight ~ Diet + Chick, data=ChickWeight))
Analysis of Variance TableResponse: weight Df Sum Sq Mean Sq F value Pr(>F)
Diet 3 155863 51954 11.5045 2.571e-07Chick 46 374243 8136 1.8015 0.001359Residuals 528 2384450 4516
The interpretation of the model output is sequential, from the
bottom to the top. This line tests the model: weight ~ Diet + Chick
This line tests the model: weight ~ Diet vs weight ~ Diet + Chick.
![Page 9: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/9.jpg)
Generalized Linear Models
Linear regression models hinge on the assumption that the response variable follows a Normal distribution.
Generalized linear models are able to handle non-Normal response variables and transformations to linearity.
![Page 10: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/10.jpg)
Logistic Regression
When faced with a binary response Y = (0,1), we use logistic regression.
),|1( xiii YP
T
ip
i
i
x
x
x
1
T
p
i
1
where
jijj
T
ii
i
ii
iix
YP
YPxx
x
1log
),|0(
),|1(log
jijj
jijj
i
x
x
exp1
exp
Logit
![Page 11: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/11.jpg)
Fit the Logistic Regression Model
> anes.logit <- glm(move ~ conc, family=binomial(link=logit), data=anesthetic)
The output summary looks like this: > summary(anes.logit)
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.469 2.418 2.675 0.00748 **conc -5.567 2.044 -2.724 0.00645 **
Estimates of P(Y=1) are given by: > fitted.values(anes.logit)
![Page 12: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/12.jpg)
Update models and model selection
Some handy functions to know about:
new.model <- update(old.model, new.formula)
Model Selection functions available in the MASS package
drop1, droptermadd1, addtermstep
stepAIC
![Page 13: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/13.jpg)
SURVIVAL ANALYSIS
![Page 14: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/14.jpg)
Problem 5 – Survival Analysis
1.Read in the data file aml.txt. This data stores the survival data on patients with Acute Myelogenous Leukemia.
2.Compute the Kaplan-Meier estimate for all patients in this data. Compute the corresponding Kaplan-Meier plot. Construct Kaplan-Meier plots grouped by chemotherapy status.
3.Using a log-rank test, test if the two survival curves (patients on maintenance chemotherapy, patients who are not) are identical.
4.Fit a Cox proportional hazards model to the data set.
5.Plot these survival functions for patients from the different groups.
![Page 15: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/15.jpg)
Survival Analysis
library(survival)Example: aml leukemia data
Kaplan-Meier curve
fit1 <- survfit(Surv(aml$time,aml$status)~1)summary(fit1)plot(fit1)
Log-rank test
survdiff(Surv(time, status)~x, data=aml)
![Page 16: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu](https://reader035.vdocuments.mx/reader035/viewer/2022072005/56649cdc5503460f949a7066/html5/thumbnails/16.jpg)
Survival analysis
> cp <- coxph(Surv(aml$time,+ aml$status)~x,data=aml)>> summary(cp)>> plot(survfit(Surv(aml$time,aml$status)~x,+ data=aml),col=c("red","green"),lwd=2)