bio503: lecture 4 statistical models in r --- recap --- stefan bentink [email protected]

BIO503: Lecture 4Statistical models in R

--- Recap ---

Stefan Bentink [email protected]

Linear Regression Models

residual error

regression coefficient

dependent variable

intercept independent variable

Using the methods of least squares, we can derive the following estimators:

Our goal is to test the hypothesis: 0^

We can do this with a T test:

)(

0^

^

SEt

under the null hypothesis, this follows a T distribution with (n-1) df.

my.model <- lm(y ~ x)

Some important functions

my.model <- lm(x~y)

summary(my.model)anova(my.model)predict(my.model,new.data)

Specifying ModelsIn R we use model formula to specify the model we want to fit to

our data. y ~ x Simple Linear Regressiony ~ x – 1 Simple Linear Regression

without the intercept (line goes through origin)

y ~ x1 + x2 + x3 Multiple Regressiony ~ x + I(x^2) Quadratic Regressionlog(y) ~ x1 + x2 Multiple Regression of

Transformed VariableFor factors A, B:y ~ A 1-way ANOVA y ~ A + B 2-way ANOVAy ~ A*B 2-way ANOVA + interaction term

ANOVA ExampleLet's use a different dataset:

> library(MASS)> data(ChickWeight)> attach(ChickWeight)The factor Diet has 4 levels.> levels(Diet)> anova(lm(weight ~ Diet, data=ChickWeight))Analysis of Variance Table

Response: weight Df Sum Sq Mean Sq F value Pr(>F) Diet 3 155863 51954 10.81 6.433e-07Residuals 574 2758693 4806

Two-way ANOVAWe can fit a two-way ANOVA:

> anova(lm(weight ~ Diet + Chick, data=ChickWeight))

Analysis of Variance TableResponse: weight Df Sum Sq Mean Sq F value Pr(>F)

Diet 3 155863 51954 11.5045 2.571e-07Chick 46 374243 8136 1.8015 0.001359Residuals 528 2384450 4516

The interpretation of the model output is sequential, from the

bottom to the top. This line tests the model: weight ~ Diet + Chick

This line tests the model: weight ~ Diet vs weight ~ Diet + Chick.

Generalized Linear Models

Linear regression models hinge on the assumption that the response variable follows a Normal distribution.

Generalized linear models are able to handle non-Normal response variables and transformations to linearity.

Logistic Regression

When faced with a binary response Y = (0,1), we use logistic regression.

),|1( xiii YP

T

ip

i

i

x

x

x

1

T

p

i

1

where

jijj

T

ii

i

ii

iix

YP

YPxx

x

1log

),|0(

),|1(log

jijj

jijj

i

x

x

exp1

exp

Logit

Fit the Logistic Regression Model

> anes.logit <- glm(move ~ conc, family=binomial(link=logit), data=anesthetic)

The output summary looks like this: > summary(anes.logit)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.469 2.418 2.675 0.00748 **conc -5.567 2.044 -2.724 0.00645 **

Estimates of P(Y=1) are given by: > fitted.values(anes.logit)

Update models and model selection

Some handy functions to know about:

new.model <- update(old.model, new.formula)

Model Selection functions available in the MASS package

drop1, droptermadd1, addtermstep

stepAIC

SURVIVAL ANALYSIS

Problem 5 – Survival Analysis

1.Read in the data file aml.txt. This data stores the survival data on patients with Acute Myelogenous Leukemia.

2.Compute the Kaplan-Meier estimate for all patients in this data. Compute the corresponding Kaplan-Meier plot. Construct Kaplan-Meier plots grouped by chemotherapy status.

3.Using a log-rank test, test if the two survival curves (patients on maintenance chemotherapy, patients who are not) are identical.

4.Fit a Cox proportional hazards model to the data set.

5.Plot these survival functions for patients from the different groups.

Survival Analysis

library(survival)Example: aml leukemia data

Kaplan-Meier curve

fit1 <- survfit(Surv(aml$time,aml$status)~1)summary(fit1)plot(fit1)

Log-rank test

survdiff(Surv(time, status)~x, data=aml)

Survival analysis

> cp <- coxph(Surv(aml$time,+ aml$status)~x,data=aml)>> summary(cp)>> plot(survfit(Surv(aml$time,aml$status)~x,+ data=aml),col=c("red","green"),lwd=2)

bio503: lecture 4 statistical models in r --- recap --- stefan bentink [email protected]

Documents