bio503: lecture 4 statistical models in r --- recap --- stefan bentink [email protected]

16
BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink [email protected]

Upload: lynne-edwards

Post on 16-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

BIO503: Lecture 4Statistical models in R

--- Recap ---

Stefan Bentink [email protected]

Page 2: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu
Page 3: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

Linear Regression Models

residual error

regression coefficient

dependent variable

intercept independent variable

Using the methods of least squares, we can derive the following estimators:

Our goal is to test the hypothesis: 0^

We can do this with a T test:

)(

0^

^

SEt

under the null hypothesis, this follows a T distribution with (n-1) df.

Page 4: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

my.model <- lm(y ~ x)

Page 5: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

Some important functions

my.model <- lm(x~y)

summary(my.model)anova(my.model)predict(my.model,new.data)

Page 6: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

Specifying ModelsIn R we use model formula to specify the model we want to fit to

our data. y ~ x Simple Linear Regressiony ~ x – 1 Simple Linear Regression

without the intercept (line goes through origin)

y ~ x1 + x2 + x3 Multiple Regressiony ~ x + I(x^2) Quadratic Regressionlog(y) ~ x1 + x2 Multiple Regression of

Transformed VariableFor factors A, B:y ~ A 1-way ANOVA y ~ A + B 2-way ANOVAy ~ A*B 2-way ANOVA + interaction term

Page 7: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

ANOVA ExampleLet's use a different dataset:

> library(MASS)> data(ChickWeight)> attach(ChickWeight)The factor Diet has 4 levels.> levels(Diet)> anova(lm(weight ~ Diet, data=ChickWeight))Analysis of Variance Table

Response: weight Df Sum Sq Mean Sq F value Pr(>F) Diet 3 155863 51954 10.81 6.433e-07Residuals 574 2758693 4806

Page 8: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

Two-way ANOVAWe can fit a two-way ANOVA:

> anova(lm(weight ~ Diet + Chick, data=ChickWeight))

Analysis of Variance TableResponse: weight Df Sum Sq Mean Sq F value Pr(>F)

Diet 3 155863 51954 11.5045 2.571e-07Chick 46 374243 8136 1.8015 0.001359Residuals 528 2384450 4516

The interpretation of the model output is sequential, from the

bottom to the top. This line tests the model: weight ~ Diet + Chick

This line tests the model: weight ~ Diet vs weight ~ Diet + Chick.

Page 9: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

Generalized Linear Models

Linear regression models hinge on the assumption that the response variable follows a Normal distribution.

Generalized linear models are able to handle non-Normal response variables and transformations to linearity.

Page 10: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

Logistic Regression

When faced with a binary response Y = (0,1), we use logistic regression.

),|1( xiii YP

T

ip

i

i

x

x

x

1

T

p

i

1

where

jijj

T

ii

i

ii

iix

YP

YPxx

x

1log

),|0(

),|1(log

jijj

jijj

i

x

x

exp1

exp

Logit

Page 11: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

Fit the Logistic Regression Model

> anes.logit <- glm(move ~ conc, family=binomial(link=logit), data=anesthetic)

The output summary looks like this: > summary(anes.logit)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.469 2.418 2.675 0.00748 **conc -5.567 2.044 -2.724 0.00645 **

Estimates of P(Y=1) are given by: > fitted.values(anes.logit)

Page 12: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

Update models and model selection

Some handy functions to know about:

new.model <- update(old.model, new.formula)

Model Selection functions available in the MASS package

drop1, droptermadd1, addtermstep

stepAIC

Page 13: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

SURVIVAL ANALYSIS

Page 14: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

Problem 5 – Survival Analysis

1.Read in the data file aml.txt. This data stores the survival data on patients with Acute Myelogenous Leukemia.

2.Compute the Kaplan-Meier estimate for all patients in this data. Compute the corresponding Kaplan-Meier plot. Construct Kaplan-Meier plots grouped by chemotherapy status.

3.Using a log-rank test, test if the two survival curves (patients on maintenance chemotherapy, patients who are not) are identical.

4.Fit a Cox proportional hazards model to the data set.

5.Plot these survival functions for patients from the different groups.

Page 15: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

Survival Analysis

library(survival)Example: aml leukemia data

Kaplan-Meier curve

fit1 <- survfit(Surv(aml$time,aml$status)~1)summary(fit1)plot(fit1)

Log-rank test

survdiff(Surv(time, status)~x, data=aml)

Page 16: BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

Survival analysis

> cp <- coxph(Surv(aml$time,+ aml$status)~x,data=aml)>> summary(cp)>> plot(survfit(Surv(aml$time,aml$status)~x,+ data=aml),col=c("red","green"),lwd=2)