r workshop xiv--survival analysis with r

43
SSuurrvviivvaall AAnnaallyyssiiss iinn RR Yuan Huang, Project Manager Intern @ SupStat Inc Kai Xiao, Data Scientist@ Supstat Inc Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1 1 of 43 6/13/14, 9:49 PM

Upload: vivian-s-zhang

Post on 05-Dec-2014

908 views

Category:

Education


1 download

DESCRIPTION

NYC Data Science Academy, NYC Open Data Meetup, Big Data, Data Science, NYC, Vivian Zhang, SupStat Inc,survival anaysis with R, R programming

TRANSCRIPT

Page 1: R workshop xiv--Survival Analysis with R

SSuurrvviivvaall AAnnaallyyssiiss iinn RRYuan Huang, Project Manager Intern @ SupStat IncKai Xiao, Data Scientist@ Supstat Inc

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

1 of 43 6/13/14, 9:49 PM

Page 2: R workshop xiv--Survival Analysis with R

OOuuttlliinneeIntroduction to survival analysis

Method and Implementation in R

Case study: ADDICTS data

·

Data types

Statistics of interest (Survival function, Hazard function, Relative risk)

-

-

·Create survival objects1.

Estimate survival functions2.

Test for equality of survival functions3.

Cox proportional hazards model4.

·

2/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

2 of 43 6/13/14, 9:49 PM

Page 3: R workshop xiv--Survival Analysis with R

Introduction to survival analysis

3/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

3 of 43 6/13/14, 9:49 PM

Page 4: R workshop xiv--Survival Analysis with R

WWhhaatt iiss ssuurrvviivvaall aannaallyyssiiss??Survival Analysis is a collection of statistical procedures that seeks to answer questions such as howlong a population can survive past a certain time or event and what variables can explain thisduration. Data often comes in the form of time until event of interest occurs.

Convention:

time: years/months/weeks/days from the beginning of follow-up of an individual until an eventoccurs

event: death, heart attack, disease incidence, or the event of interest.

·

·

time --> survival time

event --> failure

·

·

4/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

4 of 43 6/13/14, 9:49 PM

Page 5: R workshop xiv--Survival Analysis with R

EExxaammpplleess::Clinical trial

Finance

Economics

Industry engineering

·

Test for the effect of medicine, study the time until a disease/ death. (event: disease/death)

Access the risk of organ transplant, study the living time after transplant. (event: death)

-

-

·

Credit model, study the time to default of a client. (event: default)-

·

Study the unemployment duration (event: employment)-

·

Study the lifetime of some product: light bulb fails, computer crashes-

5/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

5 of 43 6/13/14, 9:49 PM

Page 6: R workshop xiv--Survival Analysis with R

DDaattaa TTyyppeessComplete data

Truncation

Censoring

True survival time = Observed survival time (follow-up time)·

Truncation may occurs when a subject enters the study: observation of subject depends onevent.

Subjects may not be observed. If the subjects are observed, the event time is precisely known.

e.g instruments with limits of detection

·

·

·

Censoring may occur when a subject leaves the study: time of event is not known precisely.

All subjects are observed, but the event time may not be precisely known.

Three types of censoring data: right censoring, left censoring, and interval censoring.

·

·

·

6/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

6 of 43 6/13/14, 9:49 PM

Page 7: R workshop xiv--Survival Analysis with R

MMoorree oonn cceennssoorriinnggRight censoring : True survival time > Observed survival time (Most common)

Left censoring : True survival time < Observed survival time

Interval censoring :Observed survival time 1 < True survival time < Observed survival time 2.

e.g., Patients are alive at the end of the follow-up time.

Note: In this talk, we focus on events with right-censored data.

·

·

e.g. consider following persons until they become HIV positive. We may record a failure when asubject first tests positive for the virus at time t. In this case, we only know that the failure occursbefore t, instead of knowing exact failure time.

·

e.g. consider following persons until they become HIV positive. A subject may have had two HIVtests, where he/she was HIV negative at the time (say, t1) of the first test and HIV positive at thetime (t2) of the second test. In such a case, the subject’s true survival time occurred after time t1and before time t2, i.e., the subject is interval censored in the time interval (t1,t2).

·

7/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

7 of 43 6/13/14, 9:49 PM

Page 8: R workshop xiv--Survival Analysis with R

DDaattaa llaayyoouuttEvent time data usually represented by pair (t,d), where

t: time

d: censoring indicator. d=1 if failure and d=0 if censored.

x: covariates of interests

·

·

·

8/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

8 of 43 6/13/14, 9:49 PM

Page 9: R workshop xiv--Survival Analysis with R

DDaattaa llaayyoouutt:: eexxaammpplleeExmaple: Acute Myelogenous Leukemia (AML) data. It is included in the R package "survival".

Description: Survival in patients with AML. Experiment was designed to investigate whether thestandard course of chemotherapy should be extended (’maintenance’) for additional cycles.

library("survival")

head(aml)

time status x1 9 1 Maintained2 13 1 Maintained3 13 0 Maintained4 18 1 Maintained5 23 1 Maintained6 28 0 Maintained

9/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

9 of 43 6/13/14, 9:49 PM

Page 10: R workshop xiv--Survival Analysis with R

Statistics of interest

10/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

10 of 43 6/13/14, 9:49 PM

Page 11: R workshop xiv--Survival Analysis with R

SSuurrvviivvaall ffuunnccttiioonn:: Definition

Survival function gives proportion of population still without the event by time t.

Graph

is graphed as a decreasing smooth curve, which begins at S(t)=1 at t=0 and heads downwardtoward zero as t increases toward infinity.

S(t)

S(t) = Pr(T > t)

S(t)

11/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

11 of 43 6/13/14, 9:49 PM

Page 12: R workshop xiv--Survival Analysis with R

EEssttiimmaatteedd//EEmmppiirriiccaall ssuurrvviivvaall ccuurrvveess Estimator

Survival curve is estimated by Kaplan-Meier (KM) estimator , also known as "product estimator".

Graph

(t)S

(t)S

is a step function, rather than smooth curve.

The estimated survival curve jumps only at observed failure times, and the information from thecensored observations contributes to the sizes of the steps.

· (t)S

·

12/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

12 of 43 6/13/14, 9:49 PM

Page 13: R workshop xiv--Survival Analysis with R

HHaazzaarrdd ffuunnccttiioonn:: Alternative names

Hazard function, Incidence rate, Instantaneous risk, and Force of mortality

Definition

Hazard function gives the instantaneous potential per unit time for the event to occur, given that theindividual has survived up to time .

h(t)

t

h(t) = limΔt→0

P(t ≤ T < t + Δt|T ≥ t)Δt

The hazard is event rate at t for those at risk, rather than a probability. Thus, the values of thehazard function range between zero and infinity.

·

13/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

13 of 43 6/13/14, 9:49 PM

Page 14: R workshop xiv--Survival Analysis with R

RReellaattiivvee rriisskkssAlternative names

Relative risk, and Risk ratio, and Hazard ratio (RR/HR)

Definition

RR is a measure of the strength of the effect on survival.

The risk ratio is defined by

Let denote the hazard rate from treatment group at the time t ,

Let denote the hazard rate from control group at the time t.

· (t)h1

· (t)h0

(t)h1

(t)h0

14/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

14 of 43 6/13/14, 9:49 PM

Page 15: R workshop xiv--Survival Analysis with R

GGooaallss ooff ssuurrvviivvaall aannaallyyssiiss??Step 1Estimate and interpret survival and hazard functions from survival data. (Descriptive statistics)

Step 2:Compare survival and/or hazard functions. (Two-sample mean test)

Step 3:Assess the relationship of explanatory variables to survival time. (Regression analysis)

15/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

15 of 43 6/13/14, 9:49 PM

Page 16: R workshop xiv--Survival Analysis with R

Methods and implementation in R

16/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

16 of 43 6/13/14, 9:49 PM

Page 17: R workshop xiv--Survival Analysis with R

DDaattaa:: aammllstr(aml) # check variables defined in aml dataset.

'data.frame': 23 obs. of 3 variables: $ time : num 9 13 13 18 23 28 31 34 45 48 ... $ status: num 1 1 0 1 1 0 1 1 0 1 ... $ x : Factor w/ 2 levels "Maintained","Nonmaintained": 1 1 1 1 1 1 1 1 1 1 ...

17/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

17 of 43 6/13/14, 9:49 PM

Page 18: R workshop xiv--Survival Analysis with R

Step 0: Create survival objects

18/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

18 of 43 6/13/14, 9:49 PM

Page 19: R workshop xiv--Survival Analysis with R

Purpose : Create survival objects

Usage : Survival object is usually used as a response variable in a model formula.

Syntax

Exmaple

Surv(time, event, type=c('right', 'left', 'interval', 'counting', 'interval2', 'mstate'))

time: for right censored data, this is the follow up time.

event: The status indicator, normally 0=alive, 1=dead.

type: character string specifying the type of censoring. The default is "right".

·

·

·

surv.aml <- with(aml,Surv(time, status))

19/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

19 of 43 6/13/14, 9:49 PM

Page 20: R workshop xiv--Survival Analysis with R

Step 1: Estimate the survival curves

20/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

20 of 43 6/13/14, 9:49 PM

Page 21: R workshop xiv--Survival Analysis with R

EEssttiimmaattee tthhee ssuurrvviivvaall ccuurrvveessMethod: Kaplan-Meier estimator

Implimentation in R: survfit ( )

Visualization:Plot of survival curves

·

·

·

21/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

21 of 43 6/13/14, 9:49 PM

Page 22: R workshop xiv--Survival Analysis with R

EExxaammppllee:: DDaattaaaml[with(aml,x=="Maintained"),]

time status x1 9 1 Maintained2 13 1 Maintained3 13 0 Maintained4 18 1 Maintained5 23 1 Maintained6 28 0 Maintained7 31 1 Maintained8 34 1 Maintained9 45 0 Maintained10 48 1 Maintained11 161 0 Maintained

22/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

22 of 43 6/13/14, 9:49 PM

Page 23: R workshop xiv--Survival Analysis with R

PPrroocceedduurreessConstruct a table as follows:

Each row represents a time point that even happens.

for each row, calculate

·

·Number of people at risk at time t: ;1.

Number of people die at time t: ;2.

Surivival rate at time t3.

nrisk,t

ndeath,t

S(t) = S(t − 1) × (1 − )ndeath,t

nrisk,t

23/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

23 of 43 6/13/14, 9:49 PM

Page 24: R workshop xiv--Survival Analysis with R

Each row represents a time point that even happens.

TIME # OF AT RISK # OF DEATH S(T)

9

13

18

23

31

34

48

24/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

24 of 43 6/13/14, 9:49 PM

Page 25: R workshop xiv--Survival Analysis with R

For each row, calculate the four quantities.

TIME # OF AT RISK # OF DEATH S(T)

9 11 1

13 10 1

18 8 1

23

31

34

48

1 × (1 − ) = 0.909111

0.909 × (1 − ) = 0.818110

0.818 × (1 − ) = 0.71618

25/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

25 of 43 6/13/14, 9:49 PM

Page 26: R workshop xiv--Survival Analysis with R

Finished table:

TIME # OF AT RISK # OF DEATH S(T)

9 11 1

13 10 1

18 8 1

23 7 1

31 5 1

34 4 1

48 2 1

Scatter plot with pairs of (t, S(t)) gives the esitmated survival curve.

1 × (1 − ) = 0.909111

0.909 × (1 − ) = 0.818110

0.818 × (1 − ) = 0.71618

0.716 × (1 − ) = 0.61417

0.614 × (1 − ) = 0.49115

0.491 × (1 − ) = 0.36814

0.368 × (1 − ) = 0.18412

26/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

26 of 43 6/13/14, 9:49 PM

Page 27: R workshop xiv--Survival Analysis with R

IImmpplliimmeennttaattiioonn iinn RR :: ssuurrvvffiitt (( ))fit <- survfit(Surv(time, status) ~ x, data=aml)summary(fit)

Call: survfit(formula = Surv(time, status) ~ x, data = aml)

x=Maintained time n.risk n.event survival std.err lower 95% CI upper 95% CI 9 11 1 0.909 0.0867 0.7541 1.000 13 10 1 0.818 0.1163 0.6192 1.000 18 8 1 0.716 0.1397 0.4884 1.000 23 7 1 0.614 0.1526 0.3769 0.999 31 5 1 0.491 0.1642 0.2549 0.946 34 4 1 0.368 0.1627 0.1549 0.875 48 2 1 0.184 0.1535 0.0359 0.944

x=Nonmaintained time n.risk n.event survival std.err lower 95% CI upper 95% CI 5 12 2 0.8333 0.1076 0.6470 1.000 8 10 2 0.6667 0.1361 0.4468 0.995 12 8 1 0.5833 0.1423 0.3616 0.941 23 6 1 0.4861 0.1481 0.2675 0.883 27 5 1 0.3889 0.1470 0.1854 0.816

27/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

27 of 43 6/13/14, 9:49 PM

Page 28: R workshop xiv--Survival Analysis with R

VViissuuaalliizzaattiioonn:: PPlloott ssuurrvviivvaall ffuunnccttiioonn..Better looking survival curves: 1. KM plot with at-risk-table 2. Good-looking KM curves·

plot(fit,lty = 1:2) # basic plot: plot( ) function.legend("topright",lty=1:2,legend= levels(aml$x))

28/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

28 of 43 6/13/14, 9:49 PM

Page 29: R workshop xiv--Survival Analysis with R

Step 2: Test for equality of survivalfunctions

29/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

29 of 43 6/13/14, 9:49 PM

Page 30: R workshop xiv--Survival Analysis with R

Question: Is there statistically significant difference between the two survival curves?

Method:

Implimentation in R: survdiff( )

1.log-rank test: test equality of two survival curves.

2.Stratified log-rank test: test equality of two survival curves in every stratum of the categoricalexplanatory variable.

·

·

30/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

30 of 43 6/13/14, 9:49 PM

Page 31: R workshop xiv--Survival Analysis with R

lloogg--rraannkk tteessttLog-rank test is the most popular test for testing

(⋅) = (⋅)S1 S2

By testing the survival curves, we are testing at infinity many time points.

What Log-rank test tests does:

The test is based on statistic constructed over a series of tables.

·

·

If = c, for all t,(t)h1

(t)h2

test for : c = 1 versus : c ≠ 1H0 H1

·

31/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

31 of 43 6/13/14, 9:49 PM

Page 32: R workshop xiv--Survival Analysis with R

CCaallccuullaattee lloogg--rraannkk ssttaattiittiiccStep 1: For each time with events , construct the table

GROUP 1 GROUP 2 ROW TOTAL

NO. of death at

NO. of survivors beyond

Column total

Step 2: Compute three quantities:

· , j = 1, … , Jtj

tj d1j d2j dj

tj s1j s2j sj

n1j n2j nj

·

= , = , =Oj d1j Ejn1jdj

njVj

n1jn2jdjsj

( − 1)n2j nj

Step 3: The log-rank statistics is·

=χ 2L

[ ( − )]∑Jj=1 Oj Ej

2

∑Jj=1 Vj

32/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

32 of 43 6/13/14, 9:49 PM

Page 33: R workshop xiv--Survival Analysis with R

IImmpplliimmeennttaattiioonn iinn RR:: ssuurrvvddiiffff(( ))Syntax

Example

The p-value is 0.065, which is greater than 0.05. Therefore, there is no sufficient evidence toconclude the difference between the two survival curves.

survdiff(formula, data, subset, na.action, rho=0)

logrank <- survdiff(Surv(time, status) ~ x, data=aml)# Stratified log-rank test: survdiff(Surv(time,status)~x+strata(sex))logrank

Call:survdiff(formula = Surv(time, status) ~ x, data = aml)

N Observed Expected (O-E)^2/E (O-E)^2/Vx=Maintained 11 7 10.69 1.27 3.4x=Nonmaintained 12 11 7.31 1.86 3.4

Chisq= 3.4 on 1 degrees of freedom, p= 0.0653

33/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

33 of 43 6/13/14, 9:49 PM

Page 34: R workshop xiv--Survival Analysis with R

Step 3: Cox proportional hazardsmodel

34/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

34 of 43 6/13/14, 9:49 PM

Page 35: R workshop xiv--Survival Analysis with R

CCooxx pprrooppoorrttiioonnaall hhaazzaarrddss mmooddeellModel setup

Model assumption

Model interpretation

Implimentation in R: coxph( )

·

·

·

·

35/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

35 of 43 6/13/14, 9:49 PM

Page 36: R workshop xiv--Survival Analysis with R

MMooddeell sseettuuppThe Cox PH model specifies the hazard for individual i as

The effects of covariates are additive and linear on the log-risk scale:

(t) = (t) exp( + … + )λi λ0 β1xi1 βpxip

is the value of variable for subject . and do not depend on the time .

is the baseline hazard.It depends on the time and is the same for all individuals.

Note: There is no intercept term in cox model.

· xij j i x β t

· (t)λ0 t

·

log( (t)) = log( (t) + + … +λi λ0 β1xi1 βpxip

is called linear predictor or risk score.· + … +β1xi1 βpxip

36/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

36 of 43 6/13/14, 9:49 PM

Page 37: R workshop xiv--Survival Analysis with R

MMooddeell iinntteerrpprreettaattiioonnFor model

Intuition: Look at the model with only one treatment indicator as an example. In this case, the modelis , where

(t) = (t) exp( + … + )λi λ0 β1xi1 βpxip

is the log risk ratio associated with one-unit change in , given other 's are heldconstant.

· βk xk X

(t) = (t) exp(β )λi λ0 Xi

= {Xi01

, if i is treated, if i is control

For subject from control group,

For subject from treatment group,

Hence the hazard ratio between the treatment group and the control group is

· (t) = (t)λi λ0

· (t) = (t) exp(β)λi λ0

·

= exp(β)(t)λi

(t)λ0

37/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

37 of 43 6/13/14, 9:49 PM

Page 38: R workshop xiv--Survival Analysis with R

MMooddeell aassssuummppttiioonn

Propostional hazards (PH) assumption

If the model set up is correct, then we can see from the formula that

indeed is a constant over time.

Let denote the hazard function for person .

Let denote the hazard function for person .

· (t)λi i

· (t)λj j

requires: is a constant over time.

means that the hazard for one individual is proportional to the hazard for any other individual,where the proportionality constant is independent of time.

· (t)λi

(t)λj

·

= exp( ( − ) + … + ( − ))(t)λi

(t)λjβ1 xi1 xj1 βp xip xjp

38/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

38 of 43 6/13/14, 9:49 PM

Page 39: R workshop xiv--Survival Analysis with R

MMooddeelliinngg wwhheenn PPHH aassssuummppttiioonn iiss vviioollaatteedd1.Stratified Cox model

It is applied when PH assumption is not fulfilled across stratas, but is statisfied within each strata.

where stands for the baseline hazard for th group.

(further inquiry, email [email protected])

(t) = (t) exp( + … + )λi λ0k β1X1 βpXp

(t)λ0k k

Accelerated failure-time models (AFT).1.

39/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

39 of 43 6/13/14, 9:49 PM

Page 40: R workshop xiv--Survival Analysis with R

CChheecckk aassssuummppttiioonnss1. Model with only one indicator variable

2. General cases

In general cases, we apply statistical test for asscessing the PH assumpation. We will skip the theoryhere. In R, it's implemented by cox.zph( ) function.

Graphical approach: Plotting log-log Kaplan Meier survival estimates against time and evaluatingwhether the curves are reasonably parallel.

·

This statistical test is a test of correlation between the Schoenfeld residuals and survival time (orranked survival time).

A correlation of zero supports the proportional hazards assumption (the null hypothesis).

·

·

40/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

40 of 43 6/13/14, 9:49 PM

Page 41: R workshop xiv--Survival Analysis with R

Case study: ADDICTS data

41/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

41 of 43 6/13/14, 9:49 PM

Page 42: R workshop xiv--Survival Analysis with R

FFuurrtthheerr ttooppiiccssParametric model - Accelerated failure time model(AFT)

Modeling with time dependent covaraites

Competing risk model

Model for interval censored data

Model for recurrent events

·

·

·

·

·

42/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

42 of 43 6/13/14, 9:49 PM

Page 43: R workshop xiv--Survival Analysis with R

RReeffeerreenncceeTableman, M., & Kim, J. S. (2003). Survival analysis using S: analysis of time-to-event data. CRCPress.

Kleinbaum, D.G.& Klein, M. (2005). Survival Analysis: A Self-Learning Text.Springer-Verlag.

Ghosh, G. (2012). Lecture notes: survival analysis. Penn State University.

URL: http://staff.pubhealth.ku.dk/~sr/Aarhus08112010.pdf

URL: http://www.summitllc.us/applying-survival-analysis-to-the-hunger-games/

·

·

·

·

·

43/43

Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1

43 of 43 6/13/14, 9:49 PM