A Comparison of Modeling Scales in Flexible Parametric Models
Noori Akhtar-Danesh, PhDMcMaster UniversityHamilton, [email protected]
2
Outline
BackgroundA review of splinesFlexible parametric modelsResults
• Ovarian cancer
• Colorectal cancer
Conclusions
Stata Conference 2015, Columbus, USA
Stata Conference 2015, Columbus, USA 3
Background
Cox-regression and parametric survival models are quite common in the analysis of survival data
Recently, Flexible Parametric Models (FPM), have been introduced as an extension to the parametric models such as Weibull model (hazard- scale), loglogistic model (odds-scale), and lognormal model (probit-scale)
Stata Conference 2015, Columbus, USA 4
Objectives & Methods
In this presentation different FPMs will be compared based on these modeling scales
Used two subsets of the U.S. National Cancer Institute's Surveillance, Epidemiology and End Results (SEER) dataset from the original 9 registries; • Ovarian cancer diagnosed 1991 - 2010
• Colorectal cancer in men 60+ diagnosed 2001 - 2010
5
A review of splines
The daily statistical practice usually involves assessing relationship between one outcome variable and one or more explanatory variables
We usually assume linear relationship between some function of the outcome variable and the explanatory variables
However, in many situations this assumption may not be appropriate
Stata Conference 2015, Columbus, USA
6
Linear splinesysuse auto
twoway (scatter mpg weight) (lfit mpg weight), xlabel(1700(300)4900)
Stata Conference 2015, Columbus, USA
10
20
30
40
1,700 2,000 2,300 2,600 2,900 3,200 3,500 3,800 4,100 4,400 4,700 5,000Weight (lbs.)
7
Linear splinesysuse auto
twoway (scatter mpg weight) (lfit mpg weight), xlabel(1700(300)4900)
Stata Conference 2015, Columbus, USA
10
20
30
40
1,700 2,000 2,300 2,600 2,900 3,200 3,500 3,800 4,100 4,400 4,700 5,000Weight (lbs.)
8
Linear spline mkspline lnweight1 2400 lnweight2= weight
regress mpg lnweight*
predict linsp
twoway (scatter mpg weight) (line linsp weight, sort)
10
20
30
40
2,000 3,000 4,000 5,000Weight (lbs.)
Mileage (mpg) Fitted values
Stata Conference 2015, Columbus, USA
9
Cubic Splines
Cubic splines are piecewise cubic polynomials with a separate cubic polynomial fit in each of the predefined number of intervals
The number of intervals is chosen by the user and the split points are known as knots
Continuity restrictions are imposed to join the splines at knots to fit a smooth function
Stata Conference 2015, Columbus, USA
10
Restricted Cubic Splines
In RCS the spline function is forced (restricted) to be linear before the first and after the last knot (the boundary knots)
When modeling survival time, the boundary knots are usually defined as the minimum and maximum of the uncensored survival times
Stata Conference 2015, Columbus, USA
11
Restricted Cubic SplinesLet s(x) be the restricted cubic spline function, if
we define m interior knots, k1,…, km, and two boundary knots, kmin and kmax, we can write s(x) as a function of parameters and some newly defined variables z1,…, zm+1,
0 1 1 2 2 1 1( ) ... m ms x z z z
Stata Conference 2015, Columbus, USA
12
Restricted Cubic SplinesThe derived variables (zj, also know as the basis
functions) are calculated as following
where for j=2,…,m+1, and
(Royston & Lambert, 2011)
1
3 3 3min max( ) ( ) (1 )( )j j j j
z x
z x k x k x k
max
max min
jj
k k
k k
Stata Conference 2015, Columbus, USA
3 3( ) ( ) if it is positive and 0 otherwisej jx k x k
13
Restricted Cubic Splines
These RCSs can be calculated using a number of Stata commands, including mkspline (an official Stata command), rcsgen, and splinegen (two user written commands)
The rcsgen command can orthogonalize the derived spline variables which can lead to more stable parameter estimates and quicker model convergence
Stata Conference 2015, Columbus, USA
14
rcsgen; an alternative spline macro
sysuse auto, clear
rcsgen weight, gen(rcs) df(3)
regress mpg rcs1-rcs3
predictnl pred = xb(), ci(lci uci)
twoway (rarea lci uci weight, sort) ///
(scatter mpg weight, sort) ///
(line pred weight, sort lcolor(black)), /// legend(off)
Stata Conference 2015, Columbus, USA
15
Flexible Parametric Models rcsgen; an alternative spline macro
10
20
30
40
2,000 3,000 4,000 5,000Weight (lbs.)
Stata Conference 2015, Columbus, USA
16
FPM: Royston-Parmer (RP) Models
RP models are a extension of the parametric models (Weibull, log-logistic, and log-normal) which offer greater flexibility with respect to shape of the survival distribution
The additional flexibility of an RP model is because, for instance for a hazard model, it represents the baseline distribution function as a restricted cubic spline function of log time instead of simply as a linear function of log time
The complexity of modeling spline functions is determined by the number and positions of the knots in the log time
Stata Conference 2015, Columbus, USA
17
FPM: Royston-Parmer (RP) Models
By default, the internal knots for modeling baseline distribution function in RP models are positioned on the distribution of uncensored log event-times
Internal knots d.f. Knot positions (centiles)1 2 50
2 3 33, 67
3 4 25, 50, 75
4 5 20, 40, 60, 80
5 6 17, 33, 50, 67, 83
6 7 14, 29, 43, 57, 71, 86
7 8 12.5, 25, 37.5, 50, 625.5, 75, 87.5
8 9 11.1, 22.2, 33.3, 44.4, 55.6, 66.7, 77.8. 88.9
9 10 10, 20, 30, 40, 50, 60, 70, 80, 90
Stata Conference 2015, Columbus, USA
18
FPM: Royston-Parmer (RP) Models
Spline models can be chosen by the appearance of the survival functions, hazard functions, etc. or more formally, by minimizing the value of an information criterion [Akaike (AIC) or Bayes (BIC)]Estimation of parameters is by maximum likelihood
Stata Conference 2015, Columbus, USA
19
FPM: A review of Weibull distribution
The cumulative hazard function for a Weibull distribution is
To make it consistent with rest of this presentation let’s change the notation as
where 1 is the shape parameter. Then, the Weibull hazard function is
( ) pH t t
1( )H t t
1 11( ) ( ) /h t dH t dt t
Stata Conference 2015, Columbus, USA
20
FPM: A review of Weibull distributionuse "C:Desktop/ovary.dta", clear
keep if agegrp==3
stset SRV_TIME_MON, f( STAT_REC=4) scale(12) exit(time 10*12)
sts gen S=s sts gen Slb=lb(s) sts gen Sub=ub(s)
sts gen Hna=nasts gen Hlb=lb(na)sts gen Hub=ub(na)
quietly streg, d(w)
gen Hweib=exp(_b[_cons])*(_t)^(e(aux_p))gen Sweib=exp(-Hweib)
bysort _t: drop if _n>1
twoway(rarea Hlb Hub _t, pstyle(ci) sort) /// (line Hna Hweib _t, sort), leg(off) ytitle(" Cumulative hazard function") /// xtitle("") ylab(, angle(h)) name(g1, replace) nodraw
twoway(rarea Slb Sub _t, pstyle(ci) sort) /// (line S Sweib _t, sort), leg(off) ytitle(" Survival function") /// xtitle("") ylab(, angle(h)) name(g2, replace) nodraw
graph combine g1 g2, b2title(Years from diagnosis)
Stata Conference 2015, Columbus, USA
21
FPM: A review of Weibull distribution Women age 60-69 diagnosed with ovarian cancer
Stata Conference 2015, Columbus, USA
0
.5
1
1.5
Cum
ulat
ive
haza
rd fu
nctio
n
0 2 4 6 8 10
.2
.4
.6
.8
1
Sur
viva
l fun
ctio
n
0 2 4 6 8 10
Years from diagnosis
22
FPM: A review of Weibull distribution
One reason that a Weibull model does not fit very well to the dataset is that it has a monotonic hazard function To have a more flexible form, we begin by writing the Weibull cumulative hazard function in logarithmic form
Now, suppose that f(t;) represents some general family of nonlinear functions of time t, with some parameter vector and
1 0 1ln ( ) ln ln lnH t t t
Stata Conference 2015, Columbus, USA
ln ( ) ( ; )H t f t
23
FPM: Royston-Parmer (RP) Models
Because cumulative hazard functions are monotonic in time, f(t;) must be monotonic too
Two potentially appropriate functions are fractional polynomials (Royston & Altman 1994) and splines (de Boor 2001)
Stata Conference 2015, Columbus, USA
24
FPM: Royston-Parmer (RP) Models
We write a restricted cubic spline function as s(lnt;) instead of f(t;) with s standing for spline and lnt to emphasize that we are working on the scale of log time
where lnt, z1(lnt), z2(lnt), ..., are the basis functions of the restricted cubic spline
2 1 30 1 2ln ( ) (ln ; ) ln (ln ) (ln ) ...z t zH s t t tt
Stata Conference 2015, Columbus, USA
25
FPM: Royston-Parmer (RP) Models
When we specify one or more knots, the spline function includes a constant term (0), a linear function of lnt with parameter 1, and a basis function for each knot
By convention, the “no knots” case for a hazard model corresponds to the linear function, s(lnt;)= 0+ 1 lnt, which is the Weibull model
Stata Conference 2015, Columbus, USA
26
FPM: Royston-Parmer (RP) Models
We estimate the parameters by maximum likelihood method using the stpm2 routine (Lambert & Royston 2009)
We identify df for each model based on AIC criteria and evaluate the variables in the model using lrtest
We use options of hazard, odds, and normal in stpm2 for fitting different scales
Stata Conference 2015, Columbus, USA
27
FPM: Ovarian cancer
. tab agegrpAge group | Freq. Percent Cum.-------------+-----------------------------------40- 49 years | 2,700 19.55 19.5550- 59 years | 3,896 28.21 47.7660- 69 years | 3,466 25.10 72.8670- 79 years | 2,606 18.87 91.73 >=80 years | 1,142 8.27 100.00-------------+----------------------------------- Total | 13,810 100.00
gen year=DATE_yr-1990mkspline yearsp=year, cubic nknots(3)
stpm2 agegrp2-agegrp5 yearsp*, df(7) tvc(agegrp2- /// agegrp5 yearsp1) dftvc(2) ///
scale(hazard) eform nolog
Stata Conference 2015, Columbus, USA
28
FPM: Ovarian cancer
. estat ic
Scale | AIC
-------------+--------------
Hazard | 35615.92
Odds | 35616.51
Normal | 35564.12
Stata Conference 2015, Columbus, USA
29
FPM: Ovarian cancer
Stata Conference 2015, Columbus, USA
30
FPM: Ovarian cancer
Stata Conference 2015, Columbus, USA
0
.2
.4
.6
.8
1
Re
lativ
e su
rviv
al r
atio
0 1 2 3 4 5 6 7 8 9 10Years from diagnosis
40- 49 years
0
.2
.4
.6
.8
1
Re
lativ
e su
rviv
al r
atio
0 1 2 3 4 5 6 7 8 9 10Years from diagnosis
50- 59 years
0
.2
.4
.6
.8
1
Re
lativ
e su
rviv
al r
atio
0 1 2 3 4 5 6 7 8 9 10Years from diagnosis
60- 69 years
0
.2
.4
.6
.8
1
Re
lativ
e su
rviv
al r
atio
0 1 2 3 4 5 6 7 8 9 10Years from diagnosis
70- 79 years
0
.2
.4
.6
.8
1
Re
lativ
e su
rviv
al r
atio
0 1 2 3 4 5 6 7 8 9 10Years from diagnosis
>= 80 years
95%CI HazardOdds Normal
Scale
31
FPM: Ovarian cancer
Stata Conference 2015, Columbus, USA
0
.2
.4
.6
.8
1
On
e-y
ea
r R
SR
1991 1994 1997 2000 2003 2006 2009Years of diagnosis
40- 49 years
0
.2
.4
.6
.8
1
On
e-y
ea
r R
SR
1991 1994 1997 2000 2003 2006 2009Years of diagnosis
50- 59 years
0
.2
.4
.6
.8
1
On
e-y
ea
r R
SR
1991 1994 1997 2000 2003 2006 2009Years of diagnosis
60- 69 years
0
.2
.4
.6
.8
1
On
e-y
ea
r R
SR
1991 1994 1997 2000 2003 2006 2009Years of diagnosis
70- 79 years
0
.2
.4
.6
.8
1
On
e-y
ea
r R
SR
1991 1994 1997 2000 2003 2006 2009Years of diagnosis
>= 80 years
HazardOddsNormal
Scale
32
FPM: Ovarian cancer
Stata Conference 2015, Columbus, USA
.3
.5
.7
Fiv
e-y
ea
r R
SR
1991 1994 1997 2000 2003 2006 2009Years of diagnosis
40- 49 years
.3
.5
.7
Fiv
e-y
ea
r R
SR
1991 1994 1997 2000 2003 2006 2009Years of diagnosis
50- 59 years
.3
.5
.7
Fiv
e-y
ea
r R
SR
1991 1994 1997 2000 2003 2006 2009Years of diagnosis
60- 69 years
.3
.5
.7
Fiv
e-y
ea
r R
SR
1991 1994 1997 2000 2003 2006 2009Years of diagnosis
70- 79 years
.3
.5
.7
Fiv
e-y
ea
r R
SR
1991 1994 1997 2000 2003 2006 2009Years of diagnosis
>= 80 years
HazardOddsNormal
Scale
33
FPM: Ovarian cancer
Stata Conference 2015, Columbus, USA
0
10
20
30
40
50
60
Mo
rta
lity
rate
(p
er
10
0 P
Y)
0 1 2 3 4 5Years from Diagnosis
40- 49 years
0
10
20
30
40
50
60
Mo
rta
lity
rate
(p
er
10
0 P
Y)
0 1 2 3 4 5Years from Diagnosis
50- 59 years
0
10
20
30
40
50
60
Mo
rta
lity
rate
(p
er
10
0 P
Y)
0 1 2 3 4 5Years from Diagnosis
60- 69 years
0
10
20
30
40
50
60
Mo
rta
lity
rate
(p
er
10
0 P
Y)
0 1 2 3 4 5Years from Diagnosis
70- 79 years
0
10
20
30
40
50
60
Mo
rta
lity
rate
(p
er
10
0 P
Y)
0 1 2 3 4 5Years from Diagnosis
>= 80 years
Hazard
OddsNormal
Scale:
34
FPM: Colorectal cancer
. tab agegrp
Age group | Freq. Percent Cum.-------------+-----------------------------------60- 69 years | 14,837 35.32 35.3270- 79 years | 16,026 38.16 73.48 >=80 years | 11,139 26.52 100.00-------------+----------------------------------- Total | 42,002 100.00
stpm2 agegrp2-agegrp3 year, df(9) tvc(agegrp2- /// agegrp3) dftvc(2) scale(hazard) eform nolog
Stata Conference 2015, Columbus, USA
35
FPM: Colorectal cancer
. estat ic
Scale | AIC
-------------+--------------
Hazard | 111997.6
Odds | 112056.2
Normal | 111829.9
Stata Conference 2015, Columbus, USA
36
FPM: Colorectal cancer
Stata Conference 2015, Columbus, USA
0
.1
.2
.3
.4
.5
.6
.7
.8
.9
1
Su
rviv
al r
ate
0 1 2 3 4 5 6 7 8 9 10Years from diagnosis
60- 69 years
0
.1
.2
.3
.4
.5
.6
.7
.8
.9
1
Su
rviv
al r
ate
0 1 2 3 4 5 6 7 8 9 10Years from diagnosis
70- 79 years
0
.1
.2
.3
.4
.5
.6
.7
.8
.9
1
Su
rviv
al r
ate
0 1 2 3 4 5 6 7 8 9 10Years from diagnosis
>= 80 years
95%CI HazardOdds Normal
Scale
37
FPM: Colorectal cancer
Stata Conference 2015, Columbus, USA
.7
.75
.8
.85
.9
On
e-y
ea
r su
rviv
al r
ate
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010Years of diagnosis
60- 69 years
.7
.75
.8
.85
.9
On
e-y
ea
r su
rviv
al r
ate
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010Years of diagnosis
70- 79 years
.7
.75
.8
.85
.9
On
e-y
ea
r su
rviv
al r
ate
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010Years of diagnosis
>= 80 years
Hazard
OddsNormal
Scale
38
FPM: Colorectal cancer
Stata Conference 2015, Columbus, USA
0
10
20
30
40
50
60
70
80
Mo
rta
lity
rate
(p
er
10
0 P
Y)
0 1 2 3 4 5Years from Diagnosis
60- 69 years
0
10
20
30
40
50
60
70
80
Mo
rta
lity
rate
(p
er
10
0 P
Y)
0 1 2 3 4 5Years from Diagnosis
70- 79 years
0
10
20
30
40
50
60
70
80
Mo
rta
lity
rate
(p
er
10
0 P
Y)
0 1 2 3 4 5Years from Diagnosis
>= 80 years
HazardOdds
Normal
Scale
39
Conclusion
In general, there were no substantial differences between the estimates from the three modeling scales, although the probit-scale showed slightly better fit based on the Akaike information criterion (AIC) for both datasets
Stata Conference 2015, Columbus, USA
40
References de Boor, C. 2001. A Practical Guide to Splines, Revised ed ed. New York,
Springer. Durrleman, S. & Simon, R. 1989. Flexible regression models with cubic splines.
Stat.Med., 8, (5) 551-561 available from: PM:2657958 Lambert, P.C. & Royston, P. 2009. Further development of flexible parametric
models for survival analysis. The Stata Journal, 9, 265-290 Royston, P. & Altman, D.G. 1994. Regression using fractional polynomials of
continuous covariates: Parsimonious parametric modelling (with discussion). Applied Statistics, 43, 429-467
Royston, P. & Lambert, P.C. 2011. Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model Stata Press.
Royston, P. & Parmar, M.K. 2002. Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Stat.Med., 21, (15) 2175-2197 available from: PM:12210632
Stata Conference 2015, Columbus, USA