principles of survival analysis - king's college london · 2014-02-10 · principles of...
Post on 28-Jun-2020
4 Views
Preview:
TRANSCRIPT
Principles of Survival AnalysisVersion of July 22nd 2012
ACC CoolenKing’s College London
Principles of Survival Analysis 2
Preface
When I first tried to learn about survival analysis I found there to be an unwelcome gap in
literature. There are many textbooks and papers that give the formulas of survival analysis, explain
how one should use standard statistical software packages, and give examples of applications of
standard methods to real data. Then there are hardcore statistics papers that often focus on
very mathematical/technical questions and are often written in the language of measure theory.
I could not find good textbooks that sit in the middle, to explain in detail the conceptual and
mathematical basis of the formulas of survival analysis. Where do all these formulas come from?
What assumptions were made? How exactly are key quantities defined?
The traditional survival analysis methods such as proportional hazards regression and Kaplan-
Meier risk estimators were perfect for their time (the 1970s), when each university had just one
computer (that filled several rooms, and was probably slower than today’s average laptop) and
mathematical methods had to be simple in order to be applied to real data. However, nowadays
the use of this traditional methodology is increasingly inappropriate. In modern biomedicine we
have new problems and new ambitions: we want to use the wealth of new data for personalised
medicine, but face complex heterogeneous diseases and cohorts, and a vast dimensional mismatch
between the number of (e.g. genetic) covariates and the number of patients on which we have data.
I would say that at this moment the pressing problems in survival analysis are not measure-
theoretic, they are more basic. We need to develop new methods, that do not deviate unnecessarily
from the traditional ones (and preferably include these in special simplifying limits), but can
handle the big questions of today – individualised prediction, cohort heterogeneity and dimensional
mismatch. To do this we need to understand and review in full detail the principles and
mathematical derivations of traditional survival analysis, and rebuild the edifice where this is needed
to accommodate the new questions that we want survival analysis to answer.
These lecture notes are written as an attempt to fill the above gap. I try to map out and explain
the definitions, assumptions and derivations of the main methods in survival analysis. The style is
that of the physicist who appreciates that there is a time and place for investigating mathematical
subtleties like Lebesgue measures, noncommuting limits, and distribution theory, but who first
wants to erect the building in terms of structure. Once the roof is on, and the windows are in we
can start thinking about the colour of the door handles. I also write with the benefit of hindsight.
In the 1970s maximum-likelihood estimation was the norm, and that is often the language in which
original derivations are given. Nowadays we prefer the Bayesian route (within which maximum
likelihood is but a special limit), and this makes derivations and subtleties a lot more transparent.
In these notes I will not give journal references. Below I simply list four textbooks on the
subject, which contain a wealth of references to research papers and other texts. With the possible
exception of Crowder, these books tend not to give full mathematical derivations to the extent that
I would have liked. The texts of Klein and Moeschberger and of Crowder I like best in terms of
subject coverage and writing style, but of course such assessments are subjective ...
P Hougaard: Analysis of multivariate survival data. Springer (2001).
JG Ibrahim, MH Chen and D Sinha: Bayesian survival analysis. Springer (2001).
JP Klein and ML Moeschberger: Survival analysis – techniques for censored and
truncated data. Springer (2005).
M Crowder: Multivariate survival analysis and competing risks. CRC Press (2012).
3
CONTENTS
1 Why probability and statistics are tricky 5
2 Definitions and basic properties in survival analysis 8
2.1 Notation, data and objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Survival probability and cause-specific hazard rates . . . . . . . . . . . . . . . . . . . 9
2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Event time correlations and the identifiability problem 15
3.1 Independently distributed event times . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 The (Tsiatis) identifiability problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Incorporating cure as a possible outcome 20
4.1 The clean way to include cure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 The quick and dirty way to include cure . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Individual versus cohort level survival statistics 25
5.1 Population level survival functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Population hazard rates and data likelihood . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6 Survival prediction 30
6.1 Cause-specific survival functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2 Estimation of cause-specific hazard rates . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3 Derivation of the Kaplan-Meier estimator . . . . . . . . . . . . . . . . . . . . . . . . 34
6.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7 Including covariates 43
7.1 Definition via covariate sub-cohorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 Definition by conditioning individual hazard rates on covariates . . . . . . . . . . . . 45
7.3 Connection between the conditioning picture and the sub-cohort picture . . . . . . . 47
7.4 Conditionally homogeneous cohorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.5 Nonparametrised determination of covariates-to-risk connection . . . . . . . . . . . . 50
7.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8 Proportional hazards (Cox) regression 55
8.1 Definitions, assumptions and regression equations . . . . . . . . . . . . . . . . . . . . 55
8.2 Uniqueness and p-values for regression parameters . . . . . . . . . . . . . . . . . . . 58
8.3 Properties and limitations of Cox regression . . . . . . . . . . . . . . . . . . . . . . . 59
8.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4
9 Overfitting and p-values 70
9.1 What is overfitting? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9.2 Overfitting in binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9.3 Overfitting in Cox regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.4 p-values for Kaplan-Meier curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.5 Notes and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
10 Heterogeneous cohorts and competing risks 78
10.1 Population-level hazard rate correlations and competing risks . . . . . . . . . . . . . 79
10.2 Rational paramerisations for heterogeneous population models . . . . . . . . . . . . 81
10.3 Types of heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
10.4 Impact of heterogeneity on hazard ratios . . . . . . . . . . . . . . . . . . . . . . . . . 84
10.5 False protectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
10.6 Frailty models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
10.7 Fine and Gray regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
10.8 Bayesian regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
10.9 Notes and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
11 Further topics 84
11.1 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
11.2 Multiple testing corrections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
11.3 Log-rank test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Appendix A The δ-distribution 87
Appendix B Steepest descent integration 88
Appendix C Maximum likelihood and Bayesian parameter estimation 89
5
1. Why probability and statistics are tricky
Probability and statistics are probably the most tricky and most abused areas of mathematics.
In order to get some feeling for why this is so, let us start with some simple examples of
statistical/probabilistic questions that as yet have nothing to do with survival analysis.
Example 1: The Monty Hall problem
This problem is based loosely on the scenario played out at the end of many typical television
game shows of the 1970s. The archetypical show in the USA after which the problem was
named was called ‘Let’s Make a Deal’ (USA, 1963-1977) and was hosted by Monty Hall.
At the end of the show the winner faces a final challenge to claim a prize. There are three
closed doors; behind one of these is a big prize (large amount of money, a car, etc) , behind
the other two is something silly (e.g. goats or llamas). What happens next is:
• the winner is asked to choose one of the
three closed doors (randomly, as he/she
has no clue).
• Monty opens one of the remaining two
doors, behind which there is a goat/llama
(this is always possible, irespective of
the winner’s selection, since only one
door has the true prize).
We are then left with two closed doors,
one of which was picked by the winner.
We still don’t know which of these leads to the prize.
• Monty then offers the winner the option to change his/her mind at the last minute and
switch from the initial selection to the other closed door.
The question then is: will it make a difference to the likelihood of winning the prize if he/she
were to switch at the last minute? Intuitively one is tempted to say no. It would seem that each
door simply has a 50% chance of leading to the prize, and switching would make no difference.
In fact the correct answer is yes. careful analysis of all possible events and their probabilities
shows that switching at the last minute doubles one’s likelihood of winning the prize ...
6
Example 2: share price statistics
Imagine one is asked to produce a statistical report on the typical behaviour of share prices
over a a twenty year period, of corporations that are listed on the London Stock Exchange. For
the sake of the argument, let us pretend that the most recent financial crisis hadn’t happened.
How would we go about this task? It would seem natural to proceed as follows:
• Make a list of all companies that have
been on the LSE since 1992.
• Find/buy the data that give the daily
share values of all companies on this
list over the last 20 years.
• Carry out a careful statistical analysis
of these data.
Yet in doing so we would make fundamental mistakes. In fact we are already in trouble from
the first step. By putting only those companies on our list that have been on the LSE for the
last twenty years, we are biasing our sample to those companies that are sufficiently healthy
to remain in business (and listed on the LSE) for at least twenty years. Irrespective of the
statistical analysis methods used, this will lead to a picture of share statistics that is too rosy.
Pitfalls and dangers in probability and statistics. Probability and statistics are in principle as
sound and unambiguous as any other area of mathematics. The pitfalls that so often get us
into trouble with statistics do not relate to the precision and consistency of the formal theory
or mathematical manipulations, but they tend to emerge when we apply statistics and probability
to practical scenarios and real-world problems. The main ones are related to:
• The meaning of of uncertainty
Probabilities quantify uncertainty, but there are two types of uncertainty. Probabilities can
express our ignorance of:
(a) something that cannot be known, because it is still to happen and can still go either way
(e.g. the probability of finding a six for a dice that is stil to be rolled ...)
(b) something that is known, but not by us
(e.g. the probability of finding a six for a dice that has been rolled inside a black box ...)
For instance, if we write Prob(phenotype) = Prob(genotype) × Prob(phenotype|genotype) in
biomedicine, then the uncertainty in an individual’s genotype expressed by Prob(genotype)
would be of type (b) (in principle written in stone, but we don’t have the information), whereas
given the genotype we would still expect variability in the phenotype Prob(phenotype|genotype)that results at least partly from non-predictable events (e.g. cell signalling, mutations), i.e.
represent type type (a) uncertainty. In medicine the difference can be quite relevant.
7
• Accidental conditioning
This is what happens when we inadvertedly collect our information from a non-representative
subset of the events or individuals on which we seek to make statistical statements. This is
what happens in the Monty Hall problem (where Monty’s decision of which door to open is
constrained or conditioned by the initial selection of the winner, this brings in subtle extra
information that is exploited when the winner switches at the last minute), and in the example
of share price analysis (where we condition our sample of companies on at least 20 years’
survival). Extra information B generally modifies the probability to observe A: we wish to
sample according to a prior P (A), but end up sampling acording to a conditioned posterior
measure P (A|B), described by the Bayesian relation
posterior︷ ︸︸ ︷P (A|B) =
P (A,B)
P (B)=
prior︷ ︸︸ ︷P (A)×P (B|A)
P (B)
• Limitations of our intuition
Possibly because of the evolutionary advantages of pattern detection, humans are obsessed with
patterns, and consequently very poor at judging likelihoods objectively. We struggle to accept
intuitively that even after we have thrown ten successive sixes with a fair dice (an unlikely
sequence of events for which the a priori probability is around 1.7 10−7), then for the next roll
to give yet another six still carries a probability of 1/6 (in spite of the fact that this would
lead to an even more remarkable sequence of eleven sixes in a row). By the same token most
humans would struggle to generate sequences of random numbers 01001010001011010010...; it
is trivial to write a simple computer program that can predict the next digit in such human-
generated sequences correctly with probability of some 60%. This failing statistical intuition
explains why most would get the answer to the Monty Hall question wrong, and partly explains
the profitability of the gambling industry ...
• Assumptions behind methods
All statistical methods involve explicit or implicit assumptions, and many involve further
mathematical approximations. If nothing is assumed, nothing can be calculated. For instance,
in least-squares data fitting and in principal component analysis one assumes that the noise in
the data is Gaussian; in order to use the central limit theorem it is not sufficient just to have
a large sum of independent random variables (but we have to satisfy very specific criteria on
the distributions of the random variables, e.g. those quantified by Lindeberg’s condition), etc.
Obviously, the correctness of any outcome of such methods depends on the extent to which
these assumptions and approximations are reasonable in the context of the problem at hand.
It is vital that one has a basic understanding of what these assumptions and approximations
are, so that one can convince oneself that the chosen method can be used.
• Imprecise definitions
In statistics it is vital that we are very precise in defining quantities. When we speak about
the probability of getting a specific disease within a given time, do we mean the probability for
one individual? Or the probability for a randomly drawn individual from a population? Do
we include our ignorance of the previous two? (i.e. the probability of the probability) ...
8
2. Definitions and basic properties in survival analysis
2.1. Notation, data and objective
Notation and data. Imagine we have data on a cohort of N patients, labelled i = 1 . . . N . They
are subject to R distinct ‘hazards’ or ‘risks’, labelled r = 1 . . . R, which trigger irreversible events
such as onset of a given disease of interest, death due to causes other than the disease of interest,
etc. We also measure p characteristics of our patients (e.g. gender, blood serum counts, BMI,
socio-economic factors, genetic variables, etc), resulting for each patient i in a list of p numbers
(Zi1, . . . , Zip), the so-called ‘covariates’. Covariates can be discrete (e.g. gender) or real-valued (e.g.
BMI). We monitor our cohort during a trial of finite duration, and record for each patient when
the first event happened to them, and which event this was; the start of the trial is taken as time
zero. To label those patients that did not record any event during the trial (they could be lost to
follow-up along the way, or may have reached the end of the trial without experiencing any of the
R events), we introduce a further ‘risk’ r = 0 (which we will refer to simply as end-of-trial). Our
data thus take the following form. For each patient i we have
Zi = (Zi1, . . . , Zip): value of p covariates
Xi ≥ 0: time at which the first event occurred
∆i ∈ 0, . . . , R: label indicating which event occurred at time Xi
A typical example of such data is shown in figure 1. The covariates (or ‘explanatory factors’) can
be divided into three qualitatively distinct groups:
• uncontrolled covariates: e.g. gender, genetic make-up, etc
• controlled covariates: e.g. medical treatment,
• modifiable covariates: e.g. smoking, drinking, nutrition, etc
Objectives of survival analysis. Survival analysis is the statistical discipline that deals with data
of the above type, and tries to extract patterns from these data to quantify the relations (if any)
between the covariates and the risks. Usually we are interested mainly in one particular risk,
traditionally chosen as r = 1, so the other risks r ∈ 0, 2, 3, . . . , R are unfortunate complications.
More specifically we would like to
• Evaluate the effects of covariates
• Predict event times from knowledge of the covariates
• Compare and validate different models with which to explain the data
Censoring. ‘Censoring’ means that the value of a measurement is only partially known, as is the
case here. An end-of-trial outcome ∆i = 0 means that all we know is that patient i is either ‘lost’
along the way, or will experience his/her first actual event from the risk set 1, . . . , R at some
time ti ≥ C, where C is the duration of our trial. This latter option is called ‘right censoring’.
Alternative types of censoring (which we will not deal with here) are ‘left censoring’, i.e. ti ≤ C for
some C, or ‘interval censoring’, i.e. ti ∈ [C1, C2] for some C1, C2. Here we label all the patients i
that are censored as ∆i = 0, and use Xi to denote the time where they left our trial.
Complications. The main complications in survival analysis are caused by (i) the statistical ‘noise’
caused by censoring, (ii) the fact that different risks prevent each other from happening (or from
being observed), e.g. if a patient dies we will never know whether and when he/she would have
9
got the disease of interest, (iii) possible correlations between the different risks, (iv) heterogeneity
in cohorts (in terms of covariates, and in terms of what covariates imply in terms of risks), and
(v) the fact that most studies are ‘underpowered’, i.e. we want to extract complicated statistical
patterns with from data on relatively small patient cohorts, which brings the danger of overfitting
and non-reproducibility of results.
2.2. Survival probability and cause-specific hazard rates
Joint event times and survival function. Imagine the imaginary situation where for each individual
i all events r = 0 . . . R could in principle be observed (irrespective of their nature and their order
in time), and let tr denote the time at which event r occurs. If we assume also that all events will
ultimately always happen (some earlier and some later, and some perhaps only at times so late to
be practically irrelevant), we write the joint distribution for individual i of the event times as
Pi(t0, . . . , tR) (1)
Since we assume that each event will ultimately happen, this distribution must be normalised, so∫ ∞0. . .
∫ ∞0
dt0 . . . dtR Pi(t0, . . . , tR) = 1 (2)
We can next define the integrated event time distribution for individual i:
Si(t0, . . . , tR) =
∫ ∞t0. . .
∫ ∞tR
ds0 . . . dsR Pi(s0, . . . , sR)
=
∫ ∞0. . .
∫ ∞0
ds0 . . . dsR Pi(s0, . . . , sR)R∏r=0
θ(sr − tr) (3)
with the step function, defined as θ(z>0) = 1 and θ(z<0) = 0. Si(t0, . . . , tR) gives the probability
that for individual i event 0 occurs later than t0, and event 1 occurs later than t1, and . . . etc. Note
that
Si(0, . . . , 0) =
∫ ∞0. . .
∫ ∞0
ds0 . . . dsR Pi(s0, . . . , sR) = 1 (4)
We can now define the survival function Si(t) as the probability that for individual i all events
r = 0 . . . R will happen later than time t:
Si(t) = Si(t0, . . . , tR)|tr=t for all r = Si(t, t, . . . , t) (5)
Cause-specific hazard rates. We next want to characterise for each individual risk how likely it is to
trigger an event as a function of time, and how it impacts on the overall survival probability Si(t)
defined above. This is done via the so-called cause-specific hazard rates, defined as
πiµ(t) = −[ ∂∂tµ
logSi(t0, . . . , tR)]tr=t for all r
(6)
Whenever we write log(.) we will mean the natural logarithm. Inserting the definition of Si(t) above,
and using ddzθ[z] = δ(z) (see Appendix A on the δ-distribution) allows us to work this out:
πiµ(t) =[∫∞
0 . . .∫∞
0 ds0 . . . dsR Pi(s0, . . . , sR)δ(sµ − tµ)∏Rr 6=µ θ(sr − tr)
Si(t0, . . . , tR)
]tr=t for all r
=
∫∞t . . .
∫∞t
(∏Rr 6=µ dsr
)Pi(s0, . . . , sµ−1, t, sµ+1, . . . , sR)
Si(t)(7)
10
Sheet1
Page 1
pat BMI SELENIUM PHYS_ACT_LEIS PHYS_ACT_WORK Smoking Time
1 22.63671875 105 2 1 0 33.6892539357 0
2 34.21875 65 2 2 0 24.810403833 1
3 20.06640625 72 2 3 2 23.1047227926 2
4 28.3984375 81 2 0 2 33.6783025325 0
5 22.94921875 73 0 3 2 32.8843258042 0
6 20.59765625 73 2 0 2 23.2936344969 1
7 26.46875 70 2 3 2 27.9069130732 2
8 26.38671875 91 1 0 1 26.6119096509 2
9 24.296875 73 -10 0 0 26.379192334 1
10 30.01953125 84 1 1 2 26.8254620123 2
11 25.4296875 95 1 0 2 30.6557152635 2
13 23.19921875 67 3 1 0 33.2457221081 0
14 22.90625 82 2 2 0 33.6399726215 0
15 21.62890625 53 1 1 1 33.2457221081 0
16 23.046875 77 2 0 2 33.6125941136 0
17 21.70703125 76 1 2 2 27.8466803559 1
18 22.91796875 102 1 0 2 33.6125941136 0
19 24.5078125 57 2 0 2 25.8097193703 2
20 26.58984375 72 1 2 2 26.803559206 2
21 26.76953125 78 2 0 2 33.234770705 0
22 20.4296875 75 -10 1 1 33.5934291581 0
23 25.0078125 69 0 3 0 33.6125941136 0
24 24.296875 73 2 3 2 21.6098562628 1
25 23.65625 75 2 3 1 33.2320328542 0
26 25.9296875 90 1 1 2 33.5359342916 0
27 23.3671875 58 1 1 2 33.2320328542 0
28 30.08984375 77 -10 -10 2 32.0438056126 2
29 31.08984375 66 1 0 0 29.4893908282 1
30 27.13671875 82 1 -10 1 33.1526351814 0
31 19.828125 68 2 2 2 28.2600958248 2
32 27.30859375 97 2 3 1 33.5742642026 0
33 23.41796875 77 2 0 0 25.9219712526 1
34 20.5078125 78 1 3 0 32.8350444901 0
35 24.90625 75 0 1 0 24.7255304586 1
36 21.70703125 67 0 1 1 33.2265571526 0
38 24.20703125 75 1 1 0 33.2183436003 0
39 22.71875 76 2 2 2 31.5893223819 2
40 30.06640625 79 1 0 0 33.5934291581 0
41 26.1171875 66 3 3 0 33.5633127995 0
42 22.33984375 94 2 0 2 33.5550992471 0
43 24.02734375 77 1 2 0 33.582477755 0
44 23.546875 81 3 -10 0 19.1266255989 1
45 25.58984375 81 2 0 2 31.3456536619 2
46 24.38671875 58 2 2 2 30.984257358 2
47 22.75 96 2 1 2 33.582477755 0
48 31.76953125 75 2 3 1 33.0924024641 0
49 23.9375 64 1 1 1 22.1656399726 2
50 22.88671875 74 1 0 1 33.2128678987 0
51 20.90625 92 2 1 2 32.9034907598 0
52 22.9765625 82 2 1 0 33.196440794 0
53 34.33984375 68 1 0 0 33.2073921971 0
54 29.0078125 81 2 1 1 27.9370294319 2
55 24.33984375 76 2 1 2 22.8364134155 1
56 30.45703125 84 2 2 0 16.2299794661 1
57 25.25 62 1 2 2 23.318275154 2
58 23.71875 76 1 1 2 33.2895277207 0
59 25.12890625 85 2 1 0 24.7091033539 1
60 22.26953125 96 2 1 2 33.1991786448 0
61 24.859375 66 1 -10 2 22.006844627 2
62 26.80859375 69 -10 1 2 23.3538672142 2
63 25.95703125 87 3 0 0 33.1909650924 0
PCcens
Figure 1. Sample survival data from the ULSAM prostate cancer study. Column one:
patient label i. Columns two to six: values of five covariates (Z1i , . . . , Z
i5), of which four
are modifiable (BMI, leisure time physical activity, physical activity at work, smoking) and
one is uncontrolled (selenium level in the blood). Last two columns: event time and label
(Xi,∆i). Entries ‘-10’ refer to missing data.
11
Hence πiµ(t)dt gives the probability for individual i that event µ happens in the time interval
[t, t+ dt), given that no event has happened to i yet prior to time t:
πiµ(t)dt = Prob(tiµ ∈ [t, t+dt)
∣∣∣ i had no events yet at time t)
(dt ↓ 0) (8)
Since πiµ(t) gives a probability of hazardous events per unit time it is called a ‘hazard rate’. It
depends on which risk µ we are discussing, hence it is ‘cause specific’. The subtlety is in the
conditioning: it is defined conditional on the individual still being event-free at the relevant time.
Survival function in terms of cause-specific hazard rates. It turns out that the overall survival
probability Si(t) can be written in terms of the cause-specific hazard rates, in a simple way. To see
this we calculate
d
dtlogSi(t) =
d
dtlogSi(t, t, . . . , t) =
R∑r=0
[ ∂∂tr
logSi(t0, . . . , tR)]tr=t for all r
= −R∑r=0
πir(t) (9)
Hence, using Si(0) = 1,
logSi(t) = logSi(0)−R∑r=0
∫ t
0ds πir(s) = −
R∑r=0
∫ t
0ds πir(s) (10)
so
Si(t) = e−∑R
r=0
∫ t0
ds πir(s) (11)
Data likelihood in terms of cause-specific hazard rates. Similarly we can express also the likelihood
Pi(X,∆)dX to observe in our trial patient i reporting a first event of type ∆ at a time in the interval
[X,X + dX) (with dX ↓ 0) in terms of the cause specific hazard rates. To observe the above the
following three statements must be true:
• the time of the event is in [X,X + dX),
• the type of the event is ∆, and
• no events occurred prior to X.
This can all be written in terms of properties of the joint event times (t0, . . . , tR) of individual i:
θ(t∆−X)θ(X+dX−t∆)∏r 6=∆
θ(tr−X) = 1 (12)
and the likelihood Pi(X,∆) can therefore be written as ‡
Pi(X,∆) = limdX↓0
1
dXProbi
(θ(t∆−X)θ(X+dX−t∆)
∏r 6=∆
θ(tr−X) = 1)
= limdX↓0
1
dX
∫ ∞0. . .
∫ ∞0
dt0 . . . tR Pi(t1, . . . , tR)θ(t∆−X)θ(X+dX−t∆)∏r 6=∆
θ(tr−X)
=
∫ ∞0. . .
∫ ∞0
dt0 . . . tR Pi(t1, . . . , tR) limε↓0
hε(t∆−X)∏r 6=∆
θ(tr−X) (13)
‡ Note that we implicitly assume that the joint event time distribution Pi(t0, . . . , tR) is continuous and
smooth, so that the probability of seeing ties in the timing of events, i.e. tµ = tν for µ 6= ν, is negligible.
12
with
hε(z) = ε−1θ(z)θ(ε−z) =
ε−1 for z ∈ [0, ε]
0 elsewhere(14)
We note that the function limε↓0 hε(z) has all the properties that define the δ-function (see Appendix
A): hε(z) ≥ 0 for all ε > 0,∫
dz hε(z) = 1 for all ε > 0, limε↓0 hε(z) = 0 for all z 6= 0, and
limε↓0 hε(0) =∞. So limε↓ hε(z) = δ(z), and we get
Pi(X,∆) =
∫ ∞0. . .
∫ ∞0
dt0 . . . tR Pi(t1, . . . , tR)δ(t∆−X)∏r 6=∆
θ(tr−X)
= Si(X)πi∆(X)
= πi∆(X)e−∑R
r=0
∫ X0
ds πir(s) (15)
where we used (7) in the first step, and (11) in the second. So the survival probabilities Si(t) and the
data likelihoods Pi(X,∆) can both be written strictly in terms of the cause-specific hazard rates.
The final picture is as in the diagram below. We can therefore anticipate that in all our statistical
analyses of the data the cause-specific hazard rates will play a central role.
P1(t0, . . . , tR) . . . . . . PN (t0, . . . , tR)︸ ︷︷ ︸individual event time statistics
⇓(π1
0(t), . . . , π1R(t)) . . . . . . (πN0 (t), . . . , πNR (t))︸ ︷︷ ︸
individual hazard rates
⇓(X1,∆1) . . . . . . (XN ,∆N )︸ ︷︷ ︸
observed survival data
Starting our description from the distributions Pi(t0, . . . , tR) was useful in terms of understanding
how cause-specific hazard rates πir(t) emerge, but working directly at the level of these rates has
advantages. It avoids us having to think in terms of the event times (t0, . . . , tR) and their distribution
(which refers to a hypothetical situation where all event times could be observed - including e.g. the
onset of a disease after death). Secondly, we will see that upon using the hazard rates as a starting
point we can also deal in a transparent way with events that have a nonzero probability of never
happening; these we have so far ruled out as a result of starting with a normalised Pi(t0, . . . , tR).
Cause-specific hazard rates in terms of data probabilities. We have seen that the data probabilities
Pi(X,∆) can be written fully in terms of the cause specific hazard rates πir(t). It turns out that the
converse is also true, i.e. the cause-specific hazard rates can be written fully and explicitly in terms
of the data probabilities Pi(X,∆). To see this, let us first sum over ∆ in (15)
R∑∆=0
Pi(X,∆) =( R∑
∆=0
πi∆(X))e−∑R
r=0
∫ X0
ds πir(s) = − d
dXe−∑R
r=0
∫ X0
ds πir(s) (16)
Hence
e−∑R
r=0
∫ X0
ds πir(s) = 1−∫ X
0dt
R∑r=0
Pi(t, r) =R∑r=0
∫ ∞X
dt Pi(t, r) (17)
13
Substituting this into the right-hand side of (15), followed by re-arranging in order to make the
hazard rate the subject of the equation, then immediately gives us
πi∆(X) =Pi(X,∆)∑R
r=0
∫∞X dt Pi(t, r)
(18)
So, if we wanted, we could build our theory entirely in the language of data probabilities Pi(X,∆),
as opposed to the language of the cause-specific hazard rates. Summation over all ∆ in both sides of
(18) gives another transparent and useful identity relating the cumulative hazard rate∑Rr=0 π
ir(X)
(i.e. the rate of events, irrespective of type, conditional on there not having been any events prior
to time X) to the distribution P (X) =∑Rr=0 P (X, r) of reported event times (of any type):
R∑r=0
πir(X) =
∑Rr=0 Pi(X, r)∑R
r=0
∫∞X dt Pi(t, r)
=Pi(X)∫∞
X dt Pi(t)(19)
Possible pitfalls and misconceptions. The most tricky aspect of survival analysis is its formulation
in terms of cause-specific hazard rates, which involve nontrivial conditioning at any time t on there
not having been any event prior to time t. This causes interpretation issues. For instance
• Expression (11) can be written in a form that factorises over the different risks as
Si(t) =∏r exp[−
∫ t0 ds πir(s)]. Does this imply that the risks are uncorrelated? No. All
risks r 6= µ will generally contribute to each πiµ(t), since the alternative risks modify
the conditioning, i.e. the likelihood that nothing has happened yet prior to t. The risks
may well interact strongly with each other, but we can no longer see this after we have
calculated the rates πiµ(t) and forget about the times (t0, . . . , tR).
• Starting from the survival function (11), do we get the survival function for the
hypothetical situation where risk µ is disabled by setting πiµ(t) to zero, i.e. Si(t) →exp[−
∑r 6=µ
∫ t0 ds πir(s)]? No. We would indeed have πiµ(t) = 0 for all t, but that is not
all. If we disable a risk µ, all other risks will in principle be more likely to happen first,
and hence the removal of risk µ changes in principle also all hazard rates πir(t) with r 6= µ.
2.3. Examples
Example 1: time-independent hazard rates
Here we have πir(t) = πir, independent of t, for all (r, i). Thus∫X
0 ds πiµ(s) = πiµX. This gives
the following simple formulae for the survival function and the data likelihood:
Si(t) = e−t∑R
r=0πir Pi(X,∆) = πi∆e−X
∑R
r=0πir (20)
Example 2: a single risk r = 1
Suppose we have only one hazard, r = 1, and for each patient i a single hazard rate πi(t):
Si(t) = e−∫ t
0ds πi(s) Pi(X,∆) = πi(X)e−
∫ X0
ds πi(s)δ∆,1 (21)
14
In this case we can write the event time distribution in terms of the hazard rate via (3):
Pi(t) = − d
dtSi(t) = πi(t)e−
∫ t0
ds πi(s) (22)
which is of course no surprise in view of (15). Now it makes perfect sense to think in terms
of Pi(t): there is only one risk, so nothing hypothetical about the event time t (as no other
events can prevent it from being observed).
If we have one risk only, and moreover a time independent hazard rate (i.e. a combination if
the two examples discussed above) we obtain
Si(t) = e−tπi
Pi(X,∆) = Pi(X)δ∆,1 Pi(t) = πie−tπi
(23)
Example 3: the most probably event time distribution for R = 1, given the value of the average
Finally, let us show how the exponential distribution of event times in (21) can be seen as
the simplest natural choice for the case R = 1, in an information-theoretic sense. Suppose
the only knowledge we have of Pi(t) is the value of the average event time 〈t〉i. The
most probably distribution Pi(t) with this average is found by maximizing the Shannon
entropy Hi = −∫∞
0 dt Pi(t) logPi(t), subject to the two constraints∫∞
0 dt Pi(t) = 1 and∫∞0 dt Pi(t)t = 〈t〉i§. The maximum is found via the Lagrange method:
δ
δPi(x)
∫ ∞0
ds Pi(s) logPi(s) =δ
δPi(x)
λ0
∫ ∞0
ds Pi(s) + λ1
∫ ∞0
ds Pi(s)s
1 + logPi(t) = λ0 + λ1t so Pi(t) = eλ0−1+λ1t
We note that λ1 < 0 is required for Pi(t) to be normalisable. Normalisation gives
1 = eλ0−1∫ ∞
0dt eλ1t = −λ−1
1 eλ0−1
So eλ0−1 = −λ1, giving Pi(t) = |λ1|e−|λ1|t. Finally we demand that the average time is 〈t〉i:
〈t〉i =
∫ ∞0
dt t|λ1|e−|λ1|t =1
|λ1|
∫ ∞0
ds se−s =1
|λ1|
[− se−s
]∞0
+
∫ ∞0
ds e−s
=1
|λ1|
0−
[e−s
]∞0
=
1
|λ1|
Hence
Pi(t) = πie−tπi
with πi = 1/〈t〉i (24)
§ Strictly speaking we must demand also that Pi(t) ≥ 0 for all t ≥ 0, but it turns out that this latter demand
will be satisfied automatically.
15
3. Event time correlations and the identifiability problem
We have seen that knowing the joint event time statistics Pi(t0, . . . , tR) of an individual i allows us
to calculate the cause-specific hazard rates πi0(t), . . . , πiR(t). We now ask the following: if we know
the cause-specific hazard rates, which we expect can be estimated from the data via (15), can we
deduce from this the distribution Pi(t0, . . . , tR)? In particular, can we deduce from the hazard rates
whether or not the event times of different risks are statistically independent? This will become
important when we turn to competing risks later.
3.1. Independently distributed event times
If the event times are all uncorrelated, i.e. if knowing one such time conveys no information on the
others, the joint distribution factorises by definition into the simple form
Pi(t0, . . . , tR) =R∏r=0
Pir(tr) (25)
Via (3) we then get
Si(t0, . . . , tR) =R∏r=0
∫ ∞0
dsr Pir(sr)θ(sr − tr) =R∏r=0
Sir(tr) (26)
Sir(t) =
∫ ∞t
ds Pir(s) (27)
So the probability to observe that each event r happens at a time later than tr is just the product of
the individual survival probabilities Sir(tr) for the risks. The cause specific hazard rates (6) become
πiµ(t) = −[ ∂∂tµ
R∑r=0
logSir(tr)]tr=t for all r
= −[ ∂∂tµ
logSiµ(tµ)]tµ=t
= − d
dtlogSiµ(t) (28)
Hence, if we integrate both sides, and use Sir(0) = 1 (no events have occurred yet at time t = 0):
logSiµ(t) = logSiµ(0)−∫ t
0ds πiµ(s) = −
∫ t
0ds πiµ(s) (29)
giving, as expected
Siµ(t) = e−∫ t
0ds πiµ(s) (30)
If we now differentiate (27) and use our formula for Sir(t) we find that we can express the event
time probabilities for each risk in terms of the associated hazard rates. This results in the following
generalisation to multiple independent risks of formula (22):
Pir(t) = − d
dtSir(t) = − d
dte−∫ t
0ds πir(s) = πir(t)e
−∫ t
0ds πir(s) (31)
3.2. The (Tsiatis) identifiability problem
We have seen above that for the special case of statistically independent event times one can indeed
calculate the event time probablities uniquely from the cause-specific hazard rates. However, we
can also deduce something else from the above derivation:
16
For any set of cause-specific hazard rates πi0(t), . . . , πiR(t), including those that
correspond to statistically dependent event times, there always exists a distribution for
independent event times that will give exactly the same cause-specifc hazard rates, namely
Pi(t0, . . . , tR) =R∏r=0
[πir(tr)e
−∫ tr
0ds πir(s)
](32)
It follows that knowledge of the cause-specific hazard rates (which is all we may ever hope to extract
from survival data alone) does not generally permit us to identify the underlying joint distribution
of event times – in particular, we cannot find out from survival data alone whether or not the event
times of the different risks are statistically independent. This is Tsiatis’ identifiability problem.
Tsiatis’ result appears to have created some pessimism in the past as to what can be achieved
with statistical analyses, especially in the context of so-called ‘competing risks’. We will turn to
these in more detail later; for now let us just say that ‘competing risks’ describes the situation where
at the level of populations or trial cohorts the event times of different risks appear to be correlated.
The identifiability problem suggested to some that in the case of competing risks there is not much
that survival analysis can do. Let us counteract this with a few observations:
• If the cohort under study is homogeneous, then Pi(t0, . . . , tR) will be identical for all
i and thus also describe the statistics of the cohort; here we do indeed have a problem.
However, if competing risks are due to population level correlations of hazard rates in
an inhomogeneous population, then the identifiability problem doesn’t arise. One could
imagine all individuals having independent event times, i.e. Pi(t0, . . . , tR) =∏r Pir(tr)
for each i, but correlated hazard rates: those individuals with a higher hazard rate for
diabetes might for instance have also a higher hazard rate for pancreatic cancer. Here we
would at population level have P (t0, . . . , tR) 6=∏r Pr(tr) and S(t) 6=
∏r Sr(t). But since
the correlations are now generated at the level of hazard rates (which are in principle
accessible via data) this would represent a competing risk problem that can be solved.
• Even if we do have correlated event times at the level of individuals, then still all is
not lost. If we only extract hazard rates from our data then we indeed cannot extract
from these the distribution Pi(t0, . . . , tR). But in Bayesian regression we do not require
uniqueness of explanations anyway. We would calculate the likelihood of each possible
explanation Pi(t0, . . . , tR) for the observed hazard rates, and find the most plausible one.
• Finally, it might well be that the ‘statistically independent’ event time explanation
above, that we can always construct for any set of observed hazard rates, has unwanted or
unlikely mathematical or interpretational features. For instance, to have mathematically
acceptable distributions Pir(t) they need to be normalised, i.e. we must demand
1 =
∫ ∞0
dt Pir(t) =
∫ ∞0
dt πir(t)e−∫ t
0ds πir(s) =
∫ ∞0
dt[− d
dte−∫ t
0ds πir(s)
]= 1− e−
∫∞0
ds πir(s) (33)
Hence the independent-times explanation for the hazard rates requires that
limt→∞
∫ t
0ds πir(s) =∞ (34)
This is just another way of saying that the probability of event r never occurring must be
zero. We will see in an example below that this is not always satisfied.
17
3.3. Examples
Example 1: true versus independent-times explanation for observed hazard rates
Let us inspect the following event time distribution for the times t1, t2 ≥ 0, with parameters
a, b, τ > 0 and ε ∈ [0, 1], for a single individual:
P (t1, t2) = ae−at2[εδ(t1 − t2 − τ) + (1−ε)be−bt1
](35)
It has the form P (t1, t2) = P (t1|t2)P (t2), with
P (t2) = ae−at2 , P (t1|t2) = εδ(t1−t2−τ) + (1−ε)be−bt1 (36)
P (t1, t2) is clearly nonnegative and normalised, so is a bona fide joint distribution. With
probability 1 − ε the two times are statistically independent, and with probability ε event 1
happens precisely a duration τ later than event 2. The integrated distribution S(t1, t2) is
S(t1, t2) =
∫ ∞t1
ds1
∫ ∞t2
ds2
ae−as2
[εδ(s1 − s2 − τ) + (1−ε)be−bs1
]= ε
∫ ∞t2
ds2 ae−as2∫ ∞t1
ds1 δ(s1−s2−τ)
+ (1−ε)∫ ∞t2
ds2 ae−as2∫ ∞t1
ds1 be−bs1
= ε
∫ ∞t2
ds2 ae−as2θ(s2+τ−t1) + (1−ε)e−bt1∫ ∞t2
ds2 ae−as2
= ε
∫ ∞max(t2,t1−τ)
ds2 ae−as2 + (1−ε)e−bt1−at2
= εe−a max(t2,t1−τ) + (1−ε)e−bt1−at2
=
εe−at2 + (1−ε)e−bt1−at2 if t2 > t1 − τεe−a(t1−τ) + (1−ε)e−bt1−at2 if t2 < t1 − τ
(37)
This gives for the survival function S(t) = S(t, t):
S(t) = e−at(ε+ (1−ε)e−bt
)(38)
Next we calculate the cause-specific hazard rates for this example, via (6). Since after the
partial differentiations we must set t1, t2 → t, we need only use the formula that applies for
t2 > t1 − τ :
π1(t) = − ∂
∂t1logS(t1, t2)
∣∣∣t1=t2=t
= − ∂
∂t1log
[εe−at2 + (1−ε)e−bt1−at2
]|t1=t2=t
= −∂∂t1
[(1−ε)e−bt1−at2
]|t1=t2=t
εe−at + (1−ε)e−(a+b)t=
b(1−ε)e−(a+b)t
εe−at + (1−ε)e−(a+b)t
=b(1−ε)e−bt
ε+ (1−ε)e−bt= b
(1 +
ε
1−εebt)−1
(39)
π2(t) = − ∂
∂t2logS(t1, t2)
∣∣∣t1=t2=t
= − ∂
∂t2log
[εe−at2 + (1−ε)e−bt1−at2
]|t1=t2=t
= −∂∂t2
[e−at2
(ε+ (1−ε)e−bt1
)]|t1=t2=t
εe−at + (1−ε)e−(a+b)t= a (40)
18
The hazard rate for cause 1 rate decays monotonically from the initial value π1(0) = b(1−ε)down to zero as t→∞. The hazard rate for cause 2 is independent of time.
We can now calculate the alternative ‘independent times’ explanation Pindep(t1, t2) =
P1(t1)P2(t2) for the above cause-specific hazard rates, as given in (32). For this we first require
the time integrals over the hazard rates:∫ t
0ds π1(s) =
∫ t
0ds
b(1−ε)e−bs
ε+ (1−ε)e−bs= −
[log
(ε+ (1−ε)e−bs
)]t0
= − log(ε+ (1−ε)e−bt
)(41)∫ t
0ds π2(s) = at (42)
with which we obtain
P1(t1) = π1(t1)e−∫ t1
0ds π1(s) = −
(ε+ (1−ε)e−bt1
)log
(ε+ (1−ε)e−bt1
)(43)
P2(t2) = ae−at2 (44)
However, P1(t1) is not normalised to 1 as soon as ε > 0, which follows from the fact that here
condition (34) is violated:
limt→∞
∫ t
0ds πir(s) = − lim
t→∞log
(ε+ (1−ε)e−bt
)= log(1/ε) (45)
We also see this by explicit integration:∫ ∞0dt P1(t) =
∫ ∞0dt π1(t)e−
∫ t0
ds π1(s) =
∫ ∞0dt
d
dt
[− e−
∫ t0
ds π1(s)]
= 1− e−∫∞
0ds π1(s) = 1− e−(− log ε) = 1− ε (46)
Hence, in the independent-times explanation for the cause-specific hazard rates we have a
probability ε that event 1 will never happen. If e.g. event 1 represents death and event 2 the
onset of some disease, then in the original correlated time distribution death will inevitably
occur, but in the independent-times explanation there is a probability ε of our individual being
immortal.
Example 2: true versus independent-times explanation for observed hazard rates
Let us not think that the above always happens. Inspect the following event time distribution
for the times t1, t2 ≥ 0, with a parameters a > 0 and a normalisation constant Z(a), again
referring to an individual:
P (t1, t2) =1
Z(a)e−a(t1+t2)−a2t1t2 (47)
We will need the following function:
F (x) =
∫ ∞x
ds e−s/s (48)
It decreases monotonically, i.e. F ′(x) < 0, from F (0) = ∞ down to F (∞) = 0. Note that
F ′(x) = −e−x/x. Let us calculate for this example the jount survival probability S(t1, t2):
S(t1, t2) =
∫ ∞t1
ds1
∫ ∞s2
dt2 P (s1, s2) =1
Z(a)
∫ ∞t1
ds1
∫ ∞t2
ds2 e−a(s1+s2)−a2s1s2
19
=1
a2Z(a)
∫ ∞at1
ds1
∫ ∞at2
ds2 e−s1−s2−s1s2
= − 1
a2Z(a)
∫ ∞at1
ds1e−s1
1+s1
[e−s2(1+s1)
]∞at2
=1
a2Z(a)
∫ ∞at1
ds1e−s1−at2(1+s1)
1+s1=
1
a2Z(a)
∫ ∞1+at1
due−(u−1)−at2u
u
=e
a2Z(a)
∫ ∞1+at1
due−u(1+at2)
u=
e
a2Z(a)
∫ ∞(1+at2)(1+at1)
dxe−x
x
=e
a2Z(a)F((1+at1)(1+at2)
)(49)
The normalisation factor Z(a) follows from using S(0, 0) = 1 (no events yet at time zero):
1 =e
a2Z(a)F (1) so Z(a) = eF (1)/a2 (50)
Hence
S(t1, t2) = F((1+at1)(1+at2)
)/F (1) (51)
Next we calculate the cause-specific hazard rates, using F ′(x) = −e−x/x:
π1(t) = −[ ∂∂t1
logS(t1, t2)]t1=t2=t
= −[ ∂∂t1
logF((1+at1)(1+at2)
)]t1=t2=t
= −a(1+at)F ′
((1+at)2
)F((1+at)2
) =a
1+at
e−(1+at)2
F ((1+at)2)(52)
and since S(t1, t2) is a symmetric function of (t1, t2) we get the same for risk 2, i.e. π2(t) = π1(t).
We can rewrite both hazard rates as
πr(t) = − 1
2
d
dtlogF ((1+at)2) (53)
and hence, using logF (∞) = log 0 = −∞,∫ ∞0
dt πr(t) = − 1
2
∫ ∞0
dt[ d
dtlogF ((1+at)2)
]= − 1
2
[logF ((1+at)2)
]∞0
= −1
2
(logF (∞)− logF (1)
)=∞ (54)
We conclude that condition (34) is satisfied, and we can indeed have an independent-times
explanation for our cause-specific hazard rates with fully normalised event time distributions
(i.e. all events happen at finite times).
20
4. Incorporating cure as a possible outcome
4.1. The clean way to include cure
So far we have worked with proper normalised joint event time distributions Pi(t0, . . . , tR), in which
all events will ultimately occur. We also saw that it is quite possible to define cause-specific hazard
rates for which the corresponding event has a finite probability of not happening at all. The question
here is how this can be incorporated in our formalism in a clean way. The natural solution is to
assign to each risk two random variables, i.e. replace tr → (τr, tt), in which tr is an event time and
τr ∈ 0, 1 tells us whether (τr = 1) or not (τr = 0) risk r will actually trigger an event at time tr.
The more general starting point for any individual i would then have to be the distribution
Pi(t0, . . . , tR; τ0, . . . , τR) (55)
which is now normalised according to
1∑τ0=0
. . .1∑
τR=0
∫ ∞0. . .
∫ ∞0
dt0 . . . dtR Pi(t0, . . . , tR; τ0, . . . , τR) = 1 (56)
The probability that for individual i all events will actually happen, for instance, would be
Pi(τ0 =1, . . . , τR=1) =1∑
τ0=0
. . .1∑
τR=0
∫ ∞0. . .
∫ ∞0
dt0 . . . dtR Pi(t0, . . . , tR; τ0, . . . , τR)∏r
δτr,1
=
∫ ∞0. . .
∫ ∞0
dt0 . . . dtR Pi(t0, . . . , tR; 1, 1, . . . , 1) (57)
We next define the integrated event time distribution for individual i, i.e. the probability that event
0 has not happened yet at time t0, event 1 hasn’t happened yet at time t1, etc. The conditions for
this are now somewhat more involved: for each r we demand that either τr = 0 or the event time
sr is later than tr, i.e.
R∏r=0
[τrθ(sr − tr) + (1− τr)
]= 1 (58)
Hence
Si(t0, . . . , tR) =1∑
τ0=0
. . .1∑
τR=0
∫ ∞0. . .
∫ ∞0
ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)
×R∏r=0
[τrθ(sr − tr) + (1− τr)
](59)
As before, nothing is assumed to have happened yet at time zero, so
Si(0, . . . , 0) =1∑
τ0=0
. . .1∑
τR=0
∫ ∞0. . .
∫ ∞0
ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)R∏r=0
[τrθ(sr)+(1−τr)
]
=1∑
τ0=0
. . .1∑
τR=0
∫ ∞0. . .
∫ ∞0
ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR) = 1 (60)
The survival function Si(t), i.e. the probability that for individual i all events r = 0 . . . R will
happen later than time t, now becomes
Si(t) = Si(t, t, . . . , t) =1∑
τ0=0
. . .1∑
τR=0
∫ ∞0. . .
∫ ∞0
ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)
21
×R∏r=0
[τrθ(sr − t) + (1− τr)
](61)
In contrast to our earlier formulation, now we need no longer find Si(∞) = 0. Here we get
Si(∞) =1∑
τ0=0
. . .1∑
τR=0
∫ ∞0. . .
∫ ∞0
ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)R∏r=0
(1− τr)
=1∑
τ0=0
. . .1∑
τR=0
Pi(τ0, . . . , τR)R∏r=0
(1− τr) (62)
This is the probability that all variables τr are zero, i.e. that all events do not occur. In practice,
we normally include the end-of-trial risk r = 0 for the specific purpose of assigning an event for
each individual, so we would choose to have Pi(s0, . . . , sR; τ0, . . . , τR) = 0 if τ0 6= 1; this ensures that
Si(∞) = 0 (even if none of the medical events happen, at least end-of-trial will always kick in).
Cause-specific hazard rates. At this stage matters seem to get a bit more tricky, but in fact
everything proceeds as before but with slightly more complicated formulae. To keep our notation
compact we will henceforth use the following short-hands:∑τ0...τR
. . . =∑1τ0=0 . . .
∑1τR=0 . . . and∫
ds0 . . . dsR . . . =∫∞
0 . . .∫∞
0 ds0 . . . dsR . . .. Our expressions also compactify if we use the identity
τrθ(sr−tr)+(1−τr) = 1− τrθ(tr−sr) (63)
We now define the usual cause-specific hazard rates
πiµ(t) = −[ ∂∂tµ
logSi(t0, . . . , tR)]tr=t ∀r
(64)
Working this out for our present function Si(t0, . . . , tR) gives
πiµ(t) =[∑τ0...τR
τµ∫
ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)δ(sµ−tµ)∏r 6=µ
[1−τrθ(tr−sr)
]Si(t0, . . . , tR)
]tr=t ∀r
=
∑τ0...τR
τµ∫
ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)δ(sµ−t)∏r 6=µ
[1−τrθ(t−sr)
]Si(t)
(65)
Hence πiµ(t)dt still gives the probability for individual i that event µ happens in the time interval
[t, t + dt), given that no event has happened to i yet prior to time t. What has changed is that
finding a nonzero value now requires having τµ = 1 (hence the new factor in the numerator), and
that the conditioning on the events other than µ has become somewhat more involved.
Survival function and data likelihood. Let us find out which of the properties involving the survival
function survive our generalisation to include cure as an outcome. Due to the fact that definition
(64) was still valid, and since still Si(0) = 1 (no events yet at time zero) our earlier simple expression
of the survival function in terms of hazard rates still holds:
d
dtlogSi(t) =
d
dtlogSi(t, t, . . . , t) =
R∑r=0
[ ∂∂tr
logSi(t0, . . . , tR)]tr=t ∀r
= −R∑r=0
πir(t) (66)
and thus we continue to have
Si(t) = e−∑R
r=0
∫ t0
ds πir(s) (67)
To see event ∆ first, in time interval [X,X + dX), the following conditions need to be met:
22
• τ∆ = 1
• the time of event ∆ is in [X,X + dX)
• no events occurred prior to X
The combination can be written compactly in terms of (t0, . . . , tR) and (τ0, . . . , τR) as
τ∆θ(t∆−X)θ(X+dX−t∆)∏r 6=∆
[1−τr(X−tr)
]= 1 (68)
So the probability P (X,∆) per unit time of this happening, for infinitesimally small time intervals
dX, becomes
Pi(X,∆) = limdX↓0
1
dX
∑τ0...τR
∫dt0 . . . dtR Pi(t0, . . . , tR; τ0, . . . , τR)
× τ∆θ(t∆−X)θ(X+dX−t∆)∏r 6=∆
[1−τr(X−tr)
]=
∑τ0...τR
τ∆
∫dt0 . . . dtR Pi(t0, . . . , tR; τ0, . . . , τR)δ(t∆−X)
∏r 6=∆
[1−τr(X−tr)
]= πi∆(X)Si(X) = πi∆(X)e−
∑R
r=1
∫ X0
ds πir(s) (69)
So also this relation continues to hold. This is nice, since it implies that at the level of survival
functions and hazard rates we don’t need to change anything – we now know that if there are risks
with nonzero probability of the associated event never happening, then we can describe this also at
the level of event times if we want to.
4.2. The quick and dirty way to include cure
An alternative way to include cure, often found in textbooks and papers (regretfully) is to extend
the time set [0,∞) and include tµ =∞ in the event time distribution; events that don’t happen are
said to happen at t =∞. For instance, in the case where there is just one risk we would write the
symbolic expression
Pi(t) = εiPi(t) + (1−εi)δ(t−∞) (70)
with Pi(t) an ordinary normalised distribution, describing event time statistics for the case where
the event does happen, which would be the case with probability εi = Probi(τ=1) (in terms of our
previous set-up). We would then define∫∞
0 dt δ(t−∞) = 1, and find∫ ∞0
dt Pi(t) = εi + (1−εi)∫ ∞
0dt δ(t−∞) = 1 (71)
but
limX→∞
∫ X
0dt Pi(t) = εi lim
X→∞
∫ X
0dt Pi(t) = εi (72)
The survival function and the hazard rate would at finite times become
Si(t) =
∫ ∞t
ds Pi(s) = 1−∫ t
0ds Pi(s) = 1− εi
∫ t
0ds Pi(s) (73)
πi(t) = − d
dtlogSi(t) = Pi(t)/Si(t) = εiPi(t)/Si(t) (74)
And so we find
Si(t) = e−∫ t
0ds πi(s), πi(t)e
−∫ t
0ds πi(s) = εiPi(t) (75)
23
We can now express εi (the probability of the evnt not happening at all) by integration of both
sides of the second identity over time, since P (t) is normalised:
εi =
∫ ∞0
dt πi(t)e−∫ t
0ds πi(s) = −
∫ ∞0
dt[ d
dte−∫ t
0ds πi(s)
]= 1− e−
∫∞0
ds πi(s) (76)
This makes sense, since we know that∫∞
0 ds πi(s) < ∞ is indeed the condition for finding a finite
‘no event’ probability. We see that we can now also write the initial time distribution Pi(t) as
Pi(t) = πi(t)e−∫ t
0ds πi(s) + e−
∫∞0
ds πi(s)δ(t−∞) (77)
It is in principle possible to analysis the situation this way, but it is mathematically somewhat
messy. For instance: we have had to give up the standard convention of calculus that∫∞0 ds G(s) =
limz→∞∫ z
0 ds G(s), so we would always have to indicate whether we mean one or the other. It is
therefore more prone to mistakes. And as soon as we ask about the joint distribution Pi(t0, . . . , tR)
it all becomes even worse ...
4.3. Examples
Let us return to an earlier example, where an independent-times explanation of cause-specific hazard
rates led to a risk with a nonzero probability of not generating events. We start with the following
cause-specific hazard rates, for an individual subject to two risks:
π1(t) =b(1−ε)e−bt
ε+ (1−ε)e−bt, π2(t) = a (78)
We now try to construct the independent-times explanation P (t1, t2; τ1, τ2) = P1(t1, τ1)P2(t2, τ2) for
these hazard rates, so S(t1, t2) = S1(t1)S2(t2) with
S1(t) =∑τ
∫ ∞0
ds P1(s, τ)[τθ(s−t)+(1−τ)
](79)
S2(t) =∑τ
∫ ∞0
ds P2(s, τ)[τθ(s−t)+(1−τ)
](80)
We first calculate the two risk-specific survival probabilities:
S1(t) = e−∫ t
0ds π1(s) = exp
[−∫ t
0ds
b(1−ε)e−bs
ε+ (1−ε)e−bs]
= exp[ ∫ t
0ds
d
dslog
(ε+ (1−ε)e−bs
)]= exp
[log
(ε+ (1−ε)e−bt
)]= ε+ (1−ε)e−bt (81)
S2(t) = e−∫ t
0ds π2(s) = e−at (82)
Thus our equations (79,80) from which to calculate P1(t, τ) and P2(t, τ) become, after working out
the summations over τ :
ε+ (1−ε)e−bt =
∫ ∞0
ds P1(s, 0) +
∫ ∞0
ds P1(s, 1)θ(s−t) (83)
e−at =
∫ ∞0
ds P2(s, 0) +
∫ ∞0
ds P2(s, 1)θ(s−t) (84)
24
The functions P1,2(s, 0) are obsolete, since if τ = 0 (i.e. if the event doesn’t happen) the associated
event time is not used. We use normalisation and write∫∞
0 ds P1,2(s, 0) = 1−∫∞
0 ds P1,2(s, 1), giving
ε+ (1−ε)e−bt = 1−∫ ∞
0ds P1(s, 1)[1− θ(s−t)] = 1−
∫ ∞0
ds P1(s, 1)θ(t−s)
= 1−∫ t
0ds P1(s, 1) (85)
e−at = 1−∫ ∞
0ds P2(s, τ)[1− θ(s−t)] = 1−
∫ ∞0
ds P2(s, 1)θ(t−s)
= 1−∫ t
0ds P2(s, 1) (86)
Finally we differentiate both sides of both equations. This gives
P1(t, 1) = b(1−ε)e−bt, P2(t, 1) = ae−at (87)
This, in turn, means that∫ ∞0
dt P1(t, 0) = 1−∫ ∞
0dt P1(t, 1) = 1−
∫ ∞0
dt b(1−ε)e−bt = 1− (1−ε) = ε (88)∫ ∞0
dt P2(t, 0) = 1−∫ ∞
0dt P2(t, 1) = 1−
∫ ∞0
dt ae−at = 1− 1 = 0 (89)
We conclude that
P1(t, τ) = εδτ,0P (t) + (1−ε)δτ,1be−bt (90)
P2(t, τ) = δτ,1ae−at (91)
with some irrelevant normalised distribution P (t) (since for τ = 0 the event to which t refers will
by definition not materialise). This gives in combination:
P (t1, t2; τ1, τ2) = δτ2,1ae−at2[εδτ1,0P (t1) + (1−ε)δτ1,1be−bt1
](92)
This distribution, with a nonzero probability of event 1 never happening, would in terms of observed
survival data be indistinguishable from the original one, where both events will always happen:
P (t1, t2) = ae−at2[εδ(t1 − t2 − τ) + (1−ε)be−bt1
](93)
Both (92) and (93) give exactly the same cause-specific hazard rates (78). The independent-event-
times explanation for the fraction ε of cases in (93) where in reality event 1 would not be reported
in a trial because it happens at a fixed time later than event 2 is to say that in the fraction ε of
cases event 1 simply does not happen (irrespective of risk 2).
25
5. Individual versus cohort level survival statistics
5.1. Population level survival functions
The quantities defined so far describe statistical features at the level of individuals. If we want to
characterize the cohort as a whole (or if perhaps we have no information at the level of individuals)
we would work instead with the cohort averages of these functions, i.e.
S(t) =1
N
N∑i=1
Si(t) P (t0, . . . , tR) =1
N
N∑i=1
Pi(t0, . . . , tR) (94)
S(t) gives the probability that a randomly picked individual in the cohort will not have experienced
any event prior to time t, and P (t0, . . . , tR) gives the probability density for a randomly picked
individiual to have joint event times (t0, . . . , tR). If we inspect the derivation of Si(t) from
Pi(t0, . . . , tR) we note that we can simply insert 1N
∑Ni=1 everywhere and get also
S(t) = S(t, t, . . . , t), S(t0, . . . , tR) =
∫ ∞t0. . .
∫ ∞tR
ds0 . . . dsR P (s0, . . . , sR) (95)
In fact, we could have started developing our previous theory fully at population level. This is what
most textbooks do. It would effectively have meant dropping the indices i from all identities in the
previous sections. We would have defined population level cause-specific hazard rates πr(t), such
that the above population survival function would be written as
S(t) = e−∑R
r=0
∫ t0
ds πr(s) (96)
However, one would not have πr(t) = N−1∑i π
ir(t), since log 1
N
∑i(. . .) 6= 1
N
∑i log(. . .). We must
therefore always clarify whether we talk about individual or population functions. Failure to make
this distinction leads to confusion and mistakes. With the drive towards personalised medicine the
differences between cohort and individual survival statistics will become even more important.
Event time uncertainty versus hazard rate uncertainty. The description of survival statistics at
the cohort level, via S(t), involves two sources of uncertainty, which one cannot easily disentangle:
the uncertainty of event times, as described by the individual functions Si(t), and the uncertainty
of which individual we pick from the cohort, represented by the averaging N−1∑i. To illustrate
this, imagine we have just one risk, and we observe at population level what seems to be a simple
exponentially decaying survival function
S(t) = e−πt (97)
This can arise in many ways. For instance, all individuals could be identical, and the uncertainty
fully due to event time uncertainty at the individual level: the choice Si(t) = e−πt for all i would
trivially give the above S(t). The opposite extreme would be the case where the individuals have
no event time uncertainty at all, i.e. Si(t) = θ[t?i − t], so each i dies fully deterministically at some
time t?i , but the pre-ordained times t?i vary from one individual to another. This would mean
πi(t) = − d
dtlogSi(t) = − d
dtlog θ[t?i − t] (98)
Here we would find, with W (t?) = N−1∑i δ[t
? − t?i ] (the distribution of death times over the
population):
S(t) =1
N
∑i
θ[t?i − t] =
∫ ∞0
dt? W (t?)θ[t? − t] =
∫ ∞t
dt? W (t?) (99)
26
It is easy to see that also here we can recover the above exponential form for S(t), if the predestined
times of death are distributed over the population exponentially, according to W (t?) = πe−πt?:
S(t) =
∫ ∞t
dt? πe−πt?
= e−πt (100)
So in both case we find the population survival function (97), but for very different reasons. In real
patient data one would typically expect to have a combination of both types of uncertainty.
5.2. Population hazard rates and data likelihood
Relation between population hazard rates and individual hazard rates. It will be instructive to
express the population-level cause-specific hazard rates πr(t) in (102) in terms of the individual
cause-specific hazard rates πir(t) by application of (7) to population level functions:
πµ(t)S(t) =
∫ ∞t. . .
∫ ∞t
( R∏r 6=µ
dsr)P (s0, . . . , sµ−1, t, sµ+1, . . . , sR)
=1
N
∑i
∫ ∞t. . .
∫ ∞t
( R∏r 6=µ
dsr)Pi(s0, . . . , sµ−1, t, sµ+1, . . . , sR)
=1
N
∑i
πiµ(t)Si(t) (101)
So we find
πµ(t) =
∑i π
iµ(t)Si(t)∑i Si(t)
=
∑i π
iµ(t)e−
∑r
∫ t0
ds πir(s)∑i e−
∑r
∫ t0
ds πir(s)(102)
It is clear that πr(t) 6= N−1∑i π
ir(t) as soon as the cohort is not strictly homogeneous, i.e. as soon
as the hazard rates of different individuals i are not all identical.
In fact, we see that heterogeneity will give us a time-dependent population-level hazard rate
even if all individuals in the population have time-independent hazard rates. Suppose πiµ(t) = πiµfor all (i, µ) and all t. We would then obtain
πµ(t) =
∑i π
iµ(t)Si(t)∑i Si(t)
=
∑i π
iµe−t
∑rπir∑
i e−t∑
rπir
(103)
For large times, the individuals with the lowest hazard rates contribute most to the average in (103):
πµ(0) =1
N
∑i
πiµ, limt→∞
πµ(t) = πi?
µ , i? = argmin(∑
r
πir
)(104)
In Cox regression (see a later section) one indeed often observes population hazard ratios that
appear to decay over time; it is now clear that this should not necessarily be interpreted as a time
dependence at the level of individuals, as it could be due simply to cohort heterogeneity.
Data likelihood. In the same way we find that if we work at population level, with population
hazard rates and population survival functions, we can no longer use (15) to quantify the likelihood
of finding an individual reporting a first event of type ∆ at time t. Instead we would now use
P (X,∆) = π∆(X)e−∑R
r=0
∫ X0
ds πr(s) (105)
27
It now follows from (101), upon inserting our formulae for S(t) and Si(t), that in fact P (X,∆) =
N−1∑i Pi(X,∆). The final picture is therefore as in the diagram below:
INDIVIDUAL LEVEL POPULATION LEVEL
P1(t0, . . . , tR) . . . . . . PN (t0, . . . , tR)︸ ︷︷ ︸individual event time statistics
P (t0, . . . , tR) =1
N
∑i
Pi(t0, . . . , tR)
⇓
individual hazard rates︷ ︸︸ ︷π1
0(X), . . . , π1R(X) . . . . . . πN0 (X), . . . , πNR (X) π∆(X) =
P (X,∆)∑r
∫∞X dt P (t, r)
P1(X, 0), . . . , P1(X,R) . . . . . . PN (X, 0), . . . , PN (X,R)︸ ︷︷ ︸individual data likelihoods
P (X,∆) =1
N
∑i
Pi(X,∆)
P (X,∆) = π∆(X)e−∑
r
∫ X0
ds πr(s)
⇓
(X1,∆1) . . . . . . (XN ,∆N )︸ ︷︷ ︸observed survival data
5.3. Examples
Example 1: impact of heterogeneity on population hazard rates
Imagine a population of two distinct groups of individuals, A and B: 1, . . . , N = A∪B. Let
there be NA = fN patients in group A and NB = (1− f)N patients in group B. They are all
subject to just one risk, and have time-independent individual hazard rates: πi(t) = 1 if i ∈ Aand πi(t) = 3 if i ∈ B. At population level this would give the time dependent hazard rate
π(t) =
∑i π
ie−tπi∑
i e−tπi=
∑i∈A e−t +
∑i∈B 3e−3t∑
i∈A e−t +∑i∈B e−3t
=f + 3(1−f)e−2t
f + (1−f)e−2t(106)
The result is shown in figure 2 for f ∈ 0, 14 ,
12 ,
34 , 1.
Example 2: correlated population risks without correlated individual risks
Imagine again a population of two distinct groups of individuals, A and B: 1, . . . , N = A∪B.
Let there be NA = fN patients in group A and NB = (1− f)N patients in group B. They are
28
0
1
2
3
4
0 1 2 3
t
π(t)
f=1.00
f=0.75
f=0.50
f=0.25
f=0.00
→
→
→
→
→
Figure 2. The population-level hazard rate π(t) as given by (106), for different values of
the relative sizes of the sub-classes in the cohort. The hazard rate decays over time as soon
as there is heterogeneity in the cohort (i.e. for 0 < f < 1), in spite of all individuals of the
cohort having strictly time-independent hazard rates.
all subject to two risks r = 1 and r = 2. Assume that all have independent event times and
hence factorising survival functions as in (26), and constant hazard rates:
i ∈ A : Pi(t1, t2) =(πA1 e−t1π
A1
)(πA2 e−t2π
A2
)(107)
i ∈ B : Pi(t1, t2) =(πB1 e−t1π
B1
)(πB2 e−t2π
B2
)(108)
So
i ∈ A : Si(t1, t2) = SA1(t1)SA2(t2), SA1(t) = e−tπA1 , SA2(t) = e−tπ
A2 (109)
i ∈ B : Si(t1, t2) = SB1(t1)SB2(t2), SB1(t) = e−tπB1 , SB2(t) = e−tπ
B2 (110)
Within each group the two risks are clearly independent. At population level we find the overall
survival function
S(t) =1
N
N∑i=1
Si(t) =1
N
∑i∈A
SA1(t)SA2(t) +1
N
∑i∈B
SB1(t)SB2(t)
=NA
NSA1(t)SA2(t) +
NB
NSB1(t)SB2(t)
= fe−t(πA1 +πA2 ) + (1−f)e−t(π
B1 +πB2 ) (111)
One would naievely expect the population survival functions for the individual risks to be the
cohort averages over the corresponding individual survival functions:
Sr(t) =1
N
N∑i=1
Sir(t) =NA
NSAr(t) +
NB
NSBr(t) = fe−tπ
Ar + (1−f)e−tπ
Br (112)
This gives for the product S1(t)S2(t):
S1(t)S2(t) =(fe−tπ
A1 + (1−f)e−tπ
B1
)(fe−tπ
A2 + (1−f)e−tπ
B2
)= f2e−t(π
A1 +πA2 ) + (1−f)2e−t(π
B1 +πB2 ) + f(1−f)
[e−t(π
A1 +πB2 ) + e−t(π
A2 +πB1 )
]
29
If we had population level independence of risks (as is true at the level of the individual groups)
we would have expected to find S(t) = S1(t)S2(t). Instead here we get
S(t)− S1(t)S2(t) = f(1−f)[e−t(π
A1 +πA2 ) + e−t(π
B1 +πB2 ) − e−t(π
A1 +πB2 ) − e−t(π
A2 +πB1 )
]= f(1−f)
[e−tπ
A1 − e−π
B1
][e−tπ
A2 − e−tπ
B2
](113)
We see that generally our two risks will be correlated at population level, i.e. S(t) 6=∏r Sr(t),
except for f = 0, 1 or when either πA1 = πB1 or πA2 = πB2 . These are precisely the cases where
the correlation C12 over the population of the hazard rates of the two risks vanishes:
C12 = 〈π1π2〉 − 〈π1〉〈π2〉
=1
N
∑i
πi1πi2 −
( 1
N
∑i
πi1
)( 1
N
∑i
πi2
)= fπA1 π
A2 + (1−f)πB1 π
B2 −
(fπA1 + (1−f)πB1
)(fπA2 + (1−f)πB2
)= f(1−f)
[πA1 π
A2 + πB1 π
B2 − πA1 πB2 − πB1 πA2
]= f(1−f)
[πA1 − πB1
][πA2 − πB2
](114)
This illustrates how risk correlations at population level can emerge in a natural way as a result
of correlations of the cause-specific hazard rates of the individuals in a heterogeneous cohort,
in spite of each individual having independent event times.
30
6. Survival prediction
To predict survival we need to know (or estimate) the cause-specific hazard rates. The natural
object to use would be the survival function Si(t), which gives the probability of individual i not
experiencing any of the risk events prior to time t. However, sometimes we cannot use this (for
instance if we only have information on the cause-specific hazard rates of the cohort as a whole,
rather then those of the individual i), or we may wish to calculate different probabilities.
6.1. Cause-specific survival functions
Non-hypothetical survival probabilities. Instead of predicting overall survival, via Si(t) or S(t), we
will often be interested in other predictions. We have already seen Pi(X,µ), the probability density
for seeing event µ reported first, in a time interval located at time t (see (15)):
Pi(X,µ) = πiµ(X)e−∑R
r=0
∫ X0
ds πir(s) (115)
From this follows e.g. the so-called cumulative incidence function Fiµ(t), which is the probability
that individual i ‘fails’ from cause µ at any time prior to t:
Fiµ(t) =
∫ t
0dX Pi(X,µ) =
∫ t
0dX πiµ(X)e−
∑R
r=0
∫ X0
ds πir(s) (116)
or the population average of this function Fµ(t), which gives the probability that a randomly drawn
individual from our cohort ‘fails’ from cause µ at any time prior to t:
Fµ(t) =1
N
∑i
∫ t
0dX πiµ(X)e−
∑R
r=0
∫ X0
ds πir(s) (117)
Equivalently, in terms of the global cause-specific hazard rates:
Fµ(t) =
∫ t
0dX πµ(X)e−
∑R
r=0
∫ X0
ds πr(s) (118)
Again it is important to specify which of the above cumulative incidence functions one is referring to
(unless the cohort consists of clones, where the difference between the two vanishes). An equivalent
quantity is the cause-specific survival probability Giµ(t), defined as the likelihood that at time t
individual i has not yet failed from cause µ, either because he/she experienced another event prior
to t, or because nothing has yet happened at time t:
Giµ(t) = 1− Fiµ(t) = 1−∫ t
0dX πiµ(X)e−
∑R
r=0
∫ X0
ds πir(s) (119)
The likelihood that individual i will never report event µ would then be
Giµ(∞) = 1−∫ ∞
0dX πiµ(X)e−
∑R
r=0
∫ X0
ds πir(s) (120)
We see that Fiµ(t), Fµ(t), and Giµ(t) all depend on all cause specific hazard rates, not just that
of risk µ, since the other risks influence how likely it is for risk µ to trigger an event first. Note
that even in the case of statistically independent event times, where Si(t) =∏r Sir(t), the function
Giµ(t) is not the same as Sir(t): both describe how likely it is for event µ not to have taken place
yet at time t, but Giµ(t) takes into account the likelihood that we haven’t seen event µ because
other events happened earlier, whereas Siµ(t) does not.
The effects of disabling risks on cause-specific hazard rates. We might also be interested in
hypothetical quantities, such as what would be the survival probabilities if one or more of the risks
31
could be eliminated. Often we wish to study one specific ‘primary’ risk, and would want to eliminate
the obscuring effects of the others. If we denote the ‘active’ set of risks as A ⊆ 0, 1, 2, . . . , R, then
we have to disable all risks r /∈ A. We have already noted earlier that this does not simply mean
setting πir(t) or πr(t) to zero for all r /∈ A, due to the conditioning in the definition of the cause-
specific hazard rates. If we start from the general distribution Pi(t0, . . . , tR; τ0, . . . , τR) and disable
all risks other than those in the set A, we effectively change this distribution into
P ′i (t0, . . . , tR; τ0, . . . , τR) =Pi(t0, . . . , tR; τ0, . . . , τR)
∏r/∈A δτr,0∑
τ ′0...τ′R
∫ds′0 . . . ds
′R Pi(t′0, . . . , t
′R; τ ′0, . . . , τ
′R)∏r/∈A δτ ′r,0
(121)
and the new cause-specific hazard rates become
πi′µ(t) =
∑τ0...τR
τµ∫
ds0 . . . dsR P ′i (s0, . . . , sR; τ0, . . . , τR)δ(sµ−t)∏r 6=µ
[1−τrθ(t−sr)
]∑τ0...τR
∫ds0 . . . dsR P ′i (s0, . . . , sR; τ0, . . . , τR)
∏r
[1−τrθ(t−sr)
]=
∑τ0...τR
τµ∫
ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)δ(sµ−t)∏r/∈A δτr,0
∏r 6=µ
[1−τrθ(t−sr)
]∑τ0...τR
∫ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)
∏r/∈A δτr,0
∏r
[1−τrθ(t−sr)
](122)
If we started from events that always happen, i.e. a distribution of the form Pi(t0, . . . , tR), then we
would have upon disabling the risks r /∈ A:
P ′i (t0, . . . , tR; τ0, . . . , τR) = Pi(t0, . . . , tR)( ∏r∈A
δτr,1)( ∏
r/∈Aδτr,0
)(123)
and find
πi′µ(t) =
∑τ0...τR
τµ∫
ds0 . . . dsR P ′i (s0, . . . , sR; τ0, . . . , τR)δ(sµ−t)∏r 6=µ
[1−τrθ(t−sr)
]∑τ0...τR
∫ds0 . . . dsR P ′i (s0, . . . , sR; τ0, . . . , τR)
∏r
[1−τrθ(t−sr)
]=
∫ds0 . . . dsR Pi(s0, . . . , sR)δ(sµ−t)
∑τ0...τR
τµ∏r∈A δτr,1
∏r/∈A δτr,0
∏r 6=µ
[1−τrθ(t−sr)
]∫
ds0 . . . dsR Pi(s0, . . . , sR)∑τ0...τR
∏r∈A δτr,1
∏r/∈A δτr,0
∏r∈A
[1−θ(t−sr)
](124)
As expected, if µ /∈ A (so risk µ is disabled) we always get πi′µ(t) = 0 for all t. If µ ∈ A (so risk
µ is not disabled), then it is clear from the above that all cause-specific hazard rates of risks in
the active set A are affected by our disabling of risks, and that we need to know the distribution
Pi(s0, . . . , sR; τ0, . . . , τR) to calculate the new rates. In the case of (124) and µ ∈ A we can simplify
our formula for πi′µ(t) further to
πi′µ(t) =
∫ds0 . . . dsR Pi(s0, . . . , sR)δ(sµ−t)
∏r∈A/µ
[1−θ(t−sr)
]∫
ds0 . . . dsR Pi(s0, . . . , sR)∏r∈A
[1−θ(t−sr)
] (125)
which in effect involves only the marginal of Pi(s0, . . . , sR), obtained by integrating out all times
tr with r /∈ A. We now appreciate why the question of whether Pi(s0, . . . , sR; τ0, . . . , τR) can be
calculated was a relevant one. In particular, in view of the Tsiatis identifiability problem we cannot
expect to write the new hazard rates in terms of the old ones, since we cannot generally express
Pi(s0, . . . , sR; τ0, . . . , τR) in terms of the old hazard rates . . .
Hypothetical survival probabilities for uncorrelated event times. Only if the event times are known
to be uncorrelated can we proceed to calculate the new hazard rates. In this case we may write
32
Pi(s0, . . . , sR; τ0, . . . , τR) =∏Rr=0 Pir(sr, τr), and simplify the above formula for the new cause-
specific hazard rates for µ ∈ A to
πi′µ(t) =
[∑τµ τµ
∫dsµ Piµ(sµ, τµ)δ(sµ−t)
]∏r∈A/µ
[∑τr
∫dsr Pir(sr, τr)[1−τrθ(t−sr)]
]∏r∈A
[∑τr
∫dsr Pir(sr, τr)[1−τrθ(t−sr)]
]=
∑τµ τµ
∫dsµ Piµ(sµ, τµ)δ(sµ−t)∑
τµ
∫dsµ Piµ(sµ, τµ)[1−τµθ(t−sµ)]
= πiµ(t) (126)
The reason that the last formula is simply the old rate πiµ(t) is that it no longer involves the active
risk set A, so it must also be true for the choice A = 0, 1, . . . , R (i.e. none of the risks is disabled),
for which we must recover the old hazard rate. In conclusion: if we disable risks, all cause-specific
hazard rates will generally be affected in a complicated way, and to calculate their new values we
need to know the joint event time distribution. Only if the event times are uncorrelated, then
disabling risks simply means setting all cause-specific hazard rates of the disabled risks to zero.
For instance, if all risks except risk µ were eliminated (i.e. A = µ) and the event times
are uncorrelated, then one would have πir(t) = 0 for all r 6= µ, and find the following hypothetical
cause-specific survival probability Siµ(t), describing a world where only risk µ is active:
Siµ(t) = 1−∫ t
0dX πiµ(X)e−
∫ X0
ds πiµ(s)
= 1 +
∫ t
0dX
d
dXe−∫ X
0ds πiµ(s) = 1 +
[e−∫ X
0ds πiµ(s)
]t0
= e−∫ X
0ds πiµ(t) (127)
which is identical to Siµ(t) (due to the assumed event time independence), i.e. to the risk-specific
survival probability in Si(t) =∏r Sir(t).
6.2. Estimation of cause-specific hazard rates
Strategies for extracting hazard rate information from the data. The previous subsection deals with
how we can make predictions once we know the cause-specific hazard rates. In reality we must
find these rates first. Suppose we have survival data D = (X1,∆1), . . . , (XN ,∆N ), referring to
N individuals who can be regarded as independently drawn random samples from a population
characterised by an as yet unknown set of population-level cause-specific hazard rates π0, . . . , πR.There are two (connected) approaches to the question of how to get the rates from D‖.
The first (traditional) approach is to construct formulas for so-called estimators, which are
expressions π0, . . . , πR written in terms of the data D, for which we can prove that in the limit
N → ∞ they converge to the true values π0, . . . , πR. Once these estimators are chosen and
their properties verified, one then uses in prediction the estimators instead of the real (unknown)
cause-specific hazard rates. There are two downsides to this approach. The first is the difficulty in
the construction of good (i.e. unbiased and fast-converging) candidates for these formulas, which
is simple for trivial quantities but is not trivial for the present case of hazard rates. The second
is that, especially when N is not very large, we know that estimators are not exact, and we have
no way of accounting for this imprecision in our predictions. We would just hope for the best ...
The second approach is to use Bayesian arguments, which not only take into account our residual
‖ The procedures described here apply more generally to the extraction of parameters from data, not just
to the extraction of cause-specific hazard rates π0, . . . , πR from our survival data D.
33
uncertainty regarding the true hazard rates after having observed the data, but also lead us to
systematic formulae for estimators. In Appendix C we work out and compare the different routes
available for estimating model parameters from data for a simple example.
The maximum likelihood estimator. Let us try to determine the most probable values for our
hazard rates, given our observation of the data D, in the Bayesian way. This means finding
the maximium over π0, . . . , πR of the distribution P(π0, . . . , πR|D).¶ The standard Bayesian
identity p(a|b)p(b) = p(b|a)p(a) allows us to express this distribution in terms of its counterpart,
the data likelihood P(D|π0, . . . , πR) given the hazard rates:
P(π0, . . . , πR|D) =P(π0, . . . , πR|D)P(D)
P (D)
=P(D|π0, . . . , πR)P(π0, . . . , πR)
P(D)
=P(D|π0, . . . , πR)P(π0, . . . , πR)∫dπ′0, . . . , π′R P (π′0, . . . , π′R, D)
=P(D|π0, . . . , πR)P(π0, . . . , πR)∫
dπ′0, . . . , π′R P(D|π′0, . . . , π′R)P(π′0, . . . , π′R)(128)
Within the Bayesian framework one would choose for the prior the maximum-entropy distribution,
subject to applicable constraints (such as πr(t) ≥ 0 for all t). If, on the other hand, we choose a
so-called flat prior, i.e. we take P(π0, . . . , πR) to be a independent of π0, . . . , πR (so we have
no prior preference either way, beyond the constraints), then the most probable set π0, . . . , πRis the one that maximises P(D|π0, . . . , πR). This is called a maximum-likelihood estimator.
Equivalently, we can maximize the logarithm of the data likelihood, viz. L(D|π0, . . . , πR)) =
logP(D|π0, . . . , πR), which will give slightly more compact equations. To proceed we need a
formula for L(D|π0, . . . , πR), which, given our independence assumption and in view of (105), is
L(D|π0, . . . , πR) = log∏i
P (Xi,∆i) =∑i
logP (Xi,∆i)
=∑i
log π∆i(Xi)−∑i
R∑r=0
∫ Xi
0ds πr(s)
=R∑r=0
∫ ∞0
ds∑i
log πr(s)δr,∆iδ(s−Xi)− πr(s)θ(Xi−s)
(129)
Now we maximize this latter expression by variation of each of the functions πr(t), giving us an
estimator for each of the population-level cause-specific hazard rates. It is standard convention to
write estimators with a ‘hat’ symbol on top, so after differentiation of (129) we get
(∀r)(∀t) :1
πr(t)
∑i
δr,∆iδ(t−Xi) =∑i
θ(Xi−t) (130)
(∀r)(∀t) : πr(t) =
∑i δr,∆iδ(t−Xi)∑i θ(Xi−t)
(131)
This latter result seems very sensible. At any stage t, the rate for event r is estimated as the total
number of observed failures to r per unit time, divided by the number of patients observed to be
still ‘at risk’ at time t.
¶ We will use ordinary Roman capitals (e.g. P (..), W (..)) for distributions describing intrinsic survival
statistics of individuals and groups, and calligraphic Roman capitals (e.g. P(..)) for Bayesian probabilities,
which quantify our confidence in having extracted information correctly from survival data.
34
6.3. Derivation of the Kaplan-Meier estimator
Estimator for the survival function in presence of just one risk. From the estimator (216) we can
construct estimators for the various survival functions defined earlier. In particular, provided we
know (or may assume) that the event times are statistically independent, we can estimate the
hypothetical survival function that would describe a situation where all risks except µ (the risk of
interest, or the ‘primary’ risk) would be eliminated. The latter is estimated by Sµ(t), where
log Sµ(t) = −∫ t
0ds πµ(s) = −
∫ t
0ds
∑i δµ,∆iδ(s−Xi)∑
i θ(Xi−s)
= −∑i
∫ t
0ds
δµ,∆iδ(s−Xi)∑j θ(Xj−s)
= −∑i
δµ,∆iθ(t−Xi)∑j θ(Xj−Xi)
(132)
If we denote with Ωµ ⊂ 1, . . . , N the set of all patients that report event µ, this becomes
log Sµ(t) = −∑i∈Ωµ
θ(t−Xi)∑j θ(Xj−Xi)
(133)
Note that R(Xi) =∑j θ(Xj−Xi) is the number of individuals still ‘at risk’ at time Xi. We may
now write
Sµ(t) =∏i∈Ωµ
e−θ(t−Xi)/R(Xi) (134)
Estimator for the overall survival function. The estimate for the overall survival function S(t) =
exp[−∑µ
∫ t0ds πµ(s)] can be obtained by summing over all risks in the derivation above. We get
log S(t) = −∫ t
0ds
∑µ
πµ(s) = −∫ t
0ds
∑i δ(s−Xi)∑i θ(Xi−s)
= −∑i
θ(t−Xi)∑j θ(Xj−Xi)
(135)
Hence
S(t) = exp[−∑i
θ(t−Xi)∑j θ(Xj−Xi)
]=∏i
e−θ(t−Xi)/R(Xi) (136)
Here there is no issue relating to independence of event times; this estimator is always valid. Even
without the arguments leading to (136), one can easily convince oneself that for N → ∞ the
expression (136) will indeed converge to the true survival function. Upon defining the emperical
distribution Pem(X) = N−1∑i δ(X−Xi) we may write:
limN→∞
S(t) = limN→∞
exp[−∑i
θ(t−Xi)∑j θ(Xj−Xi)
]= lim
N→∞exp
[−∫ ∞
0dX Pem(X)
θ(t−X)∫∞0 dX ′ Pem(X ′)θ(X ′−X)
]= lim
N→∞exp
[−∫ t
0dX
Pem(X)∫∞X dX ′ Pem(X ′)
](137)
For N →∞ we will have Pem(X)→ P (X) (the true distribution of ‘first event times corresponding
to our cohort), and the fraction inside the above integral becomes identical to the right-hand side
of the population level version of (19), viz.
R∑r=0
πr(X) =P (X)∫∞
X dt P (t)(138)
35
Thus we find, provided limits and integrations commute (i.e. for non-pathological P (x)), that
limN→∞
S(t) = e−∑R
r=0
∫ t0
dX πr(X) = S(t) (139)
The Kaplan-Meier estimators. From equations (134) and (136) it is only a small step to the so-
called Kaplan-Meier curves. We collect and order all distinct times at which one or more events are
reported in our cohort, giving an ordered set of time points t1 < t2 < t3 < . . ., to be labelled by t`.
This allows us to write (134) as
Sµ(t) = e−∑
i∈Ωµθ(t−Xi)/R(Xi)
= e−∑
`
∑i∈Ωµ, Xi=t`
θ(t−t`)/R(t`)
= e−∑
`θ(t−t`)R−1(t`)
∑i∈Ωµ, Xi=t`
1(140)
We recognize that Dµ(t`) =∑i∈Ωµ, Xi=t`
1 is the number of individuals that reported event µ at
time t`, so
Sµ(t) =∏
`, t`≤te−Dµ(t`)/R(t`) =
∏`, t`≤t
[1− Dµ(t`)
R(t`)+O
(Dµ(t`)
R(t`)
)2](141)
If finally we truncate the expansion of exponentials after the first two terms (which is valid until
times become so large that the number R(t) of individuals at risk become of order one) we obtain
what is known as the Kaplan-Meier estimator for the risk-specific survival function for the case of
uncorrelated risks:
SKMµ (t) =
∏`, t`≤t
[1− Dµ(t`)
R(t`)
],
R(t) : nr at risk at time t
Dµ(t) : nr reporting event µ at time t(142)
Similarly, since the only difference between (134) and (136) is whether or not we limit the
contributing individuals i to those that report event µ, we can retrace the above argument with
only minor adjustments, and obtain the Kaplan-Meier estimator of the overall survival function:
SKM(t) =∏
`, t`≤t
[1− D(t`)
R(t`)
],
R(t) : nr at risk at time t
D(t) : nr reporting event at time t(143)
Examples of cause-specific and overall Kaplan-Meier survival curves are shown in figure 3.
Some properties of KM curves. The formulae for Kaplan-Meier curves are simple and compact,
but they have limitations. The main one is that SKMµ (t) only estimates the survival probability for
risk µ for uncorrelated risks. For correlated risks we can still use SKM(t), but SKMµ (t) may bare no
relation at all to the event statistics of the individual risks that would be found of the alternative
risks were disabled. Also the expansion used means in both cases that for small values of D(t`), i.e.
for large times when only few patients are still event-free, they are no longer reliable.
Secondly, by definition KM curves have the shape of descending staircases, with steps at the
times where events occurred in the cohort. The smaller the cohort size N , the smaller the number
of steps and the larger the jumps involved. This jagged nature of the curves is an artifact of the
procedure of maximum-likelihood that was followed; since common sense dictates that the true
survival curves are smooth, a better procedure would be to add a non-flat prior P (π0, . . . , πR)to our derivation, which punishes non-smooth dependencies of cause-specific hazard rates on time.
The only reason this is usually not done is that we would get a more complicated equation than
(216), from which πr(t) can no longer be solved in explicit form.
36
SKM(t) SKM1 (t) SKM
2 (t) SKM0 (t) (EOT)
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35
PCN=2047
time (yrs) time (yrs) time (yrs) time (yrs)
0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000 5000 6000 7000 8000 0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000 5000 6000 7000 8000 0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000 5000 6000 7000 8000 0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000 5000 6000 7000 8000
BCN=70
time (days) time (days) time (days) time (days)
Figure 3. Kaplan-Meier curves for a large prostate cancer data set (top row, N = 2047
patients, primary risk: onset of prostate cancer) and for a smaller breast cancer data set
(bottom row, N = 70 patients). Left curves: the KM estimator for the overall survival
probability (including end-of-trial censoring events). Middle two columns: KM estimators
of cause-specific survival probabilities (assuming independence of risks), for the primary risk
1 (cancer onset for PC for patients monitored from age 50 onwards, cancer recurrence for
BC following a primary tumour at time t = 0) and for risk 2 (other deaths). Right column:
KM estmator of the survival probability of the end-of-trial risk, which tells us about the
distribution of end-of-trial censoring times. We see that protate cancer risk increases with
time (slope of SKM1 (t) becomes more negative with age), and that the recurrence risk for
breast cancer patients decreases with time (slope of SKM1 (t) gets less negative over time).
In view of the interpretation and the underlying assumptions of the KM curves, we should
expect that SKM(t) =∏µ S
KMµ (t). This is indeed true (within the orders of accuracy considered in
the derivation of the formulae):
∏µ
SKMµ (t) =
∏µ
∏`, t`≤t
[1− Dµ(t`)
R(t`)
] =∏
`, t`≤t
∏µ
[1− Dµ(t`)
R(t`)
](144)
If there are no ties in timing, i.e. all events happen at distinct times, then Dµ(t`), D(t`) ∈ 0, 1and
∏µDµ(t`) = D(t`)δµ,µ(t`) with µ(t`) denoting the type of event observed at time t`. We then
immediately get∏µ
SKMµ (t) =
∏`, t`≤t
[1− D(t`)
R(t`)
]= SKM(t) (145)
37
If there are ties, the identity is true in relevant orders, since∏µ
SKMµ (t) =
∏`, t`≤t
∏µ
e−Dµ(t`)
R(t`)+O(D2
µ(t`)/R2(t`))
=∏
`, t`≤te−D(t`)
R(t`)+O(
∑µ
[Dµ(t`)/R(t`)]2)
=∏
`, t`≤t
[1− D(t`)
R(t`)+O
(D2(t`)
R2(t`)
)]≈ SKM(t) (146)
How to measure cause-specific risk in the presence of risk correlations. We have already seen that
the cause-specific survival function Sµ(t) only estimates the true survival with respect to risk one
in the absence of other risks if all risks are independent. The same is true for the cause-specific
Kaplan-Meier curves, which are just approximations of the Sµ(t). So how do we measure then
the cause-specific survival prospects of a cohort when there are correlated risks? Unless we know
the joint event time distibution, we have no choice but to return to non-hypothetical survival
probabilities, such as the cause specific incidence functions
Fiµ(t) =
∫ t
0dX πiµ(X)e−
∑R
r=0
∫ X0
ds πir(s) (147)
Fµ(t) =1
N
∑i
∫ t
0dX πiµ(X)e−
∑R
r=0
∫ X0
ds πir(s) (148)
The only limitation is that if we then compare two groups in terms of their risk µ incidence, we can
never be sure whether any differences are due to changes in the event time statistics of the risk µ
itself, or due to differences in the other risks r 6= µ which can influence Fiµ(t) or Fµ(t) via censoring
(i.e. by changing the probability that the type µ events are the first to occur). This limitation is
fundamental, as it involves the joint statistics of timings which we know cannot be inferred from
hazard rates, and therefore cannot be inferred from survival data.
6.4. Examples
Example 1: risk elimination
To get a better feel for the perhaps counter-intuitive statement that removal of one risk affects
the hazard rates of the remaining risks, let us work out a simple example where two risks 1
and 2 are both consequences of a third event 3 with an exponentially distributed event time t3(which we do not observe, and which in itself has no negative direct consequences). We assume
that always t1 = t3 + 10, and t2 = t3 + 20, so
P (t1, t2, t3) = τ−1e−t3/τδ(t1−t3−10)δ(t2−t3−20) (149)
Integration over the unknown t3 gives the event time distribution for the observable events:
P (t1, t2) = τ−1∫ ∞
0dt3e−t3/τδ(t1−t3−10)δ(t2−t3−20)
= τ−1e−(t1−10)/τδ(t2−t1−10)θ(t1−10) (150)
From this we obtain
S(t1, t2) =
∫ ∞t1
∫ ∞t2
ds1ds2 τ−1e−(s1−10)/τδ(s2−s1−10)θ(s1−10)
38
= τ−1∫ ∞
max(t1,10)ds1 e−(s1−10)/τθ(s1+10−t2)
= τ−1∫ ∞
max(t1−10,0,t2−20)ds e−s/τ = e−max(t1−10,0,t2−20)/τ
=
1 if t1 < 10 and t2 < 20
e−(t1−10)/τ if t1 > 10 and t1 > t2 − 10
e−(t2−20)/τ if t2 > 10 and t1 < t2 − 10
(151)
To next calculate the hazard rates via (7) we need S(t1, t2) for |t1− t2| small, i.e. we need only
the first two options in the result above:
t < 10 : π1(t) = π2(t) = 0 (152)
t > 10 : π1(t) = − ∂
∂tlog e−(t−10)/τ = 1/τ, π2(t) = 0 (153)
We can understand this: event 2 always hapens after 1 so will never be observed. Event 1
happens with a constant rate (that of the cause, i.e. of event 3) as soon as t > 10.
Next we disable risk 1. Evidently this means that π′1(t) = 0. However, the time t2 still happens
exactly at t2 = t3 + 20, but it is no longer preceded by the ‘masking event’ 1. So we must get
t < 20 : π′1(t) = π′2(t) = 0 (154)
t > 20 : π′1(t) = 0, π′2(t) = 1/τ (155)
Let us also calculate π′2(t) via the formal route, i.e. formula (125):
π′2(t) =
∫ds1ds2 P (s1, s2)δ(s2−t)∫
ds1ds2 P (s1, s2)[1−θ(t−s2)]=
P (t)∫∞t ds P (s)
(156)
in which
P (t) =
∫ ∞0
dt1 P (t1, t) =
∫ ∞10
dt1 τ−1e−(t1−10)/τδ(t−t1−10)
= τ−1e−(t−20)/τθ(t− 20) (157)
and so insertion into our formula for π′2(t) gives indeed
π′2(t) = θ(t− 20)e−(t−20)/τ∫∞
t ds e−(s−20)/τθ(s− 20)
= θ(t− 20)e−(t−20)/τ
τ e−(t−20)/τ= θ(t− 20) τ−1 (158)
Example 2: correlated risks and false protectivity
Let us inspect a previously used distribution for two event times t1, t2 ≥ 0, with parameters
a, b, τ > 0 and ε ∈ [0, 1], assumed to apply at population level:
P (t1, t2) = ae−at2[εδ(t1 − t2 − τ) + (1−ε)be−bt1
](159)
39
We note that the first marginal of this distribution is
P (t1) =
∫ ∞0
dt2 P (t1, t2)
= ε
∫ ∞0
dt2 ae−at2δ(t1 − t2 − τ) + (1−ε)be−bt1∫ ∞
0dt2 ae−at2
= εθ(t1−τ)ae−a(t1−τ) + (1−ε)be−bt1 (160)
Also we note that in the above distribution the two event times are generally correlated:
〈t1t2〉−〈t1〉〈t2〉 =
∫dt2 P (t2)t2
∫dt1 P (t1|t2)t1
−( ∫
dt2 P (t2)
∫dt1 P (t1|t2)t1
)( ∫dt1 P (t2)t2
)=
∫dt2 P (t2)t2
(ε(t2+τ)+(1−ε)1
b
)− 1
a
∫dt2 P (t2)
(ε(t2+τ)+(1−ε)1
b
)= ε(〈t22〉+
τ
a) +
1− εab− ε( 1
a2+τ
a)− 1− ε
ab
= ε(〈t22〉 −1
a2) (161)
The remaining average is
〈t22〉 =
∫ ∞0
dt t2ae−at = ad2
da2
∫ ∞0
dt e−at = ad2
da2
1
a= 2/a2 (162)
Hence
〈t1t2〉−〈t1〉〈t2〉 = ε/a2 (163)
We conclude that in the above distribution the two times are positively correlated as soon as
ε > 0.
Let us calculate the actual cause-specific hazard rate π′1(t) that would be found if risk 2 were
disabled. This rate is given in (125), which here simplifies to
π′1(t) =
∫dt1dt2 P (t1, t2)δ(t1−t)∫dt1dt2 P (t1, t2)θ(t1−t)
=P (t)∫∞
t dt1 P (t1)
=εθ(t−τ)ae−a(t−τ) + (1−ε)be−bt
ε∫∞
maxt,τdt1 ae−a(t1−τ) + (1−ε)∫∞t dt1 be−bt1
=εθ(t−τ)ae−a(t−τ) + (1−ε)be−bt
εe−a maxt−τ,0 + (1−ε)e−bt(164)
Hence
t < τ : π′1(t) =(1−ε)be−bt
ε+ (1−ε)e−bt= − d
dtlog
[ε+ (1−ε)e−bt
](165)
t > τ : π′1(t) =εae−a(t−τ) + (1−ε)be−bt
εe−a(t−τ) + (1−ε)e−bt= − d
dtlog
[εe−a(t−τ) + (1−ε)e−bt
](166)
We can now immediately read off the true survival function S1(t) = exp[−∫ t
0ds π′1(s)] for risk
1 that would correspond to a world where risk 2 was disabled:
t < τ : S1(t) = ε+ (1−ε)e−bt (167)
t > τ : S1(t) = εe−a(t−τ) + (1−ε)e−bt (168)
40
Next we want to compare this result to the estimator S1(t) in (134), of which the risk-1 KM
curve SKM1 (t) is an approximation, which aims to describe the survival statistics for risk 1
alone, but whose derivation relied on assuming independence of the event times. To do this
we generate N time pairs (ti1, ti2) from the distribution P (t1, t2), and define the corresponding
survival data D = (X1,∆1), . . . , (XN ,∆N ), where
(∀i = 1 . . . N) : Xi = minti1, ti2), ∆i =
1 if ti1 < ti22 if ti2 < ti1
(169)
It is convenient to rewrite S1(t) first as
S1(t) = exp−∑i
δ∆i,1θ(t−Xi)∑j θ(Xj−Xi)
= exp
− 1
N
∑i
δ∆i,1θ(t−Xi)1N
∑j θ(Xj−Xi)
= exp−
2∑∆=1
∫ ∞0
dXP (X,∆) δ∆,1θ(t−X)∑2
∆′=1
∫∞0 dX ′P (X ′,∆′)θ(X ′−X)
= exp
−∫ t
0dX
P (X, 1)∫∞X dX ′ P (X ′, 1) +
∫∞X dX ′ P (X ′, 2)
(170)
Here P (X,∆) is the empirical joint distribution of reported event times and event types, i.e.
P (X,∆) =1
N
∑i
δ∆,∆iδ(X −Xi) (171)
For sufficiently large cohorts, i.e. for N → ∞, and given that our ‘patients’ were generated
independently, the law of large numbers guarantees that P (X,∆) will converge to the true
distribution P (X,∆) defined in (15):
P (X,∆) = π∆(X)e−∫ X
0ds π1(s)−
∫ X0
ds π2(s) (172)
We have already calulated the cause specific hazard rates π1,2(t) and their time integrals for
our present example, see (39,40), which resulted in
π1(t) =b(1−ε)e−bt
ε+ (1−ε)e−bt,
∫ t
0ds π1(s) = − log
(ε+ (1−ε)e−bt
)(173)
π2(t) = a,
∫ t
0ds π2(s) = at (174)
Hence we find
P (X, 1) =b(1−ε)e−bX
ε+ (1−ε)e−bX(ε+ (1−ε)e−bX
)e−aX = b(1−ε)e−(a+b)X (175)
P (X, 2) = a(ε+ (1−ε)e−bX
)e−aX = aεe−aX + a(1−ε)e−(a+b)X (176)
Hence for N →∞ our estimator S1(t) will report
S1(t) = exp
−∫ t
0dX
P (X, 1)∫∞X dX ′ P (X ′, 1) +
∫∞X dX ′ P (X ′, 2)
= exp
−∫ t
0dX
b(1−ε)e−(a+b)X
(a+b)(1−ε)∫∞X ds e−(a+b)s + aε
∫∞X ds e−as
41
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0 1 2 3 4
t
S1(t) vs S1(t)
ε=0.75
ε=0.50
ε=0.25
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0 1 2 3 4
t
SKM1 (t) vs S1(t)
ε=0.75
ε=0.50
ε=0.25
Figure 4. We compare the estimators S1(t) (dashed left) and SKM1 (t) (dashed right) for the
survival function of risk 1 to the true survival function S1(t) for risk 1 (solid) that would be
found if risk 2 were disabled. The joint event times are assumed to be distributed according to
the example (159), with parameter values a = b = τ = 1. The curves correspond to formulae
(167,168) and (177,142). The three Kaplan-Meier curves on the right were calulated from
N = 1000 synthetic patient data (Xi,∆i), generated according to (169). Since the two risks
in this example are positively correlated, the KM estimator SKM1 (t) and its precursor S1(t)
(both of which assume there are no correlations between the two risks) underestimate the
severity of risk 1. This effect is called ‘false protectivity due to competing risks’.
= exp
−∫ t
0dX
b(1−ε)e−(a+b)X
(1−ε)e−(a+b)X + εe−aX
= exp
−∫ t
0dX
b(1−ε)e−bX
(1−ε)e−bX + ε
= e−
∫ t0
dX π1(X)
= ε+ (1−ε)e−bt (177)
Comparison with the true survival function (167,168) for risk 1, describing correctly the world
where risk 2 is disabled, shows that our estimator is only correct for short times. See figure
4 for example curves, corresponding to ε ∈ 14 ,
12 ,
34. As soon as t > τ (and provided ε > 0,
so there are indeed event time correlations) the estimator S1(t) and the Kaplan-Meier curve
SKM1 (t) both grossly over-estimate the survival probability of risk 1. This effect, called ‘false
protectivity’, is entirely the consequence of the fact that KM-type estimators neglect risk
correlations. In our present example the two times are positively correlated, so we know that
high-risk individuals with respect to event type 2 tend also to be high-risk with respect to event
type 1. The early events of type 2 are therefore more likely to ‘filter out’ those individuals that
would have also given early type 1 events. Event 2 censoring thereby changes over time the
composition of the population, increasing the fraction of individuals with lower type 1 risk.
An example of real medical data affected by competing risks are those of figure 1. The
corresponding Kaplan-Meirer curves are shown in figure 5, or rather the incidence estimator
1 − SKM1 (t), which estimates the the probability of having experienced event type 1 (here:
42
1− SKM1 (t)
Figure 5. The incidence estimator 1 − SKM1 (t) for prostate cancer, calculate for three
subgroups of a population of males. This illustrates the false protatectivity effect. Here
smoking seems to have a preventative effect on prostate cancer, but this is in fact caused by
correlations between the risk of prostate cancer and the risk of lung cancer; in the presence
of such correlations the Kaplan-Meier estimators can no longer be trusted.
prostate cancer) as a function of time. The curves are shown for different groups, smokers,
ex-smokers, and non-smokers. The result suggests a preventative effect of smoking with respect
to prostate cancer. In fact, a more careful analysis reveals that this is not real, but caused by
the false protetectivity effect of lung cancer; those of the smokers who did not get lung cancer
by the time they get to the age of 75 are inherently more robust, and therefore also less likely
to get prostate cancer.
Note that the deviation between the true S1(t) and the estimator S1(t) could also work in the
opposite direction. If our two risks had been negatively correlated, then risk 2 events would
have been more likely to filter out individuals with low type 1 risk; we would then have found
our estimators S1(t) and SKM1 (t) under-estimating the risk 1 survival probability. Risk-specific
Kaplan-Meier curves were not designed for, and should therefore not be used in, a context
where different risks may be correlated.
43
7. Including covariates
All survival probabilities and data likelihoods considered so far were dependent only upon the
cause-specific hazard rates π = π0, . . . , πR. Let us emphasise this in our notation, and write
S(t|π) = e−∑
r
∫ t0
ds πr(s) (178)
P (X,∆|π) = π∆(X)e−∑
r
∫ X0
ds πr(s) (179)
Note that with these conventions we can also write survival functions and data probabilities at the
level of individuals simply as Si(t) = S(t|πi) and Pi(X,∆) = P (X,∆|πi). If we want to predict
survival for individuals on which we have further information in the form of the values of covariates
Z = (Z1, . . . , Zp), then we would want to use this information. There are two disctinct ways to do
this, both of which are valid, internally consistent and correct, but they differ in strategy.
7.1. Definition via covariate sub-cohorts
How to relate covariates to prediction. Let us assume for simplicity that our covariates are
discrete. We can then define a sub-cohort ΩZ ⊆ 1, . . . , N consisting of those individuals i
that have covariates Zi = Z, and apply the analysis in section 5 of the link between individual-
level descriptions and population-level descriptions to this sub-cohort. ΩZ will be characterized by
some cohort-level cause-specific hazard rates π(Z), which are related to the individual cause-specific
hazard rates via (102), which takes the form
πµ(t|Z) =
∑i∈ΩZ
πiµ(t)e−∑
r
∫ t0
ds πir(s)
∑i∈ΩZ
e−∑
r
∫ t0
ds πir(s)(180)
All our previous analysis linking cohort-level to individual-level quantities applies to ΩZ , so we
can immediately write down the formulas for the probability S(t|π(Z)) that a randomly drawn
individual from ΩZ will be alive at time t, and for the likelihood per unit time P (X,∆|π(Z)) that
a randomly drawn individual from ΩZ will report an event of type ∆ at time X:
S(t|π(Z)) = e−∑
r
∫ t0
ds πr(s|Z) (181)
P (X,∆|π(Z)) = π∆(X|Z)e−∑
r
∫ X0
ds πr(s|Z) (182)
with the generic definitions (178) and (179). Similarly, we find identity (101) translated into
πµ(t|Z)S(t|π(Z)) =1
|ΩZ |∑i∈ΩZ
πiµ(t)S(t|πi) (183)
with |ΩZ | =∑i∈ΩZ
1. We think strictly in terms of cohort-level cause-specific hazard rates π(Z),
which by definition depend solely and uniquely on Z, as opposed to individual-level cause-specific
hazard rates π (which can and generally will vary within the sub-cohort ΩZ).
Estimation of π(Z) from the data and Bayesian prediction. Within the sub-cohort picture, the
Bayesian estimation of π(Z) is straightforward. To distinguish between the π (cause-specific hazard
rates, being functions of time but without limiting oneself to individuals with specific covariates)
and the time- and covariate-dependent rates π(Z), let us write the latter when used as arguments
in probability distributions as π?. In Bayesian estimation we would simply write
P(π?|D) =P(D|π?)P(π?)∫
dπ?′ P(D|π?′)P(π?′)(184)
44
P(D|π?)) =N∏i=1
P (Xi,∆i|π?(Zi))
=N∏i=1
π?∆i
(Xi|Zi)e−∑
r
∫ Xi0
dt π?r (t|Zi)
(185)
Here P(π?) is a distribution that codes for any prior knowledge we have on the relation π?(Z)
(including applicable constraints). Fully Bayesian prediction, taking into account our limited
certainty on whether we have extracted the correct π?(Z) from the data D, would become
S(t|Z,D) =
∫dπ? P(π?|D) S(t|π(Z)) (186)
P (X,∆|Z,D) =
∫dπ? P(π?|D) P (X,∆|π?(Z)) (187)
Most probable covariate-to-rates relation. The most probable function π?(Z) is the one that
maximises P(π?|D) in (184), i.e. that maximises
logP(π?|D) = logP(D|Wh(π?)) + logP(π?)
=N∑i=1
logπ?∆i
(Xi|Zi)e−∑
r
∫ Xi0
dt π?r (t|Zi)
+ logP(π?)
=N∑i=1
log π?∆i(Xi|Zi)−
N∑i=1
∑r
∫ Xi
0dt π?r (t|Zi) + logP(π?)
=∑r
N∑i=1
δr,∆i log π?r (Xi|Zi)−N∑i=1
∫ Xi
0dt π?r (t|Zi)
+ logP(π?) (188)
Unless we have prior evidence that suggests we should couple risks, we should use the maximum
entropy prior, which is of the form P(π?) =∏r P(π?r ). In that case the posterior P(π?|D) factorises
fully over risks, and hence
logP(π?|D) =∑r
logP(π?r |D) (189)
logP(π?r |D) =N∑i=1
δr,∆i log π?r (Xi|Zi)−N∑i=1
∫ Xi
0dt π?r (t|Zi) + logP(π?r ) (190)
We see that the functions π?r (t|Z) for different risks r are calculated from disconnected maximisation
problems. However, this does not mean that we can simply forget about the other risks r 6= 1. The
risks could still be correlated, so eliminating competing risks still can still impact upon the primary
hazard rate π?1(t|Z). Only with the further assumption of noncorrelating risks can we take π?1(t|Z)
as a correct measure of risk in a world where only risk 1 can materialise.
Information-theoretic interpretation. There is a nice interpretation of what the above formulae are
effectively doing. To show this we first need to define the empirical covariate distribution and the
empirical conditioned data distribution:
P (Z) =1
N
∑i
δ(Z−Zi), P (t, r|Z) =
∑i δ(t−Xi)δr,∆iδ(Z−Zi)∑
i δ(Z−Zi)(191)
From (184,185) we obtain, using the definitions (191):
1
NlogP(π?|D) =
1
N
N∑i=1
logP (Xi,∆i|π?(Zi)) +1
NlogP(π?) + constant
45
=
∫dZ
∑r
1
N
N∑i=1
δ(Z−Zi)δr,∆i
∫ ∞0
dt δ(t−Xi) logP (t, r|π?(Z)
+1
NlogP(π?) + constant
=
∫dZ P (Z)
∑r
∫ ∞0
dt P (t, r|Z) logP (t, r|π?(Z)
+1
NlogP(π?) + constant
= −∫
dZ P (Z)∑r
∫ ∞0
dt P (t, r|Z) log[ P (t, r|Z)
P (t, r|π?(Z)
]+
1
NlogP(π?) + Constant (192)
Apart from the regularising influence of a prior P(π?), the most probable function π? is apparently
the one that minimises the Z-averaged Kullback-Leibler distance between the empirical covariate-
conditioned distribution P (t, r|Z) and its theoretical expectation P (t, r|π(Z)).
7.2. Definition by conditioning individual hazard rates on covariates
How to relate covariates to prediction. The second approach to bringing in covariate information is
formulated in terms of the individual cause-specific hazard rates. We regard individual covariates
as predictors of individual cause-specific hazard rates, which in turn predict individual survival:
Zi → predicts→ πi → predicts→ (Xi,∆i)
The question then is how to formalise this. If we are given the values Z of the covariates of an
individual, then the survival probability and data likelihood for that individual, conditional on
knowing their covariates to be Z, can be written as
S(t|Z,W ) =
∫dπ W (π|Z)S(t|π) (193)
P (X,∆|Z,W ) =
∫dπ W (π|Z)P (X,∆|π) (194)
Here∫
dπ represents functional integration over the values of all hazard rates at all times, subject
to πr(t) ≥ 0 for all (t, r), and W (π|Z) gives the probability that a randomly drawn individual with
covariates Z will have individual cause-specific hazard rates π.
The distribution W (π|Z) depends strictly on the degree to which Z is informative of π, i.e.
on biochemistry. If Z is very informative, then W (π|Z) will be very narrow and point us to a
very small set of cause-specific hazard rates compatible with observing covariates Z. One cannot
conclude that any patterns linking Z to π, embodied in W (π|Z), are causal. For instance, π and
(components of) Z could both be (partially) effects of a common cause Y . W (π|Z) only answers
the question: if one knows Z for an individual, what does this tell us about his/her π?
Estimation of covariates-to-risk connection. To proceed we need to estimate the unknown
distribution W (π|Z), from analysis of the complete data D = (X1,∆1;Z1), . . . , (XN ,∆N ;ZN )(i.e. the survival data plus the covariates of all patients). For infinitely large cohorts we would
expect to find W (π|Z) becoming identical to the empirical frequency W (π|Z) with which the
hazard rates π are observed among the individuals with covariates Z:
W (π|Z) = limN→∞
W (π|Z) (195)
46
W (π|Z) =
∑i δ(π − πi)δ(Z −Zi)∑
i δ(Z −Zi)(196)
For finite data sets we will only be able to say how likely is each possible function W (π|Z), in the
light of the data D: we will calculate P(W |D), where W is the conditional distribution W (π|Z).
If we assume that all patients in D are independently drawn from a given population, the standard
Bayesian formula P (A|B) = P (B|A)P (A)/P (B) tells us that
P(W |D) =P(D|W )P(W )∫
dW ′ P(D|W ′)P(W ′)(197)
with
P(D|W ) =N∏i=1
P (Xi,∆i|Zi,W ) =N∏i=1
∫dπ W (π|Zi)P (Xi,∆i|π) (198)
Here∫
dW denotes functional integration over all distributions W (π|Z), subject to the constraint∫dπ W (π|Z) = 1, for all Z. The survival prediction formulae for an individual with covariates Z
will then be:
S(t|Z,D) =
∫dW
likelihood that W is right︷ ︸︸ ︷P(W |D) ×
survival prediction given W︷ ︸︸ ︷S(t|Z,W )
=
∫dW P(W |D)
∫dπ W (π|Z)S(t|π) (199)
and
P (X,∆|Z,D) =
∫dW P(W |D)P (X,∆|Z,W )
=
∫dW P(W |D)
∫dπ W (π|Z)P (X,∆|π) (200)
Equivalently:
S(t|Z,D) =
∫dπ W (π|Z,D)S(t|π) (201)
P (X,∆|Z,D) =
∫dπ W (π|Z,D)P (X,∆|π) (202)
with
W (π|Z,D) =
∫dW P(W |D)W (π|Z) (203)
The distribution W (π|Z,D) combines two sources of uncertainty: (i) uncertainty in the individual
hazard rates π given an individual’s covariates Z (coded in W (π|Z)), and (ii) our ignorance about
which is the true relation W (π|Z), given the data D (described by P(W |D)). The first uncertainty
can be reduced by using more informative covariates, the second by acquiring more data.
Information-theoretic interpretation. Again there exists a nice information-theoretic interpretation
of our Bayesian formulae, since
1
NlogP(W |D) =
1
N
N∑i=1
log
∫dπ W (π|Zi)P (Xi,∆i|π) +
1
NlogP(W ) + constant
=∑r
1
N
N∑i=1
∫dZ δ(Z−Zi)
∫ ∞0
dt δ(t−Xi)δr,∆i log
∫dπ W (π|Z)P (t, r|π)
+1
NlogP(W ) + constant
47
=
∫dZ P (Z)
∑r
∫ ∞0
dt P (t, r|Z) log
∫dπ W (π|Z)P (t, r|π)
+1
NlogP(W ) + constant (204)
We can rewrite (204) in terms of a Kullback-Leibler distance:
1
NlogP(W |D) = −
∫dZ P (Z)
∑r
∫ ∞0
dt P (t, r|Z) log[ P (t, r|Z)∫
dπ W (π|Z)P (t, r|π)
]+
1
NlogP(W ) + Constant (205)
(in which Constant is a new constant that differs from the previous one by a further W -independent
term, being the Z-averaged Shannon entropy of P (t, r|Z)). So we see that, apart from the
regularising influence of the prior P(W ), the most probable function W (π|Z) is the one that
minimises the Z-averaged Kullback-Leibler distance between the empirical distribution P (t, r|Z)
and its theoretical expectation P (t, r|Z) =∫
dπ W (π|Z)P (t, r|π).
7.3. Connection between the conditioning picture and the sub-cohort picture
Conditioned cohort-level hazard rates in terms of W . It is instructive to inspect the relation between
the two routes for incorporating covariates into survival prediction in more detail. We note that
(180) can be written in terms of the empirical estimator W (π|Z) given in (196):
πµ(t|Z) =
∫dπ
∑i δZ ,Ziδ(π−πi)πµ(t)e−
∑r
∫ t0
ds πr(s)
∫dπ∑i δZ ,Ziδ(π−πi)e−
∑r
∫ t0
ds πr(s)
=
∫dπ W (π|Z)πµ(t)e−
∑r
∫ t0
ds πr(s)∫dπ W (π|Z)e−
∑r
∫ t0
ds πr(s)(206)
Relation in terms of prediction. We note the difference between the definitions of S(t|π(Z)) and
S(t|Z,W ). The first gives the survival probability for a randomly drawn individual from the data
set with covariates Z; the second gives the survival probability for a randomly drawn individual
with covariates Z (not necessarily from the data set). Any finite-size imperfections of the data set
will affect S(t|π(Z)). Within the sub-cohort picture we find, using (183) and (180):
S(t|π(Z)) =
1|ΩZ |
∑i∈ΩZ
πiµ(t)S(t|πi)
πµ(t|Z)=
1|ΩZ |
∑i∈ΩZ
πiµ(t)e−∑
r
∫ t0
ds πir(s)
πµ(t|Z)
=( 1
|ΩZ |∑i∈ΩZ
πiµ(t)e−∑
r
∫ t0
ds πir(s)) ∑
i∈ΩZe−∑
r
∫ t0
ds πir(s)
∑i∈ΩZ
πiµ(t)e−∑
r
∫ t0
ds πir(s)
=1
|ΩZ |∑i∈ΩZ
e−∑
r
∫ t0
ds πir(s)
=
∫dπ
(∑i∈ΩZδ(π−πi)∑i∈ΩZ
)e−∑
r
∫ t0
ds πr(s)
=
∫dπ W (π|Z)S(t|π) = S(t|Z, W ) (207)
48
Similarly we can connect the expressions P (X,∆|π(Z)) and P (X,∆|Z,W ). Starting from the
sub-cohort picture we get
P (X,∆|π(Z)) = π∆(X|Z)S(X|π(Z))
=1
|ΩZ |∑i∈ΩZ
πi∆(X)S(X|πi)
=
∫dπ( 1
|ΩZ |∑i∈ΩZ
δ(π−πi))π∆(X)S(X|π)
=
∫dπ W (π|Z)π∆(X)S(X|π) = P (X,∆|Z, W ) (208)
There is no contradiction between the two approaches, they just focus on different quantities. In the
conditioning picture we capture the variability in the connection Z → π in a distribution W (π|Z),
where π refers to the cause-specific hazard rates of individuals. In the sub-cohort picture we describe
the variability in the connection Z → π via sub-cohort level cause-specific hazard rates π(Z). In
both cases we still need to estimate this variability from the data.
7.4. Conditionally homogeneous cohorts
Trivial versus nontrivial heterogeneity. There are two types of cohort heterogeneity. The trivial one
is heterogeneity in covariates, meaning that the Zi are not identical for all individuals i. We always
allow for this by default; it would be silly to include covariates and then assume they take identical
values for all individuals (as they would give no information). The nontrivial type of heterogeneity
refers to the link between covariates and risks. A conditionally homogeneous cohort is one in which
the cause-specific hazard rates are identical for all individuals i with identical covariates Zi.
If we work within the sub-cohort picture, formulated in terms of the sub-cohort level cause-
specific hazard rates π(Z) of individuals with covariates Z, we need not make any statements on
presence or absence of covariate-to-risk heterogeneity. In either case we simply calculate survival
statistics and data likelihood for any individual with covariates Z via
S(t|π(Z)) = e−∑
r
∫ t0
ds πr(s|Z) (209)
P (X,∆|π(Z)) = π∆(X|Z)e−∑
r
∫ X0
ds πr(s|Z) (210)
We can get these equations also starting from the conditioning picture, but there we need conditional
cohort homogeneity, i.e. W (π|Z) = δ[π−π(Z)], with π(Z) now representing the individual cause-
specific hazard rates of individuals with covariates Z. The risk statistics of each individual with
covariates Z are now described by the cause-specific hazard rates π(Z). Due to the conditional
homogeneity these are trivially identical to the cohort-level hazard rates, and hence (209,210) again
hold. Also the Bayesian estimation of π(Z), which involves evaluation of the quantity P (X,∆|π(Z))
for the available data points (Xi,∆i), would proceed identically in both approaches, but there would
be different interpretations of why we would write all this and what would be the meaning of π(Z):
• conditioning picture:
the cohort is taken to be homogeneous in the covariates-to-risk patterns, and
we assume that all individuals with covariates Z have individual hazard rates π(Z)
(capturing covariate-to-risk heterogeneity is not possible because we assumed there isn’t any)
49'
&
$
%
all modelscovariate-to-risk connection:
W (π|Z)
estimating W from data:P(W |D)
[
'
&
$
%
((((((
homogeneous cohorts
W (π|Z) = δ[π−π?(Z)] covariate-to-risk connection:π?(Z)
estimating π? from data:P(π?|D)
[
Figure 6. Description of general and of conditionaly homogeneous cohorts, within the
framework where covariate information is used to condition the probability for individuals
to have individual cause-specific hazard rates π. In conditionally homogeneous cohorts all
cause-specific hazard rates are fully determined by the covariates; there are no ‘hidden’
covariates that impact upon risk. The remaining uncertainty is only in our limited ability
to infer the function π?(Z) from the data (this uncertainty is decribed by P(π?|D)).
• sub-cohort picture:
we make no assumptions regarding homogenity/heterogeneity, but
we assume that π(Z) represents sub-cohort level hazard rates
(capturing covariate-to-risk heterogeneity is not possible because we lack the information)
The difference between the two interpretations will become relevant when we start thinking about
how to capture covariate-to-risk heterogeneity. Then the most suitable starting point will be the
conditioning picture, since W (π|Z) is defined in terms of individual cause specific-hazard rates.
Within the conditioning picture, the assumption of cohort homogeneity must be brought in
via the prior P(W ) in formula (197), by choosing
P(W ) =
∫dπ? δ[W −W (π?)]P(π?) (211)
Here W (π?) is the covariates-to-rates distribution of a conditionally homogeneous cohort with
W (π|Z) = δ[π − π?(Z)], and formula (197) becomes
P(W |D) =P(D|W )
∫dπ? δ[W −W (π?)]P(π?)∫
dW ′ P(D|W ′)∫
dπ? δ[W ′ −W (π?)]P(π?)
=P(D|W (π?))
∫dπ? δ[W −W (π?)]P(π?)∫
dπ? P(D|W (π?))P(π?)
=
∫dπ? δ[W−W (π?)] P(π?|D) (212)
With (212) our earlier prediction formula for the conditioning picture simplifies to
W (π|Z,D) =
∫dπ? P(π?|D) δ[π − π?(Z)] (213)
which, in turn, leads us directly to (186,187).
50
7.5. Nonparametrised determination of covariates-to-risk connection
If we do not wish to take into account our uncertainty regarding the covariates-to-risk relations
π?(t|Z), we could turn to the simple recipe of the maximum likelihood estimator. We saw earlier
that this means finding the maximum of logP(π?|D), but with a flat prior P(π?); equivalently,
maximising logP(D|π?). A flat prior also factorises trivially over risks, so we find the disconnected
maximisation problems (190), with constant functions P(π?r ). Hence the maximum likelihood
estimator πr(t|Z) is found by maximising
L(D|π?r ) =N∑i=1
δr,∆i log π?r (Xi|Zi)−N∑i=1
∫ Xi
0dt π?r (t|Zi)
=N∑i=1
∫ ∞0
dtδ(t−Xi)δr,∆i log π?r (t|Zi)− θ(Xi−t)π?r (t|Zi)
=
∫ ∞0
dtN∑i=1
∫dZ δ(Z−Zi)
δ(t−Xi)δr,∆i log π?r (t|Z)− θ(Xi−t)π?r (t|Z)
=
∫ ∞0
dt
∫dZ
log π?r (t|Z)
N∑i=1
δ(Z−Zi)δ(t−Xi)δr,∆i
−π?r (t|Z)N∑i=1
δ(Z−Zi)θ(Xi−t)
(214)
Straigthforward functional differentiation of this latter expression gives:
(∀Z)(∀t ≥ 0) :1
π?r (t|Z)
N∑i=1
δ(Z−Zi)δ(t−Xi)δr,∆i =N∑i=1
δ(Z−Zi)θ(Xi−t) (215)
which gives us the maximum likelihood estimator
(∀r)(∀t) : πr(t|Z) =
∑i δ(Z−Zi)δr,∆iδ(t−Xi)∑i δ(Z−Zi)θ(Xi−t)
(216)
This is very similar to the earlier estimator (216), but now we find the sums over individuals
restricted to those with covariates Z. Although formally correct, expressions such as (216) are in
practice rather useless. The problem is that we are here estimating functions of p + 1 arguments.
Even if we we reduce our ambition and ask for just five or so points per dimension (a rather small
number), and we have e.g. p = 5 covariates (a modest number), we would still already need in
excess of 5p+1 = 15,625 data points to start covering the space of all (Z, t) combinations. If we
want in addition to estimate values of π?r (t|Z) with, say, 10% accuracy, we need to multiply the
number if data points needed further by a factor 100.
We conclude that, even for conditionally homogeneous cohorts, we have no choice but to find
suitable parametrisations of the functions π?r (t|Z), i.e. we will propose a specific sensible formula for
π?r (t|Z) with a modest number of free parameters, and use the data to estimate these parameters.
This is the idea behind Cox regression.
7.6. Examples
Let us get intuition for the effect of Bayesian priors in regression. We saw that nonparametrised
maximisation of (190) with a flat prior P(π?r ) gives as the most probable hazard rate a ‘spiky’
51
estimator (216), with δ-functions at the times where the events in the data set occurred. One does
not expect the real hazard rate to have spikes; this knowledge can be coded into a prior of the form
P(π?r ) =1
C(α)e−α
∫dZ W (Z)
∫∞0
dt (dπ?r (t|Z)/dt)2
(217)
with some normalisation constant C(α). This prior ‘punishes’ explanations with discontinuous
behaviour, while reducing to the flat prior for α→ 0. Let us choose the simplest example, with just
one binary covariate Zi ∈ 0, 1 and just one risk (i.e. ∆i = 1 for all i, so we can drop the index r).
We make the most natural choice W (Z) = 12δZ,0 + 1
2δZ,1 in (217). Expression (190) then becomes,
apart from an irrelevant normalisation constant,
logP(π?|D) =N∑i=1
log π?(Xi|Zi)−N∑i=1
∫ Xi
0dt π?(t|Zi)
− 1
2α
∫ ∞0
dt[(dπ?(t|0)
dt
)2+(dπ?(t|1)
dt
)2]=
N∑i=1
δZi,0 log π?(Xi|0)−N∑i=1
δZi,0
∫ Xi
0dt π?(t|0)− 1
2α
∫ ∞0
dt(dπ?(t|0)
dt
)2
+N∑i=1
δZi,1 log π?(Xi|1)−N∑i=1
δZi,1
∫ Xi
0dt π?(t|1)− 1
2α
∫ ∞0
dt(dπ?(t|1)
dt
)2(218)
The quantity to be maximised has separated into independent expressions, one for π?(t|0) and one
for π?(t|1). For each Z = 0, 1 we have to maximize and expression of the form
LZ(π) =N∑i=1
δZi,Z
log π(Xi|Z)−
∫ Xi
0dt π(t|Z)
− 1
2α
∫ ∞0
dt(dπ(t|Z)
dt
)2(219)
To differentiate LZ(π) with respect to π(t|Z) we will use the following identity
δ
δf(t)
∫ ∞0
ds (f ′(s))2 = 2
∫ ∞0
ds f ′(s)δ
δf(t)f ′(s)
= 2 limε→0
1
ε
∫ ∞0
ds f ′(s)δ
δf(t)
(f(s+ ε)− f(s)
)= 2 lim
ε→0
1
ε
(f ′(t−ε)− f ′(t)
)= −2f ′′(t) (220)
Application to LZ(π) tells us that the most probable function π(t|Z) is to be solved from
π(t|Z) = 0 or1
π(t|Z)
N∑i=1
δZi,Zδ(t−Xi)−N∑i=1
δZi,Zθ(Xi−t) + αd2
dt2π(t|Z) = 0 (221)
This can be rewritten in terms of the maximum likelihood estimator
π(t|Z) =
∑i δZi,Zδ(t−Xi)∑i δZi,Zθ(Xi−t)
(222)
as
π(t|Z) = 0 ord2
dt2π(t|Z) =
1
α
(1− π(t|Z)
π(t|Z)
) N∑i=1
δZi,Zθ(Xi−t) (223)
For α→ 0 we recover the maximum-likelihood solutions. For α > 0 we will have jumps in the first
derivative of π(t|Z), but continuous (i.e. non-spiky) rates π(t|Z), as a consequence of the prior. To
52
0.0
0.5
1.0
0 1 2 3 4 5 6 7 8
t
π(t|Z)
0.0
0.5
1.0
0 1 2 3 4 5 6 7 8
t
π?(t|Z)
Figure 7. Left: the maximum likelihood estimator π(t|Z) in (222) for Z = 0, calculated
from a data set with N = 71 patients and a binary covariate Z ∈ 0, 1, in which there
are many early and many late events (but with few at intermediate times). By definition
this estimator always consists of weighted delta-peaks (‘spikes’) at the observed event times.
Right: the most probable solution π?(t|Z) for Z = 0 within the Bayesian formalism, which
differs from the previous one in the addition of a ‘smoothness’ prior P (π). Here α = 50.
calculate the corresponding value for LZ we rewrite
LZ(π) =N∑i=1
δZi,Z
log π(Xi|Z)−
∫ Xi
0dt π(t|Z)
− 1
2α[π(t|Z)
d
dtπ(t|Z)
]∞0
+1
2α
∫ ∞0
dt π(t|Z)d2
dt2π(t|Z)
=N∑i=1
δZi,Z
log π(Xi|Z)−
∫ Xi
0dt π(t|Z)
+
1
2α(π(0|Z)π′(0|Z)− π(∞|Z)π′(∞|Z)
)
+1
2
∫ ∞0
dt(π(t|Z)− π(t|Z)
) N∑i=1
δZi,Zθ(Xi−t)
=N∑i=1
δZi,Z
log π(Xi|Z)− 1
2
∫ Xi
0dt π(t|Z)
+
1
2απ(0|Z)π′(0|Z)− 1
2
N∑i=1
δZi,Z (224)
Note that a finite nonzero derivative of π(t|Z) as t→∞ is ruled out as it would give either negative
or diverging hazard rates. It is also clear that the maximum must have π(Xi|Z) > 0 for all i, in view
of the term with log π(Xi|Z). Zero rates can only occus in between the data times X1, . . . , XN.Let us inspect the shape of π(t|Z) when it is nonzero, and assume that there are no ties, i.e.
ti 6= tj if i 6= j. We can then order our individuals i such that X0 < X1 < X2 < . . .XN−1 < XN
(with the definition X0 ≡ 0) . At any time t /∈ X1, . . . , XN equation (223) simplifies considerably:
t < X1 :d2
dt2π(t|Z) = γ1(Z) =
1
α
N∑i=1
δZi,Z (225)
53
t ∈ (X`, X`+1) :d2
dt2π(t|Z) = γ`+1(Z) =
1
α
N∑i=`+1
δZi,Z (226)
t ∈ (XN−1, XN ) :d2
dt2π(t|Z) = γN (Z) =
1
αδZN ,Z (227)
t > XN :d2
dt2π(t|Z) = 0 (228)
In each interval I` = (X`−1, X`) we apparently have a hazard rate in the shape of a local parabola:
t ∈ (X`−1, X`) : π(t|Z) =1
2γ`(t− t`)2 + δ` (229)
t > XN : π(t|Z) = π(∞|Z) (230)
We only need to determine the constants (t`, δ`) for each interval. The solutions in adjacent time
intervals are related by the continuity condition, i.e. limε↓0 π(X` + ε) = limε↓0 π(X` − ε), giving
` < N :1
2γ`(X` − t`)2 + δ` =
1
2γ`+1(X` − t`+1)2 + δ`+1 (231)
` = N :1
2γN (XN − tN )2 + δN = π(∞|Z) (232)
The second identity which we can use relates to the first derivative of π(t|Z) near each X`.
Integration over both sides of (223), with ε > 0, gives:
π′(X`+ε|Z)− π′(X`−ε|Z) =1
α
N∑i=1
δZi,Z
∫ X`+ε
X`−εdt(θ(Xi−t)−
δ(t−Xi)
π(Xi|Z)
)
=1
α
N∑i=1
δZi,Z
[(t−Xi)θ(Xi−t)−
θ(t−Xi)
π(Xi|Z)
]X`+εX`−ε
= − 1
α
N∑i=1
δZi,Zπ(Xi|Z)
θ(X`+ε−Xi)− θ(X`−ε−Xi)
= −
δZ`,Zαπ(X`|Z)
(233)
Thus we find
` < N : γ`(X` − t`) = γ`+1(X` − t`+1) +1
α
δZ`,Z12γ`+1(X` − t`+1)2 + δ`+1
(234)
` = N : γN = 0 or XN−tN =1
π(∞|Z)(235)
In combination, we end up with the following iteration for the unknown constants (t`, δ`), where we
note (and use) the fact that t` is irrelevant if γ` = 0:
t` = X` −γ`+1
γ`(X` − t`+1)− 1
αγ`
δZ`,Z12γ`+1(X` − t`+1)2 + δ`+1
(236)
δ` = δ`+1 +1
2γ`+1(X` − t`+1)2 − 1
2γ`(X` − t`)2 (237)
to be iterated downwards, starting with
tN = XN −1
π(∞|Z)δN = π(∞|Z)− γN
2π2(∞|Z)(238)
The only remaining freedom in our solution is the value chosen for π(∞|Z). This value is determined
by the requirement that our solution must maximise expression (224). For the present solution
54
π(t|Z) this expression reduces to
LZ(π) =N∑`=1
δZ`,Z log[1
2γ`(X`−t`)2+δ`
]− 1
2
( N∑i=1
δZi,Z
)[1 + t1
(1
2γ1t
21+δ1
)]
− 1
2
N∑i=1
δZi,Z
i∑`=1
∫ X`
X`−1
dt(1
2γ`(t− t`)2 + δ`
)
=N∑i=1
δZi,Z
log
[1
2γi(Xi−ti)2+δi
]− 1
2
[1 + t1
(1
2γ1t
21+δ1
)
+i∑
`=1
(1
6γ`(X`−t`)3 − 1
6γ`(X`−1−t`)3 + δ`(X`−X`−1)
)](239)
The resulting solution π?(t|Z) is shown for an example data set in Figure , for α = 50, together
with the ‘spiky’ maximum likelihood estimator π(t|Z). The new estimator combines evidence from
the data (the event times) with our prior belief that the true hazard rate should be smooth.
55
8. Proportional hazards (Cox) regression
Most existing survival analysis protocols that aim to quantify the impact of covariate values on
risk, and/or predict survival outcome from an individual’s covariates, can be obtained from the
general Bayesian description in the previous section, upon implementing specific further complexity
reductions. These reductions are always of the following types (often in combination):
• Within the sub-cohort picture: assumptions on the form of π(Z)
These assumptions are designed to reduce the complexity of the mathematical formulas, and
are formulated via simple low-dimensional parametrisations.
• Within the conditioning picture: assumptions on the form of the distribution W (π|Z)
These aim again to reduce the complexiy of the mathematical formulas, and are implemented
via the prior P(W |D) (which is set to zero for all W (π|Z) that are not of the assumed form).
• Assumptions on correlations between risks in the cohort
These relate to the interpretation of results. For instance, the assumption of independence is
needed if we want to interpret the cause specific hazard rates of the primary risk as indicative
of risk in a world where the other risks are eliminated.
• Mathematical approximations
These are short-cuts which in principle always induce some error. An example is limiting
oneself to the most probable value of a parameter, even if its distribution has finite width (e.g.
maximum likelihood estimation versus Bayesia regression).
8.1. Definitions, assumptions and regression equations
Definition of Cox regression. ‘Proportional hazards regression’ or ‘Cox regression’ (dating from
1972) is a formalism that is indeed obtained from the general Bayesian picture via several
simplifications of the type listed above. To appreciate its definition, let us inspect which formulas
we could in principle write for the hazard rate of the primary risk. We must demand π1(t|Z) ≥ 0
for all t ≥ 0 and all Z, so we can always write it in exponential form. If we then also expand the
exponent in powers of Z we see that any acceptable cause-specific hazard rate can be written as
π1(t|Z) = π1(t|0) e∑p
µ=1βµ(t)Zµ+ 1
2
∑p
µ,ν=1βµν(t)ZµZν+O(Z3
)(240)
Cox regression boils down to an inspired simplification of this general expression, crucial at the time
of the method’s conception when computation resources where very limited (remember that in 1972
the average university would have just one big but slow computer):
We assume that (conditional on the covariates Z) all risks are statistically independent,
and that the cause-specific hazard rate of the primary risk for individuals with covariates
Z is a function of the following parametrized form:
π1(t|Z) = λ0(t)eβ·Z (241)
Here β · Z =∑pµ=1 βµZµ, with time-independent parameters β = (β1, . . . , βp). We then
focus on calculating the most probable β and the most probable function λ0(t).
56
The function λ0(t) ≥ 0 is called the ‘base hazard rate’. It is the primary risk hazard rate one would
find for the trivial covariates Z = (0, 0, . . . , 0). The name ‘proportional hazards’ refers to the fact
that, due to the exponential form of (241), the effect of each covariate is multiplicative:
π1(t) = λ0(t)︸ ︷︷ ︸base hazard rate
× eβ1Z1 × . . .× eβpZp︸ ︷︷ ︸‘proportional hazards′
The main implications of (241) are that the effects of the covariates are taken to be mutually
independent and independent of time. One effectively assumes that there exists a time-independent
hyper-plane in covariate space that separates high risk individuals from low risk individuals:
‘high risk covariates′ : β1Z1 + . . .+ βpZp large
‘low risk covariates′ : β1Z1 + . . .+ βpZp small
In addition we can now quantify the risk impact of each individual covariate µ in a single time-
independent number, the so-called ‘hazard ratio’
HRµ =π1(t|Z)|Zµ=1
π1(t|Z)|Zµ=0=λ0(t)e
βµ.1+∑
ν 6=µ βνZν
λ0(t)eβµ.0+
∑ν 6=µ βνZν
= eβµ (242)
Covariates with no impact on risk, i.e. with βµ = 0, would thus give HRµ = 1. Note that in
the more general case (240) the ratio π1(t|Z)|Zµ=1/π1(t|Z)|Zµ=0 would still have depended on the
remaining covariates Zν with ν 6= µ. The main virtue of the choice (241) is that it is the simplest
nontrivial definition to meet the main criteria that we need to build into any parametrisation of
cause-specific hazard rates (nonnegativity, possible dependence on time and on covariates) in which
we can effectively decouple the time variable from the variables relating to covariates.
Derivation of equations for regression parameters. Within Cox regression we seek to find the most
probable parameters β = (β1, . . . , βp) and the most probable function λ0(t) in (241). In Cox’s
original paper he did not in fact calculate λ0(t) explicitly, but instead focused on calculating β
using an argument (‘partial likelihood’) that avoids having to know the base hazard rate. Here we
use the benefit of hindsight and the fact that we have already done much of the preparatory work,
and calculate the most probable parameters directly from (190) (with a flat prior P(π?1), where the
most probable Bayesian solution reduces to maximum likelihood estimation):
logP(β, λ0|D) =N∑i=1
δ1,∆i log π1(Xi|Zi)−
∫ Xi
0dt π1(t|Zi)
+ constant
=N∑i=1
∫ ∞0
dt log λ0(t)δ1,∆iδ(t−Xi) + δ1,∆iβ ·Zi − eβ·Zi∫ ∞
0dt θ(Xi−t)λ0(t)
+ constant (243)
Maximisation of this expression is done as always via the Lagrange formalism. Let us first maximise
over λ0(t), and define L(β|D) = maxλ0 logP(β, λ0|D). It will again turn out that the constraint
λ0(t) ≥ 0 will be met automatically, so the Lagrange equations from which to solve λ0(t) become
0 =δ
δλ0(t)logP(β, λ0|D) =
1
λ0(t)
N∑i=1
δ1,∆iδ(t−Xi)−N∑i=1
eβ·Zi
θ(Xi−t) (244)
It follows that the maximising function λ0(t), given β, is
λ0(t|β) =
∑Ni=1 δ1,∆iδ(t−Xi)∑Ni=1 eβ·Z
i
θ(Xi−t)(245)
57
For β = 0 this expression reduces to the simple estimator π1(t) in (216) for the cause-specific hazard
rate, as one would expect. Having calculated the most probable base hazard rate (245) in terms of
the regression parameters β, we are then left with the following function to be maximised over β:
L(β|D) = maxλ0 logP(β, λ0|D)
=
∫ ∞0
dt( N∑i=1
δ1,∆iδ(t−Xi))
log( N∑i=1
δ1,∆iδ(t−Xi))
+ constant
−∫ ∞
0dt( N∑i=1
δ1,∆iδ(t−Xi))
log( N∑i=1
eβ·Zi
θ(Xi−t))
+N∑i=1
δ1,∆iβ ·Zi −N∑i=1
eβ·Zi∫ ∞
0dt θ(Xi−t)
[ ∑Nj=1 δ1,∆jδ(t−Xj)∑Nj=1 eβ·Z
j
θ(Xj−t)
]
= L(0|D)−∫ ∞
0dt( N∑i=1
δ1,∆iδ(t−Xi))
log( N∑i=1
eβ·Zi
θ(Xi−t))
+N∑i=1
δ1,∆iβ ·Zi −N∑i=1
eβ·Zi∫ ∞
0dt θ(Xi−t)
( ∑Nj=1 δ1,∆jδ(t−Xj)∑Nj=1 eβ·Z
j
θ(Xj−t)
)
+
∫ ∞0
dt( N∑i=1
δ1,∆iδ(t−Xi))
log( N∑i=1
θ(Xi−t))
+N∑i=1
∫ ∞0
dt θ(Xi−t)(∑N
j=1 δ1,∆jδ(t−Xj)∑Nj=1 θ(Xj−t)
)
= L(0|D) +N∑i=1
δ1,∆iβ ·Zi −N∑i=1
δ1,∆i log( N∑j=1
eβ·Zj
θ(Xj−Xi))
+N∑i=1
δ1,∆i log( N∑j=1
θ(Xj−Xi))
= L(0|D) +N∑i=1
δ1,∆iβ ·Zi −N∑i=1
δ1,∆i log∑N
j=1 eβ·Zj
θ(Xj−Xi)∑Nj=1 θ(Xj−Xi)
(246)
From this result we can immediately derive by differentiation the equation 0 = ∂L(β|D)/∂βµ from
which to derive the most probable β, giving
for all µ :N∑i=1
δ1,∆i
Ziµ −
∑Nj=1 Z
jµ eβ·Z
j
θ(Xj−Xi)∑Nj=1 eβ·Z
j
θ(Xj−Xi)
= 0 (247)
This is a relatively simple set of coupled nonlinear equations for just p parameters (β1, . . . , βp),
which could indeed be analysed with the computing power of the 1970s. Once the parameters β
have been determined, the corresponding hazard ratios follow via (242), and the most probable base
hazard rate λ0(t) follows from (245).
Finally, once the most probable base hazard rate and the most probable regression parameters
β are known, the Cox formalism allows us to predict the survival time for any individual with
covariates Z via the primary risk-specific version of (181), which now reduces to
SCox(t|Z) = exp(−∫ t
0ds π1(s|Z)
)= exp
(− e
ˆβ·ZΛ0(t|β))
(248)
58
with, upon integrating (245) over time,
Λ0(t|β) =
∫ t
0ds λ0(s|β) =
N∑i=1
δ1,∆i
θ(t−Xi)∑Nj=1 e
ˆβ·Zj
θ(Xj−Xi)(249)
(which is Breslow’s estimator, first given in the comments at the end of Cox’s 1972 paper). In
combination this gives
SCox(t|Z) = exp(−
N∑i=1
δ1,∆i
eˆβ·Zθ(t−Xi)∑N
j=1 eˆβ·Zj
θ(Xj−Xi)
)(250)
8.2. Uniqueness and p-values for regression parameters
Curvature of L(β|D) and uniqueness. To find out whether there could be multiple solutions of
equation it is helpful to inspect the second derivative (or curvature) of (246). We note that (247)
was derived from the first derivative of L(β|D):
∂
∂βµL(β|D) =
∑i
δ1,∆i
Ziµ −
∑j Z
jµ eβ·Z
j
θ(Xj−Xi)∑j eβ·Z
j
θ(Xj−Xi)
(251)
Hence, upon introducing the short-hand
〈uj〉i =∑j
pj(i)uj , pj(i) =eβ·Z
j
θ(Xj−Xi)∑j eβ·Z
j
θ(Xj−Xi)(252)
we obtain by further differentiation
∂2
∂βµ∂βνL(β|D) = −
∑i
δ1,∆i
〈ZjµZjν〉i − 〈Zjµ〉i〈Zjν〉i
= −
∑i
δ1,∆i〈(Zjµ − 〈Zjµ〉i)〉i〈(Zjν − 〈Zjν〉i)〉i (253)
Unless all event times are equal, the matrix of second derivatives is seen to be negative definite
everywhere, since for any vector y ∈ IRp one has
p∑µ,ν=1
yµ( ∂2
∂βµ∂βνL(β|D)
)yν = −
N∑i=1
δ1,∆i
( p∑µ=1
yµ〈(Zjµ − 〈Zjµ〉i)〉i)2< 0 (254)
Hence there will be only one extremal point of L(β|D) and therefore only one solution of (247), and
we know it will indeed be a maximum. From now on we will write the relevant point as β.
Shape of P(β, λ0|D) near the most probable point. We next write the entries of the curvature matrix
of L(β|D) at the most probable point β as minus Aµν (where A is a positive definite matrix):
Aµν(β) = − ∂2
∂βµ∂βνL(β|D)
∣∣∣ ˆβ=∑i
δ1,∆i〈(Zjµ − 〈Zjµ〉i)〉i〈(Zjν − 〈Zjν〉i)〉i∣∣∣ ˆβ
(255)
We can now expand L(β|D) close to the maximum point:
L(β|D) = L(β|D)− 1
2
∑µν
Aµν(β)(βµ − βµ)(βν − βν) +O(|β−β|3) (256)
59
Given the definition of L(β|D), we can now also write
maxλ0P(β, λ0|D) = eL(
ˆβ|D)− 12
∑µνAµν(
ˆβ)(βµ−βµ)(βν−βν)+O(|β− ˆβ|3)(257)
It follows that, provided maxλ0P(β, λ0|D) is a narrow distribution around the most probable point
β and provided we can disregard the uncertainty in the base hazard rate+, we can approximate
P(β|D) = maxλ0P(β, λ0|D) by a multi-variate Gaussian distribution, from which we can also obtain
the Gaussian marginals for individual regression parameters:
P(βµ|D) ≈ (σµ√
2π)−1e−12
(βµ−βµ)2/σ2µ , σ2
µ = (A−1)µµ(β) (258)
p-values for Cox regression parameters and hazard ratios. The latter result (258) allows us to define
approximate p-values for Cox regression. We do this in the usual way: given an observed value of
βµ in regression, we define the p-value as the probability to observe |βµ| ≥ |βµ| in a ‘null model’.
The null model chosen here is the distribution (258) that corresponds to the trivial value β = 0.
However, this is a further approximation, since one could also set only βµ = 0 in the null model,
leaving the other regression parameters nonzero (in which case the variance in (258) could depend,
via the matrix A(β), on all other βν with ν 6= µ). If we choose the null model β = 0 we get
P0(βµ) = (σµ√
2π)−12 e−
12
(βµ)2/σ2µ , σ2
µ = (A−1)µµ(0) (259)
and our p-value approximation will be
p− value = Prob(|βµ| ≥ |βµ|
)= 1− 2
σµ√
2π
∫ |βµ|0
dβ e−12β2/σ2
µ
= 1− 2√π
∫ |βµ|/σµ√2
0dx e−x
2= 1− Erf
(|βµ|/σµ
√2)
(260)
The ratio |βµ|/σµ is called the z-score. Note that the approximations underlying this final simple
result (260) are quite drastic: (i) forget about uncertainty in the base hazard rate λ0(t), (ii)
approximate the posterior distribution for β by a Gaussian, (iii) assume a null model in which
all regression parameters are zero, and (iv) ignore all correlations between regression parameters of
different covariates. Note also that the p-values do not measure the possible error introduced by
overfitting. We will come back to overfitting later.
8.3. Properties and limitations of Cox regression
Normalisation of covariates. The optimal value one will find for the regression parameter vector
β will obviously depend on the units chosen for the covariates, since only the sums∑pµ=1 βµZ
iµ
appear in the parameter likelihood. For instance, a renormalisation Ziµ → %µZiµ for all i would
simply rescale the most probable regression parameters via βµ → βµ/%µ. This implies that, unless
we prescribe a normalisation convention for the covariates, we cannot use the value of βµ directly
as a quantitative measure of the impact of covariate µ on survival. In addition, the definition of the
hazard ratio given in (242) as yet makes sense only for binary covariates Zµ ∈ 0, 1. To resolve
+ Note: this is an assumption for which we have no justification yet, but which is essential if in the Cox
approach we want to quantify regression uncertainty, since the whole point in Cox regression is to eliminate
the base hazard rate from the problem and formulate everything strictly in terms of β alone.
60
these problems we need a unified normalisation of the covariates. One natural convention is to
choose units for all coariates such that
covariate normalisation :1
N
∑i
Ziµ = 0,1
N
∑i
(Ziµ)2 = 1 (261)
Unless all covariates take the same value, this can always be achieved by linear rescaling. Upon
adopting (261), different components βµ can be compared meaningfully, with those further away
from zero implying a more prominent impact of covariates on survival.
Hazard ratios for normalised covariates. With the normalisation (261) we can also generalise
our previous definition (242) of hazard ratios (which as yet applied to binary covariates only)
to include arbitrary (e.g. real-valued) covariates. In a balanced cohort and for binary covariates the
normalisation (261) would imply Zµ ∈ −1, 1, and consistency with our earlier definition (242) of
hazard ratios would then demand that we define
HRµ =π1(t|Z)|Zµ=1
π1(t|Z)|Zµ=−1=λ0(t)e
∑ν 6=µ βνZν+βµ
λ0(t)e∑
ν 6=µ βνZν−βµ= e2βµ (262)
This is the appropriate hazard rate definition corresponding to the convention (261). The Gaussian
approximation (258) allows us to calculate for each covariate µ the so-called 95% confidence intervals
for hazard ratios:
[HR−µ ,HR+µ ], HR±µ = e2(βµ±dµ) (dµ > 0) (263)
such that Prob(βµ−dµ < βµ < βµ+dµ) = 0.95. From (258) we can calculate the quanty dµ:
0.95 =
∫ βµ+dµ
βµ−dµ
dβµ
σµ√
2πe−
12
(βµ−βµ)2/σ2µ = Erf
( dµ
σµ√
2
)(264)
Hence we can express dµ via the inverse error function as
dµ =√
2 Erf−1(0.95) σµ ≈ 1.96 σµ (265)
To be on the safe side, since the above is still an approximation in view of the Gaussian assumption
for the βµ-distribution, many authors in fact use dµ = 2σµ. This gives the convention
95% confidence interval : HRµ ∈ [e2(βµ−2σµ), e2(βµ+2σµ)] (266)
Univariate versus multivariate regression and correlated covariates. The proportional hazards
assumption in Cox regression, i.e. different covariates contribute each an independent multiplicative
factor to the primary risk hazard rate, will be violated as soon as covariates are correlated. We
should expect strictly uncorrelated covariates to be the exception rather than the norm. This has
implications. It can cause degeneracies in the parameter likelihood, such that there is no longer a
unique optimal regression vector. To see this just consider the extreme case where we have just two
covariates Z1,2 ∈ −1, 1, and these two covariates contain exactly the same information:
Z2 = Z1 : π1(t|Z) = λ0(t)eβ1Z1+β2Z2 = λ0(t)e(β1+β2)Z1
In either case one can at best find an optimal linear combination of covariates, but no unique values;
any risk evidence in the covariates can be shared in arbitrary ratios among the two covariates.
A further consequence of covariate correlations is that there will now be a difference between
the regression parameters βµ that one would find in univariate regression, i.e. in Cox regression
61
using π1(t|Zµ) = λ0(t)eβµZµ (with only one covariate µ included), and multivariate regression, where
π1(t|Zµ) = λ0(t)e∑p
ν=1βνZν such that µ ∈ 1, . . . , p. If covariate µ correlates with one or more other
covariates, then in multi-variate regression the predictive evidence will generally be ‘shared’ among
the covariates, whereas in uni-variate regression it will not. Again, to appreciate this just consider
the previous example, with Z1 = Z2 = Z:
univariate regression : π1(t|Z1) = λ0(t)eβ1Z , π1(t|Z2) = λ0(t)eβ2Z (267)
bivariate regression : π1(t|Z1, Z2) = λ0(t)e(β′1+β′2)Z (268)
Here (β1, β2) are the regression parameters found upon studying the impact of the two covariates
via univariate regression, and (β′1, β′2) are the regression parameters found via bivariate regression.
Since the data likelihood only depends on the cause-specific hazard rates, we will always find
β1 = β2 = β′1 + β′2. Hence, unless β1 = 0 or β2 = 0, one will inevitable have β1 6= β′1 or β2 6= β′2(or both). In fact even for uncorrelated covariates and infinitely large data sets (where there are no
issues with finite size corrections and uncertainties) one will still generally find different regression
parameters when comparing univariate to multivariate regression – see example 1 at the end of this
chapter. The regression parameters (and hence also the hazard ratios) depend on the modelling
context; exactly which other covariates were included? They are not objective quantitative measures
of the impact of individual covariates on risk.
Issues related to inclusion of treatment parameters as covariates. Often one includes treatment
parameters as covariates, with the objective of quantifying treatment effect on survival. As soon as
such treatment decisions involve human judgement based on observing covariates (as they usually
do), this may affect the outcome of regression for the initial covariates:
• Adding the treatment decision as a new covariate by definition turns the extended covariate
set into a correlated one, even if the initial set was not. For instance, assume a covariate
Z1 ∈ 1, 2, 3 indicates the grade of a tumour, and the clinical protocol is to give a patient with
grade 2 or 3 chemotherapy, then we could indicate this decision with a variable Z2 = θ(Z1−3/2)
(with the step function), and the pair (Z1, Z2) will be strongly correlated.
• Provided they are medically effective, treatments will reduce or undo any patterns that connect
the other covariates to risk: if high-risk patients are correctly identified from covariates
(implying a significant predictive signal in the covariates), and selected for medical treatment,
then these individuals are thereby converted by this treatment to low-risk patients. This
removes the prior link between covariates and medical outcome.
Issues related to interpretation of regression parameters. Once the regression parameters β have
been calculated, and assuming there are no problems caused by risk correlations, there are still
pitfalls in the interpretation of these parameters. To name two:
• In heterogeneous cohorts one will typically find dependence of the regression parameters on the
duration of the trial, with these parameters moving closer to zero with longer trial durations.
This need not be due to a true time dependence, but may well reflect ‘cohort filtering’, similar
to the machanism underlying figure 2.
• One should not interpret nonzero regression parameters as evidence for causal effects of the
associated covariates on risk. Finding βµ > 0 simply tells us that individuals with larger Zµ are
more likely to experience the primary hazard. Hazard and covariate could both be consequences
62
of a common cause, or perhaps the impending hazard could even cause the elevated covariate.
Imagine what we would find upon including the frequency of hospital visits as a covariate; we
would undoubtedly find a significantly large associated regression parameter, as individuals
who visit hospitals more are more likely to to be ill. Naive interpretation of the outcome of
regression would then lead us to recommend that hospital visits should generally be avoided.
8.4. Examples
Example 1: multivariate versus univariate analysis
We explore further the differences between univariate Cox regression (one covariate included at
a time) and multivariate regression (with multiple covariates simultaneously), and the effects
of covariate correlations. Assume for simplicity that we have just one risk (so ∆i = 1 for all i)
and ony two covariates, and start from expression (246), with β = (β1, β2) and Zi = (Zi1, Zi2):
L(β) =N∑i=1
β ·Zi −N∑i=1
log( 1
N
N∑j=1
eβ·Zj
θ(Xj−Xi))
+ constant (269)
Let us assume our data are of the form Xi = f(Zi1 +Zi2), where f(x) is some monotonically
decreasing function (so those with larger values of Zi1+Zi2 experience primary events earlier).
This implies that θ(Xj−Xi) = θ[Zi1+Zi2−Zj1−Z
j2 ]. In addition we define L(β) = L(β)/N , and
assume that all Zi are drawn randomly from a zero-average distribution P (Z). For very large
populations we will then find, using the law of large numbers:
limN→∞
L(β) = limN→∞
1
N
N∑i=1
β ·Zi
− limN→∞
1
N
N∑i=1
log( 1
N
N∑j=1
eβ·Zj
θ[Zi1+Zi2− Zj1−Z
j2 ])
+ constant
= −⟨
log⟨
eβ·Z′θ[Z1+Z2− Z ′1−Z ′2]
⟩Z ′⟩Z
+ constant (270)
in which 〈. . .〉Z =∫
dZ P (Z) . . .. Our different Cox regression versions are now:
multivariate : find minβ1,β2
⟨log
⟨eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]
⟩Z ′⟩Z
(271)
univariate : find minβ1
⟨log
⟨eβ1Z′1θ[Z1+Z2− Z ′1−Z ′2]
⟩Z ′⟩Z
(272)
find minβ1
⟨log
⟨eβ2Z′2θ[Z1+Z2− Z ′1−Z ′2]
⟩Z ′⟩Z
(273)
Upon doing the required differentiations with respect to the regression parameters, we then
find the following equations from which to solve β1 and β2:
multivariate :
∫dZ P (Z)
∫ dZ ′P (Z ′)Z ′1eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]∫dZ ′P (Z ′)eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]
= 0 (274)
∫dZ P (Z)
∫ dZ ′P (Z ′)Z ′2eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]∫dZ ′P (Z ′)eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]
= 0 (275)
univariate :
∫dZ P (Z)
∫ dZ ′P (Z ′)Z ′1eβ1Z′1θ[Z1+Z2− Z ′1−Z ′2]∫dZ ′P (Z ′)eβ1Z′1θ[Z1+Z2− Z ′1−Z ′2]
= 0 (276)
63∫dZ P (Z)
∫ dZ ′P (Z ′)Z ′2eβ2Z′2θ[Z1+Z2− Z ′1−Z ′2]∫dZ ′P (Z ′)eβ2Z′2θ[Z1+Z2− Z ′1−Z ′2]
= 0 (277)
We next work out two choices for the covariate statistics P (Z), which we take to be zero-average
Gaussian. In the first choice the covariates are identical, P (Z1, Z2) = δ(Z2−Z1)e−12Z2
1/√
2π.
In the second choice they are independent: P (Z1, Z2) = (e−12Z2
1/√
2π)(e−12Z2
2/√
2π).
• Correlated (identical) covariates:
Here our previous equations from which to solve (β1, β2) can now all be written in terms
of the following function:
F (u) =
∫dZ
e−12Z2
√2π
∫ dY Y e−12Y 2+uY θ[Z−Y ]∫
dY e−12Y 2+uY θ[Z−Y ]
(278)
To be specific, one finds
multivariate Cox regression : F (β1 + β2) = 0 (279)
univariate Cox regression : F (β1) = F (β2) = 0 (280)
One can prove easily that the function F (u) is convex, i.e. F ′′(u) > 0 for all u, so the
equation F (u) = 0 has exactly one solution u?. Thus we find:
multivariate Cox regression : β1 + β2 = u? (281)
univariate Cox regression : β1 = β2 = u? (282)
As expected, multivariate regression and univariate regression do not lead to the same
regression parameters. In univariate regression each individual covariate provides the
same amount of evidence for survival outcome, quantified by the value u? for each β1,2.
In multivariate regression, in contrast, the evidence is shared between the two covariates.
• Uncorrelated covariates:
Here, with P (Z1, Z2) = e−12
(Z21+Z2
2 )/2π, the calculations are slightly more involved. It will
be advantagous to first transfrom the variables Z and Z ′ according to
X1 =1√2
(Z1+Z2), X2 =1√2
(Z1−Z2), P (X1, X2) =1
2πe−
12
(X21 +X2
2 ) (283)
This will give θ[Z1 +Z2− Z ′1−Z ′2] = θ[X1−X ′1]. Let us start with the ratio of integrals
appearing in our expression for multivariate analysis:∫dZ ′P (Z ′)Z ′1eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]∫
dZ ′P (Z ′)eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]
=1√2
∫dX ′e−
12
(X′21 +X′22 )(X ′1+X ′2)e1√2
[X′1(β1+β2)+X′2(β1−β2)]θ[X1−X ′1]∫
dX ′e−12
(X′21 +X′22 )e1√2
[X′1(β1+β2)+X′2(β1−β2)]θ[X1−X ′1]
=1√2
∫dX ′1 X
′1e− 1
2X′21 + 1√
2X′1(β1+β2)
θ[X1−X ′1]∫dX ′1 e
− 12X′21 + 1√
2[X′1(β1+β2)]
θ[X1−X ′1]
+1√2
∫dX ′2 X
′2e− 1
2X′22 + 1√
2X′2(β1−β2)∫
dX ′2 e− 1
2X′22 + 1√
2X′2(β1−β2)
64
=1√2
∫X1−∞dx xe
− 12x2+ 1√
2x(β1+β2)∫X1
−∞dx e− 1
2x2+ 1√
2x(β1+β2)
+1
2(β1−β2)
=1
2(β1+β2)− 1√
2
∫X1−∞dx d
dxe− 1
2[x− 1√
2(β1+β2)]2∫X1
−∞dx e− 1
2[x− 1√
2(β1+β2)]2
+1
2(β1−β2)
= β1 −1√2
e− 1
2[X1− 1√
2(β1+β2)]2
∫X1− 1√2
(β1+β2)
−∞ dx e−12x2
= β1 −1√π
e− 1
2[X1− 1√
2(β1+β2)]2
1+Erf(X1− 1√
2(β1+β2)√
2
) (284)
After further averaging over X we then find that all equations for regression parameters
can now be written in terms of the function
G(u) =
∫dx
e−12x2
√2π
1√π
e− 1
2[x− 1√
2u]2
1+Erf(x− 1√
2u
√2
) =
∫dx
π√
2
e− 1
2x2− 1
2[x+ 1√
2u]2
1+Erf(x/√
2)(285)
To be specific, we find
multivariate Cox regression : β1 = G(β1+β2), β2 = G(β1+β2) (286)
univariate Cox regression : β1 = G(β1), β2 = G(β2) (287)
So for either version of Cox analysis we have β1 = β2 = β, but the the equations from
which to solve β, and therefore the values found for β, are not identical in the two cases.
Even when covariates are not correlated, one can apparently still find different regression
parameters and hazard ratios when comparing univariate to multivariate regression:
multivariate Cox regression : β = G(2β) (288)
univariate Cox regression : β = G(β) (289)
Example 2: effect of duration of trial on regression parameters
Imagine we have data D on a cohort of size N , with one primary risk and the end-of-trial risk.
The trial is terminated at time τ > 0. All individuals i with a primary event prior to time τ
will have ∆i = 1, and all others will have ∆i = 0. This implies that if ti is the time at which
the primary event would occur for individual i, then the actually reported data will be
(Xi,∆i) =
(ti, 1) if ti < τ
(τ, 0) if ti ≥ τ(290)
We measure one binary covariate Zi ∈ 0, 1, and we assume that our cohort is heterogeneous,
involving two distinct event time distributions. We write the probability to find ti = t as pi(t)
i ≤ N/2 : pi(t) = ae−at, i > N/2 : pi(t) = a(1+Zi)e−a(1+Zi)t (291)
with a > 0. Let us first determine the true cause-specific hazard rates for the above example.
The individual data distributions are
Pi(X,∆) = δ∆,1 θ(τ−X)pi(X) + δ∆,0 δ(X−τ)
∫ ∞τ
dt pi(t) (292)
We note that ∫ ∞X
dt [Pi(t, 0) + Pi(t, 1)] =
∫ ∞X
dt[θ(τ−t)pi(t) + δ(t−τ)
∫ ∞τ
ds pi(s)]
=
∫ ∞X
dt pi(t) (293)
65
Via (18) we can now calculate the individual cause-specific hazard rates:
πi0(X) =δ(X−τ)
∫∞τ dt pi(t)∫∞
X dt [Pi(t, 0) + Pi(t, 1)]=δ(X−τ)
∫∞τ dt pi(t)∫∞
X dt pi(t)= δ(X−τ) (294)
πi1(X) =θ(τ−X)pi(X)∫∞
X dt [Pi(t, 0) + Pi(t, 1)]=θ(τ−X)pi(X)∫∞
X dt pi(t)(295)
We note that all individuals in the cohort have exponential event time distributions for the
primary risk, so prior to the trial termination time τ all should have time-independent cause-
specific hazard rates for the primary risk. This indeed follows from the above formula:
X>τ, all i : πi1(X) = 0 (296)
X<τ, i ≤ N/2 : πi1(X) =ae−aX∫∞
X dt ae−at= a (297)
X<τ, i > N/2 : πi1(X) =a(1+Zi)e−a(1+Zi)X∫∞
X dt a(1+Zi)e−a(1+Zi)t= a(1+Zi) (298)
The covariate is positively associated with the risk, since a value Zi = 1 shortens the average
time to the primary event by a factor two for half of our cohort. We draw all Zi randomly and
independently from P (Z) = 12δZ,1 + 1
2δZ,0. Note that we can write all individal hazard rates
above in the Cox form, with the same base hazard rate but with different regression parameters
for the two sub-groups
i ≤ N/2 : πi1(X) = λ0(t)eβZi, λ0(t) = aθ(τ−t), β = 0 (299)
i > N/2 : πi1(X) = λ0(t)eβZi, λ0(t) = aθ(τ−t), β = ln(2) ≈ 0.693 (300)
From this, in turn, we can calculate the true sub-cohort primary risk hazard rate, via (180)
(in which we abbreviate βi = 0 if i ≤ N/2 and βi = ln(2) for i > N/2):
π1(t|Z) =
∑i∈ΩZ
πi1(t)e−∫ t
0ds [δ(τ−s)+aθ(τ−s)eβiZi ]∑
i∈ΩZe−∫ t
0ds [δ(τ−s)+aθ(τ−s)eβiZi ]
=
∑i∈ΩZ
πi1(t)e−aeβiZi∫ t
0ds θ(τ−s)∑
i∈ΩZe−aeβiZi
∫ t0
ds θ(τ−s)(301)
We clearly always have π1(t|Z) = 0 for t > τ . For t < τ we find
π1(t<τ |0) =
∑i(1−Zi)πi1(t)e−at∑
i(1−Zi)e−at= a
∑i(1−Zi)∑i(1−Zi)
= a (302)
π1(t<τ |1) =
∑i≤N/2 Z
iπi1(t)e−at +∑i>N/2 Z
iπi1(t)e−2at∑i≤N/2 Z
ie−at +∑i>N/2 Z
ie−2at
= a
∑i≤N/2 Z
i + 2∑i>N/2 Z
ie−at∑i≤N/2 Z
i +∑i>N/2 Z
ie−at(303)
For N →∞ this would become
π1(t<τ |0) = a, π1(t<τ |1) = a1+2e−at
1+e−at(304)
66
So in combination we can write
π1(t|Z) = aθ(τ−t) eβ(t)Z , β(t) = ln(1+2e−at
1+e−at
)(305)
This shows that the effect identified earlier, of heterogeneous cohorts giving decaying hazard
rates even if all individual hazard rates are stricty time independent, also impacts on regression
parameters. Here we find that the sub-cohort primary hazard rate is nearly of the Cox form,
but with a time dependent regression parameter which is not allowed in Cox regression.
According to (246) we want to maximise in Cox regression the following quantity over β (apart
from an irrelevant constant):
L(β|D) =β
N
N∑i=1
δ1,∆iZi − 1
N
N∑i=1
δ1,∆i log[ 1
N
N∑j=1
eβZjθ(Xj−Xi)
]
=β
N
N∑i=1
θ(τ−ti)Zi
− 1
N
N∑i=1
θ(τ−ti) log[ 1
N
N∑j=1
eβZj(θ(τ−tj)θ(tj−ti) + θ(tj−τ)θ(τ−ti)
)](306)
We now inspect the case where our cohort is very large, so that we may send N →∞. By the
law or large numbers we then obtain
limN→∞
L(β|D) = β⟨1
2Z
∫ τ
0dt[ae−at + a(1+Z)e−a(1+Z)t
]⟩Z
−⟨1
2
∫ τ
0dt[ae−at+a(1+Z)e−a(1+Z)t
]log
⟨1
2eβZ
′∫ ∞t
dt′(ae−at
′+a(1+Z ′)e−a(1+Z′)t′
)⟩Z′
⟩Z
= β⟨1
2Z(2−e−aτ−e−a(1+Z)τ
)⟩Z
−⟨1
2
∫ τ
0dt[ae−at+a(1+Z)e−a(1+Z)t
]log
⟨1
2eβZ
′(e−at+e−a(1+Z′)t
)⟩Z′
⟩Z
=1
4β(2−e−aτ−e−2aτ
)−⟨1
2
∫ τ
0dt[ae−at+a(1+Z)e−a(1+Z)t
]log
[1
2e−at
]⟩Z
−⟨1
2
∫ τ
0dt[ae−at+a(1+Z)e−a(1+Z)t
]log
[1+
1
2eβ(1+e−at)
]⟩Z
(307)
We need to calculate the maximum with respect to β of this expression. Differentiation with
respect to β gives
4d
dβlimN→∞
L(β|D) = 2−e−aτ−e−2aτ −∫ τ
0dt(3ae−at+2ae−2at
) eβ(1+e−at)
2+eβ(1+e−at)
= 2−e−aτ−e−2aτ −∫ τ
0dt(3ae−at+2ae−2at
)+ 2a
∫ τ
0dt
3e−at+2e−2at
2+eβ(1+e−at)
= 2
∫ aτ
0ds
3e−s+2e−2s
2+eβ(1+e−s)− 2(1−e−aτ ) (308)
So β is the unique solution of∫ aτ
0ds
3e−s+2e−2s
2+eβ(1+e−s)= 1−e−aτ (309)
67
0.0
0.1
0.2
0.3
0.4
0.5
0 1 2 3 4 5
aτ
β
Figure 8. The most probable parameter value β in Cox regression, for the example data
(290,291), in the limit N →∞. All individuals have strictly time-independent hazard rates,
but due to cohort filtering the cohort-level primary hazard rate becomes time dependent.
In Cox regresison this is not allowed, and as a conseuence one finds that the most probable
parameter β becomes dependent on (and decays with) the duration of the trial.
which via the transformation x = e−s can be rewritten as
1−e−aτ =
∫ 1
e−aτdx
3+2x
2+eβ(1+x)= 2e−β
∫ 1
e−aτdx
32 +x
2e−β+1+x
= 2e−β(1−e−aτ ) + 2e−β(1
2−2e−β)
∫ 1
e−aτdx
1
2e−β+1+x
= 2e−β(1−e−aτ ) + e−β(1−4e−β) log( 2e−β+2
2e−β+1+e−aτ
)(310)
Hence we get
(1−e−aτ )(1−2e−β) = e−β(1−4e−β) log( 2e−β+2
2e−β+1+e−aτ
)(311)
For small τ we find
aτ(1−2e−β) +O((aτ)2) = −e−β(1−4e−β) log(1− aτ
2e−β+2
)(1−2e−β) +O(aτ) = −e−β(1−4e−β)
2e−β+2
)+O(aτ)
2(1−2e−β)(e−β+1) = e−β(1−4e−β) +O(aτ)
2 = 3e−β+O(aτ) so β = ln(3/2) +O(aτ) ≈ 0.405 +O(aτ) (312)
Numerical solution of β from equation (311) for different values of the trial cut-off time τ results
in the curve of figure 8. Here the cohort ‘filtering’ results by definition in a time-independent
regression paramater β (since this is what Cox regression allows for), but the value found for β
decays with increasing trial durations, in spite of the fact that at the level of individuals there
is not a single time-dependent risk parameter.
68
Example 3: parametrisation of the base hazard rate
The derivation of Cox’s regression equations for the paramaters β was based on first maximising
(243) over the base hazard rates, giving (245), which was then substituted into (243) and
subsequently maximised over β. However, it is clear that the sum over δ-peaks (245) is a
maximum-likelihood estimator which is only realistic for infinitely large cohorts; we expect the
true base hazard rate to be smooth in time. We have already seen earlier that in the bayesian
formalism one could deal with this via a suitable smoothness prior (although this leads to
complicated equations). An alternative route for implementing smoothness of the base hazard
rate within the Cox formalism is to insert into (243) a simple parametrised form λ0(t|θ) :
logP(β,θ|D) =N∑i=1
δ1,∆i log λ0(Xi|θ) + β ·N∑i=1
δ1,∆iZi −
N∑i=1
eβ·Zi∫ Xi
0dt λ0(t|θ) (313)
We now maximise this expression over (β,θ) instead of (β, λ0). One tends to look for
paramerisations for which the integral in the last line can be done analytically. For instance,
a popular paramerisation is
λ0(t|y, τ) =y
τ(t/τ)y−1, τ > 0, y > 0 (314)
which gives us the following quantity, to be maximised over (τ, y,β):
L(β, y, τ |D) = log(y
τy)∑i
δ1,∆i + (y−1)∑i
δ1,∆i logXi + β ·∑i
δ1,∆iZi −
∑i
eβ·Zi
(Xi
τ)y
(315)
• We first maximise (315) over τ via
∂
∂τL(β, y, τ |D) =
y
τ
[ N∑i=1
eβ·Zi
(Xi/τ)y −N∑i=1
δ1,∆i
](316)
giving
τy =
∑Ni=1 eβ·Z
i
(Xi)y∑N
i=1 δ1,∆i
(317)
Upon substituting this optimal value for the time-scale parameter τ into (315), we are left
with the following function to be maximised over (y,β):
L(β, y|D) = log( y
∑i δ1,∆i∑
i eβ·Zi
(Xi)y
)(∑i
δ1,∆i
)+ (y−1)
∑i
δ1,∆i logXi
+ β ·∑i
δ1,∆iZi −
∑i
δ1,∆i (318)
• We next maximize the expression L(β, y|D) over (y,β), via∂
∂yL(β, y|D) =
(∑i
δ1,∆i
) ∂∂y
log( y
∑i δ1,∆i∑
i eβ·Zi
(Xi)y
)+∑i
δ1,∆i logXi
=(∑
i
δ1,∆i
)1
y−∑i eβ·Z
i
log(Xi)(Xi)y∑
i eβ·Zi
(Xi)y
+∑i
δ1,∆i logXi (319)
and∂
∂βµL(β, y|D) =
∑i
δ1,∆iZiµ −
(∑i
δ1,∆i
) ∂
∂βµlog
(∑i
eβ·Zi
(Xi)y)
=∑i
δ1,∆iZiµ −
(∑i
δ1,∆i
)∑i eβ·Z
i
Ziµ(Xi)y∑
i eβ·Zi
(Xi)y(320)
69
Thus we find that the optimal y and β are to be solved simultaneously from the following two
equations:
1
y=
∑i eβ·Z
i
log(Xi)(Xi)y∑
i eβ·Zi
(Xi)y−∑i δ1,∆i logXi∑
i δ1,∆i
(321)
∑i δ1,∆iZ
iµ∑
i δ1,∆i
=
∑i eβ·Z
i
Ziµ(Xi)y∑
i eβ·Zi
(Xi)y(322)
In contrast, the standard Cox equations for β are (247), which we can also write as∑i δ1,∆iZ
iµ∑
i δ1,∆i
=
∑i eβ·Z
i
Ziµ∫Xi
0 dt λ0(t|β)∑i δ1,∆i
(323)
with the base hazard rate (245). It will be clear that the most probable values for β
corresponding to the choice of a paramerised base hazard rate will generally be different from
the standard Cox values that follow from (247). It follows from the above that they will only
be identical if(∑i
eβ·Zi
(Xi)y)(∑
i
eβ·Zi
Ziµ
∫ Xi
0dt λ0(t|β)
)=(∑
i
δ1,∆i
)(∑i
eβ·Zi
Ziµ(Xi)y)
(324)
In the simplest case of just one risk (i.e. ∆i = 1 for all i), for instance, this condition and the
equation for y reduce after some simple rewriting to
1
N
∑i
eβ·Zi
Ziµ
1
N
∑k
θ(Xi−Xk)∑j e
ˆβ·Zj
θ(Xj−Xk)− (Xi)
y∑j eβ·Z
j
(Xj)y
= 0 (325)
1
y=
∑i eβ·Z
i
log(Xi)(Xi)y∑
i eβ·Zi
(Xi)y− 1
N
∑i
logXi (326)
This set of equations simplifies when written in terms of new variables Yi = Xyi :
1
N
∑i
Ziµ
1
N
∑k
eβ·Zi
θ(Yi−Yk)∑j e
ˆβ·Zj
θ(Yj−Yk)− eβ·Z
i
Yi∑j eβ·Z
j
Yj
= 0 (327)
1 =
∑i eβ·Z
i
log(Yi)Yi∑i eβ·Z
i
Yi− 1
N
∑i
log Yi (328)
This will only be satisfied in very special cases. Generally, therefore, one should not use
parametrised base hazard rates in conjunction with the conventional formulae for the Cox
regression parameters.
70
9. Overfitting and p-values
9.1. What is overfitting?
Extracting covariate-to-outcome patterns from examples. In survival analysis we are given examples
of covariates Zi and corresponding ‘outcome variables’ (Xi,∆i). We assumed that the outcomes are
not generated purely randomly, but drawn from a distribution P (X,∆|Z) that actually depends on
Z. We have no direct information on P (X,∆|Z), but have to infer it from the N input-outcome
combinations (Zi, Xi,∆i) in our data set D. This is the objective of regression or classification.
One uses the term regression when the outcome is a real variable, classification when it is discrete;
in survival analysis we generally have both.
Model complexity. The danger in extracting patterns from such data is that we may extract
regularities that describe perfectly the detailed realisation of the specific data in D, but not the larger
population from which the data set D was drawn. If the individuals in D were selected randomly
from a larger population, there will be sampling noise in this selection (exactly which individuals
were picked?); we are not interested in any information that pertains only to the individuals in Dbut that is not representative of the population. Let us be more specific. Suppose we were to take
the extimator P (X,∆|Z) in (191) as our description of the true event statistics in the population,
i.e.
P (X,∆|Z) =
∑i δ(X−Xi)δ∆,∆iδ(Z−Zi)∑
i δ(Z−Zi)(329)
This would suggest that we truly believe that no events can ever occur at times other than those
observed in the data set D. This would be nonsense. At best we hope that for sufficiently large data
sets this expression gives a reasonable approximation of the survival statistics in a distributional
sense, but the precise locations of the individual δ-peaks reflect randomness specific to D that most
probably does not describe actual biochemistry in the population.
Overfitting happens when we use a model that is too complex for the amount of availabe
data, and that therefore describes not just generic regularities but also the ‘noise’ in the data at
hand. It is clear that when one uses a model with more adjustable parameters than the number of
available data points, one is simply fitting a curve through data, and one cannot expect this curve
to generalise to the wider population and make reproducible predictions.
The example in figure 7.6 illustrates the impact of complexity reduction. The maximum
likelihood estimator π(t|Z) on the left is a finite sum of δ-peaks, whose precise locations we do not
expect to be reproducible. We allowed π(t|Z) to take any shape, however irregular or discontinuous.
The Bayesian estimator on the right in the figure, in contrast, is constrained by the prior to describe
the data with a smooth function π(t|Z), and consequently focuses on the density of δ-peaks in each
time range, which we would hope to be a more realistic description of the population.
9.2. Overfitting in binary classification
Binary outcome prediction from binary expression data. Let us make a small detour and simplify
our problem to its core. We assume that (i) we have just one risk (so the label ∆i is obsolete), (ii) all
our covariates are binary, i.e. Ziµ ∈ 0, 1 for all µ and all patients i (for instance we have covariates
giving gene expression levels, rounded off to the values Z = 1 ‘expressed’ or Z = 0 ‘non-expressed’).
Rather than predicting the event time X via P (X|Z) we try to classify our patients i simply into
good outcome and poor outcome ones, via a time cut-off τ that separates the classes: σi = 1 if
71
Xi > τ , and σi = 0 if Xi < τ . For example, if we have 4 patients and in each patient we measure
60 expression levels, our data could look like this:
outcome expression pattern
i = 1 : σ1 = 1 Z1 = (100101001010010101010010001010111001001001001001001000011111)
i = 2 : σ2 = 1 Z2 = (010001000010101001010101010010101000111100101001001010101000)
i = 3 : σ3 = 0 Z3 = (001010001110101101100100100111001110010100101010101000101010)
i = 4 : σ4 = 0 Z4 = (101011001010110010100100111100100101100111010111010001010010)
We would then try to find patterns in the Zi that enable us to predict the outcome σi; if our
covariates are gene expresison levels, such patterns are often called ‘gene signatures’. Here a
candidate signature gene would be a gene that takes the same values for all patients i in the
σi = 1 group, and opposite values for all patients in the σi = 0 group. Inspection reveals that there
are seven candidate signature genes in the above data, coloured in red (and marked with arrows):
outcome expression pattern ↓↓ ↓ ↓ ↓ ↓ ↓
i = 1 : σ1 = 1 Z1 = (100101001010010101010010001010111001001001001001001000011111)
i = 2 : σ2 = 1 Z2 = (010001000010101001010101010010101000111100101001001010101000)
i = 3 : σ3 = 0 Z3 = (001010001110101101100100100111001110010100101010101000101010)
i = 4 : σ4 = 0 Z4 = (101011001010110010100100111100100101100111010111010001010010)
However, the above data were in fact generated purely randomly, so these ‘patterns’ are not real,
but just random accidents. To illustrate this, let us randomize the outcome labels. This will again
reveal a set of alternative candidate ‘signature genes’ that appear to predict outcome:
outcome expression pattern ↓ ↓ ↓
i = 1 : σ1 = 1 Z1 = (100101001010010101010010001010111001001001001001001000011111)
i = 2 : σ2 = 0 Z2 = (010001000010101001010101010010101000111100101001001010101000)
i = 3 : σ3 = 1 Z3 = (001010001110101101100100100111001110010100101010101000101010)
i = 4 : σ4 = 0 Z4 = (101011001010110010100100111100100101100111010111010001010010)
Clearly, the larger the number p of covariates, the more frequent will be the accidental ‘signature
genes’. The numbers p = 60 and N = 4 chosen in the above example may be too small to be
realistic, but their ratio p/N = 15 is in fact similar to that in genetic data bases (where we may
well have values like N ≈ 1000 and p ≈ 15000).
Mulitple-testing correction to p-values. Let us quantify all this in more detail. If we generate the
values Zµ = (Z1µ, . . . , Z
Nµ ) for each gene randomly, with Prob(Z = 1) = Prob(Z = 0) = 1
2 (as was
done for the above data), then for a given assigment of outcome variables (σ1, . . . , σN ) we find
Prob(µ is signature gene) = Prob[Ziµ = σi for all i
]+ Prob
[Ziµ = 1−σi for all i
]= 2.(
1
2)N = (
1
2)N−1 (330)
72
If we have p such random genes and N patients, we will find on average p(12)N−1 accidental signature
genes (giving an average of 60.(12)3 = 71
2 for our above example), and the probability to see k
accidental signature genes will be
P (k signature genes) =(pk
)(1
2)k(N−1)
(1− (
1
2)N−1
)p−k(331)
In particular, the probability to find at least one candidate signature gene isp∑
k=1
P (k signature genes) = 1− P (0 signature genes)
= 1−(1− (
1
2)N−1
)p(332)
If we look for candidate signature genes in real data, and we find one or more of these, we should
calculate the p-value (i.e. the probability that our observation is just the result of chance) using as
our ‘null hypothesis’ the above situation of randomly generated expression values. This gives:
p− value of observed signature gene = 1−(1− (
1
2)N−1
)p(333)
We cannot not just use the naive probability (12)N−1 of an individual gene being a signature as our
p-value (unless we had indeed selected this particular gene beforehand and investigated only this
chosen gene, rather then looking for interesting genes in a list of p candidates). We must take into
account the number of genes we inspect for candidacy. This is called ‘multiple-testing correction’.
For the data above we would have a probability of a given gene to be a signature gene the value
(12)N−1 = 0.125 (nearly a signal), but with multiple testing the burden of evidence is much higher,
and we find multiple-testing p-value of 1 − (78)60 ≈ 0.9997 (no signal at all). If in our data set the
balance between expressed and non-expressed genes differs from 50%, or the balance between good
and poor outcome patients differs from 50%, then formula (333) is modified in a trivial way.
9.3. Overfitting in Cox regression
Using Cox regression for binary classification. Although Cox regression was not designed for binary
classification, it can be used for classifying patients into low risk (σi = 1 if Xi > τ) versus low
risk (σi = 0 if Xi < τ) classes. Once the Cox regression parameters β have been calculated from
our data D = (Z1, X1,∆1), . . . , (ZN , XN ,∆N ), we can use (250) to predict new classifications
σ ∈ 0, 1 on the basis of covariate information Z alone. To emphasise more the similarity with the
above example of gene signatures, we first rewrite SCox(τ |Z) as
SCox(τ |Z) = Φ(w ·Z) (334)
with w = −β, and where Φ(u) is a monotonically increasing function:
Φ(u) = exp(−e−uΛ0(τ |β)
)(335)
with the integrated base hazard rate Λ0(τ |β) as defined in (249). The function Φ(u) is monotonically
increasing, with Φ(−∞) = 0 (provided there is at least one i withXi < τ and ∆i = 1) and Φ(∞) = 1.
This definition allows us to write the class probability for an individual with covariates Z within
the Cox formalism as
P (σ|Z) = δσ,1Φ(w ·Z) + δσ,0[1− Φ(w ·Z)] (336)
73
p
QT
QV
Figure 9. Classification performance QT on training sets and classification performance
QV on validation sets, in Cox regression used for binary classification, with breast cancer
tumour biomarkers as covariates. Classes were defined by cancer relapse before 8 years
(σi = 0) or after 8 years (σi = 1) following diagnosis of the primary tumour. Here N = 70
(with randomly selected training and validation sets of equal sizes NV = NT = 35), and
p = 1 . . . 18. The curves were generated iteratively, starting from Cox regression and
performance measurement for all p = 18 available covariates, followed by iterative removal of
the least important covariate (the one with the smallest average value of |βµ|) and repetition
of the Cox regression and performance measurement, down to p = 1 (which leaves only the
most informative covariate). Classes are assigned according to (337), and the measures QVand QT are defined in (339). Error bars indicate standard deviations over all divisions into
training and validation sets.
Expression (336) is of a form commonly used by all machine learning binary classification algorithms
that are based on linear separation. We can then allocate a class σ(Z) to each individual with
covariates Z, defined as the most probable class according to (336):
σ(Z) =
1 if P (1|Z) > 1
2
0 if P (0|Z) > 12
(337)
Evidence of overfitting in classification performance on training and validation sets. We like to know
whether overfitting affects also Cox regression. To answer this we need to identify a measurable
marker of overfitting. Overfitting in pattern detection algorithms and regression methods means
that the final model (i.e. a quantitative pattern linking covariates to outcome) predicts outcome
significant better on the specific data set used in the analysis than on data from the wider population
from which the data were drawn, i.e. patterns which claim to predict outcome are insufficiently
reproducible beyond the data set from which they were extracted. The only unambiguous test for
overfitting in pattern detection algorithms and methods therefore requires having two data sets:
one set from which to extract the patterns (if any), the ‘training set’, and one set on which to test
these patterns, called the ‘validation set’. Starting from our full cohort 1, . . . , N we would this
separate this cohort in two, 1, . . . , N = ΩT ∪ΩV (the simplest division would be into two random
subsets of equal size NT = NV = 12N), where only data from ΩT are used for finding classification
74
parameters. The fractions correcty classified for a given division are defined as
QT (ΩT ) =1
NT
∑i∈ΩT
δσi,σ(Zi
)QV (ΩV ) =
1
NV
∑i∈ΩV
δσi,σ(Zi
)(338)
and the average performance measures are defined as the averages of the above values over all
possible divisions of 1, . . . , N into equally large training and validation sets:
QT = 〈QT (ΩT )〉ΩT QV = 〈QV (ΩV )〉ΩV (339)
An example calculation of the qantities in (339) as a function of (iteratively reduced) values of p
(the number of covariates included) is shown in figure 9. Here the problem was to predict whether
breast cancer relapse occurs after 8 years (σi = 1) or before 8 years (σi = 0). This figure reveals
obvious overfitting as soon as p > 6 (where NT /p > 6). For p ≤ 6 both training and validation
performance increase with increased model complexity, but for p > 7 we see the classic fingerprint
of overfitting: the performance on the validation set continuous to increase, but the performance
on the validation set deteriorates consistently (here the regression model is trying to predict the
‘noise’ in the training set, as opposed to a reproducible pattern). One finds very similar figures if
in this same N = 70 data set one uses clinical covariates (e.g. tumour size, tumour grade, lymph
node involvement, etc); also there overfitting sets in after inclusion of just a few covariates.
It is not possible to say beforehand which is the optimal ratio NT /p, i.e. the ratio where QT and
QV start to separate, in a given classification problem. This ratio must depend on various details of
the data, e.g. correlations amongst covariates, any imbalance between the sizes of the σ = 1 versus
σ = 0 classes, etc. For large training sets ΩT with randomly generated Zi ∈ 0, 1p (with equal
probabilities) and randomly generated outcome variables σi ∈ 0, 1 (with equal probabilities) an
old exact result by Cover on binary classifiers tells us that we need at least NT /p > 2.
Evidence of overfitting in regression parameters. We have seen that Cox regression suffers from
overfitting if the number p of covariates is too large relative to the size N of the data set. One also
observes this when calculating β for small data sets. The simplest case is to consider p = N = 2,
with just one risk, where we know we are in the overfitting regime. For reasons that will become clear
we add a weak Gaussian prior P (β) ∼ exp(−12αβ
2) to the Bayesian description, with 0 < α 1.
Here we have D = (Z11 , Z
12 , X1), (Z2
1 , Z22 , X2) and find (246) reducing to
L(β1, β2|D) = L(0, 0|D) + β1(Z11 +Z2
1 ) + β2(Z12 +Z2
2 )− 1
2α(β2
1 + β22)
− log 1
2eβ1Z11+β2Z1
2 + eβ1Z21+β2Z2
2 θ(X2−X1)12 + θ(X2−X1)
− log
eβ1Z11+β2Z1
2 θ(X1−X2) + 12eβ1Z2
1+β2Z22
θ(X1−X2) + 12
(340)
using θ(0) = 12 . Without loss of generality we may take X1 < X2, giving
L(β1, β2|D) = L(0, 0|D) + β1Z11 + β2Z
12 −
1
2α(β2
1 + β22)
− log(1
3eβ1Z1
1+β2Z12 +
2
3eβ1Z2
1+β2Z22
)(341)
To find the maximum we differentiate, and find the equations from which to solve (β1, β2) and the
curvature matrix at the extremal point:
∂
∂β1L(β1, β2|D) = Z1
1 − αβ1 −Z1
1eβ1Z11+β2Z1
2 + 2Z21eβ1Z2
1+β2Z22
eβ1Z11+β2Z1
2 + 2eβ1Z21+β2Z2
2
(342)
75
=2(Z1
1 − Z21 )eβ1Z2
1+β2Z22
eβ1Z11+β2Z1
2 + 2eβ1Z21+β2Z2
2
− αβ1
=2(Z1
1 − Z21 )
eβ1(Z11−Z
21 )+β2(Z1
2−Z22 ) + 2
− αβ1 (343)
∂
∂β2L(β1, β2|D) = Z1
2 − αβ2 −Z1
2eβ1Z11+β2Z1
2 + 2Z22eβ1Z2
1+β2Z22
eβ1Z11+β2Z1
2 + 2eβ1Z21+β2Z2
2
(344)
=2(Z1
2 − Z22 )eβ1Z2
1+β2Z22
eβ1Z11+β2Z1
2 + 2eβ1Z21+β2Z2
2
− αβ2
=2(Z1
2 − Z22 )
eβ1(Z11−Z
21 )+β2(Z1
2−Z22 ) + 2
− αβ2 (345)
and
∂2
∂β21
L(β1, β2|D) =−2(Z1
1−Z21 )2eβ1(Z1
1−Z21 )+β2(Z1
2−Z22 )
[eβ1(Z11−Z
21 )+β2(Z1
2−Z22 ) + 2]2
− α (346)
∂2
∂β1∂β2L(β1, β2|D) =
−2(Z11−Z2
1 )(Z12−Z2
2 )eβ1(Z11−Z
21 )+β2(Z1
2−Z22 )
[eβ1(Z11−Z
21 )+β2(Z1
2−Z22 ) + 2]2
(347)
∂2
∂β22
L(β1, β2|D) =−2(Z1
2−Z22 )2eβ1(Z1
1−Z21 )+β2(Z1
2−Z22 )
[eβ1(Z11−Z
21 )+β2(Z1
2−Z22 ) + 2]2
− α (348)
If we write the first derivatives in terms of the quantity Ξ = 12eβ1(Z1
1−Z21 )+β2(Z1
2−Z22 ) we find
∂L∂β1
=Z1
1−Z21
Ξ + 1− αβ1,
∂L∂β2
=Z1
2−Z22
Ξ + 1− αβ2 (349)
The maximum is achieved for
β1 =Z1
1−Z21
α(Ξ + 1), β2 =
Z12−Z2
2
α(Ξ + 1)(350)
with Ξ(α) to be solved from the transcendental equation
(1+Ξ) log(2Ξ) = (Z1−Z2)2/α (351)
Clearly Ξ(α) → ∞ for α → 0, such that limα→0 αΞ(α) = 0. Hence the regression parameters β1,2
both diverge as α→ 0, which is a clear sign of overfitting.
Although in the example above the regression parameters diverge, it turns out that the z-
scores do not diverge, as a result of which we also find large p-values (260) for the hazard ratios
HRµ (equivalently: for the βµ). The latter are given by
µ−th p−value = 1− Erf(zµ/√
2), zµ = |βµ|/σµ (352)
The second derivatives of L are needed to calculate p-values, and must be evaluated for the null
model β1 = β2 = 0, giving
∂2L∂β2
1
∣∣∣β=0
= −2
9(Z1
1−Z21 )2 − α, ∂2L
∂β1∂β2
∣∣∣β=0
= −2
9(Z1
1−Z21 )(Z1
2−Z22 ) (353)
∂2L∂β2
2
∣∣∣β=0
= −2
9(Z1
2−Z22 )2 − α (354)
For the matrix A(0) this implies
Aµν(0) = − ∂2L∂βµ∂βν
∣∣∣β=0
= αδµν + ZµZν , with Zµ =
√2
3(Z1
µ−Z2µ) (355)
76
The normalised eigenvectors of A(0) are e1 = (Z1, Z2)/√Z2
1 + Z22 and e2 = (Z2,−Z1)/
√Z2
1 + Z22 ,
wih respective eigenvalues:
λ1 = e1 ·A(0)e1 = α+ Z21 + Z2
2 (356)
λ2 = e2 ·A(0)e2 = α (357)
This allows to write down the entries of A−1(0) in explicit form:
A−1(0)µν =1
λ1(e1)µ(e1)ν +
1
λ2(e2)µ(e2)ν (358)
In particular we find the two variances appearing in our formulae for p-values:
σ21 = A−1(0)11 =
(e1)21
λ1+
(e2)21
λ2=
Z21
λ1(Z21 +Z2
1 )+
Z22
λ2(Z21 +Z2
2 )
=Z2
1
(α+ Z21 + Z2
2 )(Z21 +Z2
2 )+
Z22
α(Z21 +Z2
2 )(359)
σ22 = A−1(0)22 =
(e1)22
λ1+
(e2)22
λ2=
Z22
λ1(Z21 +Z2
2 )+
Z21
λ2(Z21 +Z2
2 )
=Z2
2
(α+ Z21 + Z2
2 )(Z21 +Z2
2 )+
Z21
α(Z21 +Z2
2 )(360)
It follows that for small α we will have
σ1 =
√1
α(1 + Z2
1/Z22 )−1 +O(1) =
1√α
(1 + Z21/Z
22 )−
12 +O(
√α) (361)
σ2 =
√1
α(1 + Z2
2/Z21 )−1 +O(1) =
1√α
(1 + Z22/Z
21 )−
12 +O(
√α) (362)
For small α we thus obtain the following z-scores for our regression parameters β1,2:
z1 =|β1|σ1
=Z1
1−Z21√
α(Ξ + 1)(1 + Z2
1/Z22 )
12 (1 +O(α)) (363)
z2 =|β2|σ2
=Z1
2−Z22√
α(Ξ + 1)(1 + Z2
2/Z21 )
12 (1 +O(α)) (364)
The remaining question is: how does [1 + Ξ(α)]√α scale as α→ 0? Let us write Q = [1 + Ξ(α)]
√α,
so α = Q2/[1 + Ξ(α)]2. Substitution into (351) gives
Q2 log(2Ξ)
1 + Ξ= (Z1−Z2)2 (365)
From this we conclude that as α → 0 (where Ξ → ∞) we must have Q → ∞. Consequently
[1 + Ξ(α)]√α→∞ and therefore
limα→0
z1 = limα→0
z2 = 0 (366)
This, in turn, implies that for α→ 0 both p-values become 1, so here the individual diverging values
of the regression parameters β1,2 are recognised to be nonsignificant.
Generally, if we carry out Cox regression and then ask whether any of the regression parameters
are significantly away from zero, we are once more doing multiple testing, and consequently need
to use a corrected p-value for the combined test. If we write the p-value for component βµ as pµ,
then the corrected p-value is 1 minus the probability that none of the parameters are significant:
p−value = 1−p∏
µ=1
(1− pµ) = 1−p∏
µ=1
Erf(zµ/√
2)
(367)
77
9.4. p-values for Kaplan-Meier curves
9.5. Notes and examples
78
10. Heterogeneous cohorts and competing risks
In the picture below we have rows representing a number of selected genes, and columns representing
patients of a breast cancer cohort. Each small square gives the expression level Ziµ of a specific gene
µ for a specific individual i (red: upregulated; green downregulated). The patients are clustered
on the basis of similarity in their expression profiles. This gives a dendrogram (at the top), which
reveals the existence of breast cancer sub-types (at least in terms of expression profiles):
79
The clustering in the prevous figure was done in the space IRp of the covariates Zi, so this maps
covariate heterogeneity. However, the individuals are all breast cancer patients, so relative to the
overall population we are looking at a high-risk group. If this cohort is homogeneous in terms of
the covariate-to-risk relation, we expect to see some ‘colour pattern’ that is common to all patients
(in contrast to the situation of generating such a figure for all members of the population, with low
and high breast cancer risk). We see in the figure that there is no such pattern: most genes are
consistently upregulated in some cancer sub-types, but consistently downregulated in others. If we
would apply Cox regression to this full cohort, with all disease sub-types included, we would not
extract significant information. In Cox regression there is just one parameter for each covariate (an
upregulated gene is assumed to be either consistently good news for any patient, or consistently
bad news, but never good news for some and bad news for others), so the regression parameter
βµ of any gene µ that is not mostly of one colour would in our formulae be averaged out by the
different patient sub-groups, and take a value close to zero (note that the expression patterns of the
basal-like cancers are nearly opposite to those of the luminal subtype A).
In the data set at hand we are lucky: the covariate heterogeneity (which is directly measurable)
allows us to produce clusters which also describe the covariate-to-risk heterogeneity. This is not
always the case; one could have absent covariate clusters, but still clusters in the covariate-to-risk
patterns. In this section we try to develop intuition and theory for handling such scenarios.
10.1. Population-level hazard rate correlations and competing risks
Competing risks caused by hazard rate heterogeneity. We have seen that most survival analysis
methods assume different risks to be uncorrelated. Risk correlations are defined by a non-factorising
joint event time distributions, i.e. P (t0, . . . , tR) 6=∏r P (tr). This could happen because at
the level of individuals we have Pi(t0, . . . , tR) 6=∏r Pi(tr). A more plausible mechanism is for
correlated risks to be a caused by correlations in risk-specific susceptibilities. We could have
Pi(t0, . . . , tR) =∏r Pi(tr) for each i, and still find correlated risks in the population, since generally
P (t0, . . . , tR) =1
N
∑i
(∏r
Pi(tr))6=
∏r
P (tr) (368)
Let us illustrate this. Imagine we have a cohort of N = 1000 individuals, subject to two risks. At
the level of individuals the event times are not correlated, so Pi(t1, t2) = Pi(t1)Pi(t2) for all i, and
all individual hazard rates (πi1, πi2) are time-independent.
• If our cohort is homogeneous, all individuals have
(πi1, πi2) = (π1, π2). There cannot be risk correlations
since there is no risk variability. There can never be
competing risk or false protectivity problems.
'
&
$
%
(π1, π2)
N=1000
• Imagine that our cohort is heterogeneous, e.g.
there are four subgroups (A,B,C,D) with distinct
cause specific hazard rates, but the subgroups
are of equal size. There is risk variability, but
there are still no risk correlations.
'
&
$
%(π1 ↑, π2 ↓)
NC =250
(π1 ↑, π2 ↑)
NA=250 (π1 ↓, π2 ↓)
ND=250
(π1 ↓, π2 ↑)
NB=250
80
• Imagine that our cohort is heterogeneous, e.g.
there are four subgroups (A,B,C,D) with distinct
cause specific hazard rates, and distinct sizes.
Here there will be significant risk correlations.
Patients with elevated risk for event 1 tend to
have also elevated risk for event 2.
'
&
$
%(π1 ↑, π2 ↓)
NC =20
(π1 ↑, π2 ↑)
NA=480 (π1 ↓, π2 ↓)
ND=480
(π1 ↓, π2 ↑)
NB=20
In the third scenario risk 2 will exert a ‘false protectivity’ effect on risk 1: the type-2 censoring
events will tend to remove individuals from the population that have an elevated type-1 risk.
We have seen ‘cohort filtering’ already in previous examples. Failure to take heterogeneity and
competing risks into account (as in Kaplan-Meier estimation and Cox regression) can give serious
problems – it can underestimate risks and hazard rates (if competing risks correlate positively) or
overestimate them (if they correlate negatively). It can make harmful treatments appear beneficial
and beneficial treatments appear harmful. The point here is that finding competing risks does
not require risk correlations at the level of individuals; they could be generated at the level of the
statistics of cause specific hazard rates over the cohort. But cause specific hazard rates can be
estimated from survival data so this problem is in principle solvable.
Structure of regression models for heterogeneous cohorts. Any regression model that is to capture
the relation between covariates and risk in a heterogeneous cohort, where risks can compete, should
(i) be able to describe the distribution of the individual cause-specific hazard rates, and how this
distribution depends on covariates, and (ii) do so for all risks simultaneously, not just for the primary
risk. We are then led automatically to the conditioning picture of survival analysis with covariates
in subsection 7.2, which was formulated in terms of W (π|Z), the probability that a randomly drawn
individual with covariates Z will have individual cause-specific hazard rates π.
All prediction formulae are already given in subsection 7.2. However, just as in the sub-cohort
formulation we were forced (by computation limitations and the danger of overfitting) to use simple
parametrisations of π(Z), also here we need simple parametrisations of W (π|Z) for the same
reasons. Let us write our chosen paramerised form as Wθ = W (π|Z,θ), in which θ ∈ IRq are the
parameters. We choose one-to-one parametrisations, so that Wθ = Wθ′ if and only if θ = θ′. The
parametrisation is built into the prior P(W ), which is zero if W is not of the required form:
P(W ) =
∫dθ P (θ)δ[W−Wθ] (369)
in which P (θ) is a parameter prior. Insertion into (197,198) tells us that
logP(W |D) =N∑i=1
log
∫dπ W (π|Zi)P (Xi,∆i|π) + log
∫dθ P (θ)δ[W−Wθ] + constant (370)
So indeed P(W |D) > 0 only for thoseW that are of the parametrised form. All statistical predictions
to be made with P(W |D) can now always be written as
Q =
∫dW P(W |D)
∫dπ W (π|Z)Q(π,Z)
=
∫dW P(W )P(D|W )
∫dπ W (π|Z)Q(π,Z)∫
dW P(W )P(D|W )
=
∫dθ P (θ|D)
∫dπ W (π|Z,θ)Q(π,Z) (371)
81
in which
P (θ|D) =P (θ)P(D|θ)∫
dθ′ P (θ′)P(D|θ′)(372)
P(D|θ) =N∏i=1
∫dπ W (π|Zi,θ)P (Xi,∆i|π) (373)
The structure of the formalism has remained unchanged by the imposition of our parametrisation,
but our predictions are now written in terms of averages over a posterior distribution of θ (instead
of averages over all W ), with the usual contributions from a prior and from the evidence in the data.
The log-likelihood of the posterior parameter distribution L(θ|D) = logP (θ|D) takes the form
L(θ|D) =N∑i=1
log
∫dπ W (π|Zi,θ)P (Xi,∆i|π) + logP (θ) + constant (374)
These formulae are still completely general. Any theory will have this form; we could even choose
as our parametrisation the set of all possible functions W (π|Z) and work our way from (374) back
to (197,198). Theories can differ only in the chosen paramerisation W (π|Z,θ) and prior P (θ).
10.2. Rational paramerisations for heterogeneous population models
Heterogeneity caused by missing covariates in Cox-type models – frailty models. Let us try
to construct a parametrisation W (π|Z,θ) from the following reasonable (but not necessarily
true) assumption: any covariate-to-risk heterogeneity in our cohort is due to the fact that we
haven’t measured enough information-carrying covariates. It follows that for each heterogeneous
cohort with covariates Z = (Z1, . . . , Zp) there exists an expanded set of covariates (Z, Z) =
(Z1, . . . , Zp, Z1, . . . , Zn) such that upon measuring the expanded set the cohort would have been
homogeneous. Hence
W (π|Z, Z) = δ[π − π(Z, Z)] (375)
for some cause-specific hazard rates π(Z, Z), and
W (π|Z) =
∫dZ P (Z)W (π|Z, Z) =
∫dZ P (Z)δ[π − π(Z, Z)] (376)
In the spirit of Cox regression, we now focus only on the primary risk hazard rate, and choose the
full primary hazard rate to be of the Cox form:
π1(t|Z, Z) = λ0(t)eβ·Z+˜β· ˜Z = π1(t|Z)e
˜β· ˜Z (377)
with the conventional Cox formula π1(t|Z) = λ0(t)eβ·Z . This implies that we postulate that the
primary risk hazard rate of each individual i is of the form
πi1(t) = λ0(t)eui+β·Zi
(378)
The (to us) unknown variable ui raises or lowers the overall hazard rate, and is called a ‘frailty
factor’. Models of this type are called ‘frailty models’. We can now write the distribution W (π|Z),
which we seek to parametrise, in terms of the original covariates as
W (π|Z) =
∫du P (u) δ[π−π?(Z, u)] (379)
82
with
P (u) =
∫dZ P (Z)δ[u−β · Z] (380)
π1(t|Z, u) = λ0(t) eu+β·Z , πr 6=1(t|Z, u) = πr 6=1(t|Z) unspecified (381)
The average 〈u〉 =∫
du P (u)u = β · 〈Z〉 can always be absorbed into λ0(t), via λ0(t) → λ0(t)e〈u〉.
We may therefore always take P (u) to have zero average. We now simply parametrize P (u), and
each such parametrisation then gives a class of frailty models that are simple generalisations of
the standard Cox model, For instance, upon choosing P (u) = (σ√
2π)−1e−12u2/σ2
(coresponding to
Gaussian distributed missing covariates Z) or P (u) =∑L`=1w` δ(u−u`) (corresponding to discrete
missing covariates Z), we would get
Gaussian : W (π|Z,β, λ0, σ) =
∫dz√2π
e−12z2δ[π−π(Z, σz)], θ = (β, λ0, σ) (382)
discrete : W (π|Z,β, λ0, w`, u`) =L∑`=1
w` δ[π−π(Z, u`)], θ = (β, λ0, w`, u`) (383)
with the Cox-type hazard rates (381).
Since one does not specify πr 6=1(t|Z, u), we focus in frailty models on extracting the parameters
relating to the primary risk. This is possible if we choose a simple prior in (374) that factorises over
risks, viz. P (θ) = P (β, λ0, σ)P (θ′), where θ′ refers to any parameters relating to the non-primary
risks r 6= 1. For such factorising priors we find (374) reducing for the above frailty models to
L(θ|D) =N∑i=1
log
∫du P (u)P (Xi,∆i|π(Zi, u)) + logP (θ) + constant
=N∑i=1
∑r
δ∆i,r log
∫du P (u)πr(Xi|Zi, u)e−
∑r′∫ Xi
0dt πr′ (t|Z
i,u) + logP (θ) + constant
=N∑i=1
δ∆i,1 log∫
du P (u)π1(Xi|Zi, u)e−∫ Xi
0dt π1(t|Zi
,u)−∑
r′ 6=1
∫ Xi0
dt πr′ (t|Zi)
+ logP (θ) + terms independent of θ
=N∑i=1
δ∆i,1 log
∫du P (u)λ0(Xi) exp
[u+ β ·Zi − eu+β·Zi
∫ Xi
0dt λ0(t)
]+ logP (θ) + terms independent of θ
=N∑i=1
δ∆i,1 log λ0(Xi) +N∑i=1
δ∆i,1 log
∫du P (u)eu exp
[− eu+β·Zi
∫ Xi
0dt λ0(t)
]
+N∑i=1
δ∆i,1β ·Zi + logP (θ) + terms independent of θ (384)
For P (u) → δ(u) the above log-likelihood reverts back to that of the Cox model. For nontrivial
P (u), however, the maximisation of L(θ|D) is clearly more involved. Variation of λ0(t) gives
83
84
new statistical methodology
data: 2047 prostate cancer patients
Cox regression: BMI selenium leis act work act smoking
PC 0.1 -0.1 0.2 -0.1 -0.1
new explanation:
2 classes52%, low overall frailty
BMI selenium leis act work act smoking
PC 1.5 -4.8 2.2 -1.4 4.1
other 0.8 -0.2 -0.2 0.0 1.5
48%, high overall frailty
BMI selenium leis act work act smoking
PC 0.0 0.0 0.0 0.0 0.0
other 0.0 -0.1 -0.1 0.0 0.2
10.3. Types of heterogeneity
10.4. Impact of heterogeneity on hazard ratios
10.5. False protectivity
10.6. Frailty models
10.7. Fine and Gray regression
10.8. Bayesian regression
10.9. Notes and examples
11. Further topics
11.1. Binary classification
Bayesian approach to gene signatures
• prognostic signatures
e.g. ‘typical’ profile of patients with poor outcome
‘typical’ profile of patients with good outcome
‘typical’ profile of patients that respond to treatment
profile similar to signature → treat differently
various heuristic definitions ...
different normalisation of signals ...
85
what does similar mean ...
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
distance? covariance?
• •
x1 x1
x2 x2 use ‘good outcome’ or
‘poor outcome’ signatures?
how to map similarity to
classification reliability?
Bayesian prediction of responders
medical trial data: D
classes: σ = 1, 0 (response yes/no)
fraction responders: φ
prob that woman
with profile x
will respond:
p(1|x) =
∫dθ p(θ|D)
[ φp(x|1,θ)
φp(x|1,θ) + (1−φ)p(x|0,θ)
]p(x|σ,θ): parametrised distr
p(θ|D): explicit formula• no ambiguities
• formulated in terms of response prob
• find cohort’s profile characteristics
simplest case:
Gaussian p(x|σ,θ),
variances ∆0,∆1
∆0<∆1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
∆0 =∆1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
∆0>∆1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
• • •
heterogenenous disease?
e.g.
one peak in non-responders’ gene stats
two peaks • in responders’ gene stats ...
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
•
•
•
•
•
•
prognostic signatures? Bayesian p(1|x)
86
Bayesian decision boundaries adapt
to shape of gene profile statistics of the classes
11.2. Multiple testing corrections
11.3. Log-rank test
87
Appendix A. The δ-distribution
Definition. We define the δ-distribution as the probability distribution δ(x) corresponding to a
zero-average random variable x in the limit where the randomness in the variable vanishes. So∫dx f(x)δ(x) = f(0) for any function f
By the same token, the expression δ(x−a) will then represent the distribution for a random variable
x with average a, in the limit where the randomness vanishes, since∫dx f(x)δ(x− a) =
∫dx f(x+ a)δ(x) = f(a) for any function f
Formulas for the δ-distribution. A problem arises when we want to write down a formula for δ(x).
Intuitively one could propose to take a zero-average normal distribution and send its width to zero,
δ(x) = limσ→0
pσ(x) pσ(x) =1
σ√
2πe−x
2/2σ2(A.1)
This is not a true function in a mathematical sense: δ(x) is zero for x 6= 0 and δ(0) =∞. However,
we realize that δ(x) only serves to calculate averages; it only has a meaning inside an integration. If
we adopt the convention that one should set σ → 0 in (A.1) only after performing the integration,
we can use (A.1) to derive the following properties (for sufficiently well-behaved functions f):∫dx δ(x)f(x) = lim
σ→0
∫dx pσ(x)f(x) = lim
σ→0
∫dx√2π
e−x2/2f(σx) = f(0)∫
dx δ′(x)f(x) = limσ→0
∫dx
d
dx[pσ(x)f(x)]− pσ(x)f ′(x)
= lim
σ→0[pσ(x)f(x)]∞−∞ − f
′(0) = −f ′(0)
The following relation links the δ-distribution to the step-function:
δ(x) =d
dxθ(x) θ(x) =
1 if x > 0
0 if x < 0(A.2)
This one proves by showing that both sides of the equation have the same effect inside an integration:∫dx
[δ(x)− d
dxθ(x)
]f(x) = f(0)− lim
ε→0
∫ ε
−εdx
d
dx[θ(x)f(x)]−f ′(x)θ(x)
= f(0)− lim
ε→0[f(ε)−0] + lim
ε→0
∫ ε
0dx f ′(x) = 0
Finally one can use the definitions of Fourier transforms and inverse Fourier transforms to obtain
the following integral representation of the δ-distribution:
δ(x) =
∫ ∞−∞
dk
2πeikx (A.3)
88
Appendix B. Steepest descent integration
Steepest descent (or ‘saddle-point’) integration is a method for dealing with integrals of the following
type, with x ∈ IRp, continuous functions f(x) and g(x) of which f is bounded from below, and
with N ∈ IR positive and large:
IN [f, g] =
∫IRp
dx g(x)e−Nf(x) (B.1)
We first take f(x) to be real-valued; this is the simplest case, for which finding the asymptotic
behaviour of (B.1) as N → ∞ goes back to Laplace. We assume that f(x) can be expanded in a
Taylor series around its minimum f(x?), which we assume to be unique, i.e.
f(x) = f(x?)+1
2
p∑ij=1
Aij(xi−x?i )(xj−x?j ) +O(|x−x?|3), Aij =∂2f
∂xi∂xj|x? (B.2)
If the integral (B.1) exists, inserting (B.2) into (B.1) followed by transforming x = x?+y/√N gives
IN [f, g] = e−Nf(x?)∫
IRpdx g(x)e
− 12N∑
ij(xi−x?i )Aij(xj−x?j )+O(N |x−x?|3)
= N−p2 e−Nf(x?)
∫IRp
dy g(x?+y√N
) e− 1
2
∑ijyiAijyj+O(N−
12 |y|3)
(B.3)
From this latter expansion, and given the assumptions made, we can obtain two important identities:
− limN→∞
1
Nlog
∫IRp
dx e−Nf(x) = − limN→∞
1
Nlog IN [f, 1]
= f(x?) + limN→∞
p logN
2N− 1
Nlog
∫IRp
dx e− 1
2
∑ijyiAijyj+O(N−
12 |y|3)
= f(x?) = min
x∈IRpf(x) (B.4)
and
limN→∞
∫dx g(x)e−Nf(x)∫
dx e−Nf(x)= lim
N→∞
IN [f, g]
IN [f, 1]= lim
N→∞
∫
IRpdy g(x?+ y√N
) e− 1
2
∑ijyiAijyj+O(N−
12 |y|3)
∫IRpdy e
− 12
∑ijyiAijyj+O(N−
12 |y|3)
=g(x?)(2π)p/2/
√DetA
(2π)p/2/√
DetA= g(x?) = g(arg min
x∈IRpf(x)) (B.5)
If f(x) is complex, the correct procedure to be followed is to deform the integration paths in the
complex plane (using Cauchy’s theorem) such that along the deformed path the imaginary part of
the function f(x) is constant, and preferably (if possible) zero. One then proceeds using Laplace’s
argument and finds the leading order in N of our integral in the usual manner by extremization of
the real part of f(x). In combination, our integrals will thus again be dominated by an extremum
of the (complex) function f(x), but since f is complex this extremum need not be a minimum:
− limN→∞
1
Nlog
∫IRp
dx e−Nf(x) = extrx∈IRpf(x) (B.6)
limN→∞
∫dx g(x)e−Nf(x)∫
dx e−Nf(x)= g(arg extrx∈IRpf(x)) (B.7)
89
Appendix C. Maximum likelihood and Bayesian parameter estimation
To illustrate the procedures of maximum likelihood and Bayesian estimation of parameters from
data we consider the following problem. We are given a dice and want to know the true (but as yet
unknown) probabilities (π1, . . . , π6) of each possible throw. A fair dice would have πr = 1/6 for all
r. Note that∑6r=1 πr = 1. Our data from which to extract the information consists of the results
of N independent throws of the dice:
D = X1, X2, . . . , XN, Xi ∈ 1, 2, . . . , 6 for each i (C.1)
• Ad hoc estimators:
Our problem is sufficiently transparent for us to simply guess suitable estimators. It would be
natural to choose for πk the empirical frequency with which the throw Xi = k is observed:
(∀k = 1 . . . 6) : πk =1
N
∑i
δXi,k (C.2)
This choice satisfies the constraint∑6k=1 πk = 1, and for N → ∞ the law of large numbers
indeed gives limN→∞ πk =∑6r=1 πrδrk = πk. So our πk are proper estimators. The results of
simulating this estimation process numerically for a loaded dice are shown in figure C1. One
clearly needs data sets of size N∼2000 or more for (C.2) to approach the true values.
• Maximum likelihood estimators:
The maximum likelihood estimators are determined by maximizing over (π1, . . . , π6) the
likelihood of the data D, given the values of (π1, . . . , π6). Here we have
logP (D|π1, . . . , π6) = logN∏i=1
P (Xi|π1, . . . , π6) =N∑i=1
log πXi (C.3)
Let us maximize this quantity over (π1, . . . , π6), subject to the constraint∑6r=1 πr = 1, using
the Lagrange formalism:
∂
∂πk
N∑i=1
log πXi = λ∂
∂πk
6∑r=1
πr
N∑i=1
1
πkδXi,k = λ, hence πk =
1
λ
N∑i=1
δXi,k (C.4)
Summation over k in both sides gives∑k πk = N/λ, so our normalisation constraint tells us
that λ = N . Hence the maximum likelihood estimator is identical to our estimator (C.2).
• Bayesian estimation:
Finally, when following the Bayesian route we calculate P (π1, . . . , π6|D), defined as
P (π1, . . . , π6|D) =P (D|π1, . . . , π6)P (π1, . . . , π6)
P (D)
=P (D|π1, . . . , π6)P (π1, . . . , π6)∫
Ω dπ′1 . . . dπ′6 P (D|π′1, . . . , π′6)P (π′1, . . . , π
′6)
(C.5)
90
0.0
0.1
0.2
0.3
0 1000 2000 3000 4000 5000 6000 7000 8000 900010000
N
πk
Figure C1. The six empirical frequencies (or estimators) πk = N−1∑i δXi,k, for each
possible dice throw k = 1 . . . 6, versus the size N of the data set. In this example of a loaded
dice the actual probabilities are (π1, . . . , π6) = (0.16, 0.16, 0.16, 0.16, 0.16, 0.20).
Here Ω is the set of all parameters (π1, . . . , π6) that satisfy the relevant constraints, i.e.
Ω = (π1, . . . , π6) ∈ IR6| πr ≥ 0 ∀r,∑r≤6 πr = 1. Alternatively (and equivalently) we can
integrate over IR6 and implement the constraints via the prior, i.e. by defining P (π1, . . . , π6) = 0
if (π1, . . . , π6) /∈ Ω.
Next we need to determine the values of the prior P (π1, . . . , π6) for (π1, . . . , π6) ∈ Ω.
Information theory tells us that if the only prior information available is our knowledge of
the constraints, we should choose the prior that maximizes the Shannon entropy subject to
these constraints. This is again done via the Lagrange method, where we now vary the entries
of the prior P (π1, . . . , π6) (it turns out that non-negatively will be satisfied automalically, so
we only impose the normalisation constraint):
δ
δP (π1, . . . , π6)
∫Ω
dπ′1 . . . dπ′6 P (π′1, . . . , π
′6) logP (π′1, . . . , π
′6)
= Λδ
δP (π1, . . . , π6)
∫Ω
dπ′1 . . . dπ′6 P (π′1, . . . , π
′6)
∀(π1, . . . , π6) ∈ Ω : 1 + logP (π1, . . . , π6) = Λ (C.6)
We see that the maximum entropy prior is flat over Ω, so P (π1, . . . , π6) = 1/|Ω|. Hence, upon
insertion into (C.5) we get for (π1, . . . , π6) ∈ Ω an expression that can again be written in
terms of the estimator (C.2):
P (π1, . . . , π6|D) =P (D|π1, . . . , π6)∫
Ω dπ′1 . . . dπ′6 P (D|π′1, . . . , π′6)
=
∏i πXi∫
Ω dπ′1 . . . dπ′6
∏i π′Xi
=e∑6
k=1log πk
∑iδk,Xi∫
Ω dπ′1 . . . dπ′6 e∑6
k=1log π′
k
∑iδk,Xi
=eN∑6
k=1πk log πk∫
Ω dπ′1 . . . dπ′6 eN
∑6
k=1πk log π′
k
(C.7)
Let us work out the denominator, using the standard integral representation δ(z) =
91
(2π)−1∫∞−∞dx eixz for the delta-function:
1
Nlog Den =
1
Nlog
∫Ω
dπ′1 . . . dπ′6 eN
∑6
k=1πk log π′k
=1
Nlog
∫ ∞0
dπ1 . . . dπ6 δ[1−
6∑k=1
πk]eN∑6
k=1πk log πk
=1
Nlog
∫ ∞−∞
dx
2πeix
6∏k=1
∫ ∞0
dy eNπk log y−ixy
=1
Nlog
∫ ∞−∞
dx
2π/NeiNx
6∏k=1
∫ ∞0
dy eN[πk log y−ixy]
(C.8)
Focusing on the y integral, we note that for large N the dominant contribution comes from
the saddle-point, i.e. after shifting the contour in the complex plane from the solution ofddy [πk log y−ixy] = 0, giving y = −iπk/x. So steepest descent integration gives us (see Appendix
B for an introduction to steepest descent integration), using log(−i) = iArg(−i) = −iπ/2:∫ ∞0
dy eN[πk log y−ixy] = eN[πk log(−iπk/x)−πk]+O(N0)
= eN[πk log πk− 12
iππk−πk log x−πk]+O(N0) (C.9)
Hence
1
Nlog Den =
1
Nlog
∫ ∞−∞
dx eN[ix+∑6
k=1(πk log πk− 1
2iππk−πk log x−πk)]+O(N0) +
logN
N
=1
Nlog
∫ ∞−∞
dx eN[ix+∑6
k=1πk log πk− 1
2iπ−log x−1]+O(N0) +
logN
N
=6∑
k=1
πk log πk −1
2iπ − 1 +
1
Nlog
∫ ∞−∞
dx eN(ix−log x) +logN
N+O(
1
N)
(C.10)
Steepest descent integration over x gives N−1∫
dx eN(ix−log x) = 1 + 12 iπ +O(N−1). Thus we get
1
Nlog Den =
6∑k=1
πk log πk +logN
N+O(N−1)
Den = NeN∑6
k=1πk log πk+O(N0) (C.11)
The end result is the following appealing large-N form of our formula (C.7):
P (π1, . . . , π6|D) =1
NeN∑6
k=1πk log πk−N
∑6
k=1πk log πk+O(N0)
=1
Ne−N
∑6
k=1πk log(πk/πk)+O(N0) (C.12)
The leading order in the exponent, apart from the factor N , is the Kullback-Leibler distance between
the true and estimated probability distributions πk and πk. The most probable values of the
probabilities are therefore again seen to be the estimators (C.2), but now we know more: we also
have quantified our uncertainty for large but finite N .
92
Maximum prediction accuracy with Cox regression
Assume we know our regression parameters exactly, the hazard rate is indeed of the Cox form, and
there is no censoring (the ideal scenario). We take all covariates to beindependent zero average and
unit variance Gaussian variables. The survival probability for risk 1 is then
S(t|Z) = exp(−
N∑i=1
eˆβ·Zθ(t−Xi)∑N
j=1 eˆβ·Zj
θ(Xj−Xi)
)(C.13)
and we would classify a patient with covariates Z at time t according to the most probable outcome:
σ(Z) = θ[S(t|Z)− 1
2
]= θ
[log(2)− 1
N
N∑i=1
eˆβ·Zθ(t−Xi)
1N
∑Nj=1 e
ˆβ·Zj
θ(Xj−Xi)
](C.14)
For N →∞ there will be no difference between training and validation sets in terms of prediction
accuracy, and the fraction predicted correctly will simply be
Qt =⟨1
2+
1
2sgn(X−t) sgn
[log(2)− 1
N
N∑i=1
eˆβ·Zθ(t−Xi)
1N
∑Nj=1 e
ˆβ·Zj
θ(Xj−Xi)
]〉Z ,X
=1
2+
1
2
⟨sgn(X−t) sgn
[log(2)− e
ˆβ·Z⟨ θ(t−X ′)
〈e ˆβ·Z ′′θ(X ′′−X ′)〉Z ′′,X′′
⟩X′
]〉Z ,X (C.15)
We first do the average in the denominator:
〈eˆβ·Z ′′θ(X ′′−X ′)〉Z ′′,X′′ =
∫DZ e
ˆβ·Z∫ ∞X′
ds P (s|Z)
=
∫DZ e
ˆβ·Z∫ ∞X′
ds π1(s|Z)e−∫ s
0ds′ π1(s′|Z)
=
∫DZ e
ˆβ·Z[− e−
∫ s0
ds′ π1(s′|Z)]∞X′
=
∫DZ e
ˆβ·Z(e−∫ X′
0ds π1(s|Z) − e−
∫∞0
ds π1(s|Z))
=
∫DZ e
ˆβ·Z(S(X ′|Z)− S(∞|Z)
)(C.16)
top related