principles of survival analysis - king's college london · 2014-02-10 · principles of...

Principles of Survival AnalysisVersion of July 22nd 2012

ACC CoolenKing’s College London

Principles of Survival Analysis 2

Preface

When I first tried to learn about survival analysis I found there to be an unwelcome gap in

literature. There are many textbooks and papers that give the formulas of survival analysis, explain

how one should use standard statistical software packages, and give examples of applications of

standard methods to real data. Then there are hardcore statistics papers that often focus on

very mathematical/technical questions and are often written in the language of measure theory.

I could not find good textbooks that sit in the middle, to explain in detail the conceptual and

mathematical basis of the formulas of survival analysis. Where do all these formulas come from?

What assumptions were made? How exactly are key quantities defined?

The traditional survival analysis methods such as proportional hazards regression and Kaplan-

Meier risk estimators were perfect for their time (the 1970s), when each university had just one

computer (that filled several rooms, and was probably slower than today’s average laptop) and

mathematical methods had to be simple in order to be applied to real data. However, nowadays

the use of this traditional methodology is increasingly inappropriate. In modern biomedicine we

have new problems and new ambitions: we want to use the wealth of new data for personalised

medicine, but face complex heterogeneous diseases and cohorts, and a vast dimensional mismatch

between the number of (e.g. genetic) covariates and the number of patients on which we have data.

I would say that at this moment the pressing problems in survival analysis are not measure-

theoretic, they are more basic. We need to develop new methods, that do not deviate unnecessarily

from the traditional ones (and preferably include these in special simplifying limits), but can

handle the big questions of today – individualised prediction, cohort heterogeneity and dimensional

mismatch. To do this we need to understand and review in full detail the principles and

mathematical derivations of traditional survival analysis, and rebuild the edifice where this is needed

to accommodate the new questions that we want survival analysis to answer.

These lecture notes are written as an attempt to fill the above gap. I try to map out and explain

the definitions, assumptions and derivations of the main methods in survival analysis. The style is

that of the physicist who appreciates that there is a time and place for investigating mathematical

subtleties like Lebesgue measures, noncommuting limits, and distribution theory, but who first

wants to erect the building in terms of structure. Once the roof is on, and the windows are in we

can start thinking about the colour of the door handles. I also write with the benefit of hindsight.

In the 1970s maximum-likelihood estimation was the norm, and that is often the language in which

original derivations are given. Nowadays we prefer the Bayesian route (within which maximum

likelihood is but a special limit), and this makes derivations and subtleties a lot more transparent.

In these notes I will not give journal references. Below I simply list four textbooks on the

subject, which contain a wealth of references to research papers and other texts. With the possible

exception of Crowder, these books tend not to give full mathematical derivations to the extent that

I would have liked. The texts of Klein and Moeschberger and of Crowder I like best in terms of

subject coverage and writing style, but of course such assessments are subjective ...

P Hougaard: Analysis of multivariate survival data. Springer (2001).

JG Ibrahim, MH Chen and D Sinha: Bayesian survival analysis. Springer (2001).

JP Klein and ML Moeschberger: Survival analysis – techniques for censored and

truncated data. Springer (2005).

M Crowder: Multivariate survival analysis and competing risks. CRC Press (2012).

CONTENTS

1 Why probability and statistics are tricky 5

2 Definitions and basic properties in survival analysis 8

2.1 Notation, data and objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Survival probability and cause-specific hazard rates . . . . . . . . . . . . . . . . . . . 9

2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Event time correlations and the identifiability problem 15

3.1 Independently distributed event times . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 The (Tsiatis) identifiability problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Incorporating cure as a possible outcome 20

4.1 The clean way to include cure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 The quick and dirty way to include cure . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Individual versus cohort level survival statistics 25

5.1 Population level survival functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.2 Population hazard rates and data likelihood . . . . . . . . . . . . . . . . . . . . . . . 26

5.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6 Survival prediction 30

6.1 Cause-specific survival functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.2 Estimation of cause-specific hazard rates . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.3 Derivation of the Kaplan-Meier estimator . . . . . . . . . . . . . . . . . . . . . . . . 34

6.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7 Including covariates 43

7.1 Definition via covariate sub-cohorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7.2 Definition by conditioning individual hazard rates on covariates . . . . . . . . . . . . 45

7.3 Connection between the conditioning picture and the sub-cohort picture . . . . . . . 47

7.4 Conditionally homogeneous cohorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7.5 Nonparametrised determination of covariates-to-risk connection . . . . . . . . . . . . 50

7.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

8 Proportional hazards (Cox) regression 55

8.1 Definitions, assumptions and regression equations . . . . . . . . . . . . . . . . . . . . 55

8.2 Uniqueness and p-values for regression parameters . . . . . . . . . . . . . . . . . . . 58

8.3 Properties and limitations of Cox regression . . . . . . . . . . . . . . . . . . . . . . . 59

8.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

9 Overfitting and p-values 70

9.1 What is overfitting? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

9.2 Overfitting in binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

9.3 Overfitting in Cox regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

9.4 p-values for Kaplan-Meier curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

9.5 Notes and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

10 Heterogeneous cohorts and competing risks 78

10.1 Population-level hazard rate correlations and competing risks . . . . . . . . . . . . . 79

10.2 Rational paramerisations for heterogeneous population models . . . . . . . . . . . . 81

10.3 Types of heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

10.4 Impact of heterogeneity on hazard ratios . . . . . . . . . . . . . . . . . . . . . . . . . 84

10.5 False protectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

10.6 Frailty models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

10.7 Fine and Gray regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

10.8 Bayesian regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

10.9 Notes and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

11 Further topics 84

11.1 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

11.2 Multiple testing corrections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

11.3 Log-rank test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Appendix A The δ-distribution 87

Appendix B Steepest descent integration 88

Appendix C Maximum likelihood and Bayesian parameter estimation 89

1. Why probability and statistics are tricky

Probability and statistics are probably the most tricky and most abused areas of mathematics.

In order to get some feeling for why this is so, let us start with some simple examples of

statistical/probabilistic questions that as yet have nothing to do with survival analysis.

Example 1: The Monty Hall problem

This problem is based loosely on the scenario played out at the end of many typical television

game shows of the 1970s. The archetypical show in the USA after which the problem was

named was called ‘Let’s Make a Deal’ (USA, 1963-1977) and was hosted by Monty Hall.

At the end of the show the winner faces a final challenge to claim a prize. There are three

closed doors; behind one of these is a big prize (large amount of money, a car, etc) , behind

the other two is something silly (e.g. goats or llamas). What happens next is:

• the winner is asked to choose one of the

three closed doors (randomly, as he/she

has no clue).

• Monty opens one of the remaining two

doors, behind which there is a goat/llama

(this is always possible, irespective of

the winner’s selection, since only one

door has the true prize).

We are then left with two closed doors,

one of which was picked by the winner.

We still don’t know which of these leads to the prize.

• Monty then offers the winner the option to change his/her mind at the last minute and

switch from the initial selection to the other closed door.

The question then is: will it make a difference to the likelihood of winning the prize if he/she

were to switch at the last minute? Intuitively one is tempted to say no. It would seem that each

door simply has a 50% chance of leading to the prize, and switching would make no difference.

In fact the correct answer is yes. careful analysis of all possible events and their probabilities

shows that switching at the last minute doubles one’s likelihood of winning the prize ...

Example 2: share price statistics

Imagine one is asked to produce a statistical report on the typical behaviour of share prices

over a a twenty year period, of corporations that are listed on the London Stock Exchange. For

the sake of the argument, let us pretend that the most recent financial crisis hadn’t happened.

How would we go about this task? It would seem natural to proceed as follows:

• Make a list of all companies that have

been on the LSE since 1992.

• Find/buy the data that give the daily

share values of all companies on this

list over the last 20 years.

• Carry out a careful statistical analysis

of these data.

Yet in doing so we would make fundamental mistakes. In fact we are already in trouble from

the first step. By putting only those companies on our list that have been on the LSE for the

last twenty years, we are biasing our sample to those companies that are sufficiently healthy

to remain in business (and listed on the LSE) for at least twenty years. Irrespective of the

statistical analysis methods used, this will lead to a picture of share statistics that is too rosy.

Pitfalls and dangers in probability and statistics. Probability and statistics are in principle as

sound and unambiguous as any other area of mathematics. The pitfalls that so often get us

into trouble with statistics do not relate to the precision and consistency of the formal theory

or mathematical manipulations, but they tend to emerge when we apply statistics and probability

to practical scenarios and real-world problems. The main ones are related to:

• The meaning of of uncertainty

Probabilities quantify uncertainty, but there are two types of uncertainty. Probabilities can

express our ignorance of:

(a) something that cannot be known, because it is still to happen and can still go either way

(e.g. the probability of finding a six for a dice that is stil to be rolled ...)

(b) something that is known, but not by us

(e.g. the probability of finding a six for a dice that has been rolled inside a black box ...)

For instance, if we write Prob(phenotype) = Prob(genotype) × Prob(phenotype|genotype) in

biomedicine, then the uncertainty in an individual’s genotype expressed by Prob(genotype)

would be of type (b) (in principle written in stone, but we don’t have the information), whereas

given the genotype we would still expect variability in the phenotype Prob(phenotype|genotype)that results at least partly from non-predictable events (e.g. cell signalling, mutations), i.e.

represent type type (a) uncertainty. In medicine the difference can be quite relevant.

• Accidental conditioning

This is what happens when we inadvertedly collect our information from a non-representative

subset of the events or individuals on which we seek to make statistical statements. This is

what happens in the Monty Hall problem (where Monty’s decision of which door to open is

constrained or conditioned by the initial selection of the winner, this brings in subtle extra

information that is exploited when the winner switches at the last minute), and in the example

of share price analysis (where we condition our sample of companies on at least 20 years’

survival). Extra information B generally modifies the probability to observe A: we wish to

sample according to a prior P (A), but end up sampling acording to a conditioned posterior

measure P (A|B), described by the Bayesian relation

posterior︷︸︸︷P (A|B) =

P (A,B)

P (B)=

prior︷︸︸︷P (A)×P (B|A)

• Limitations of our intuition

Possibly because of the evolutionary advantages of pattern detection, humans are obsessed with

patterns, and consequently very poor at judging likelihoods objectively. We struggle to accept

intuitively that even after we have thrown ten successive sixes with a fair dice (an unlikely

sequence of events for which the a priori probability is around 1.7 10−7), then for the next roll

to give yet another six still carries a probability of 1/6 (in spite of the fact that this would

lead to an even more remarkable sequence of eleven sixes in a row). By the same token most

humans would struggle to generate sequences of random numbers 01001010001011010010...; it

is trivial to write a simple computer program that can predict the next digit in such human-

generated sequences correctly with probability of some 60%. This failing statistical intuition

explains why most would get the answer to the Monty Hall question wrong, and partly explains

the profitability of the gambling industry ...

• Assumptions behind methods

All statistical methods involve explicit or implicit assumptions, and many involve further

mathematical approximations. If nothing is assumed, nothing can be calculated. For instance,

in least-squares data fitting and in principal component analysis one assumes that the noise in

the data is Gaussian; in order to use the central limit theorem it is not sufficient just to have

a large sum of independent random variables (but we have to satisfy very specific criteria on

the distributions of the random variables, e.g. those quantified by Lindeberg’s condition), etc.

Obviously, the correctness of any outcome of such methods depends on the extent to which

these assumptions and approximations are reasonable in the context of the problem at hand.

It is vital that one has a basic understanding of what these assumptions and approximations

are, so that one can convince oneself that the chosen method can be used.

• Imprecise definitions

In statistics it is vital that we are very precise in defining quantities. When we speak about

the probability of getting a specific disease within a given time, do we mean the probability for

one individual? Or the probability for a randomly drawn individual from a population? Do

we include our ignorance of the previous two? (i.e. the probability of the probability) ...

2. Definitions and basic properties in survival analysis

2.1. Notation, data and objective

Notation and data. Imagine we have data on a cohort of N patients, labelled i = 1 . . . N . They

are subject to R distinct ‘hazards’ or ‘risks’, labelled r = 1 . . . R, which trigger irreversible events

such as onset of a given disease of interest, death due to causes other than the disease of interest,

etc. We also measure p characteristics of our patients (e.g. gender, blood serum counts, BMI,

socio-economic factors, genetic variables, etc), resulting for each patient i in a list of p numbers

(Zi1, . . . , Zip), the so-called ‘covariates’. Covariates can be discrete (e.g. gender) or real-valued (e.g.

BMI). We monitor our cohort during a trial of finite duration, and record for each patient when

the first event happened to them, and which event this was; the start of the trial is taken as time

zero. To label those patients that did not record any event during the trial (they could be lost to

follow-up along the way, or may have reached the end of the trial without experiencing any of the

R events), we introduce a further ‘risk’ r = 0 (which we will refer to simply as end-of-trial). Our

data thus take the following form. For each patient i we have

Zi = (Zi1, . . . , Zip): value of p covariates

Xi ≥ 0: time at which the first event occurred

∆i ∈ 0, . . . , R: label indicating which event occurred at time Xi

A typical example of such data is shown in figure 1. The covariates (or ‘explanatory factors’) can

be divided into three qualitatively distinct groups:

• uncontrolled covariates: e.g. gender, genetic make-up, etc

• controlled covariates: e.g. medical treatment,

• modifiable covariates: e.g. smoking, drinking, nutrition, etc

Objectives of survival analysis. Survival analysis is the statistical discipline that deals with data

of the above type, and tries to extract patterns from these data to quantify the relations (if any)

between the covariates and the risks. Usually we are interested mainly in one particular risk,

traditionally chosen as r = 1, so the other risks r ∈ 0, 2, 3, . . . , R are unfortunate complications.

More specifically we would like to

• Evaluate the effects of covariates

• Predict event times from knowledge of the covariates

• Compare and validate different models with which to explain the data

Censoring. ‘Censoring’ means that the value of a measurement is only partially known, as is the

case here. An end-of-trial outcome ∆i = 0 means that all we know is that patient i is either ‘lost’

along the way, or will experience his/her first actual event from the risk set 1, . . . , R at some

time ti ≥ C, where C is the duration of our trial. This latter option is called ‘right censoring’.

Alternative types of censoring (which we will not deal with here) are ‘left censoring’, i.e. ti ≤ C for

some C, or ‘interval censoring’, i.e. ti ∈ [C1, C2] for some C1, C2. Here we label all the patients i

that are censored as ∆i = 0, and use Xi to denote the time where they left our trial.

Complications. The main complications in survival analysis are caused by (i) the statistical ‘noise’

caused by censoring, (ii) the fact that different risks prevent each other from happening (or from

being observed), e.g. if a patient dies we will never know whether and when he/she would have

got the disease of interest, (iii) possible correlations between the different risks, (iv) heterogeneity

in cohorts (in terms of covariates, and in terms of what covariates imply in terms of risks), and

(v) the fact that most studies are ‘underpowered’, i.e. we want to extract complicated statistical

patterns with from data on relatively small patient cohorts, which brings the danger of overfitting

and non-reproducibility of results.

2.2. Survival probability and cause-specific hazard rates

Joint event times and survival function. Imagine the imaginary situation where for each individual

i all events r = 0 . . . R could in principle be observed (irrespective of their nature and their order

in time), and let tr denote the time at which event r occurs. If we assume also that all events will

ultimately always happen (some earlier and some later, and some perhaps only at times so late to

be practically irrelevant), we write the joint distribution for individual i of the event times as

Pi(t0, . . . , tR) (1)

Since we assume that each event will ultimately happen, this distribution must be normalised, so∫ ∞0. . .

∫ ∞0

dt0 . . . dtR Pi(t0, . . . , tR) = 1 (2)

We can next define the integrated event time distribution for individual i:

Si(t0, . . . , tR) =

∫ ∞t0. . .

∫ ∞tR

ds0 . . . dsR Pi(s0, . . . , sR)

∫ ∞0. . .

∫ ∞0

ds0 . . . dsR Pi(s0, . . . , sR)R∏r=0

θ(sr − tr) (3)

with the step function, defined as θ(z>0) = 1 and θ(z<0) = 0. Si(t0, . . . , tR) gives the probability

that for individual i event 0 occurs later than t0, and event 1 occurs later than t1, and . . . etc. Note

Si(0, . . . , 0) =

∫ ∞0. . .

∫ ∞0

ds0 . . . dsR Pi(s0, . . . , sR) = 1 (4)

We can now define the survival function Si(t) as the probability that for individual i all events

r = 0 . . . R will happen later than time t:

Si(t) = Si(t0, . . . , tR)|tr=t for all r = Si(t, t, . . . , t) (5)

Cause-specific hazard rates. We next want to characterise for each individual risk how likely it is to

trigger an event as a function of time, and how it impacts on the overall survival probability Si(t)

defined above. This is done via the so-called cause-specific hazard rates, defined as

πiµ(t) = −[ ∂∂tµ

logSi(t0, . . . , tR)]tr=t for all r

Whenever we write log(.) we will mean the natural logarithm. Inserting the definition of Si(t) above,

and using ddzθ[z] = δ(z) (see Appendix A on the δ-distribution) allows us to work this out:

πiµ(t) =[∫∞

0 . . .∫∞

0 ds0 . . . dsR Pi(s0, . . . , sR)δ(sµ − tµ)∏Rr 6=µ θ(sr − tr)

Si(t0, . . . , tR)

]tr=t for all r

∫∞t . . .

∫∞t

(∏Rr 6=µ dsr

)Pi(s0, . . . , sµ−1, t, sµ+1, . . . , sR)

Si(t)(7)

Sheet1

pat BMI SELENIUM PHYS_ACT_LEIS PHYS_ACT_WORK Smoking Time

1 22.63671875 105 2 1 0 33.6892539357 0

2 34.21875 65 2 2 0 24.810403833 1

3 20.06640625 72 2 3 2 23.1047227926 2

4 28.3984375 81 2 0 2 33.6783025325 0

5 22.94921875 73 0 3 2 32.8843258042 0

6 20.59765625 73 2 0 2 23.2936344969 1

7 26.46875 70 2 3 2 27.9069130732 2

8 26.38671875 91 1 0 1 26.6119096509 2

9 24.296875 73 -10 0 0 26.379192334 1

10 30.01953125 84 1 1 2 26.8254620123 2

11 25.4296875 95 1 0 2 30.6557152635 2

13 23.19921875 67 3 1 0 33.2457221081 0

14 22.90625 82 2 2 0 33.6399726215 0

15 21.62890625 53 1 1 1 33.2457221081 0

16 23.046875 77 2 0 2 33.6125941136 0

17 21.70703125 76 1 2 2 27.8466803559 1

18 22.91796875 102 1 0 2 33.6125941136 0

19 24.5078125 57 2 0 2 25.8097193703 2

20 26.58984375 72 1 2 2 26.803559206 2

21 26.76953125 78 2 0 2 33.234770705 0

22 20.4296875 75 -10 1 1 33.5934291581 0

23 25.0078125 69 0 3 0 33.6125941136 0

24 24.296875 73 2 3 2 21.6098562628 1

25 23.65625 75 2 3 1 33.2320328542 0

26 25.9296875 90 1 1 2 33.5359342916 0

27 23.3671875 58 1 1 2 33.2320328542 0

28 30.08984375 77 -10 -10 2 32.0438056126 2

29 31.08984375 66 1 0 0 29.4893908282 1

30 27.13671875 82 1 -10 1 33.1526351814 0

31 19.828125 68 2 2 2 28.2600958248 2

32 27.30859375 97 2 3 1 33.5742642026 0

33 23.41796875 77 2 0 0 25.9219712526 1

34 20.5078125 78 1 3 0 32.8350444901 0

35 24.90625 75 0 1 0 24.7255304586 1

36 21.70703125 67 0 1 1 33.2265571526 0

38 24.20703125 75 1 1 0 33.2183436003 0

39 22.71875 76 2 2 2 31.5893223819 2

40 30.06640625 79 1 0 0 33.5934291581 0

41 26.1171875 66 3 3 0 33.5633127995 0

42 22.33984375 94 2 0 2 33.5550992471 0

43 24.02734375 77 1 2 0 33.582477755 0

44 23.546875 81 3 -10 0 19.1266255989 1

45 25.58984375 81 2 0 2 31.3456536619 2

46 24.38671875 58 2 2 2 30.984257358 2

47 22.75 96 2 1 2 33.582477755 0

48 31.76953125 75 2 3 1 33.0924024641 0

49 23.9375 64 1 1 1 22.1656399726 2

50 22.88671875 74 1 0 1 33.2128678987 0

51 20.90625 92 2 1 2 32.9034907598 0

52 22.9765625 82 2 1 0 33.196440794 0

53 34.33984375 68 1 0 0 33.2073921971 0

54 29.0078125 81 2 1 1 27.9370294319 2

55 24.33984375 76 2 1 2 22.8364134155 1

56 30.45703125 84 2 2 0 16.2299794661 1

57 25.25 62 1 2 2 23.318275154 2

58 23.71875 76 1 1 2 33.2895277207 0

59 25.12890625 85 2 1 0 24.7091033539 1

60 22.26953125 96 2 1 2 33.1991786448 0

61 24.859375 66 1 -10 2 22.006844627 2

62 26.80859375 69 -10 1 2 23.3538672142 2

63 25.95703125 87 3 0 0 33.1909650924 0

PCcens

Figure 1. Sample survival data from the ULSAM prostate cancer study. Column one:

patient label i. Columns two to six: values of five covariates (Z1i , . . . , Z

i5), of which four

are modifiable (BMI, leisure time physical activity, physical activity at work, smoking) and

one is uncontrolled (selenium level in the blood). Last two columns: event time and label

(Xi,∆i). Entries ‘-10’ refer to missing data.

Hence πiµ(t)dt gives the probability for individual i that event µ happens in the time interval

[t, t+ dt), given that no event has happened to i yet prior to time t:

πiµ(t)dt = Prob(tiµ ∈ [t, t+dt)

∣∣∣ i had no events yet at time t)

(dt ↓ 0) (8)

Since πiµ(t) gives a probability of hazardous events per unit time it is called a ‘hazard rate’. It

depends on which risk µ we are discussing, hence it is ‘cause specific’. The subtlety is in the

conditioning: it is defined conditional on the individual still being event-free at the relevant time.

Survival function in terms of cause-specific hazard rates. It turns out that the overall survival

probability Si(t) can be written in terms of the cause-specific hazard rates, in a simple way. To see

this we calculate

dtlogSi(t) =

dtlogSi(t, t, . . . , t) =

R∑r=0

[ ∂∂tr

logSi(t0, . . . , tR)]tr=t for all r

= −R∑r=0

πir(t) (9)

Hence, using Si(0) = 1,

logSi(t) = logSi(0)−R∑r=0

0ds πir(s) = −

R∑r=0

0ds πir(s) (10)

Si(t) = e−∑R

∫ t0

ds πir(s) (11)

Data likelihood in terms of cause-specific hazard rates. Similarly we can express also the likelihood

Pi(X,∆)dX to observe in our trial patient i reporting a first event of type ∆ at a time in the interval

[X,X + dX) (with dX ↓ 0) in terms of the cause specific hazard rates. To observe the above the

following three statements must be true:

• the time of the event is in [X,X + dX),

• the type of the event is ∆, and

• no events occurred prior to X.

This can all be written in terms of properties of the joint event times (t0, . . . , tR) of individual i:

θ(t∆−X)θ(X+dX−t∆)∏r 6=∆

θ(tr−X) = 1 (12)

and the likelihood Pi(X,∆) can therefore be written as ‡

Pi(X,∆) = limdX↓0

dXProbi

(θ(t∆−X)θ(X+dX−t∆)

∏r 6=∆

θ(tr−X) = 1)

= limdX↓0

∫ ∞0. . .

∫ ∞0

dt0 . . . tR Pi(t1, . . . , tR)θ(t∆−X)θ(X+dX−t∆)∏r 6=∆

θ(tr−X)

∫ ∞0. . .

∫ ∞0

dt0 . . . tR Pi(t1, . . . , tR) limε↓0

hε(t∆−X)∏r 6=∆

θ(tr−X) (13)

‡ Note that we implicitly assume that the joint event time distribution Pi(t0, . . . , tR) is continuous and

smooth, so that the probability of seeing ties in the timing of events, i.e. tµ = tν for µ 6= ν, is negligible.

hε(z) = ε−1θ(z)θ(ε−z) =

ε−1 for z ∈ [0, ε]

0 elsewhere(14)

We note that the function limε↓0 hε(z) has all the properties that define the δ-function (see Appendix

A): hε(z) ≥ 0 for all ε > 0,∫

dz hε(z) = 1 for all ε > 0, limε↓0 hε(z) = 0 for all z 6= 0, and

limε↓0 hε(0) =∞. So limε↓ hε(z) = δ(z), and we get

Pi(X,∆) =

∫ ∞0. . .

∫ ∞0

dt0 . . . tR Pi(t1, . . . , tR)δ(t∆−X)∏r 6=∆

θ(tr−X)

= Si(X)πi∆(X)

= πi∆(X)e−∑R

∫ X0

ds πir(s) (15)

where we used (7) in the first step, and (11) in the second. So the survival probabilities Si(t) and the

data likelihoods Pi(X,∆) can both be written strictly in terms of the cause-specific hazard rates.

The final picture is as in the diagram below. We can therefore anticipate that in all our statistical

analyses of the data the cause-specific hazard rates will play a central role.

P1(t0, . . . , tR) . . . . . . PN (t0, . . . , tR)︸︷︷︸individual event time statistics

⇓(π1

0(t), . . . , π1R(t)) . . . . . . (πN0 (t), . . . , πNR (t))︸︷︷︸

individual hazard rates

⇓(X1,∆1) . . . . . . (XN ,∆N )︸︷︷︸

observed survival data

Starting our description from the distributions Pi(t0, . . . , tR) was useful in terms of understanding

how cause-specific hazard rates πir(t) emerge, but working directly at the level of these rates has

advantages. It avoids us having to think in terms of the event times (t0, . . . , tR) and their distribution

(which refers to a hypothetical situation where all event times could be observed - including e.g. the

onset of a disease after death). Secondly, we will see that upon using the hazard rates as a starting

point we can also deal in a transparent way with events that have a nonzero probability of never

happening; these we have so far ruled out as a result of starting with a normalised Pi(t0, . . . , tR).

Cause-specific hazard rates in terms of data probabilities. We have seen that the data probabilities

Pi(X,∆) can be written fully in terms of the cause specific hazard rates πir(t). It turns out that the

converse is also true, i.e. the cause-specific hazard rates can be written fully and explicitly in terms

of the data probabilities Pi(X,∆). To see this, let us first sum over ∆ in (15)

R∑∆=0

Pi(X,∆) =( R∑

πi∆(X))e−∑R

∫ X0

ds πir(s) = − d

dXe−∑R

∫ X0

ds πir(s) (16)

e−∑R

∫ X0

ds πir(s) = 1−∫ X

R∑r=0

Pi(t, r) =R∑r=0

∫ ∞X

dt Pi(t, r) (17)

Substituting this into the right-hand side of (15), followed by re-arranging in order to make the

hazard rate the subject of the equation, then immediately gives us

πi∆(X) =Pi(X,∆)∑R

∫∞X dt Pi(t, r)

So, if we wanted, we could build our theory entirely in the language of data probabilities Pi(X,∆),

as opposed to the language of the cause-specific hazard rates. Summation over all ∆ in both sides of

(18) gives another transparent and useful identity relating the cumulative hazard rate∑Rr=0 π

(i.e. the rate of events, irrespective of type, conditional on there not having been any events prior

to time X) to the distribution P (X) =∑Rr=0 P (X, r) of reported event times (of any type):

R∑r=0

πir(X) =

∑Rr=0 Pi(X, r)∑R

∫∞X dt Pi(t, r)

=Pi(X)∫∞

X dt Pi(t)(19)

Possible pitfalls and misconceptions. The most tricky aspect of survival analysis is its formulation

in terms of cause-specific hazard rates, which involve nontrivial conditioning at any time t on there

not having been any event prior to time t. This causes interpretation issues. For instance

• Expression (11) can be written in a form that factorises over the different risks as

Si(t) =∏r exp[−

∫ t0 ds πir(s)]. Does this imply that the risks are uncorrelated? No. All

risks r 6= µ will generally contribute to each πiµ(t), since the alternative risks modify

the conditioning, i.e. the likelihood that nothing has happened yet prior to t. The risks

may well interact strongly with each other, but we can no longer see this after we have

calculated the rates πiµ(t) and forget about the times (t0, . . . , tR).

• Starting from the survival function (11), do we get the survival function for the

hypothetical situation where risk µ is disabled by setting πiµ(t) to zero, i.e. Si(t) →exp[−

∑r 6=µ

∫ t0 ds πir(s)]? No. We would indeed have πiµ(t) = 0 for all t, but that is not

all. If we disable a risk µ, all other risks will in principle be more likely to happen first,

and hence the removal of risk µ changes in principle also all hazard rates πir(t) with r 6= µ.

2.3. Examples

Example 1: time-independent hazard rates

Here we have πir(t) = πir, independent of t, for all (r, i). Thus∫X

0 ds πiµ(s) = πiµX. This gives

the following simple formulae for the survival function and the data likelihood:

Si(t) = e−t∑R

r=0πir Pi(X,∆) = πi∆e−X

r=0πir (20)

Example 2: a single risk r = 1

Suppose we have only one hazard, r = 1, and for each patient i a single hazard rate πi(t):

Si(t) = e−∫ t

0ds πi(s) Pi(X,∆) = πi(X)e−

∫ X0

ds πi(s)δ∆,1 (21)

In this case we can write the event time distribution in terms of the hazard rate via (3):

Pi(t) = − d

dtSi(t) = πi(t)e−

∫ t0

ds πi(s) (22)

which is of course no surprise in view of (15). Now it makes perfect sense to think in terms

of Pi(t): there is only one risk, so nothing hypothetical about the event time t (as no other

events can prevent it from being observed).

If we have one risk only, and moreover a time independent hazard rate (i.e. a combination if

the two examples discussed above) we obtain

Si(t) = e−tπi

Pi(X,∆) = Pi(X)δ∆,1 Pi(t) = πie−tπi

Example 3: the most probably event time distribution for R = 1, given the value of the average

Finally, let us show how the exponential distribution of event times in (21) can be seen as

the simplest natural choice for the case R = 1, in an information-theoretic sense. Suppose

the only knowledge we have of Pi(t) is the value of the average event time 〈t〉i. The

most probably distribution Pi(t) with this average is found by maximizing the Shannon

entropy Hi = −∫∞

0 dt Pi(t) logPi(t), subject to the two constraints∫∞

0 dt Pi(t) = 1 and∫∞0 dt Pi(t)t = 〈t〉i§. The maximum is found via the Lagrange method:

δPi(x)

∫ ∞0

ds Pi(s) logPi(s) =δ

δPi(x)

∫ ∞0

ds Pi(s) + λ1

∫ ∞0

ds Pi(s)s

1 + logPi(t) = λ0 + λ1t so Pi(t) = eλ0−1+λ1t

We note that λ1 < 0 is required for Pi(t) to be normalisable. Normalisation gives

1 = eλ0−1∫ ∞

0dt eλ1t = −λ−1

1 eλ0−1

So eλ0−1 = −λ1, giving Pi(t) = |λ1|e−|λ1|t. Finally we demand that the average time is 〈t〉i:

〈t〉i =

∫ ∞0

dt t|λ1|e−|λ1|t =1

∫ ∞0

ds se−s =1

[− se−s

∫ ∞0

ds e−s

[e−s

Pi(t) = πie−tπi

with πi = 1/〈t〉i (24)

§ Strictly speaking we must demand also that Pi(t) ≥ 0 for all t ≥ 0, but it turns out that this latter demand

will be satisfied automatically.

3. Event time correlations and the identifiability problem

We have seen that knowing the joint event time statistics Pi(t0, . . . , tR) of an individual i allows us

to calculate the cause-specific hazard rates πi0(t), . . . , πiR(t). We now ask the following: if we know

the cause-specific hazard rates, which we expect can be estimated from the data via (15), can we

deduce from this the distribution Pi(t0, . . . , tR)? In particular, can we deduce from the hazard rates

whether or not the event times of different risks are statistically independent? This will become

important when we turn to competing risks later.

3.1. Independently distributed event times

If the event times are all uncorrelated, i.e. if knowing one such time conveys no information on the

others, the joint distribution factorises by definition into the simple form

Pi(t0, . . . , tR) =R∏r=0

Pir(tr) (25)

Via (3) we then get

Si(t0, . . . , tR) =R∏r=0

∫ ∞0

dsr Pir(sr)θ(sr − tr) =R∏r=0

Sir(tr) (26)

Sir(t) =

∫ ∞t

ds Pir(s) (27)

So the probability to observe that each event r happens at a time later than tr is just the product of

the individual survival probabilities Sir(tr) for the risks. The cause specific hazard rates (6) become

πiµ(t) = −[ ∂∂tµ

R∑r=0

logSir(tr)]tr=t for all r

= −[ ∂∂tµ

logSiµ(tµ)]tµ=t

= − d

dtlogSiµ(t) (28)

Hence, if we integrate both sides, and use Sir(0) = 1 (no events have occurred yet at time t = 0):

logSiµ(t) = logSiµ(0)−∫ t

0ds πiµ(s) = −

0ds πiµ(s) (29)

giving, as expected

Siµ(t) = e−∫ t

0ds πiµ(s) (30)

If we now differentiate (27) and use our formula for Sir(t) we find that we can express the event

time probabilities for each risk in terms of the associated hazard rates. This results in the following

generalisation to multiple independent risks of formula (22):

Pir(t) = − d

dtSir(t) = − d

dte−∫ t

0ds πir(s) = πir(t)e

−∫ t

0ds πir(s) (31)

3.2. The (Tsiatis) identifiability problem

We have seen above that for the special case of statistically independent event times one can indeed

calculate the event time probablities uniquely from the cause-specific hazard rates. However, we

can also deduce something else from the above derivation:

For any set of cause-specific hazard rates πi0(t), . . . , πiR(t), including those that

correspond to statistically dependent event times, there always exists a distribution for

independent event times that will give exactly the same cause-specifc hazard rates, namely

Pi(t0, . . . , tR) =R∏r=0

[πir(tr)e

−∫ tr

0ds πir(s)

It follows that knowledge of the cause-specific hazard rates (which is all we may ever hope to extract

from survival data alone) does not generally permit us to identify the underlying joint distribution

of event times – in particular, we cannot find out from survival data alone whether or not the event

times of the different risks are statistically independent. This is Tsiatis’ identifiability problem.

Tsiatis’ result appears to have created some pessimism in the past as to what can be achieved

with statistical analyses, especially in the context of so-called ‘competing risks’. We will turn to

these in more detail later; for now let us just say that ‘competing risks’ describes the situation where

at the level of populations or trial cohorts the event times of different risks appear to be correlated.

The identifiability problem suggested to some that in the case of competing risks there is not much

that survival analysis can do. Let us counteract this with a few observations:

• If the cohort under study is homogeneous, then Pi(t0, . . . , tR) will be identical for all

i and thus also describe the statistics of the cohort; here we do indeed have a problem.

However, if competing risks are due to population level correlations of hazard rates in

an inhomogeneous population, then the identifiability problem doesn’t arise. One could

imagine all individuals having independent event times, i.e. Pi(t0, . . . , tR) =∏r Pir(tr)

for each i, but correlated hazard rates: those individuals with a higher hazard rate for

diabetes might for instance have also a higher hazard rate for pancreatic cancer. Here we

would at population level have P (t0, . . . , tR) 6=∏r Pr(tr) and S(t) 6=

∏r Sr(t). But since

the correlations are now generated at the level of hazard rates (which are in principle

accessible via data) this would represent a competing risk problem that can be solved.

• Even if we do have correlated event times at the level of individuals, then still all is

not lost. If we only extract hazard rates from our data then we indeed cannot extract

from these the distribution Pi(t0, . . . , tR). But in Bayesian regression we do not require

uniqueness of explanations anyway. We would calculate the likelihood of each possible

explanation Pi(t0, . . . , tR) for the observed hazard rates, and find the most plausible one.

• Finally, it might well be that the ‘statistically independent’ event time explanation

above, that we can always construct for any set of observed hazard rates, has unwanted or

unlikely mathematical or interpretational features. For instance, to have mathematically

acceptable distributions Pir(t) they need to be normalised, i.e. we must demand

∫ ∞0

dt Pir(t) =

∫ ∞0

dt πir(t)e−∫ t

0ds πir(s) =

∫ ∞0

dt[− d

dte−∫ t

0ds πir(s)

]= 1− e−

∫∞0

ds πir(s) (33)

Hence the independent-times explanation for the hazard rates requires that

limt→∞

0ds πir(s) =∞ (34)

This is just another way of saying that the probability of event r never occurring must be

zero. We will see in an example below that this is not always satisfied.

3.3. Examples

Example 1: true versus independent-times explanation for observed hazard rates

Let us inspect the following event time distribution for the times t1, t2 ≥ 0, with parameters

a, b, τ > 0 and ε ∈ [0, 1], for a single individual:

P (t1, t2) = ae−at2[εδ(t1 − t2 − τ) + (1−ε)be−bt1

It has the form P (t1, t2) = P (t1|t2)P (t2), with

P (t2) = ae−at2 , P (t1|t2) = εδ(t1−t2−τ) + (1−ε)be−bt1 (36)

P (t1, t2) is clearly nonnegative and normalised, so is a bona fide joint distribution. With

probability 1 − ε the two times are statistically independent, and with probability ε event 1

happens precisely a duration τ later than event 2. The integrated distribution S(t1, t2) is

S(t1, t2) =

∫ ∞t1

∫ ∞t2

ae−as2

[εδ(s1 − s2 − τ) + (1−ε)be−bs1

∫ ∞t2

ds2 ae−as2∫ ∞t1

ds1 δ(s1−s2−τ)

+ (1−ε)∫ ∞t2

ds2 ae−as2∫ ∞t1

ds1 be−bs1

∫ ∞t2

ds2 ae−as2θ(s2+τ−t1) + (1−ε)e−bt1∫ ∞t2

ds2 ae−as2

∫ ∞max(t2,t1−τ)

ds2 ae−as2 + (1−ε)e−bt1−at2

= εe−a max(t2,t1−τ) + (1−ε)e−bt1−at2

εe−at2 + (1−ε)e−bt1−at2 if t2 > t1 − τεe−a(t1−τ) + (1−ε)e−bt1−at2 if t2 < t1 − τ

This gives for the survival function S(t) = S(t, t):

S(t) = e−at(ε+ (1−ε)e−bt

Next we calculate the cause-specific hazard rates for this example, via (6). Since after the

partial differentiations we must set t1, t2 → t, we need only use the formula that applies for

t2 > t1 − τ :

π1(t) = − ∂

∂t1logS(t1, t2)

∣∣∣t1=t2=t

= − ∂

∂t1log

[εe−at2 + (1−ε)e−bt1−at2

]|t1=t2=t

= −∂∂t1

[(1−ε)e−bt1−at2

]|t1=t2=t

εe−at + (1−ε)e−(a+b)t=

b(1−ε)e−(a+b)t

εe−at + (1−ε)e−(a+b)t

=b(1−ε)e−bt

ε+ (1−ε)e−bt= b

1−εebt)−1

π2(t) = − ∂

∂t2logS(t1, t2)

∣∣∣t1=t2=t

= − ∂

∂t2log

[εe−at2 + (1−ε)e−bt1−at2

]|t1=t2=t

= −∂∂t2

[e−at2

(ε+ (1−ε)e−bt1

)]|t1=t2=t

εe−at + (1−ε)e−(a+b)t= a (40)

The hazard rate for cause 1 rate decays monotonically from the initial value π1(0) = b(1−ε)down to zero as t→∞. The hazard rate for cause 2 is independent of time.

We can now calculate the alternative ‘independent times’ explanation Pindep(t1, t2) =

P1(t1)P2(t2) for the above cause-specific hazard rates, as given in (32). For this we first require

the time integrals over the hazard rates:∫ t

0ds π1(s) =

b(1−ε)e−bs

ε+ (1−ε)e−bs= −

(ε+ (1−ε)e−bs

= − log(ε+ (1−ε)e−bt

)(41)∫ t

0ds π2(s) = at (42)

with which we obtain

P1(t1) = π1(t1)e−∫ t1

0ds π1(s) = −

(ε+ (1−ε)e−bt1

P2(t2) = ae−at2 (44)

However, P1(t1) is not normalised to 1 as soon as ε > 0, which follows from the fact that here

condition (34) is violated:

limt→∞

0ds πir(s) = − lim

t→∞log

(ε+ (1−ε)e−bt

)= log(1/ε) (45)

We also see this by explicit integration:∫ ∞0dt P1(t) =

∫ ∞0dt π1(t)e−

∫ t0

ds π1(s) =

∫ ∞0dt

[− e−

∫ t0

ds π1(s)]

= 1− e−∫∞

0ds π1(s) = 1− e−(− log ε) = 1− ε (46)

Hence, in the independent-times explanation for the cause-specific hazard rates we have a

probability ε that event 1 will never happen. If e.g. event 1 represents death and event 2 the

onset of some disease, then in the original correlated time distribution death will inevitably

occur, but in the independent-times explanation there is a probability ε of our individual being

immortal.

Example 2: true versus independent-times explanation for observed hazard rates

Let us not think that the above always happens. Inspect the following event time distribution

for the times t1, t2 ≥ 0, with a parameters a > 0 and a normalisation constant Z(a), again

referring to an individual:

P (t1, t2) =1

Z(a)e−a(t1+t2)−a2t1t2 (47)

We will need the following function:

F (x) =

∫ ∞x

ds e−s/s (48)

It decreases monotonically, i.e. F ′(x) < 0, from F (0) = ∞ down to F (∞) = 0. Note that

F ′(x) = −e−x/x. Let us calculate for this example the jount survival probability S(t1, t2):

S(t1, t2) =

∫ ∞t1

∫ ∞s2

dt2 P (s1, s2) =1

∫ ∞t1

∫ ∞t2

ds2 e−a(s1+s2)−a2s1s2

a2Z(a)

∫ ∞at1

∫ ∞at2

ds2 e−s1−s2−s1s2

= − 1

a2Z(a)

∫ ∞at1

ds1e−s1

[e−s2(1+s1)

]∞at2

a2Z(a)

∫ ∞at1

ds1e−s1−at2(1+s1)

a2Z(a)

∫ ∞1+at1

due−(u−1)−at2u

a2Z(a)

∫ ∞1+at1

due−u(1+at2)

a2Z(a)

∫ ∞(1+at2)(1+at1)

dxe−x

a2Z(a)F((1+at1)(1+at2)

The normalisation factor Z(a) follows from using S(0, 0) = 1 (no events yet at time zero):

a2Z(a)F (1) so Z(a) = eF (1)/a2 (50)

S(t1, t2) = F((1+at1)(1+at2)

)/F (1) (51)

Next we calculate the cause-specific hazard rates, using F ′(x) = −e−x/x:

π1(t) = −[ ∂∂t1

logS(t1, t2)]t1=t2=t

= −[ ∂∂t1

logF((1+at1)(1+at2)

)]t1=t2=t

= −a(1+at)F ′

((1+at)2

)F((1+at)2

e−(1+at)2

F ((1+at)2)(52)

and since S(t1, t2) is a symmetric function of (t1, t2) we get the same for risk 2, i.e. π2(t) = π1(t).

We can rewrite both hazard rates as

πr(t) = − 1

dtlogF ((1+at)2) (53)

and hence, using logF (∞) = log 0 = −∞,∫ ∞0

dt πr(t) = − 1

∫ ∞0

dtlogF ((1+at)2)

]= − 1

[logF ((1+at)2)

= −1

(logF (∞)− logF (1)

)=∞ (54)

We conclude that condition (34) is satisfied, and we can indeed have an independent-times

explanation for our cause-specific hazard rates with fully normalised event time distributions

(i.e. all events happen at finite times).

4. Incorporating cure as a possible outcome

4.1. The clean way to include cure

So far we have worked with proper normalised joint event time distributions Pi(t0, . . . , tR), in which

all events will ultimately occur. We also saw that it is quite possible to define cause-specific hazard

rates for which the corresponding event has a finite probability of not happening at all. The question

here is how this can be incorporated in our formalism in a clean way. The natural solution is to

assign to each risk two random variables, i.e. replace tr → (τr, tt), in which tr is an event time and

τr ∈ 0, 1 tells us whether (τr = 1) or not (τr = 0) risk r will actually trigger an event at time tr.

The more general starting point for any individual i would then have to be the distribution

Pi(t0, . . . , tR; τ0, . . . , τR) (55)

which is now normalised according to

1∑τ0=0

. . .1∑

∫ ∞0. . .

∫ ∞0

dt0 . . . dtR Pi(t0, . . . , tR; τ0, . . . , τR) = 1 (56)

The probability that for individual i all events will actually happen, for instance, would be

Pi(τ0 =1, . . . , τR=1) =1∑

. . .1∑

∫ ∞0. . .

∫ ∞0

dt0 . . . dtR Pi(t0, . . . , tR; τ0, . . . , τR)∏r

δτr,1

∫ ∞0. . .

∫ ∞0

dt0 . . . dtR Pi(t0, . . . , tR; 1, 1, . . . , 1) (57)

We next define the integrated event time distribution for individual i, i.e. the probability that event

0 has not happened yet at time t0, event 1 hasn’t happened yet at time t1, etc. The conditions for

this are now somewhat more involved: for each r we demand that either τr = 0 or the event time

sr is later than tr, i.e.

R∏r=0

[τrθ(sr − tr) + (1− τr)

]= 1 (58)

Si(t0, . . . , tR) =1∑

. . .1∑

∫ ∞0. . .

∫ ∞0

ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)

×R∏r=0

[τrθ(sr − tr) + (1− τr)

As before, nothing is assumed to have happened yet at time zero, so

Si(0, . . . , 0) =1∑

. . .1∑

∫ ∞0. . .

∫ ∞0

ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)R∏r=0

[τrθ(sr)+(1−τr)

. . .1∑

∫ ∞0. . .

∫ ∞0

ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR) = 1 (60)

The survival function Si(t), i.e. the probability that for individual i all events r = 0 . . . R will

happen later than time t, now becomes

Si(t) = Si(t, t, . . . , t) =1∑

. . .1∑

∫ ∞0. . .

∫ ∞0

ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)

×R∏r=0

[τrθ(sr − t) + (1− τr)

In contrast to our earlier formulation, now we need no longer find Si(∞) = 0. Here we get

Si(∞) =1∑

. . .1∑

∫ ∞0. . .

∫ ∞0

ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)R∏r=0

(1− τr)

. . .1∑

Pi(τ0, . . . , τR)R∏r=0

(1− τr) (62)

This is the probability that all variables τr are zero, i.e. that all events do not occur. In practice,

we normally include the end-of-trial risk r = 0 for the specific purpose of assigning an event for

each individual, so we would choose to have Pi(s0, . . . , sR; τ0, . . . , τR) = 0 if τ0 6= 1; this ensures that

Si(∞) = 0 (even if none of the medical events happen, at least end-of-trial will always kick in).

Cause-specific hazard rates. At this stage matters seem to get a bit more tricky, but in fact

everything proceeds as before but with slightly more complicated formulae. To keep our notation

compact we will henceforth use the following short-hands:∑τ0...τR

. . . =∑1τ0=0 . . .

∑1τR=0 . . . and∫

ds0 . . . dsR . . . =∫∞

0 . . .∫∞

0 ds0 . . . dsR . . .. Our expressions also compactify if we use the identity

τrθ(sr−tr)+(1−τr) = 1− τrθ(tr−sr) (63)

We now define the usual cause-specific hazard rates

πiµ(t) = −[ ∂∂tµ

logSi(t0, . . . , tR)]tr=t ∀r

Working this out for our present function Si(t0, . . . , tR) gives

πiµ(t) =[∑τ0...τR

τµ∫

ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)δ(sµ−tµ)∏r 6=µ

[1−τrθ(tr−sr)

]Si(t0, . . . , tR)

]tr=t ∀r

∑τ0...τR

τµ∫

ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)δ(sµ−t)∏r 6=µ

[1−τrθ(t−sr)

]Si(t)

Hence πiµ(t)dt still gives the probability for individual i that event µ happens in the time interval

[t, t + dt), given that no event has happened to i yet prior to time t. What has changed is that

finding a nonzero value now requires having τµ = 1 (hence the new factor in the numerator), and

that the conditioning on the events other than µ has become somewhat more involved.

Survival function and data likelihood. Let us find out which of the properties involving the survival

function survive our generalisation to include cure as an outcome. Due to the fact that definition

(64) was still valid, and since still Si(0) = 1 (no events yet at time zero) our earlier simple expression

of the survival function in terms of hazard rates still holds:

dtlogSi(t) =

dtlogSi(t, t, . . . , t) =

R∑r=0

[ ∂∂tr

logSi(t0, . . . , tR)]tr=t ∀r

= −R∑r=0

πir(t) (66)

and thus we continue to have

Si(t) = e−∑R

∫ t0

ds πir(s) (67)

To see event ∆ first, in time interval [X,X + dX), the following conditions need to be met:

• τ∆ = 1

• the time of event ∆ is in [X,X + dX)

• no events occurred prior to X

The combination can be written compactly in terms of (t0, . . . , tR) and (τ0, . . . , τR) as

τ∆θ(t∆−X)θ(X+dX−t∆)∏r 6=∆

[1−τr(X−tr)

]= 1 (68)

So the probability P (X,∆) per unit time of this happening, for infinitesimally small time intervals

dX, becomes

Pi(X,∆) = limdX↓0

∑τ0...τR

∫dt0 . . . dtR Pi(t0, . . . , tR; τ0, . . . , τR)

× τ∆θ(t∆−X)θ(X+dX−t∆)∏r 6=∆

[1−τr(X−tr)

∑τ0...τR

∫dt0 . . . dtR Pi(t0, . . . , tR; τ0, . . . , τR)δ(t∆−X)

∏r 6=∆

[1−τr(X−tr)

]= πi∆(X)Si(X) = πi∆(X)e−

∫ X0

ds πir(s) (69)

So also this relation continues to hold. This is nice, since it implies that at the level of survival

functions and hazard rates we don’t need to change anything – we now know that if there are risks

with nonzero probability of the associated event never happening, then we can describe this also at

the level of event times if we want to.

4.2. The quick and dirty way to include cure

An alternative way to include cure, often found in textbooks and papers (regretfully) is to extend

the time set [0,∞) and include tµ =∞ in the event time distribution; events that don’t happen are

said to happen at t =∞. For instance, in the case where there is just one risk we would write the

symbolic expression

Pi(t) = εiPi(t) + (1−εi)δ(t−∞) (70)

with Pi(t) an ordinary normalised distribution, describing event time statistics for the case where

the event does happen, which would be the case with probability εi = Probi(τ=1) (in terms of our

previous set-up). We would then define∫∞

0 dt δ(t−∞) = 1, and find∫ ∞0

dt Pi(t) = εi + (1−εi)∫ ∞

0dt δ(t−∞) = 1 (71)

limX→∞

0dt Pi(t) = εi lim

X→∞

0dt Pi(t) = εi (72)

The survival function and the hazard rate would at finite times become

Si(t) =

∫ ∞t

ds Pi(s) = 1−∫ t

0ds Pi(s) = 1− εi

0ds Pi(s) (73)

πi(t) = − d

dtlogSi(t) = Pi(t)/Si(t) = εiPi(t)/Si(t) (74)

And so we find

Si(t) = e−∫ t

0ds πi(s), πi(t)e

−∫ t

0ds πi(s) = εiPi(t) (75)

We can now express εi (the probability of the evnt not happening at all) by integration of both

sides of the second identity over time, since P (t) is normalised:

∫ ∞0

dt πi(t)e−∫ t

0ds πi(s) = −

∫ ∞0

dte−∫ t

0ds πi(s)

]= 1− e−

∫∞0

ds πi(s) (76)

This makes sense, since we know that∫∞

0 ds πi(s) < ∞ is indeed the condition for finding a finite

‘no event’ probability. We see that we can now also write the initial time distribution Pi(t) as

Pi(t) = πi(t)e−∫ t

0ds πi(s) + e−

∫∞0

ds πi(s)δ(t−∞) (77)

It is in principle possible to analysis the situation this way, but it is mathematically somewhat

messy. For instance: we have had to give up the standard convention of calculus that∫∞0 ds G(s) =

limz→∞∫ z

0 ds G(s), so we would always have to indicate whether we mean one or the other. It is

therefore more prone to mistakes. And as soon as we ask about the joint distribution Pi(t0, . . . , tR)

it all becomes even worse ...

4.3. Examples

Let us return to an earlier example, where an independent-times explanation of cause-specific hazard

rates led to a risk with a nonzero probability of not generating events. We start with the following

cause-specific hazard rates, for an individual subject to two risks:

π1(t) =b(1−ε)e−bt

ε+ (1−ε)e−bt, π2(t) = a (78)

We now try to construct the independent-times explanation P (t1, t2; τ1, τ2) = P1(t1, τ1)P2(t2, τ2) for

these hazard rates, so S(t1, t2) = S1(t1)S2(t2) with

S1(t) =∑τ

∫ ∞0

ds P1(s, τ)[τθ(s−t)+(1−τ)

S2(t) =∑τ

∫ ∞0

ds P2(s, τ)[τθ(s−t)+(1−τ)

We first calculate the two risk-specific survival probabilities:

S1(t) = e−∫ t

0ds π1(s) = exp

[−∫ t

b(1−ε)e−bs

ε+ (1−ε)e−bs]

= exp[ ∫ t

(ε+ (1−ε)e−bs

)]= exp

(ε+ (1−ε)e−bt

)]= ε+ (1−ε)e−bt (81)

S2(t) = e−∫ t

0ds π2(s) = e−at (82)

Thus our equations (79,80) from which to calculate P1(t, τ) and P2(t, τ) become, after working out

the summations over τ :

ε+ (1−ε)e−bt =

∫ ∞0

ds P1(s, 0) +

∫ ∞0

ds P1(s, 1)θ(s−t) (83)

e−at =

∫ ∞0

ds P2(s, 0) +

∫ ∞0

ds P2(s, 1)θ(s−t) (84)

The functions P1,2(s, 0) are obsolete, since if τ = 0 (i.e. if the event doesn’t happen) the associated

event time is not used. We use normalisation and write∫∞

0 ds P1,2(s, 0) = 1−∫∞

0 ds P1,2(s, 1), giving

ε+ (1−ε)e−bt = 1−∫ ∞

0ds P1(s, 1)[1− θ(s−t)] = 1−

∫ ∞0

ds P1(s, 1)θ(t−s)

= 1−∫ t

0ds P1(s, 1) (85)

e−at = 1−∫ ∞

0ds P2(s, τ)[1− θ(s−t)] = 1−

∫ ∞0

ds P2(s, 1)θ(t−s)

= 1−∫ t

0ds P2(s, 1) (86)

Finally we differentiate both sides of both equations. This gives

P1(t, 1) = b(1−ε)e−bt, P2(t, 1) = ae−at (87)

This, in turn, means that∫ ∞0

dt P1(t, 0) = 1−∫ ∞

0dt P1(t, 1) = 1−

∫ ∞0

dt b(1−ε)e−bt = 1− (1−ε) = ε (88)∫ ∞0

dt P2(t, 0) = 1−∫ ∞

0dt P2(t, 1) = 1−

∫ ∞0

dt ae−at = 1− 1 = 0 (89)

We conclude that

P1(t, τ) = εδτ,0P (t) + (1−ε)δτ,1be−bt (90)

P2(t, τ) = δτ,1ae−at (91)

with some irrelevant normalised distribution P (t) (since for τ = 0 the event to which t refers will

by definition not materialise). This gives in combination:

P (t1, t2; τ1, τ2) = δτ2,1ae−at2[εδτ1,0P (t1) + (1−ε)δτ1,1be−bt1

This distribution, with a nonzero probability of event 1 never happening, would in terms of observed

survival data be indistinguishable from the original one, where both events will always happen:

Both (92) and (93) give exactly the same cause-specific hazard rates (78). The independent-event-

times explanation for the fraction ε of cases in (93) where in reality event 1 would not be reported

in a trial because it happens at a fixed time later than event 2 is to say that in the fraction ε of

cases event 1 simply does not happen (irrespective of risk 2).

5. Individual versus cohort level survival statistics

5.1. Population level survival functions

The quantities defined so far describe statistical features at the level of individuals. If we want to

characterize the cohort as a whole (or if perhaps we have no information at the level of individuals)

we would work instead with the cohort averages of these functions, i.e.

S(t) =1

N∑i=1

Si(t) P (t0, . . . , tR) =1

N∑i=1

Pi(t0, . . . , tR) (94)

S(t) gives the probability that a randomly picked individual in the cohort will not have experienced

any event prior to time t, and P (t0, . . . , tR) gives the probability density for a randomly picked

individiual to have joint event times (t0, . . . , tR). If we inspect the derivation of Si(t) from

Pi(t0, . . . , tR) we note that we can simply insert 1N

∑Ni=1 everywhere and get also

S(t) = S(t, t, . . . , t), S(t0, . . . , tR) =

∫ ∞t0. . .

∫ ∞tR

ds0 . . . dsR P (s0, . . . , sR) (95)

In fact, we could have started developing our previous theory fully at population level. This is what

most textbooks do. It would effectively have meant dropping the indices i from all identities in the

previous sections. We would have defined population level cause-specific hazard rates πr(t), such

that the above population survival function would be written as

S(t) = e−∑R

∫ t0

ds πr(s) (96)

However, one would not have πr(t) = N−1∑i π

ir(t), since log 1

∑i(. . .) 6= 1

∑i log(. . .). We must

therefore always clarify whether we talk about individual or population functions. Failure to make

this distinction leads to confusion and mistakes. With the drive towards personalised medicine the

differences between cohort and individual survival statistics will become even more important.

Event time uncertainty versus hazard rate uncertainty. The description of survival statistics at

the cohort level, via S(t), involves two sources of uncertainty, which one cannot easily disentangle:

the uncertainty of event times, as described by the individual functions Si(t), and the uncertainty

of which individual we pick from the cohort, represented by the averaging N−1∑i. To illustrate

this, imagine we have just one risk, and we observe at population level what seems to be a simple

exponentially decaying survival function

S(t) = e−πt (97)

This can arise in many ways. For instance, all individuals could be identical, and the uncertainty

fully due to event time uncertainty at the individual level: the choice Si(t) = e−πt for all i would

trivially give the above S(t). The opposite extreme would be the case where the individuals have

no event time uncertainty at all, i.e. Si(t) = θ[t?i − t], so each i dies fully deterministically at some

time t?i , but the pre-ordained times t?i vary from one individual to another. This would mean

πi(t) = − d

dtlogSi(t) = − d

dtlog θ[t?i − t] (98)

Here we would find, with W (t?) = N−1∑i δ[t

? − t?i ] (the distribution of death times over the

population):

S(t) =1

θ[t?i − t] =

∫ ∞0

dt? W (t?)θ[t? − t] =

∫ ∞t

dt? W (t?) (99)

It is easy to see that also here we can recover the above exponential form for S(t), if the predestined

times of death are distributed over the population exponentially, according to W (t?) = πe−πt?:

S(t) =

∫ ∞t

dt? πe−πt?

= e−πt (100)

So in both case we find the population survival function (97), but for very different reasons. In real

patient data one would typically expect to have a combination of both types of uncertainty.

5.2. Population hazard rates and data likelihood

Relation between population hazard rates and individual hazard rates. It will be instructive to

express the population-level cause-specific hazard rates πr(t) in (102) in terms of the individual

cause-specific hazard rates πir(t) by application of (7) to population level functions:

πµ(t)S(t) =

∫ ∞t. . .

∫ ∞t

( R∏r 6=µ

dsr)P (s0, . . . , sµ−1, t, sµ+1, . . . , sR)

∫ ∞t. . .

∫ ∞t

( R∏r 6=µ

dsr)Pi(s0, . . . , sµ−1, t, sµ+1, . . . , sR)

πiµ(t)Si(t) (101)

So we find

πµ(t) =

∑i π

iµ(t)Si(t)∑i Si(t)

∑i π

iµ(t)e−

∫ t0

ds πir(s)∑i e−

∫ t0

ds πir(s)(102)

It is clear that πr(t) 6= N−1∑i π

ir(t) as soon as the cohort is not strictly homogeneous, i.e. as soon

as the hazard rates of different individuals i are not all identical.

In fact, we see that heterogeneity will give us a time-dependent population-level hazard rate

even if all individuals in the population have time-independent hazard rates. Suppose πiµ(t) = πiµfor all (i, µ) and all t. We would then obtain

πµ(t) =

∑i π

iµ(t)Si(t)∑i Si(t)

∑i π

iµe−t

∑rπir∑

i e−t∑

For large times, the individuals with the lowest hazard rates contribute most to the average in (103):

πµ(0) =1

πiµ, limt→∞

πµ(t) = πi?

µ , i? = argmin(∑

)(104)

In Cox regression (see a later section) one indeed often observes population hazard ratios that

appear to decay over time; it is now clear that this should not necessarily be interpreted as a time

dependence at the level of individuals, as it could be due simply to cohort heterogeneity.

Data likelihood. In the same way we find that if we work at population level, with population

hazard rates and population survival functions, we can no longer use (15) to quantify the likelihood

of finding an individual reporting a first event of type ∆ at time t. Instead we would now use

P (X,∆) = π∆(X)e−∑R

∫ X0

ds πr(s) (105)

It now follows from (101), upon inserting our formulae for S(t) and Si(t), that in fact P (X,∆) =

N−1∑i Pi(X,∆). The final picture is therefore as in the diagram below:

INDIVIDUAL LEVEL POPULATION LEVEL

P1(t0, . . . , tR) . . . . . . PN (t0, . . . , tR)︸︷︷︸individual event time statistics

P (t0, . . . , tR) =1

Pi(t0, . . . , tR)

individual hazard rates︷︸︸︷π1

0(X), . . . , π1R(X) . . . . . . πN0 (X), . . . , πNR (X) π∆(X) =

P (X,∆)∑r

∫∞X dt P (t, r)

P1(X, 0), . . . , P1(X,R) . . . . . . PN (X, 0), . . . , PN (X,R)︸︷︷︸individual data likelihoods

P (X,∆) =1

Pi(X,∆)

P (X,∆) = π∆(X)e−∑

∫ X0

ds πr(s)

(X1,∆1) . . . . . . (XN ,∆N )︸︷︷︸observed survival data

5.3. Examples

Example 1: impact of heterogeneity on population hazard rates

Imagine a population of two distinct groups of individuals, A and B: 1, . . . , N = A∪B. Let

there be NA = fN patients in group A and NB = (1− f)N patients in group B. They are all

subject to just one risk, and have time-independent individual hazard rates: πi(t) = 1 if i ∈ Aand πi(t) = 3 if i ∈ B. At population level this would give the time dependent hazard rate

π(t) =

∑i π

ie−tπi∑

i e−tπi=

∑i∈A e−t +

∑i∈B 3e−3t∑

i∈A e−t +∑i∈B e−3t

=f + 3(1−f)e−2t

f + (1−f)e−2t(106)

The result is shown in figure 2 for f ∈ 0, 14 ,

34 , 1.

Example 2: correlated population risks without correlated individual risks

Imagine again a population of two distinct groups of individuals, A and B: 1, . . . , N = A∪B.

Let there be NA = fN patients in group A and NB = (1− f)N patients in group B. They are

0 1 2 3

f=1.00

f=0.75

f=0.50

f=0.25

f=0.00

Figure 2. The population-level hazard rate π(t) as given by (106), for different values of

the relative sizes of the sub-classes in the cohort. The hazard rate decays over time as soon

as there is heterogeneity in the cohort (i.e. for 0 < f < 1), in spite of all individuals of the

cohort having strictly time-independent hazard rates.

all subject to two risks r = 1 and r = 2. Assume that all have independent event times and

hence factorising survival functions as in (26), and constant hazard rates:

i ∈ A : Pi(t1, t2) =(πA1 e−t1π

)(πA2 e−t2π

)(107)

i ∈ B : Pi(t1, t2) =(πB1 e−t1π

)(πB2 e−t2π

)(108)

i ∈ A : Si(t1, t2) = SA1(t1)SA2(t2), SA1(t) = e−tπA1 , SA2(t) = e−tπ

A2 (109)

i ∈ B : Si(t1, t2) = SB1(t1)SB2(t2), SB1(t) = e−tπB1 , SB2(t) = e−tπ

B2 (110)

Within each group the two risks are clearly independent. At population level we find the overall

survival function

S(t) =1

N∑i=1

Si(t) =1

∑i∈A

SA1(t)SA2(t) +1

∑i∈B

SB1(t)SB2(t)

NSA1(t)SA2(t) +

NSB1(t)SB2(t)

= fe−t(πA1 +πA2 ) + (1−f)e−t(π

B1 +πB2 ) (111)

One would naievely expect the population survival functions for the individual risks to be the

cohort averages over the corresponding individual survival functions:

Sr(t) =1

N∑i=1

Sir(t) =NA

NSAr(t) +

NSBr(t) = fe−tπ

Ar + (1−f)e−tπ

Br (112)

This gives for the product S1(t)S2(t):

S1(t)S2(t) =(fe−tπ

A1 + (1−f)e−tπ

)(fe−tπ

A2 + (1−f)e−tπ

)= f2e−t(π

A1 +πA2 ) + (1−f)2e−t(π

B1 +πB2 ) + f(1−f)

[e−t(π

A1 +πB2 ) + e−t(π

A2 +πB1 )

If we had population level independence of risks (as is true at the level of the individual groups)

we would have expected to find S(t) = S1(t)S2(t). Instead here we get

S(t)− S1(t)S2(t) = f(1−f)[e−t(π

A1 +πA2 ) + e−t(π

B1 +πB2 ) − e−t(π

A1 +πB2 ) − e−t(π

A2 +πB1 )

]= f(1−f)

[e−tπ

A1 − e−π

][e−tπ

A2 − e−tπ

](113)

We see that generally our two risks will be correlated at population level, i.e. S(t) 6=∏r Sr(t),

except for f = 0, 1 or when either πA1 = πB1 or πA2 = πB2 . These are precisely the cases where

the correlation C12 over the population of the hazard rates of the two risks vanishes:

C12 = 〈π1π2〉 − 〈π1〉〈π2〉

πi1πi2 −

)= fπA1 π

A2 + (1−f)πB1 π

B2 −

(fπA1 + (1−f)πB1

)(fπA2 + (1−f)πB2

)= f(1−f)

[πA1 π

A2 + πB1 π

B2 − πA1 πB2 − πB1 πA2

]= f(1−f)

[πA1 − πB1

][πA2 − πB2

](114)

This illustrates how risk correlations at population level can emerge in a natural way as a result

of correlations of the cause-specific hazard rates of the individuals in a heterogeneous cohort,

in spite of each individual having independent event times.

6. Survival prediction

To predict survival we need to know (or estimate) the cause-specific hazard rates. The natural

object to use would be the survival function Si(t), which gives the probability of individual i not

experiencing any of the risk events prior to time t. However, sometimes we cannot use this (for

instance if we only have information on the cause-specific hazard rates of the cohort as a whole,

rather then those of the individual i), or we may wish to calculate different probabilities.

6.1. Cause-specific survival functions

Non-hypothetical survival probabilities. Instead of predicting overall survival, via Si(t) or S(t), we

will often be interested in other predictions. We have already seen Pi(X,µ), the probability density

for seeing event µ reported first, in a time interval located at time t (see (15)):

Pi(X,µ) = πiµ(X)e−∑R

∫ X0

ds πir(s) (115)

From this follows e.g. the so-called cumulative incidence function Fiµ(t), which is the probability

that individual i ‘fails’ from cause µ at any time prior to t:

Fiµ(t) =

0dX Pi(X,µ) =

0dX πiµ(X)e−

∫ X0

ds πir(s) (116)

or the population average of this function Fµ(t), which gives the probability that a randomly drawn

individual from our cohort ‘fails’ from cause µ at any time prior to t:

Fµ(t) =1

0dX πiµ(X)e−

∫ X0

ds πir(s) (117)

Equivalently, in terms of the global cause-specific hazard rates:

Fµ(t) =

0dX πµ(X)e−

∫ X0

ds πr(s) (118)

Again it is important to specify which of the above cumulative incidence functions one is referring to

(unless the cohort consists of clones, where the difference between the two vanishes). An equivalent

quantity is the cause-specific survival probability Giµ(t), defined as the likelihood that at time t

individual i has not yet failed from cause µ, either because he/she experienced another event prior

to t, or because nothing has yet happened at time t:

Giµ(t) = 1− Fiµ(t) = 1−∫ t

0dX πiµ(X)e−

∫ X0

ds πir(s) (119)

The likelihood that individual i will never report event µ would then be

Giµ(∞) = 1−∫ ∞

0dX πiµ(X)e−

∫ X0

ds πir(s) (120)

We see that Fiµ(t), Fµ(t), and Giµ(t) all depend on all cause specific hazard rates, not just that

of risk µ, since the other risks influence how likely it is for risk µ to trigger an event first. Note

that even in the case of statistically independent event times, where Si(t) =∏r Sir(t), the function

Giµ(t) is not the same as Sir(t): both describe how likely it is for event µ not to have taken place

yet at time t, but Giµ(t) takes into account the likelihood that we haven’t seen event µ because

other events happened earlier, whereas Siµ(t) does not.

The effects of disabling risks on cause-specific hazard rates. We might also be interested in

hypothetical quantities, such as what would be the survival probabilities if one or more of the risks

could be eliminated. Often we wish to study one specific ‘primary’ risk, and would want to eliminate

the obscuring effects of the others. If we denote the ‘active’ set of risks as A ⊆ 0, 1, 2, . . . , R, then

we have to disable all risks r /∈ A. We have already noted earlier that this does not simply mean

setting πir(t) or πr(t) to zero for all r /∈ A, due to the conditioning in the definition of the cause-

specific hazard rates. If we start from the general distribution Pi(t0, . . . , tR; τ0, . . . , τR) and disable

all risks other than those in the set A, we effectively change this distribution into

P ′i (t0, . . . , tR; τ0, . . . , τR) =Pi(t0, . . . , tR; τ0, . . . , τR)

∏r/∈A δτr,0∑

τ ′0...τ′R

∫ds′0 . . . ds

′R Pi(t′0, . . . , t

′R; τ ′0, . . . , τ

′R)∏r/∈A δτ ′r,0

and the new cause-specific hazard rates become

πi′µ(t) =

∑τ0...τR

τµ∫

ds0 . . . dsR P ′i (s0, . . . , sR; τ0, . . . , τR)δ(sµ−t)∏r 6=µ

[1−τrθ(t−sr)

]∑τ0...τR

∫ds0 . . . dsR P ′i (s0, . . . , sR; τ0, . . . , τR)

[1−τrθ(t−sr)

∑τ0...τR

τµ∫

ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)δ(sµ−t)∏r/∈A δτr,0

∏r 6=µ

[1−τrθ(t−sr)

]∑τ0...τR

∫ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)

∏r/∈A δτr,0

[1−τrθ(t−sr)

](122)

If we started from events that always happen, i.e. a distribution of the form Pi(t0, . . . , tR), then we

would have upon disabling the risks r /∈ A:

P ′i (t0, . . . , tR; τ0, . . . , τR) = Pi(t0, . . . , tR)( ∏r∈A

δτr,1)( ∏

r/∈Aδτr,0

)(123)

and find

πi′µ(t) =

∑τ0...τR

τµ∫

ds0 . . . dsR P ′i (s0, . . . , sR; τ0, . . . , τR)δ(sµ−t)∏r 6=µ

[1−τrθ(t−sr)

]∑τ0...τR

∫ds0 . . . dsR P ′i (s0, . . . , sR; τ0, . . . , τR)

[1−τrθ(t−sr)

∫ds0 . . . dsR Pi(s0, . . . , sR)δ(sµ−t)

∑τ0...τR

τµ∏r∈A δτr,1

∏r/∈A δτr,0

∏r 6=µ

[1−τrθ(t−sr)

ds0 . . . dsR Pi(s0, . . . , sR)∑τ0...τR

∏r∈A δτr,1

∏r/∈A δτr,0

∏r∈A

[1−θ(t−sr)

](124)

As expected, if µ /∈ A (so risk µ is disabled) we always get πi′µ(t) = 0 for all t. If µ ∈ A (so risk

µ is not disabled), then it is clear from the above that all cause-specific hazard rates of risks in

the active set A are affected by our disabling of risks, and that we need to know the distribution

Pi(s0, . . . , sR; τ0, . . . , τR) to calculate the new rates. In the case of (124) and µ ∈ A we can simplify

our formula for πi′µ(t) further to

πi′µ(t) =

∫ds0 . . . dsR Pi(s0, . . . , sR)δ(sµ−t)

∏r∈A/µ

[1−θ(t−sr)

ds0 . . . dsR Pi(s0, . . . , sR)∏r∈A

[1−θ(t−sr)

] (125)

which in effect involves only the marginal of Pi(s0, . . . , sR), obtained by integrating out all times

tr with r /∈ A. We now appreciate why the question of whether Pi(s0, . . . , sR; τ0, . . . , τR) can be

calculated was a relevant one. In particular, in view of the Tsiatis identifiability problem we cannot

expect to write the new hazard rates in terms of the old ones, since we cannot generally express

Pi(s0, . . . , sR; τ0, . . . , τR) in terms of the old hazard rates . . .

Hypothetical survival probabilities for uncorrelated event times. Only if the event times are known

to be uncorrelated can we proceed to calculate the new hazard rates. In this case we may write

Pi(s0, . . . , sR; τ0, . . . , τR) =∏Rr=0 Pir(sr, τr), and simplify the above formula for the new cause-

specific hazard rates for µ ∈ A to

πi′µ(t) =

[∑τµ τµ

∫dsµ Piµ(sµ, τµ)δ(sµ−t)

]∏r∈A/µ

[∑τr

∫dsr Pir(sr, τr)[1−τrθ(t−sr)]

]∏r∈A

[∑τr

∫dsr Pir(sr, τr)[1−τrθ(t−sr)]

∑τµ τµ

∫dsµ Piµ(sµ, τµ)δ(sµ−t)∑

∫dsµ Piµ(sµ, τµ)[1−τµθ(t−sµ)]

= πiµ(t) (126)

The reason that the last formula is simply the old rate πiµ(t) is that it no longer involves the active

risk set A, so it must also be true for the choice A = 0, 1, . . . , R (i.e. none of the risks is disabled),

for which we must recover the old hazard rate. In conclusion: if we disable risks, all cause-specific

hazard rates will generally be affected in a complicated way, and to calculate their new values we

need to know the joint event time distribution. Only if the event times are uncorrelated, then

disabling risks simply means setting all cause-specific hazard rates of the disabled risks to zero.

For instance, if all risks except risk µ were eliminated (i.e. A = µ) and the event times

are uncorrelated, then one would have πir(t) = 0 for all r 6= µ, and find the following hypothetical

cause-specific survival probability Siµ(t), describing a world where only risk µ is active:

Siµ(t) = 1−∫ t

0dX πiµ(X)e−

∫ X0

ds πiµ(s)

dXe−∫ X

0ds πiµ(s) = 1 +

[e−∫ X

0ds πiµ(s)

= e−∫ X

0ds πiµ(t) (127)

which is identical to Siµ(t) (due to the assumed event time independence), i.e. to the risk-specific

survival probability in Si(t) =∏r Sir(t).

6.2. Estimation of cause-specific hazard rates

Strategies for extracting hazard rate information from the data. The previous subsection deals with

how we can make predictions once we know the cause-specific hazard rates. In reality we must

find these rates first. Suppose we have survival data D = (X1,∆1), . . . , (XN ,∆N ), referring to

N individuals who can be regarded as independently drawn random samples from a population

characterised by an as yet unknown set of population-level cause-specific hazard rates π0, . . . , πR.There are two (connected) approaches to the question of how to get the rates from D‖.

The first (traditional) approach is to construct formulas for so-called estimators, which are

expressions π0, . . . , πR written in terms of the data D, for which we can prove that in the limit

N → ∞ they converge to the true values π0, . . . , πR. Once these estimators are chosen and

their properties verified, one then uses in prediction the estimators instead of the real (unknown)

cause-specific hazard rates. There are two downsides to this approach. The first is the difficulty in

the construction of good (i.e. unbiased and fast-converging) candidates for these formulas, which

is simple for trivial quantities but is not trivial for the present case of hazard rates. The second

is that, especially when N is not very large, we know that estimators are not exact, and we have

no way of accounting for this imprecision in our predictions. We would just hope for the best ...

The second approach is to use Bayesian arguments, which not only take into account our residual

‖ The procedures described here apply more generally to the extraction of parameters from data, not just

to the extraction of cause-specific hazard rates π0, . . . , πR from our survival data D.

uncertainty regarding the true hazard rates after having observed the data, but also lead us to

systematic formulae for estimators. In Appendix C we work out and compare the different routes

available for estimating model parameters from data for a simple example.

The maximum likelihood estimator. Let us try to determine the most probable values for our

hazard rates, given our observation of the data D, in the Bayesian way. This means finding

the maximium over π0, . . . , πR of the distribution P(π0, . . . , πR|D).¶ The standard Bayesian

identity p(a|b)p(b) = p(b|a)p(a) allows us to express this distribution in terms of its counterpart,

the data likelihood P(D|π0, . . . , πR) given the hazard rates:

P(π0, . . . , πR|D) =P(π0, . . . , πR|D)P(D)

=P(D|π0, . . . , πR)P(π0, . . . , πR)

=P(D|π0, . . . , πR)P(π0, . . . , πR)∫dπ′0, . . . , π′R P (π′0, . . . , π′R, D)

=P(D|π0, . . . , πR)P(π0, . . . , πR)∫

dπ′0, . . . , π′R P(D|π′0, . . . , π′R)P(π′0, . . . , π′R)(128)

Within the Bayesian framework one would choose for the prior the maximum-entropy distribution,

subject to applicable constraints (such as πr(t) ≥ 0 for all t). If, on the other hand, we choose a

so-called flat prior, i.e. we take P(π0, . . . , πR) to be a independent of π0, . . . , πR (so we have

no prior preference either way, beyond the constraints), then the most probable set π0, . . . , πRis the one that maximises P(D|π0, . . . , πR). This is called a maximum-likelihood estimator.

Equivalently, we can maximize the logarithm of the data likelihood, viz. L(D|π0, . . . , πR)) =

logP(D|π0, . . . , πR), which will give slightly more compact equations. To proceed we need a

formula for L(D|π0, . . . , πR), which, given our independence assumption and in view of (105), is

L(D|π0, . . . , πR) = log∏i

P (Xi,∆i) =∑i

logP (Xi,∆i)

log π∆i(Xi)−∑i

R∑r=0

∫ Xi

0ds πr(s)

=R∑r=0

∫ ∞0

ds∑i

log πr(s)δr,∆iδ(s−Xi)− πr(s)θ(Xi−s)

Now we maximize this latter expression by variation of each of the functions πr(t), giving us an

estimator for each of the population-level cause-specific hazard rates. It is standard convention to

write estimators with a ‘hat’ symbol on top, so after differentiation of (129) we get

(∀r)(∀t) :1

πr(t)

δr,∆iδ(t−Xi) =∑i

θ(Xi−t) (130)

(∀r)(∀t) : πr(t) =

∑i δr,∆iδ(t−Xi)∑i θ(Xi−t)

This latter result seems very sensible. At any stage t, the rate for event r is estimated as the total

number of observed failures to r per unit time, divided by the number of patients observed to be

still ‘at risk’ at time t.

¶ We will use ordinary Roman capitals (e.g. P (..), W (..)) for distributions describing intrinsic survival

statistics of individuals and groups, and calligraphic Roman capitals (e.g. P(..)) for Bayesian probabilities,

which quantify our confidence in having extracted information correctly from survival data.

6.3. Derivation of the Kaplan-Meier estimator

Estimator for the survival function in presence of just one risk. From the estimator (216) we can

construct estimators for the various survival functions defined earlier. In particular, provided we

know (or may assume) that the event times are statistically independent, we can estimate the

hypothetical survival function that would describe a situation where all risks except µ (the risk of

interest, or the ‘primary’ risk) would be eliminated. The latter is estimated by Sµ(t), where

log Sµ(t) = −∫ t

0ds πµ(s) = −

∑i δµ,∆iδ(s−Xi)∑

i θ(Xi−s)

= −∑i

δµ,∆iδ(s−Xi)∑j θ(Xj−s)

= −∑i

δµ,∆iθ(t−Xi)∑j θ(Xj−Xi)

If we denote with Ωµ ⊂ 1, . . . , N the set of all patients that report event µ, this becomes

log Sµ(t) = −∑i∈Ωµ

θ(t−Xi)∑j θ(Xj−Xi)

Note that R(Xi) =∑j θ(Xj−Xi) is the number of individuals still ‘at risk’ at time Xi. We may

now write

Sµ(t) =∏i∈Ωµ

e−θ(t−Xi)/R(Xi) (134)

Estimator for the overall survival function. The estimate for the overall survival function S(t) =

exp[−∑µ

∫ t0ds πµ(s)] can be obtained by summing over all risks in the derivation above. We get

log S(t) = −∫ t

πµ(s) = −∫ t

∑i δ(s−Xi)∑i θ(Xi−s)

= −∑i

S(t) = exp[−∑i

]=∏i

e−θ(t−Xi)/R(Xi) (136)

Here there is no issue relating to independence of event times; this estimator is always valid. Even

without the arguments leading to (136), one can easily convince oneself that for N → ∞ the

expression (136) will indeed converge to the true survival function. Upon defining the emperical

distribution Pem(X) = N−1∑i δ(X−Xi) we may write:

limN→∞

S(t) = limN→∞

exp[−∑i

]= lim

N→∞exp

[−∫ ∞

0dX Pem(X)

θ(t−X)∫∞0 dX ′ Pem(X ′)θ(X ′−X)

]= lim

N→∞exp

[−∫ t

Pem(X)∫∞X dX ′ Pem(X ′)

](137)

For N →∞ we will have Pem(X)→ P (X) (the true distribution of ‘first event times corresponding

to our cohort), and the fraction inside the above integral becomes identical to the right-hand side

of the population level version of (19), viz.

R∑r=0

πr(X) =P (X)∫∞

X dt P (t)(138)

Thus we find, provided limits and integrations commute (i.e. for non-pathological P (x)), that

limN→∞

S(t) = e−∑R

∫ t0

dX πr(X) = S(t) (139)

The Kaplan-Meier estimators. From equations (134) and (136) it is only a small step to the so-

called Kaplan-Meier curves. We collect and order all distinct times at which one or more events are

reported in our cohort, giving an ordered set of time points t1 < t2 < t3 < . . ., to be labelled by t`.

This allows us to write (134) as

Sµ(t) = e−∑

i∈Ωµθ(t−Xi)/R(Xi)

= e−∑

∑i∈Ωµ, Xi=t`

θ(t−t`)/R(t`)

= e−∑

`θ(t−t`)R−1(t`)

∑i∈Ωµ, Xi=t`

1(140)

We recognize that Dµ(t`) =∑i∈Ωµ, Xi=t`

1 is the number of individuals that reported event µ at

time t`, so

Sµ(t) =∏

`, t`≤te−Dµ(t`)/R(t`) =

∏`, t`≤t

[1− Dµ(t`)

R(t`)+O

(Dµ(t`)

)2](141)

If finally we truncate the expansion of exponentials after the first two terms (which is valid until

times become so large that the number R(t) of individuals at risk become of order one) we obtain

what is known as the Kaplan-Meier estimator for the risk-specific survival function for the case of

uncorrelated risks:

SKMµ (t) =

∏`, t`≤t

[1− Dµ(t`)

R(t) : nr at risk at time t

Dµ(t) : nr reporting event µ at time t(142)

Similarly, since the only difference between (134) and (136) is whether or not we limit the

contributing individuals i to those that report event µ, we can retrace the above argument with

only minor adjustments, and obtain the Kaplan-Meier estimator of the overall survival function:

SKM(t) =∏

`, t`≤t

[1− D(t`)

R(t) : nr at risk at time t

D(t) : nr reporting event at time t(143)

Examples of cause-specific and overall Kaplan-Meier survival curves are shown in figure 3.

Some properties of KM curves. The formulae for Kaplan-Meier curves are simple and compact,

but they have limitations. The main one is that SKMµ (t) only estimates the survival probability for

risk µ for uncorrelated risks. For correlated risks we can still use SKM(t), but SKMµ (t) may bare no

relation at all to the event statistics of the individual risks that would be found of the alternative

risks were disabled. Also the expansion used means in both cases that for small values of D(t`), i.e.

for large times when only few patients are still event-free, they are no longer reliable.

Secondly, by definition KM curves have the shape of descending staircases, with steps at the

times where events occurred in the cohort. The smaller the cohort size N , the smaller the number

of steps and the larger the jumps involved. This jagged nature of the curves is an artifact of the

procedure of maximum-likelihood that was followed; since common sense dictates that the true

survival curves are smooth, a better procedure would be to add a non-flat prior P (π0, . . . , πR)to our derivation, which punishes non-smooth dependencies of cause-specific hazard rates on time.

The only reason this is usually not done is that we would get a more complicated equation than

(216), from which πr(t) can no longer be solved in explicit form.

SKM(t) SKM1 (t) SKM

2 (t) SKM0 (t) (EOT)

0 5 10 15 20 25 30 35 0

0 5 10 15 20 25 30 35

PCN=2047

time (yrs) time (yrs) time (yrs) time (yrs)

0 1000 2000 3000 4000 5000 6000 7000 8000 0

0 1000 2000 3000 4000 5000 6000 7000 8000

BCN=70

time (days) time (days) time (days) time (days)

Figure 3. Kaplan-Meier curves for a large prostate cancer data set (top row, N = 2047

patients, primary risk: onset of prostate cancer) and for a smaller breast cancer data set

(bottom row, N = 70 patients). Left curves: the KM estimator for the overall survival

probability (including end-of-trial censoring events). Middle two columns: KM estimators

of cause-specific survival probabilities (assuming independence of risks), for the primary risk

1 (cancer onset for PC for patients monitored from age 50 onwards, cancer recurrence for

BC following a primary tumour at time t = 0) and for risk 2 (other deaths). Right column:

KM estmator of the survival probability of the end-of-trial risk, which tells us about the

distribution of end-of-trial censoring times. We see that protate cancer risk increases with

time (slope of SKM1 (t) becomes more negative with age), and that the recurrence risk for

breast cancer patients decreases with time (slope of SKM1 (t) gets less negative over time).

In view of the interpretation and the underlying assumptions of the KM curves, we should

expect that SKM(t) =∏µ S

KMµ (t). This is indeed true (within the orders of accuracy considered in

the derivation of the formulae):

SKMµ (t) =

∏`, t`≤t

[1− Dµ(t`)

] =∏

`, t`≤t

[1− Dµ(t`)

](144)

If there are no ties in timing, i.e. all events happen at distinct times, then Dµ(t`), D(t`) ∈ 0, 1and

∏µDµ(t`) = D(t`)δµ,µ(t`) with µ(t`) denoting the type of event observed at time t`. We then

immediately get∏µ

SKMµ (t) =

∏`, t`≤t

[1− D(t`)

]= SKM(t) (145)

If there are ties, the identity is true in relevant orders, since∏µ

SKMµ (t) =

∏`, t`≤t

e−Dµ(t`)

R(t`)+O(D2

µ(t`)/R2(t`))

`, t`≤te−D(t`)

R(t`)+O(

[Dµ(t`)/R(t`)]2)

`, t`≤t

[1− D(t`)

R(t`)+O

(D2(t`)

R2(t`)

)]≈ SKM(t) (146)

How to measure cause-specific risk in the presence of risk correlations. We have already seen that

the cause-specific survival function Sµ(t) only estimates the true survival with respect to risk one

in the absence of other risks if all risks are independent. The same is true for the cause-specific

Kaplan-Meier curves, which are just approximations of the Sµ(t). So how do we measure then

the cause-specific survival prospects of a cohort when there are correlated risks? Unless we know

the joint event time distibution, we have no choice but to return to non-hypothetical survival

probabilities, such as the cause specific incidence functions

Fiµ(t) =

0dX πiµ(X)e−

∫ X0

ds πir(s) (147)

Fµ(t) =1

0dX πiµ(X)e−

∫ X0

ds πir(s) (148)

The only limitation is that if we then compare two groups in terms of their risk µ incidence, we can

never be sure whether any differences are due to changes in the event time statistics of the risk µ

itself, or due to differences in the other risks r 6= µ which can influence Fiµ(t) or Fµ(t) via censoring

(i.e. by changing the probability that the type µ events are the first to occur). This limitation is

fundamental, as it involves the joint statistics of timings which we know cannot be inferred from

hazard rates, and therefore cannot be inferred from survival data.

6.4. Examples

Example 1: risk elimination

To get a better feel for the perhaps counter-intuitive statement that removal of one risk affects

the hazard rates of the remaining risks, let us work out a simple example where two risks 1

and 2 are both consequences of a third event 3 with an exponentially distributed event time t3(which we do not observe, and which in itself has no negative direct consequences). We assume

that always t1 = t3 + 10, and t2 = t3 + 20, so

P (t1, t2, t3) = τ−1e−t3/τδ(t1−t3−10)δ(t2−t3−20) (149)

Integration over the unknown t3 gives the event time distribution for the observable events:

P (t1, t2) = τ−1∫ ∞

0dt3e−t3/τδ(t1−t3−10)δ(t2−t3−20)

= τ−1e−(t1−10)/τδ(t2−t1−10)θ(t1−10) (150)

From this we obtain

S(t1, t2) =

∫ ∞t1

∫ ∞t2

ds1ds2 τ−1e−(s1−10)/τδ(s2−s1−10)θ(s1−10)

= τ−1∫ ∞

max(t1,10)ds1 e−(s1−10)/τθ(s1+10−t2)

= τ−1∫ ∞

max(t1−10,0,t2−20)ds e−s/τ = e−max(t1−10,0,t2−20)/τ

1 if t1 < 10 and t2 < 20

e−(t1−10)/τ if t1 > 10 and t1 > t2 − 10

e−(t2−20)/τ if t2 > 10 and t1 < t2 − 10

To next calculate the hazard rates via (7) we need S(t1, t2) for |t1− t2| small, i.e. we need only

the first two options in the result above:

t < 10 : π1(t) = π2(t) = 0 (152)

t > 10 : π1(t) = − ∂

∂tlog e−(t−10)/τ = 1/τ, π2(t) = 0 (153)

We can understand this: event 2 always hapens after 1 so will never be observed. Event 1

happens with a constant rate (that of the cause, i.e. of event 3) as soon as t > 10.

Next we disable risk 1. Evidently this means that π′1(t) = 0. However, the time t2 still happens

exactly at t2 = t3 + 20, but it is no longer preceded by the ‘masking event’ 1. So we must get

t < 20 : π′1(t) = π′2(t) = 0 (154)

t > 20 : π′1(t) = 0, π′2(t) = 1/τ (155)

Let us also calculate π′2(t) via the formal route, i.e. formula (125):

π′2(t) =

∫ds1ds2 P (s1, s2)δ(s2−t)∫

ds1ds2 P (s1, s2)[1−θ(t−s2)]=

P (t)∫∞t ds P (s)

in which

P (t) =

∫ ∞0

dt1 P (t1, t) =

∫ ∞10

dt1 τ−1e−(t1−10)/τδ(t−t1−10)

= τ−1e−(t−20)/τθ(t− 20) (157)

and so insertion into our formula for π′2(t) gives indeed

π′2(t) = θ(t− 20)e−(t−20)/τ∫∞

t ds e−(s−20)/τθ(s− 20)

= θ(t− 20)e−(t−20)/τ

τ e−(t−20)/τ= θ(t− 20) τ−1 (158)

Example 2: correlated risks and false protectivity

Let us inspect a previously used distribution for two event times t1, t2 ≥ 0, with parameters

a, b, τ > 0 and ε ∈ [0, 1], assumed to apply at population level:

](159)

We note that the first marginal of this distribution is

P (t1) =

∫ ∞0

dt2 P (t1, t2)

∫ ∞0

dt2 ae−at2δ(t1 − t2 − τ) + (1−ε)be−bt1∫ ∞

0dt2 ae−at2

= εθ(t1−τ)ae−a(t1−τ) + (1−ε)be−bt1 (160)

Also we note that in the above distribution the two event times are generally correlated:

〈t1t2〉−〈t1〉〈t2〉 =

∫dt2 P (t2)t2

∫dt1 P (t1|t2)t1

−( ∫

dt2 P (t2)

∫dt1 P (t1|t2)t1

)( ∫dt1 P (t2)t2

∫dt2 P (t2)t2

(ε(t2+τ)+(1−ε)1

)− 1

∫dt2 P (t2)

(ε(t2+τ)+(1−ε)1

)= ε(〈t22〉+

1− εab− ε( 1

a)− 1− ε

= ε(〈t22〉 −1

a2) (161)

The remaining average is

〈t22〉 =

∫ ∞0

dt t2ae−at = ad2

∫ ∞0

dt e−at = ad2

a= 2/a2 (162)

〈t1t2〉−〈t1〉〈t2〉 = ε/a2 (163)

We conclude that in the above distribution the two times are positively correlated as soon as

ε > 0.

Let us calculate the actual cause-specific hazard rate π′1(t) that would be found if risk 2 were

disabled. This rate is given in (125), which here simplifies to

π′1(t) =

∫dt1dt2 P (t1, t2)δ(t1−t)∫dt1dt2 P (t1, t2)θ(t1−t)

=P (t)∫∞

t dt1 P (t1)

=εθ(t−τ)ae−a(t−τ) + (1−ε)be−bt

ε∫∞

maxt,τdt1 ae−a(t1−τ) + (1−ε)∫∞t dt1 be−bt1

=εθ(t−τ)ae−a(t−τ) + (1−ε)be−bt

εe−a maxt−τ,0 + (1−ε)e−bt(164)

t < τ : π′1(t) =(1−ε)be−bt

ε+ (1−ε)e−bt= − d

[ε+ (1−ε)e−bt

](165)

t > τ : π′1(t) =εae−a(t−τ) + (1−ε)be−bt

εe−a(t−τ) + (1−ε)e−bt= − d

[εe−a(t−τ) + (1−ε)e−bt

](166)

We can now immediately read off the true survival function S1(t) = exp[−∫ t

0ds π′1(s)] for risk

1 that would correspond to a world where risk 2 was disabled:

t < τ : S1(t) = ε+ (1−ε)e−bt (167)

t > τ : S1(t) = εe−a(t−τ) + (1−ε)e−bt (168)

Next we want to compare this result to the estimator S1(t) in (134), of which the risk-1 KM

curve SKM1 (t) is an approximation, which aims to describe the survival statistics for risk 1

alone, but whose derivation relied on assuming independence of the event times. To do this

we generate N time pairs (ti1, ti2) from the distribution P (t1, t2), and define the corresponding

survival data D = (X1,∆1), . . . , (XN ,∆N ), where

(∀i = 1 . . . N) : Xi = minti1, ti2), ∆i =

1 if ti1 < ti22 if ti2 < ti1

It is convenient to rewrite S1(t) first as

S1(t) = exp−∑i

δ∆i,1θ(t−Xi)∑j θ(Xj−Xi)

δ∆i,1θ(t−Xi)1N

∑j θ(Xj−Xi)

= exp−

2∑∆=1

∫ ∞0

dXP (X,∆) δ∆,1θ(t−X)∑2

∆′=1

∫∞0 dX ′P (X ′,∆′)θ(X ′−X)

−∫ t

P (X, 1)∫∞X dX ′ P (X ′, 1) +

∫∞X dX ′ P (X ′, 2)

Here P (X,∆) is the empirical joint distribution of reported event times and event types, i.e.

P (X,∆) =1

δ∆,∆iδ(X −Xi) (171)

For sufficiently large cohorts, i.e. for N → ∞, and given that our ‘patients’ were generated

independently, the law of large numbers guarantees that P (X,∆) will converge to the true

distribution P (X,∆) defined in (15):

P (X,∆) = π∆(X)e−∫ X

0ds π1(s)−

∫ X0

ds π2(s) (172)

We have already calulated the cause specific hazard rates π1,2(t) and their time integrals for

our present example, see (39,40), which resulted in

π1(t) =b(1−ε)e−bt

ε+ (1−ε)e−bt,

0ds π1(s) = − log

(ε+ (1−ε)e−bt

)(173)

π2(t) = a,

0ds π2(s) = at (174)

Hence we find

P (X, 1) =b(1−ε)e−bX

ε+ (1−ε)e−bX(ε+ (1−ε)e−bX

)e−aX = b(1−ε)e−(a+b)X (175)

P (X, 2) = a(ε+ (1−ε)e−bX

)e−aX = aεe−aX + a(1−ε)e−(a+b)X (176)

Hence for N →∞ our estimator S1(t) will report

S1(t) = exp

−∫ t

P (X, 1)∫∞X dX ′ P (X ′, 1) +

∫∞X dX ′ P (X ′, 2)

−∫ t

b(1−ε)e−(a+b)X

(a+b)(1−ε)∫∞X ds e−(a+b)s + aε

∫∞X ds e−as

0 1 2 3 4

S1(t) vs S1(t)

ε=0.75

ε=0.50

ε=0.25

0 1 2 3 4

SKM1 (t) vs S1(t)

ε=0.75

ε=0.50

ε=0.25

Figure 4. We compare the estimators S1(t) (dashed left) and SKM1 (t) (dashed right) for the

survival function of risk 1 to the true survival function S1(t) for risk 1 (solid) that would be

found if risk 2 were disabled. The joint event times are assumed to be distributed according to

the example (159), with parameter values a = b = τ = 1. The curves correspond to formulae

(167,168) and (177,142). The three Kaplan-Meier curves on the right were calulated from

N = 1000 synthetic patient data (Xi,∆i), generated according to (169). Since the two risks

in this example are positively correlated, the KM estimator SKM1 (t) and its precursor S1(t)

(both of which assume there are no correlations between the two risks) underestimate the

severity of risk 1. This effect is called ‘false protectivity due to competing risks’.

−∫ t

b(1−ε)e−(a+b)X

(1−ε)e−(a+b)X + εe−aX

−∫ t

b(1−ε)e−bX

(1−ε)e−bX + ε

= e−

∫ t0

dX π1(X)

= ε+ (1−ε)e−bt (177)

Comparison with the true survival function (167,168) for risk 1, describing correctly the world

where risk 2 is disabled, shows that our estimator is only correct for short times. See figure

4 for example curves, corresponding to ε ∈ 14 ,

34. As soon as t > τ (and provided ε > 0,

so there are indeed event time correlations) the estimator S1(t) and the Kaplan-Meier curve

SKM1 (t) both grossly over-estimate the survival probability of risk 1. This effect, called ‘false

protectivity’, is entirely the consequence of the fact that KM-type estimators neglect risk

correlations. In our present example the two times are positively correlated, so we know that

high-risk individuals with respect to event type 2 tend also to be high-risk with respect to event

type 1. The early events of type 2 are therefore more likely to ‘filter out’ those individuals that

would have also given early type 1 events. Event 2 censoring thereby changes over time the

composition of the population, increasing the fraction of individuals with lower type 1 risk.

An example of real medical data affected by competing risks are those of figure 1. The

corresponding Kaplan-Meirer curves are shown in figure 5, or rather the incidence estimator

1 − SKM1 (t), which estimates the the probability of having experienced event type 1 (here:

1− SKM1 (t)

Figure 5. The incidence estimator 1 − SKM1 (t) for prostate cancer, calculate for three

subgroups of a population of males. This illustrates the false protatectivity effect. Here

smoking seems to have a preventative effect on prostate cancer, but this is in fact caused by

correlations between the risk of prostate cancer and the risk of lung cancer; in the presence

of such correlations the Kaplan-Meier estimators can no longer be trusted.

prostate cancer) as a function of time. The curves are shown for different groups, smokers,

ex-smokers, and non-smokers. The result suggests a preventative effect of smoking with respect

to prostate cancer. In fact, a more careful analysis reveals that this is not real, but caused by

the false protetectivity effect of lung cancer; those of the smokers who did not get lung cancer

by the time they get to the age of 75 are inherently more robust, and therefore also less likely

to get prostate cancer.

Note that the deviation between the true S1(t) and the estimator S1(t) could also work in the

opposite direction. If our two risks had been negatively correlated, then risk 2 events would

have been more likely to filter out individuals with low type 1 risk; we would then have found

our estimators S1(t) and SKM1 (t) under-estimating the risk 1 survival probability. Risk-specific

Kaplan-Meier curves were not designed for, and should therefore not be used in, a context

where different risks may be correlated.

7. Including covariates

All survival probabilities and data likelihoods considered so far were dependent only upon the

cause-specific hazard rates π = π0, . . . , πR. Let us emphasise this in our notation, and write

S(t|π) = e−∑

∫ t0

ds πr(s) (178)

P (X,∆|π) = π∆(X)e−∑

∫ X0

ds πr(s) (179)

Note that with these conventions we can also write survival functions and data probabilities at the

level of individuals simply as Si(t) = S(t|πi) and Pi(X,∆) = P (X,∆|πi). If we want to predict

survival for individuals on which we have further information in the form of the values of covariates

Z = (Z1, . . . , Zp), then we would want to use this information. There are two disctinct ways to do

this, both of which are valid, internally consistent and correct, but they differ in strategy.

7.1. Definition via covariate sub-cohorts

How to relate covariates to prediction. Let us assume for simplicity that our covariates are

discrete. We can then define a sub-cohort ΩZ ⊆ 1, . . . , N consisting of those individuals i

that have covariates Zi = Z, and apply the analysis in section 5 of the link between individual-

level descriptions and population-level descriptions to this sub-cohort. ΩZ will be characterized by

some cohort-level cause-specific hazard rates π(Z), which are related to the individual cause-specific

hazard rates via (102), which takes the form

πµ(t|Z) =

∑i∈ΩZ

πiµ(t)e−∑

∫ t0

ds πir(s)

∑i∈ΩZ

e−∑

∫ t0

ds πir(s)(180)

All our previous analysis linking cohort-level to individual-level quantities applies to ΩZ , so we

can immediately write down the formulas for the probability S(t|π(Z)) that a randomly drawn

individual from ΩZ will be alive at time t, and for the likelihood per unit time P (X,∆|π(Z)) that

a randomly drawn individual from ΩZ will report an event of type ∆ at time X:

S(t|π(Z)) = e−∑

∫ t0

ds πr(s|Z) (181)

P (X,∆|π(Z)) = π∆(X|Z)e−∑

∫ X0

ds πr(s|Z) (182)

with the generic definitions (178) and (179). Similarly, we find identity (101) translated into

πµ(t|Z)S(t|π(Z)) =1

|ΩZ |∑i∈ΩZ

πiµ(t)S(t|πi) (183)

with |ΩZ | =∑i∈ΩZ

1. We think strictly in terms of cohort-level cause-specific hazard rates π(Z),

which by definition depend solely and uniquely on Z, as opposed to individual-level cause-specific

hazard rates π (which can and generally will vary within the sub-cohort ΩZ).

Estimation of π(Z) from the data and Bayesian prediction. Within the sub-cohort picture, the

Bayesian estimation of π(Z) is straightforward. To distinguish between the π (cause-specific hazard

rates, being functions of time but without limiting oneself to individuals with specific covariates)

and the time- and covariate-dependent rates π(Z), let us write the latter when used as arguments

in probability distributions as π?. In Bayesian estimation we would simply write

P(π?|D) =P(D|π?)P(π?)∫

dπ?′ P(D|π?′)P(π?′)(184)

P(D|π?)) =N∏i=1

P (Xi,∆i|π?(Zi))

=N∏i=1

π?∆i

(Xi|Zi)e−∑

∫ Xi0

dt π?r (t|Zi)

Here P(π?) is a distribution that codes for any prior knowledge we have on the relation π?(Z)

(including applicable constraints). Fully Bayesian prediction, taking into account our limited

certainty on whether we have extracted the correct π?(Z) from the data D, would become

S(t|Z,D) =

∫dπ? P(π?|D) S(t|π(Z)) (186)

P (X,∆|Z,D) =

∫dπ? P(π?|D) P (X,∆|π?(Z)) (187)

Most probable covariate-to-rates relation. The most probable function π?(Z) is the one that

maximises P(π?|D) in (184), i.e. that maximises

logP(π?|D) = logP(D|Wh(π?)) + logP(π?)

=N∑i=1

logπ?∆i

(Xi|Zi)e−∑

∫ Xi0

dt π?r (t|Zi)

+ logP(π?)

=N∑i=1

log π?∆i(Xi|Zi)−

N∑i=1

∫ Xi

0dt π?r (t|Zi) + logP(π?)

N∑i=1

δr,∆i log π?r (Xi|Zi)−N∑i=1

∫ Xi

0dt π?r (t|Zi)

+ logP(π?) (188)

Unless we have prior evidence that suggests we should couple risks, we should use the maximum

entropy prior, which is of the form P(π?) =∏r P(π?r ). In that case the posterior P(π?|D) factorises

fully over risks, and hence

logP(π?|D) =∑r

logP(π?r |D) (189)

logP(π?r |D) =N∑i=1

∫ Xi

0dt π?r (t|Zi) + logP(π?r ) (190)

We see that the functions π?r (t|Z) for different risks r are calculated from disconnected maximisation

problems. However, this does not mean that we can simply forget about the other risks r 6= 1. The

risks could still be correlated, so eliminating competing risks still can still impact upon the primary

hazard rate π?1(t|Z). Only with the further assumption of noncorrelating risks can we take π?1(t|Z)

as a correct measure of risk in a world where only risk 1 can materialise.

Information-theoretic interpretation. There is a nice interpretation of what the above formulae are

effectively doing. To show this we first need to define the empirical covariate distribution and the

empirical conditioned data distribution:

P (Z) =1

δ(Z−Zi), P (t, r|Z) =

∑i δ(t−Xi)δr,∆iδ(Z−Zi)∑

i δ(Z−Zi)(191)

From (184,185) we obtain, using the definitions (191):

NlogP(π?|D) =

N∑i=1

logP (Xi,∆i|π?(Zi)) +1

NlogP(π?) + constant

N∑i=1

δ(Z−Zi)δr,∆i

∫ ∞0

dt δ(t−Xi) logP (t, r|π?(Z)

∫dZ P (Z)

∫ ∞0

dt P (t, r|Z) logP (t, r|π?(Z)

= −∫

dZ P (Z)∑r

∫ ∞0

dt P (t, r|Z) log[ P (t, r|Z)

P (t, r|π?(Z)

NlogP(π?) + Constant (192)

Apart from the regularising influence of a prior P(π?), the most probable function π? is apparently

the one that minimises the Z-averaged Kullback-Leibler distance between the empirical covariate-

conditioned distribution P (t, r|Z) and its theoretical expectation P (t, r|π(Z)).

7.2. Definition by conditioning individual hazard rates on covariates

How to relate covariates to prediction. The second approach to bringing in covariate information is

formulated in terms of the individual cause-specific hazard rates. We regard individual covariates

as predictors of individual cause-specific hazard rates, which in turn predict individual survival:

Zi → predicts→ πi → predicts→ (Xi,∆i)

The question then is how to formalise this. If we are given the values Z of the covariates of an

individual, then the survival probability and data likelihood for that individual, conditional on

knowing their covariates to be Z, can be written as

S(t|Z,W ) =

∫dπ W (π|Z)S(t|π) (193)

P (X,∆|Z,W ) =

∫dπ W (π|Z)P (X,∆|π) (194)

Here∫

dπ represents functional integration over the values of all hazard rates at all times, subject

to πr(t) ≥ 0 for all (t, r), and W (π|Z) gives the probability that a randomly drawn individual with

covariates Z will have individual cause-specific hazard rates π.

The distribution W (π|Z) depends strictly on the degree to which Z is informative of π, i.e.

on biochemistry. If Z is very informative, then W (π|Z) will be very narrow and point us to a

very small set of cause-specific hazard rates compatible with observing covariates Z. One cannot

conclude that any patterns linking Z to π, embodied in W (π|Z), are causal. For instance, π and

(components of) Z could both be (partially) effects of a common cause Y . W (π|Z) only answers

the question: if one knows Z for an individual, what does this tell us about his/her π?

Estimation of covariates-to-risk connection. To proceed we need to estimate the unknown

distribution W (π|Z), from analysis of the complete data D = (X1,∆1;Z1), . . . , (XN ,∆N ;ZN )(i.e. the survival data plus the covariates of all patients). For infinitely large cohorts we would

expect to find W (π|Z) becoming identical to the empirical frequency W (π|Z) with which the

hazard rates π are observed among the individuals with covariates Z:

W (π|Z) = limN→∞

W (π|Z) (195)

W (π|Z) =

∑i δ(π − πi)δ(Z −Zi)∑

i δ(Z −Zi)(196)

For finite data sets we will only be able to say how likely is each possible function W (π|Z), in the

light of the data D: we will calculate P(W |D), where W is the conditional distribution W (π|Z).

If we assume that all patients in D are independently drawn from a given population, the standard

Bayesian formula P (A|B) = P (B|A)P (A)/P (B) tells us that

P(W |D) =P(D|W )P(W )∫

dW ′ P(D|W ′)P(W ′)(197)

P(D|W ) =N∏i=1

P (Xi,∆i|Zi,W ) =N∏i=1

∫dπ W (π|Zi)P (Xi,∆i|π) (198)

Here∫

dW denotes functional integration over all distributions W (π|Z), subject to the constraint∫dπ W (π|Z) = 1, for all Z. The survival prediction formulae for an individual with covariates Z

will then be:

S(t|Z,D) =

likelihood that W is right︷︸︸︷P(W |D) ×

survival prediction given W︷︸︸︷S(t|Z,W )

∫dW P(W |D)

∫dπ W (π|Z)S(t|π) (199)

P (X,∆|Z,D) =

∫dW P(W |D)P (X,∆|Z,W )

∫dW P(W |D)

∫dπ W (π|Z)P (X,∆|π) (200)

Equivalently:

S(t|Z,D) =

∫dπ W (π|Z,D)S(t|π) (201)

P (X,∆|Z,D) =

∫dπ W (π|Z,D)P (X,∆|π) (202)

W (π|Z,D) =

∫dW P(W |D)W (π|Z) (203)

The distribution W (π|Z,D) combines two sources of uncertainty: (i) uncertainty in the individual

hazard rates π given an individual’s covariates Z (coded in W (π|Z)), and (ii) our ignorance about

which is the true relation W (π|Z), given the data D (described by P(W |D)). The first uncertainty

can be reduced by using more informative covariates, the second by acquiring more data.

Information-theoretic interpretation. Again there exists a nice information-theoretic interpretation

of our Bayesian formulae, since

NlogP(W |D) =

N∑i=1

∫dπ W (π|Zi)P (Xi,∆i|π) +

NlogP(W ) + constant

N∑i=1

∫dZ δ(Z−Zi)

∫ ∞0

dt δ(t−Xi)δr,∆i log

∫dπ W (π|Z)P (t, r|π)

NlogP(W ) + constant

∫dZ P (Z)

∫ ∞0

dt P (t, r|Z) log

∫dπ W (π|Z)P (t, r|π)

NlogP(W ) + constant (204)

We can rewrite (204) in terms of a Kullback-Leibler distance:

NlogP(W |D) = −

∫dZ P (Z)

∫ ∞0

dt P (t, r|Z) log[ P (t, r|Z)∫

dπ W (π|Z)P (t, r|π)

NlogP(W ) + Constant (205)

(in which Constant is a new constant that differs from the previous one by a further W -independent

term, being the Z-averaged Shannon entropy of P (t, r|Z)). So we see that, apart from the

regularising influence of the prior P(W ), the most probable function W (π|Z) is the one that

minimises the Z-averaged Kullback-Leibler distance between the empirical distribution P (t, r|Z)

and its theoretical expectation P (t, r|Z) =∫

dπ W (π|Z)P (t, r|π).

7.3. Connection between the conditioning picture and the sub-cohort picture

Conditioned cohort-level hazard rates in terms of W . It is instructive to inspect the relation between

the two routes for incorporating covariates into survival prediction in more detail. We note that

(180) can be written in terms of the empirical estimator W (π|Z) given in (196):

πµ(t|Z) =

∫dπ

∑i δZ ,Ziδ(π−πi)πµ(t)e−

∫ t0

ds πr(s)

∫dπ∑i δZ ,Ziδ(π−πi)e−

∫ t0

ds πr(s)

∫dπ W (π|Z)πµ(t)e−

∫ t0

ds πr(s)∫dπ W (π|Z)e−

∫ t0

ds πr(s)(206)

Relation in terms of prediction. We note the difference between the definitions of S(t|π(Z)) and

S(t|Z,W ). The first gives the survival probability for a randomly drawn individual from the data

set with covariates Z; the second gives the survival probability for a randomly drawn individual

with covariates Z (not necessarily from the data set). Any finite-size imperfections of the data set

will affect S(t|π(Z)). Within the sub-cohort picture we find, using (183) and (180):

S(t|π(Z)) =

1|ΩZ |

∑i∈ΩZ

πiµ(t)S(t|πi)

πµ(t|Z)=

1|ΩZ |

∑i∈ΩZ

πiµ(t)e−∑

∫ t0

ds πir(s)

πµ(t|Z)

|ΩZ |∑i∈ΩZ

πiµ(t)e−∑

∫ t0

ds πir(s)) ∑

i∈ΩZe−∑

∫ t0

ds πir(s)

∑i∈ΩZ

πiµ(t)e−∑

∫ t0

ds πir(s)

|ΩZ |∑i∈ΩZ

e−∑

∫ t0

ds πir(s)

∫dπ

(∑i∈ΩZδ(π−πi)∑i∈ΩZ

)e−∑

∫ t0

ds πr(s)

∫dπ W (π|Z)S(t|π) = S(t|Z, W ) (207)

Similarly we can connect the expressions P (X,∆|π(Z)) and P (X,∆|Z,W ). Starting from the

sub-cohort picture we get

P (X,∆|π(Z)) = π∆(X|Z)S(X|π(Z))

|ΩZ |∑i∈ΩZ

πi∆(X)S(X|πi)

∫dπ( 1

|ΩZ |∑i∈ΩZ

δ(π−πi))π∆(X)S(X|π)

∫dπ W (π|Z)π∆(X)S(X|π) = P (X,∆|Z, W ) (208)

There is no contradiction between the two approaches, they just focus on different quantities. In the

conditioning picture we capture the variability in the connection Z → π in a distribution W (π|Z),

where π refers to the cause-specific hazard rates of individuals. In the sub-cohort picture we describe

the variability in the connection Z → π via sub-cohort level cause-specific hazard rates π(Z). In

both cases we still need to estimate this variability from the data.

7.4. Conditionally homogeneous cohorts

Trivial versus nontrivial heterogeneity. There are two types of cohort heterogeneity. The trivial one

is heterogeneity in covariates, meaning that the Zi are not identical for all individuals i. We always

allow for this by default; it would be silly to include covariates and then assume they take identical

values for all individuals (as they would give no information). The nontrivial type of heterogeneity

refers to the link between covariates and risks. A conditionally homogeneous cohort is one in which

the cause-specific hazard rates are identical for all individuals i with identical covariates Zi.

If we work within the sub-cohort picture, formulated in terms of the sub-cohort level cause-

specific hazard rates π(Z) of individuals with covariates Z, we need not make any statements on

presence or absence of covariate-to-risk heterogeneity. In either case we simply calculate survival

statistics and data likelihood for any individual with covariates Z via

S(t|π(Z)) = e−∑

∫ t0

ds πr(s|Z) (209)

P (X,∆|π(Z)) = π∆(X|Z)e−∑

∫ X0

ds πr(s|Z) (210)

We can get these equations also starting from the conditioning picture, but there we need conditional

cohort homogeneity, i.e. W (π|Z) = δ[π−π(Z)], with π(Z) now representing the individual cause-

specific hazard rates of individuals with covariates Z. The risk statistics of each individual with

covariates Z are now described by the cause-specific hazard rates π(Z). Due to the conditional

homogeneity these are trivially identical to the cohort-level hazard rates, and hence (209,210) again

hold. Also the Bayesian estimation of π(Z), which involves evaluation of the quantity P (X,∆|π(Z))

for the available data points (Xi,∆i), would proceed identically in both approaches, but there would

be different interpretations of why we would write all this and what would be the meaning of π(Z):

• conditioning picture:

the cohort is taken to be homogeneous in the covariates-to-risk patterns, and

we assume that all individuals with covariates Z have individual hazard rates π(Z)

(capturing covariate-to-risk heterogeneity is not possible because we assumed there isn’t any)

all modelscovariate-to-risk connection:

W (π|Z)

estimating W from data:P(W |D)

((((((

homogeneous cohorts

W (π|Z) = δ[π−π?(Z)] covariate-to-risk connection:π?(Z)

estimating π? from data:P(π?|D)

Figure 6. Description of general and of conditionaly homogeneous cohorts, within the

framework where covariate information is used to condition the probability for individuals

to have individual cause-specific hazard rates π. In conditionally homogeneous cohorts all

cause-specific hazard rates are fully determined by the covariates; there are no ‘hidden’

covariates that impact upon risk. The remaining uncertainty is only in our limited ability

to infer the function π?(Z) from the data (this uncertainty is decribed by P(π?|D)).

• sub-cohort picture:

we make no assumptions regarding homogenity/heterogeneity, but

we assume that π(Z) represents sub-cohort level hazard rates

(capturing covariate-to-risk heterogeneity is not possible because we lack the information)

The difference between the two interpretations will become relevant when we start thinking about

how to capture covariate-to-risk heterogeneity. Then the most suitable starting point will be the

conditioning picture, since W (π|Z) is defined in terms of individual cause specific-hazard rates.

Within the conditioning picture, the assumption of cohort homogeneity must be brought in

via the prior P(W ) in formula (197), by choosing

P(W ) =

∫dπ? δ[W −W (π?)]P(π?) (211)

Here W (π?) is the covariates-to-rates distribution of a conditionally homogeneous cohort with

W (π|Z) = δ[π − π?(Z)], and formula (197) becomes

P(W |D) =P(D|W )

∫dπ? δ[W −W (π?)]P(π?)∫

dW ′ P(D|W ′)∫

dπ? δ[W ′ −W (π?)]P(π?)

=P(D|W (π?))

∫dπ? δ[W −W (π?)]P(π?)∫

dπ? P(D|W (π?))P(π?)

∫dπ? δ[W−W (π?)] P(π?|D) (212)

With (212) our earlier prediction formula for the conditioning picture simplifies to

W (π|Z,D) =

∫dπ? P(π?|D) δ[π − π?(Z)] (213)

which, in turn, leads us directly to (186,187).

7.5. Nonparametrised determination of covariates-to-risk connection

If we do not wish to take into account our uncertainty regarding the covariates-to-risk relations

π?(t|Z), we could turn to the simple recipe of the maximum likelihood estimator. We saw earlier

that this means finding the maximum of logP(π?|D), but with a flat prior P(π?); equivalently,

maximising logP(D|π?). A flat prior also factorises trivially over risks, so we find the disconnected

maximisation problems (190), with constant functions P(π?r ). Hence the maximum likelihood

estimator πr(t|Z) is found by maximising

L(D|π?r ) =N∑i=1

∫ Xi

0dt π?r (t|Zi)

=N∑i=1

∫ ∞0

dtδ(t−Xi)δr,∆i log π?r (t|Zi)− θ(Xi−t)π?r (t|Zi)

∫ ∞0

dtN∑i=1

∫dZ δ(Z−Zi)

δ(t−Xi)δr,∆i log π?r (t|Z)− θ(Xi−t)π?r (t|Z)

∫ ∞0

log π?r (t|Z)

N∑i=1

δ(Z−Zi)δ(t−Xi)δr,∆i

−π?r (t|Z)N∑i=1

δ(Z−Zi)θ(Xi−t)

Straigthforward functional differentiation of this latter expression gives:

(∀Z)(∀t ≥ 0) :1

π?r (t|Z)

N∑i=1

δ(Z−Zi)δ(t−Xi)δr,∆i =N∑i=1

δ(Z−Zi)θ(Xi−t) (215)

which gives us the maximum likelihood estimator

(∀r)(∀t) : πr(t|Z) =

∑i δ(Z−Zi)δr,∆iδ(t−Xi)∑i δ(Z−Zi)θ(Xi−t)

This is very similar to the earlier estimator (216), but now we find the sums over individuals

restricted to those with covariates Z. Although formally correct, expressions such as (216) are in

practice rather useless. The problem is that we are here estimating functions of p + 1 arguments.

Even if we we reduce our ambition and ask for just five or so points per dimension (a rather small

number), and we have e.g. p = 5 covariates (a modest number), we would still already need in

excess of 5p+1 = 15,625 data points to start covering the space of all (Z, t) combinations. If we

want in addition to estimate values of π?r (t|Z) with, say, 10% accuracy, we need to multiply the

number if data points needed further by a factor 100.

We conclude that, even for conditionally homogeneous cohorts, we have no choice but to find

suitable parametrisations of the functions π?r (t|Z), i.e. we will propose a specific sensible formula for

π?r (t|Z) with a modest number of free parameters, and use the data to estimate these parameters.

This is the idea behind Cox regression.

7.6. Examples

Let us get intuition for the effect of Bayesian priors in regression. We saw that nonparametrised

maximisation of (190) with a flat prior P(π?r ) gives as the most probable hazard rate a ‘spiky’

estimator (216), with δ-functions at the times where the events in the data set occurred. One does

not expect the real hazard rate to have spikes; this knowledge can be coded into a prior of the form

P(π?r ) =1

C(α)e−α

∫dZ W (Z)

∫∞0

dt (dπ?r (t|Z)/dt)2

with some normalisation constant C(α). This prior ‘punishes’ explanations with discontinuous

behaviour, while reducing to the flat prior for α→ 0. Let us choose the simplest example, with just

one binary covariate Zi ∈ 0, 1 and just one risk (i.e. ∆i = 1 for all i, so we can drop the index r).

We make the most natural choice W (Z) = 12δZ,0 + 1

2δZ,1 in (217). Expression (190) then becomes,

apart from an irrelevant normalisation constant,

logP(π?|D) =N∑i=1

log π?(Xi|Zi)−N∑i=1

∫ Xi

0dt π?(t|Zi)

∫ ∞0

dt[(dπ?(t|0)

)2+(dπ?(t|1)

N∑i=1

δZi,0 log π?(Xi|0)−N∑i=1

δZi,0

∫ Xi

0dt π?(t|0)− 1

∫ ∞0

dt(dπ?(t|0)

+N∑i=1

δZi,1 log π?(Xi|1)−N∑i=1

δZi,1

∫ Xi

0dt π?(t|1)− 1

∫ ∞0

dt(dπ?(t|1)

)2(218)

The quantity to be maximised has separated into independent expressions, one for π?(t|0) and one

for π?(t|1). For each Z = 0, 1 we have to maximize and expression of the form

LZ(π) =N∑i=1

δZi,Z

log π(Xi|Z)−

∫ Xi

0dt π(t|Z)

∫ ∞0

dt(dπ(t|Z)

)2(219)

To differentiate LZ(π) with respect to π(t|Z) we will use the following identity

δf(t)

∫ ∞0

ds (f ′(s))2 = 2

∫ ∞0

ds f ′(s)δ

δf(t)f ′(s)

= 2 limε→0

∫ ∞0

ds f ′(s)δ

δf(t)

(f(s+ ε)− f(s)

)= 2 lim

ε→0

(f ′(t−ε)− f ′(t)

)= −2f ′′(t) (220)

Application to LZ(π) tells us that the most probable function π(t|Z) is to be solved from

π(t|Z) = 0 or1

π(t|Z)

N∑i=1

δZi,Zδ(t−Xi)−N∑i=1

δZi,Zθ(Xi−t) + αd2

dt2π(t|Z) = 0 (221)

This can be rewritten in terms of the maximum likelihood estimator

π(t|Z) =

∑i δZi,Zδ(t−Xi)∑i δZi,Zθ(Xi−t)

π(t|Z) = 0 ord2

dt2π(t|Z) =

(1− π(t|Z)

π(t|Z)

) N∑i=1

δZi,Zθ(Xi−t) (223)

For α→ 0 we recover the maximum-likelihood solutions. For α > 0 we will have jumps in the first

derivative of π(t|Z), but continuous (i.e. non-spiky) rates π(t|Z), as a consequence of the prior. To

0 1 2 3 4 5 6 7 8

π(t|Z)

0 1 2 3 4 5 6 7 8

π?(t|Z)

Figure 7. Left: the maximum likelihood estimator π(t|Z) in (222) for Z = 0, calculated

from a data set with N = 71 patients and a binary covariate Z ∈ 0, 1, in which there

are many early and many late events (but with few at intermediate times). By definition

this estimator always consists of weighted delta-peaks (‘spikes’) at the observed event times.

Right: the most probable solution π?(t|Z) for Z = 0 within the Bayesian formalism, which

differs from the previous one in the addition of a ‘smoothness’ prior P (π). Here α = 50.

calculate the corresponding value for LZ we rewrite

LZ(π) =N∑i=1

δZi,Z

log π(Xi|Z)−

∫ Xi

0dt π(t|Z)

2α[π(t|Z)

dtπ(t|Z)

∫ ∞0

dt π(t|Z)d2

dt2π(t|Z)

=N∑i=1

δZi,Z

log π(Xi|Z)−

∫ Xi

0dt π(t|Z)

2α(π(0|Z)π′(0|Z)− π(∞|Z)π′(∞|Z)

∫ ∞0

dt(π(t|Z)− π(t|Z)

) N∑i=1

δZi,Zθ(Xi−t)

=N∑i=1

δZi,Z

log π(Xi|Z)− 1

∫ Xi

0dt π(t|Z)

2απ(0|Z)π′(0|Z)− 1

N∑i=1

δZi,Z (224)

Note that a finite nonzero derivative of π(t|Z) as t→∞ is ruled out as it would give either negative

or diverging hazard rates. It is also clear that the maximum must have π(Xi|Z) > 0 for all i, in view

of the term with log π(Xi|Z). Zero rates can only occus in between the data times X1, . . . , XN.Let us inspect the shape of π(t|Z) when it is nonzero, and assume that there are no ties, i.e.

ti 6= tj if i 6= j. We can then order our individuals i such that X0 < X1 < X2 < . . .XN−1 < XN

(with the definition X0 ≡ 0) . At any time t /∈ X1, . . . , XN equation (223) simplifies considerably:

t < X1 :d2

dt2π(t|Z) = γ1(Z) =

N∑i=1

δZi,Z (225)

t ∈ (X`, X`+1) :d2

dt2π(t|Z) = γ`+1(Z) =

N∑i=`+1

δZi,Z (226)

t ∈ (XN−1, XN ) :d2

dt2π(t|Z) = γN (Z) =

αδZN ,Z (227)

t > XN :d2

dt2π(t|Z) = 0 (228)

In each interval I` = (X`−1, X`) we apparently have a hazard rate in the shape of a local parabola:

t ∈ (X`−1, X`) : π(t|Z) =1

2γ`(t− t`)2 + δ` (229)

t > XN : π(t|Z) = π(∞|Z) (230)

We only need to determine the constants (t`, δ`) for each interval. The solutions in adjacent time

intervals are related by the continuity condition, i.e. limε↓0 π(X` + ε) = limε↓0 π(X` − ε), giving

` < N :1

2γ`(X` − t`)2 + δ` =

2γ`+1(X` − t`+1)2 + δ`+1 (231)

` = N :1

2γN (XN − tN )2 + δN = π(∞|Z) (232)

The second identity which we can use relates to the first derivative of π(t|Z) near each X`.

Integration over both sides of (223), with ε > 0, gives:

π′(X`+ε|Z)− π′(X`−ε|Z) =1

N∑i=1

δZi,Z

∫ X`+ε

X`−εdt(θ(Xi−t)−

δ(t−Xi)

π(Xi|Z)

N∑i=1

δZi,Z

[(t−Xi)θ(Xi−t)−

θ(t−Xi)

π(Xi|Z)

]X`+εX`−ε

= − 1

N∑i=1

δZi,Zπ(Xi|Z)

θ(X`+ε−Xi)− θ(X`−ε−Xi)

δZ`,Zαπ(X`|Z)

Thus we find

` < N : γ`(X` − t`) = γ`+1(X` − t`+1) +1

δZ`,Z12γ`+1(X` − t`+1)2 + δ`+1

` = N : γN = 0 or XN−tN =1

π(∞|Z)(235)

In combination, we end up with the following iteration for the unknown constants (t`, δ`), where we

note (and use) the fact that t` is irrelevant if γ` = 0:

t` = X` −γ`+1

γ`(X` − t`+1)− 1

δZ`,Z12γ`+1(X` − t`+1)2 + δ`+1

δ` = δ`+1 +1

2γ`+1(X` − t`+1)2 − 1

2γ`(X` − t`)2 (237)

to be iterated downwards, starting with

tN = XN −1

π(∞|Z)δN = π(∞|Z)− γN

2π2(∞|Z)(238)

The only remaining freedom in our solution is the value chosen for π(∞|Z). This value is determined

by the requirement that our solution must maximise expression (224). For the present solution

π(t|Z) this expression reduces to

LZ(π) =N∑`=1

δZ`,Z log[1

2γ`(X`−t`)2+δ`

]− 1

( N∑i=1

δZi,Z

)[1 + t1

21+δ1

N∑i=1

δZi,Z

i∑`=1

∫ X`

X`−1

2γ`(t− t`)2 + δ`

=N∑i=1

δZi,Z

2γi(Xi−ti)2+δi

]− 1

[1 + t1

21+δ1

6γ`(X`−t`)3 − 1

6γ`(X`−1−t`)3 + δ`(X`−X`−1)

)](239)

The resulting solution π?(t|Z) is shown for an example data set in Figure , for α = 50, together

with the ‘spiky’ maximum likelihood estimator π(t|Z). The new estimator combines evidence from

the data (the event times) with our prior belief that the true hazard rate should be smooth.

8. Proportional hazards (Cox) regression

Most existing survival analysis protocols that aim to quantify the impact of covariate values on

risk, and/or predict survival outcome from an individual’s covariates, can be obtained from the

general Bayesian description in the previous section, upon implementing specific further complexity

reductions. These reductions are always of the following types (often in combination):

• Within the sub-cohort picture: assumptions on the form of π(Z)

These assumptions are designed to reduce the complexity of the mathematical formulas, and

are formulated via simple low-dimensional parametrisations.

• Within the conditioning picture: assumptions on the form of the distribution W (π|Z)

These aim again to reduce the complexiy of the mathematical formulas, and are implemented

via the prior P(W |D) (which is set to zero for all W (π|Z) that are not of the assumed form).

• Assumptions on correlations between risks in the cohort

These relate to the interpretation of results. For instance, the assumption of independence is

needed if we want to interpret the cause specific hazard rates of the primary risk as indicative

of risk in a world where the other risks are eliminated.

• Mathematical approximations

These are short-cuts which in principle always induce some error. An example is limiting

oneself to the most probable value of a parameter, even if its distribution has finite width (e.g.

maximum likelihood estimation versus Bayesia regression).

8.1. Definitions, assumptions and regression equations

Definition of Cox regression. ‘Proportional hazards regression’ or ‘Cox regression’ (dating from

1972) is a formalism that is indeed obtained from the general Bayesian picture via several

simplifications of the type listed above. To appreciate its definition, let us inspect which formulas

we could in principle write for the hazard rate of the primary risk. We must demand π1(t|Z) ≥ 0

for all t ≥ 0 and all Z, so we can always write it in exponential form. If we then also expand the

exponent in powers of Z we see that any acceptable cause-specific hazard rate can be written as

π1(t|Z) = π1(t|0) e∑p

µ=1βµ(t)Zµ+ 1

µ,ν=1βµν(t)ZµZν+O(Z3

)(240)

Cox regression boils down to an inspired simplification of this general expression, crucial at the time

of the method’s conception when computation resources where very limited (remember that in 1972

the average university would have just one big but slow computer):

We assume that (conditional on the covariates Z) all risks are statistically independent,

and that the cause-specific hazard rate of the primary risk for individuals with covariates

Z is a function of the following parametrized form:

π1(t|Z) = λ0(t)eβ·Z (241)

Here β · Z =∑pµ=1 βµZµ, with time-independent parameters β = (β1, . . . , βp). We then

focus on calculating the most probable β and the most probable function λ0(t).

The function λ0(t) ≥ 0 is called the ‘base hazard rate’. It is the primary risk hazard rate one would

find for the trivial covariates Z = (0, 0, . . . , 0). The name ‘proportional hazards’ refers to the fact

that, due to the exponential form of (241), the effect of each covariate is multiplicative:

π1(t) = λ0(t)︸︷︷︸base hazard rate

× eβ1Z1 × . . .× eβpZp︸︷︷︸‘proportional hazards′

The main implications of (241) are that the effects of the covariates are taken to be mutually

independent and independent of time. One effectively assumes that there exists a time-independent

hyper-plane in covariate space that separates high risk individuals from low risk individuals:

‘high risk covariates′ : β1Z1 + . . .+ βpZp large

‘low risk covariates′ : β1Z1 + . . .+ βpZp small

In addition we can now quantify the risk impact of each individual covariate µ in a single time-

independent number, the so-called ‘hazard ratio’

HRµ =π1(t|Z)|Zµ=1

π1(t|Z)|Zµ=0=λ0(t)e

βµ.1+∑

ν 6=µ βνZν

λ0(t)eβµ.0+

∑ν 6=µ βνZν

= eβµ (242)

Covariates with no impact on risk, i.e. with βµ = 0, would thus give HRµ = 1. Note that in

the more general case (240) the ratio π1(t|Z)|Zµ=1/π1(t|Z)|Zµ=0 would still have depended on the

remaining covariates Zν with ν 6= µ. The main virtue of the choice (241) is that it is the simplest

nontrivial definition to meet the main criteria that we need to build into any parametrisation of

cause-specific hazard rates (nonnegativity, possible dependence on time and on covariates) in which

we can effectively decouple the time variable from the variables relating to covariates.

Derivation of equations for regression parameters. Within Cox regression we seek to find the most

probable parameters β = (β1, . . . , βp) and the most probable function λ0(t) in (241). In Cox’s

original paper he did not in fact calculate λ0(t) explicitly, but instead focused on calculating β

using an argument (‘partial likelihood’) that avoids having to know the base hazard rate. Here we

use the benefit of hindsight and the fact that we have already done much of the preparatory work,

and calculate the most probable parameters directly from (190) (with a flat prior P(π?1), where the

most probable Bayesian solution reduces to maximum likelihood estimation):

logP(β, λ0|D) =N∑i=1

δ1,∆i log π1(Xi|Zi)−

∫ Xi

0dt π1(t|Zi)

+ constant

=N∑i=1

∫ ∞0

dt log λ0(t)δ1,∆iδ(t−Xi) + δ1,∆iβ ·Zi − eβ·Zi∫ ∞

0dt θ(Xi−t)λ0(t)

+ constant (243)

Maximisation of this expression is done as always via the Lagrange formalism. Let us first maximise

over λ0(t), and define L(β|D) = maxλ0 logP(β, λ0|D). It will again turn out that the constraint

λ0(t) ≥ 0 will be met automatically, so the Lagrange equations from which to solve λ0(t) become

δλ0(t)logP(β, λ0|D) =

λ0(t)

N∑i=1

δ1,∆iδ(t−Xi)−N∑i=1

eβ·Zi

θ(Xi−t) (244)

It follows that the maximising function λ0(t), given β, is

λ0(t|β) =

∑Ni=1 δ1,∆iδ(t−Xi)∑Ni=1 eβ·Z

θ(Xi−t)(245)

For β = 0 this expression reduces to the simple estimator π1(t) in (216) for the cause-specific hazard

rate, as one would expect. Having calculated the most probable base hazard rate (245) in terms of

the regression parameters β, we are then left with the following function to be maximised over β:

L(β|D) = maxλ0 logP(β, λ0|D)

∫ ∞0

dt( N∑i=1

δ1,∆iδ(t−Xi))

log( N∑i=1

δ1,∆iδ(t−Xi))

+ constant

−∫ ∞

0dt( N∑i=1

δ1,∆iδ(t−Xi))

log( N∑i=1

eβ·Zi

θ(Xi−t))

+N∑i=1

δ1,∆iβ ·Zi −N∑i=1

eβ·Zi∫ ∞

0dt θ(Xi−t)

[ ∑Nj=1 δ1,∆jδ(t−Xj)∑Nj=1 eβ·Z

θ(Xj−t)

= L(0|D)−∫ ∞

0dt( N∑i=1

δ1,∆iδ(t−Xi))

log( N∑i=1

eβ·Zi

θ(Xi−t))

+N∑i=1

eβ·Zi∫ ∞

0dt θ(Xi−t)

( ∑Nj=1 δ1,∆jδ(t−Xj)∑Nj=1 eβ·Z

θ(Xj−t)

∫ ∞0

dt( N∑i=1

δ1,∆iδ(t−Xi))

log( N∑i=1

θ(Xi−t))

+N∑i=1

∫ ∞0

dt θ(Xi−t)(∑N

j=1 δ1,∆jδ(t−Xj)∑Nj=1 θ(Xj−t)

= L(0|D) +N∑i=1

δ1,∆i log( N∑j=1

eβ·Zj

θ(Xj−Xi))

+N∑i=1

δ1,∆i log( N∑j=1

θ(Xj−Xi))

= L(0|D) +N∑i=1

δ1,∆i log∑N

j=1 eβ·Zj

θ(Xj−Xi)∑Nj=1 θ(Xj−Xi)

From this result we can immediately derive by differentiation the equation 0 = ∂L(β|D)/∂βµ from

which to derive the most probable β, giving

for all µ :N∑i=1

δ1,∆i

Ziµ −

∑Nj=1 Z

jµ eβ·Z

θ(Xj−Xi)∑Nj=1 eβ·Z

θ(Xj−Xi)

= 0 (247)

This is a relatively simple set of coupled nonlinear equations for just p parameters (β1, . . . , βp),

which could indeed be analysed with the computing power of the 1970s. Once the parameters β

have been determined, the corresponding hazard ratios follow via (242), and the most probable base

hazard rate λ0(t) follows from (245).

Finally, once the most probable base hazard rate and the most probable regression parameters

β are known, the Cox formalism allows us to predict the survival time for any individual with

covariates Z via the primary risk-specific version of (181), which now reduces to

SCox(t|Z) = exp(−∫ t

0ds π1(s|Z)

)= exp

(− e

ˆβ·ZΛ0(t|β))

with, upon integrating (245) over time,

Λ0(t|β) =

0ds λ0(s|β) =

N∑i=1

δ1,∆i

θ(t−Xi)∑Nj=1 e

ˆβ·Zj

θ(Xj−Xi)(249)

(which is Breslow’s estimator, first given in the comments at the end of Cox’s 1972 paper). In

combination this gives

SCox(t|Z) = exp(−

N∑i=1

δ1,∆i

eˆβ·Zθ(t−Xi)∑N

j=1 eˆβ·Zj

θ(Xj−Xi)

)(250)

8.2. Uniqueness and p-values for regression parameters

Curvature of L(β|D) and uniqueness. To find out whether there could be multiple solutions of

equation it is helpful to inspect the second derivative (or curvature) of (246). We note that (247)

was derived from the first derivative of L(β|D):

∂βµL(β|D) =

δ1,∆i

Ziµ −

∑j Z

jµ eβ·Z

θ(Xj−Xi)∑j eβ·Z

θ(Xj−Xi)

Hence, upon introducing the short-hand

〈uj〉i =∑j

pj(i)uj , pj(i) =eβ·Z

θ(Xj−Xi)∑j eβ·Z

θ(Xj−Xi)(252)

we obtain by further differentiation

∂βµ∂βνL(β|D) = −

δ1,∆i

〈ZjµZjν〉i − 〈Zjµ〉i〈Zjν〉i

δ1,∆i〈(Zjµ − 〈Zjµ〉i)〉i〈(Zjν − 〈Zjν〉i)〉i (253)

Unless all event times are equal, the matrix of second derivatives is seen to be negative definite

everywhere, since for any vector y ∈ IRp one has

p∑µ,ν=1

yµ( ∂2

∂βµ∂βνL(β|D)

)yν = −

N∑i=1

δ1,∆i

( p∑µ=1

yµ〈(Zjµ − 〈Zjµ〉i)〉i)2< 0 (254)

Hence there will be only one extremal point of L(β|D) and therefore only one solution of (247), and

we know it will indeed be a maximum. From now on we will write the relevant point as β.

Shape of P(β, λ0|D) near the most probable point. We next write the entries of the curvature matrix

of L(β|D) at the most probable point β as minus Aµν (where A is a positive definite matrix):

Aµν(β) = − ∂2

∂βµ∂βνL(β|D)

∣∣∣ ˆβ=∑i

δ1,∆i〈(Zjµ − 〈Zjµ〉i)〉i〈(Zjν − 〈Zjν〉i)〉i∣∣∣ ˆβ

We can now expand L(β|D) close to the maximum point:

L(β|D) = L(β|D)− 1

∑µν

Aµν(β)(βµ − βµ)(βν − βν) +O(|β−β|3) (256)

Given the definition of L(β|D), we can now also write

maxλ0P(β, λ0|D) = eL(

ˆβ|D)− 12

∑µνAµν(

ˆβ)(βµ−βµ)(βν−βν)+O(|β− ˆβ|3)(257)

It follows that, provided maxλ0P(β, λ0|D) is a narrow distribution around the most probable point

β and provided we can disregard the uncertainty in the base hazard rate+, we can approximate

P(β|D) = maxλ0P(β, λ0|D) by a multi-variate Gaussian distribution, from which we can also obtain

the Gaussian marginals for individual regression parameters:

P(βµ|D) ≈ (σµ√

2π)−1e−12

(βµ−βµ)2/σ2µ , σ2

µ = (A−1)µµ(β) (258)

p-values for Cox regression parameters and hazard ratios. The latter result (258) allows us to define

approximate p-values for Cox regression. We do this in the usual way: given an observed value of

βµ in regression, we define the p-value as the probability to observe |βµ| ≥ |βµ| in a ‘null model’.

The null model chosen here is the distribution (258) that corresponds to the trivial value β = 0.

However, this is a further approximation, since one could also set only βµ = 0 in the null model,

leaving the other regression parameters nonzero (in which case the variance in (258) could depend,

via the matrix A(β), on all other βν with ν 6= µ). If we choose the null model β = 0 we get

P0(βµ) = (σµ√

2π)−12 e−

(βµ)2/σ2µ , σ2

µ = (A−1)µµ(0) (259)

and our p-value approximation will be

p− value = Prob(|βµ| ≥ |βµ|

)= 1− 2

σµ√

∫ |βµ|0

dβ e−12β2/σ2

= 1− 2√π

∫ |βµ|/σµ√2

0dx e−x

2= 1− Erf

(|βµ|/σµ

The ratio |βµ|/σµ is called the z-score. Note that the approximations underlying this final simple

result (260) are quite drastic: (i) forget about uncertainty in the base hazard rate λ0(t), (ii)

approximate the posterior distribution for β by a Gaussian, (iii) assume a null model in which

all regression parameters are zero, and (iv) ignore all correlations between regression parameters of

different covariates. Note also that the p-values do not measure the possible error introduced by

overfitting. We will come back to overfitting later.

8.3. Properties and limitations of Cox regression

Normalisation of covariates. The optimal value one will find for the regression parameter vector

β will obviously depend on the units chosen for the covariates, since only the sums∑pµ=1 βµZ

appear in the parameter likelihood. For instance, a renormalisation Ziµ → %µZiµ for all i would

simply rescale the most probable regression parameters via βµ → βµ/%µ. This implies that, unless

we prescribe a normalisation convention for the covariates, we cannot use the value of βµ directly

as a quantitative measure of the impact of covariate µ on survival. In addition, the definition of the

hazard ratio given in (242) as yet makes sense only for binary covariates Zµ ∈ 0, 1. To resolve

+ Note: this is an assumption for which we have no justification yet, but which is essential if in the Cox

approach we want to quantify regression uncertainty, since the whole point in Cox regression is to eliminate

the base hazard rate from the problem and formulate everything strictly in terms of β alone.

these problems we need a unified normalisation of the covariates. One natural convention is to

choose units for all coariates such that

covariate normalisation :1

Ziµ = 0,1

(Ziµ)2 = 1 (261)

Unless all covariates take the same value, this can always be achieved by linear rescaling. Upon

adopting (261), different components βµ can be compared meaningfully, with those further away

from zero implying a more prominent impact of covariates on survival.

Hazard ratios for normalised covariates. With the normalisation (261) we can also generalise

our previous definition (242) of hazard ratios (which as yet applied to binary covariates only)

to include arbitrary (e.g. real-valued) covariates. In a balanced cohort and for binary covariates the

normalisation (261) would imply Zµ ∈ −1, 1, and consistency with our earlier definition (242) of

hazard ratios would then demand that we define

HRµ =π1(t|Z)|Zµ=1

π1(t|Z)|Zµ=−1=λ0(t)e

∑ν 6=µ βνZν+βµ

λ0(t)e∑

ν 6=µ βνZν−βµ= e2βµ (262)

This is the appropriate hazard rate definition corresponding to the convention (261). The Gaussian

approximation (258) allows us to calculate for each covariate µ the so-called 95% confidence intervals

for hazard ratios:

[HR−µ ,HR+µ ], HR±µ = e2(βµ±dµ) (dµ > 0) (263)

such that Prob(βµ−dµ < βµ < βµ+dµ) = 0.95. From (258) we can calculate the quanty dµ:

0.95 =

∫ βµ+dµ

βµ−dµ

σµ√

2πe−

(βµ−βµ)2/σ2µ = Erf

σµ√

)(264)

Hence we can express dµ via the inverse error function as

dµ =√

2 Erf−1(0.95) σµ ≈ 1.96 σµ (265)

To be on the safe side, since the above is still an approximation in view of the Gaussian assumption

for the βµ-distribution, many authors in fact use dµ = 2σµ. This gives the convention

95% confidence interval : HRµ ∈ [e2(βµ−2σµ), e2(βµ+2σµ)] (266)

Univariate versus multivariate regression and correlated covariates. The proportional hazards

assumption in Cox regression, i.e. different covariates contribute each an independent multiplicative

factor to the primary risk hazard rate, will be violated as soon as covariates are correlated. We

should expect strictly uncorrelated covariates to be the exception rather than the norm. This has

implications. It can cause degeneracies in the parameter likelihood, such that there is no longer a

unique optimal regression vector. To see this just consider the extreme case where we have just two

covariates Z1,2 ∈ −1, 1, and these two covariates contain exactly the same information:

Z2 = Z1 : π1(t|Z) = λ0(t)eβ1Z1+β2Z2 = λ0(t)e(β1+β2)Z1

In either case one can at best find an optimal linear combination of covariates, but no unique values;

any risk evidence in the covariates can be shared in arbitrary ratios among the two covariates.

A further consequence of covariate correlations is that there will now be a difference between

the regression parameters βµ that one would find in univariate regression, i.e. in Cox regression

using π1(t|Zµ) = λ0(t)eβµZµ (with only one covariate µ included), and multivariate regression, where

π1(t|Zµ) = λ0(t)e∑p

ν=1βνZν such that µ ∈ 1, . . . , p. If covariate µ correlates with one or more other

covariates, then in multi-variate regression the predictive evidence will generally be ‘shared’ among

the covariates, whereas in uni-variate regression it will not. Again, to appreciate this just consider

the previous example, with Z1 = Z2 = Z:

univariate regression : π1(t|Z1) = λ0(t)eβ1Z , π1(t|Z2) = λ0(t)eβ2Z (267)

bivariate regression : π1(t|Z1, Z2) = λ0(t)e(β′1+β′2)Z (268)

Here (β1, β2) are the regression parameters found upon studying the impact of the two covariates

via univariate regression, and (β′1, β′2) are the regression parameters found via bivariate regression.

Since the data likelihood only depends on the cause-specific hazard rates, we will always find

β1 = β2 = β′1 + β′2. Hence, unless β1 = 0 or β2 = 0, one will inevitable have β1 6= β′1 or β2 6= β′2(or both). In fact even for uncorrelated covariates and infinitely large data sets (where there are no

issues with finite size corrections and uncertainties) one will still generally find different regression

parameters when comparing univariate to multivariate regression – see example 1 at the end of this

chapter. The regression parameters (and hence also the hazard ratios) depend on the modelling

context; exactly which other covariates were included? They are not objective quantitative measures

of the impact of individual covariates on risk.

Issues related to inclusion of treatment parameters as covariates. Often one includes treatment

parameters as covariates, with the objective of quantifying treatment effect on survival. As soon as

such treatment decisions involve human judgement based on observing covariates (as they usually

do), this may affect the outcome of regression for the initial covariates:

• Adding the treatment decision as a new covariate by definition turns the extended covariate

set into a correlated one, even if the initial set was not. For instance, assume a covariate

Z1 ∈ 1, 2, 3 indicates the grade of a tumour, and the clinical protocol is to give a patient with

grade 2 or 3 chemotherapy, then we could indicate this decision with a variable Z2 = θ(Z1−3/2)

(with the step function), and the pair (Z1, Z2) will be strongly correlated.

• Provided they are medically effective, treatments will reduce or undo any patterns that connect

the other covariates to risk: if high-risk patients are correctly identified from covariates

(implying a significant predictive signal in the covariates), and selected for medical treatment,

then these individuals are thereby converted by this treatment to low-risk patients. This

removes the prior link between covariates and medical outcome.

Issues related to interpretation of regression parameters. Once the regression parameters β have

been calculated, and assuming there are no problems caused by risk correlations, there are still

pitfalls in the interpretation of these parameters. To name two:

• In heterogeneous cohorts one will typically find dependence of the regression parameters on the

duration of the trial, with these parameters moving closer to zero with longer trial durations.

This need not be due to a true time dependence, but may well reflect ‘cohort filtering’, similar

to the machanism underlying figure 2.

• One should not interpret nonzero regression parameters as evidence for causal effects of the

associated covariates on risk. Finding βµ > 0 simply tells us that individuals with larger Zµ are

more likely to experience the primary hazard. Hazard and covariate could both be consequences

of a common cause, or perhaps the impending hazard could even cause the elevated covariate.

Imagine what we would find upon including the frequency of hospital visits as a covariate; we

would undoubtedly find a significantly large associated regression parameter, as individuals

who visit hospitals more are more likely to to be ill. Naive interpretation of the outcome of

regression would then lead us to recommend that hospital visits should generally be avoided.

8.4. Examples

Example 1: multivariate versus univariate analysis

We explore further the differences between univariate Cox regression (one covariate included at

a time) and multivariate regression (with multiple covariates simultaneously), and the effects

of covariate correlations. Assume for simplicity that we have just one risk (so ∆i = 1 for all i)

and ony two covariates, and start from expression (246), with β = (β1, β2) and Zi = (Zi1, Zi2):

L(β) =N∑i=1

β ·Zi −N∑i=1

log( 1

N∑j=1

eβ·Zj

θ(Xj−Xi))

+ constant (269)

Let us assume our data are of the form Xi = f(Zi1 +Zi2), where f(x) is some monotonically

decreasing function (so those with larger values of Zi1+Zi2 experience primary events earlier).

This implies that θ(Xj−Xi) = θ[Zi1+Zi2−Zj1−Z

j2 ]. In addition we define L(β) = L(β)/N , and

assume that all Zi are drawn randomly from a zero-average distribution P (Z). For very large

populations we will then find, using the law of large numbers:

limN→∞

L(β) = limN→∞

N∑i=1

β ·Zi

− limN→∞

N∑i=1

log( 1

N∑j=1

eβ·Zj

θ[Zi1+Zi2− Zj1−Z

+ constant

= −⟨

log⟨

eβ·Z′θ[Z1+Z2− Z ′1−Z ′2]

⟩Z ′⟩Z

+ constant (270)

in which 〈. . .〉Z =∫

dZ P (Z) . . .. Our different Cox regression versions are now:

multivariate : find minβ1,β2

⟨log

⟨eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]

⟩Z ′⟩Z

univariate : find minβ1

⟨log

⟨eβ1Z′1θ[Z1+Z2− Z ′1−Z ′2]

⟩Z ′⟩Z

find minβ1

⟨log

⟨eβ2Z′2θ[Z1+Z2− Z ′1−Z ′2]

⟩Z ′⟩Z

Upon doing the required differentiations with respect to the regression parameters, we then

find the following equations from which to solve β1 and β2:

multivariate :

∫dZ P (Z)

∫ dZ ′P (Z ′)Z ′1eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]∫dZ ′P (Z ′)eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]

= 0 (274)

∫dZ P (Z)

∫ dZ ′P (Z ′)Z ′2eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]∫dZ ′P (Z ′)eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]

= 0 (275)

univariate :

∫dZ P (Z)

∫ dZ ′P (Z ′)Z ′1eβ1Z′1θ[Z1+Z2− Z ′1−Z ′2]∫dZ ′P (Z ′)eβ1Z′1θ[Z1+Z2− Z ′1−Z ′2]

= 0 (276)

63∫dZ P (Z)

∫ dZ ′P (Z ′)Z ′2eβ2Z′2θ[Z1+Z2− Z ′1−Z ′2]∫dZ ′P (Z ′)eβ2Z′2θ[Z1+Z2− Z ′1−Z ′2]

= 0 (277)

We next work out two choices for the covariate statistics P (Z), which we take to be zero-average

Gaussian. In the first choice the covariates are identical, P (Z1, Z2) = δ(Z2−Z1)e−12Z2

In the second choice they are independent: P (Z1, Z2) = (e−12Z2

2π)(e−12Z2

• Correlated (identical) covariates:

Here our previous equations from which to solve (β1, β2) can now all be written in terms

of the following function:

F (u) =

e−12Z2

√2π

∫ dY Y e−12Y 2+uY θ[Z−Y ]∫

dY e−12Y 2+uY θ[Z−Y ]

To be specific, one finds

multivariate Cox regression : F (β1 + β2) = 0 (279)

univariate Cox regression : F (β1) = F (β2) = 0 (280)

One can prove easily that the function F (u) is convex, i.e. F ′′(u) > 0 for all u, so the

equation F (u) = 0 has exactly one solution u?. Thus we find:

multivariate Cox regression : β1 + β2 = u? (281)

univariate Cox regression : β1 = β2 = u? (282)

As expected, multivariate regression and univariate regression do not lead to the same

regression parameters. In univariate regression each individual covariate provides the

same amount of evidence for survival outcome, quantified by the value u? for each β1,2.

In multivariate regression, in contrast, the evidence is shared between the two covariates.

• Uncorrelated covariates:

Here, with P (Z1, Z2) = e−12

(Z21+Z2

2 )/2π, the calculations are slightly more involved. It will

be advantagous to first transfrom the variables Z and Z ′ according to

X1 =1√2

(Z1+Z2), X2 =1√2

(Z1−Z2), P (X1, X2) =1

2πe−

(X21 +X2

2 ) (283)

This will give θ[Z1 +Z2− Z ′1−Z ′2] = θ[X1−X ′1]. Let us start with the ratio of integrals

appearing in our expression for multivariate analysis:∫dZ ′P (Z ′)Z ′1eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]∫

dZ ′P (Z ′)eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]

=1√2

∫dX ′e−

(X′21 +X′22 )(X ′1+X ′2)e1√2

[X′1(β1+β2)+X′2(β1−β2)]θ[X1−X ′1]∫

dX ′e−12

(X′21 +X′22 )e1√2

[X′1(β1+β2)+X′2(β1−β2)]θ[X1−X ′1]

=1√2

∫dX ′1 X

′1e− 1

2X′21 + 1√

2X′1(β1+β2)

θ[X1−X ′1]∫dX ′1 e

− 12X′21 + 1√

2[X′1(β1+β2)]

θ[X1−X ′1]

+1√2

∫dX ′2 X

′2e− 1

2X′22 + 1√

2X′2(β1−β2)∫

dX ′2 e− 1

2X′22 + 1√

2X′2(β1−β2)

=1√2

∫X1−∞dx xe

− 12x2+ 1√

2x(β1+β2)∫X1

−∞dx e− 1

2x2+ 1√

2x(β1+β2)

2(β1−β2)

2(β1+β2)− 1√

∫X1−∞dx d

dxe− 1

2[x− 1√

2(β1+β2)]2∫X1

−∞dx e− 1

2[x− 1√

2(β1+β2)]2

2(β1−β2)

= β1 −1√2

e− 1

2[X1− 1√

2(β1+β2)]2

∫X1− 1√2

(β1+β2)

−∞ dx e−12x2

= β1 −1√π

e− 1

2[X1− 1√

2(β1+β2)]2

1+Erf(X1− 1√

2(β1+β2)√

) (284)

After further averaging over X we then find that all equations for regression parameters

can now be written in terms of the function

G(u) =

e−12x2

√2π

1√π

e− 1

2[x− 1√

1+Erf(x− 1√

e− 1

2x2− 1

2[x+ 1√

1+Erf(x/√

2)(285)

To be specific, we find

multivariate Cox regression : β1 = G(β1+β2), β2 = G(β1+β2) (286)

univariate Cox regression : β1 = G(β1), β2 = G(β2) (287)

So for either version of Cox analysis we have β1 = β2 = β, but the the equations from

which to solve β, and therefore the values found for β, are not identical in the two cases.

Even when covariates are not correlated, one can apparently still find different regression

parameters and hazard ratios when comparing univariate to multivariate regression:

multivariate Cox regression : β = G(2β) (288)

univariate Cox regression : β = G(β) (289)

Example 2: effect of duration of trial on regression parameters

Imagine we have data D on a cohort of size N , with one primary risk and the end-of-trial risk.

The trial is terminated at time τ > 0. All individuals i with a primary event prior to time τ

will have ∆i = 1, and all others will have ∆i = 0. This implies that if ti is the time at which

the primary event would occur for individual i, then the actually reported data will be

(Xi,∆i) =

(ti, 1) if ti < τ

(τ, 0) if ti ≥ τ(290)

We measure one binary covariate Zi ∈ 0, 1, and we assume that our cohort is heterogeneous,

involving two distinct event time distributions. We write the probability to find ti = t as pi(t)

i ≤ N/2 : pi(t) = ae−at, i > N/2 : pi(t) = a(1+Zi)e−a(1+Zi)t (291)

with a > 0. Let us first determine the true cause-specific hazard rates for the above example.

The individual data distributions are

Pi(X,∆) = δ∆,1 θ(τ−X)pi(X) + δ∆,0 δ(X−τ)

∫ ∞τ

dt pi(t) (292)

We note that ∫ ∞X

dt [Pi(t, 0) + Pi(t, 1)] =

∫ ∞X

dt[θ(τ−t)pi(t) + δ(t−τ)

∫ ∞τ

ds pi(s)]

∫ ∞X

dt pi(t) (293)

Via (18) we can now calculate the individual cause-specific hazard rates:

πi0(X) =δ(X−τ)

∫∞τ dt pi(t)∫∞

X dt [Pi(t, 0) + Pi(t, 1)]=δ(X−τ)

∫∞τ dt pi(t)∫∞

X dt pi(t)= δ(X−τ) (294)

πi1(X) =θ(τ−X)pi(X)∫∞

X dt [Pi(t, 0) + Pi(t, 1)]=θ(τ−X)pi(X)∫∞

X dt pi(t)(295)

We note that all individuals in the cohort have exponential event time distributions for the

primary risk, so prior to the trial termination time τ all should have time-independent cause-

specific hazard rates for the primary risk. This indeed follows from the above formula:

X>τ, all i : πi1(X) = 0 (296)

X<τ, i ≤ N/2 : πi1(X) =ae−aX∫∞

X dt ae−at= a (297)

X<τ, i > N/2 : πi1(X) =a(1+Zi)e−a(1+Zi)X∫∞

X dt a(1+Zi)e−a(1+Zi)t= a(1+Zi) (298)

The covariate is positively associated with the risk, since a value Zi = 1 shortens the average

time to the primary event by a factor two for half of our cohort. We draw all Zi randomly and

independently from P (Z) = 12δZ,1 + 1

2δZ,0. Note that we can write all individal hazard rates

above in the Cox form, with the same base hazard rate but with different regression parameters

for the two sub-groups

i ≤ N/2 : πi1(X) = λ0(t)eβZi, λ0(t) = aθ(τ−t), β = 0 (299)

i > N/2 : πi1(X) = λ0(t)eβZi, λ0(t) = aθ(τ−t), β = ln(2) ≈ 0.693 (300)

From this, in turn, we can calculate the true sub-cohort primary risk hazard rate, via (180)

(in which we abbreviate βi = 0 if i ≤ N/2 and βi = ln(2) for i > N/2):

π1(t|Z) =

∑i∈ΩZ

πi1(t)e−∫ t

0ds [δ(τ−s)+aθ(τ−s)eβiZi ]∑

i∈ΩZe−∫ t

0ds [δ(τ−s)+aθ(τ−s)eβiZi ]

∑i∈ΩZ

πi1(t)e−aeβiZi∫ t

0ds θ(τ−s)∑

i∈ΩZe−aeβiZi

∫ t0

ds θ(τ−s)(301)

We clearly always have π1(t|Z) = 0 for t > τ . For t < τ we find

π1(t<τ |0) =

∑i(1−Zi)πi1(t)e−at∑

i(1−Zi)e−at= a

∑i(1−Zi)∑i(1−Zi)

= a (302)

π1(t<τ |1) =

∑i≤N/2 Z

iπi1(t)e−at +∑i>N/2 Z

iπi1(t)e−2at∑i≤N/2 Z

ie−at +∑i>N/2 Z

ie−2at

∑i≤N/2 Z

i + 2∑i>N/2 Z

ie−at∑i≤N/2 Z

i +∑i>N/2 Z

ie−at(303)

For N →∞ this would become

π1(t<τ |0) = a, π1(t<τ |1) = a1+2e−at

1+e−at(304)

So in combination we can write

π1(t|Z) = aθ(τ−t) eβ(t)Z , β(t) = ln(1+2e−at

1+e−at

)(305)

This shows that the effect identified earlier, of heterogeneous cohorts giving decaying hazard

rates even if all individual hazard rates are stricty time independent, also impacts on regression

parameters. Here we find that the sub-cohort primary hazard rate is nearly of the Cox form,

but with a time dependent regression parameter which is not allowed in Cox regression.

According to (246) we want to maximise in Cox regression the following quantity over β (apart

from an irrelevant constant):

L(β|D) =β

N∑i=1

δ1,∆iZi − 1

N∑i=1

δ1,∆i log[ 1

N∑j=1

eβZjθ(Xj−Xi)

N∑i=1

θ(τ−ti)Zi

N∑i=1

θ(τ−ti) log[ 1

N∑j=1

eβZj(θ(τ−tj)θ(tj−ti) + θ(tj−τ)θ(τ−ti)

)](306)

We now inspect the case where our cohort is very large, so that we may send N →∞. By the

law or large numbers we then obtain

limN→∞

L(β|D) = β⟨1

∫ τ

0dt[ae−at + a(1+Z)e−a(1+Z)t

−⟨1

∫ τ

0dt[ae−at+a(1+Z)e−a(1+Z)t

′∫ ∞t

dt′(ae−at

′+a(1+Z ′)e−a(1+Z′)t′

)⟩Z′

= β⟨1

2Z(2−e−aτ−e−a(1+Z)τ

−⟨1

∫ τ

′(e−at+e−a(1+Z′)t

)⟩Z′

4β(2−e−aτ−e−2aτ

)−⟨1

∫ τ

2e−at

−⟨1

∫ τ

2eβ(1+e−at)

We need to calculate the maximum with respect to β of this expression. Differentiation with

respect to β gives

dβlimN→∞

L(β|D) = 2−e−aτ−e−2aτ −∫ τ

0dt(3ae−at+2ae−2at

) eβ(1+e−at)

2+eβ(1+e−at)

= 2−e−aτ−e−2aτ −∫ τ

0dt(3ae−at+2ae−2at

∫ τ

3e−at+2e−2at

2+eβ(1+e−at)

∫ aτ

3e−s+2e−2s

2+eβ(1+e−s)− 2(1−e−aτ ) (308)

So β is the unique solution of∫ aτ

3e−s+2e−2s

2+eβ(1+e−s)= 1−e−aτ (309)

0 1 2 3 4 5

Figure 8. The most probable parameter value β in Cox regression, for the example data

(290,291), in the limit N →∞. All individuals have strictly time-independent hazard rates,

but due to cohort filtering the cohort-level primary hazard rate becomes time dependent.

In Cox regresison this is not allowed, and as a conseuence one finds that the most probable

parameter β becomes dependent on (and decays with) the duration of the trial.

which via the transformation x = e−s can be rewritten as

1−e−aτ =

e−aτdx

2+eβ(1+x)= 2e−β

e−aτdx

2e−β+1+x

= 2e−β(1−e−aτ ) + 2e−β(1

2−2e−β)

e−aτdx

2e−β+1+x

= 2e−β(1−e−aτ ) + e−β(1−4e−β) log( 2e−β+2

2e−β+1+e−aτ

)(310)

Hence we get

(1−e−aτ )(1−2e−β) = e−β(1−4e−β) log( 2e−β+2

2e−β+1+e−aτ

)(311)

For small τ we find

aτ(1−2e−β) +O((aτ)2) = −e−β(1−4e−β) log(1− aτ

2e−β+2

)(1−2e−β) +O(aτ) = −e−β(1−4e−β)

2e−β+2

)+O(aτ)

2(1−2e−β)(e−β+1) = e−β(1−4e−β) +O(aτ)

2 = 3e−β+O(aτ) so β = ln(3/2) +O(aτ) ≈ 0.405 +O(aτ) (312)

Numerical solution of β from equation (311) for different values of the trial cut-off time τ results

in the curve of figure 8. Here the cohort ‘filtering’ results by definition in a time-independent

regression paramater β (since this is what Cox regression allows for), but the value found for β

decays with increasing trial durations, in spite of the fact that at the level of individuals there

is not a single time-dependent risk parameter.

Example 3: parametrisation of the base hazard rate

The derivation of Cox’s regression equations for the paramaters β was based on first maximising

(243) over the base hazard rates, giving (245), which was then substituted into (243) and

subsequently maximised over β. However, it is clear that the sum over δ-peaks (245) is a

maximum-likelihood estimator which is only realistic for infinitely large cohorts; we expect the

true base hazard rate to be smooth in time. We have already seen earlier that in the bayesian

formalism one could deal with this via a suitable smoothness prior (although this leads to

complicated equations). An alternative route for implementing smoothness of the base hazard

rate within the Cox formalism is to insert into (243) a simple parametrised form λ0(t|θ) :

logP(β,θ|D) =N∑i=1

δ1,∆i log λ0(Xi|θ) + β ·N∑i=1

δ1,∆iZi −

N∑i=1

eβ·Zi∫ Xi

0dt λ0(t|θ) (313)

We now maximise this expression over (β,θ) instead of (β, λ0). One tends to look for

paramerisations for which the integral in the last line can be done analytically. For instance,

a popular paramerisation is

λ0(t|y, τ) =y

τ(t/τ)y−1, τ > 0, y > 0 (314)

which gives us the following quantity, to be maximised over (τ, y,β):

L(β, y, τ |D) = log(y

τy)∑i

δ1,∆i + (y−1)∑i

δ1,∆i logXi + β ·∑i

δ1,∆iZi −

eβ·Zi

• We first maximise (315) over τ via

∂τL(β, y, τ |D) =

[ N∑i=1

eβ·Zi

(Xi/τ)y −N∑i=1

δ1,∆i

](316)

giving

∑Ni=1 eβ·Z

(Xi)y∑N

i=1 δ1,∆i

Upon substituting this optimal value for the time-scale parameter τ into (315), we are left

with the following function to be maximised over (y,β):

L(β, y|D) = log( y

∑i δ1,∆i∑

i eβ·Zi

)(∑i

δ1,∆i

)+ (y−1)

δ1,∆i logXi

+ β ·∑i

δ1,∆iZi −

δ1,∆i (318)

• We next maximize the expression L(β, y|D) over (y,β), via∂

∂yL(β, y|D) =

δ1,∆i

) ∂∂y

log( y

∑i δ1,∆i∑

i eβ·Zi

)+∑i

δ1,∆i logXi

δ1,∆i

y−∑i eβ·Z

log(Xi)(Xi)y∑

i eβ·Zi

δ1,∆i logXi (319)

and∂

∂βµL(β, y|D) =

δ1,∆iZiµ −

δ1,∆i

∂βµlog

eβ·Zi

(Xi)y)

δ1,∆iZiµ −

δ1,∆i

)∑i eβ·Z

Ziµ(Xi)y∑

i eβ·Zi

(Xi)y(320)

Thus we find that the optimal y and β are to be solved simultaneously from the following two

equations:

∑i eβ·Z

log(Xi)(Xi)y∑

i eβ·Zi

(Xi)y−∑i δ1,∆i logXi∑

i δ1,∆i

∑i δ1,∆iZ

iµ∑

i δ1,∆i

∑i eβ·Z

Ziµ(Xi)y∑

i eβ·Zi

(Xi)y(322)

In contrast, the standard Cox equations for β are (247), which we can also write as∑i δ1,∆iZ

iµ∑

i δ1,∆i

∑i eβ·Z

Ziµ∫Xi

0 dt λ0(t|β)∑i δ1,∆i

with the base hazard rate (245). It will be clear that the most probable values for β

corresponding to the choice of a paramerised base hazard rate will generally be different from

the standard Cox values that follow from (247). It follows from the above that they will only

be identical if(∑i

eβ·Zi

(Xi)y)(∑

eβ·Zi

∫ Xi

0dt λ0(t|β)

)=(∑

δ1,∆i

)(∑i

eβ·Zi

Ziµ(Xi)y)

In the simplest case of just one risk (i.e. ∆i = 1 for all i), for instance, this condition and the

equation for y reduce after some simple rewriting to

eβ·Zi

θ(Xi−Xk)∑j e

ˆβ·Zj

θ(Xj−Xk)− (Xi)

y∑j eβ·Z

= 0 (325)

∑i eβ·Z

log(Xi)(Xi)y∑

i eβ·Zi

(Xi)y− 1

logXi (326)

This set of equations simplifies when written in terms of new variables Yi = Xyi :

eβ·Zi

θ(Yi−Yk)∑j e

ˆβ·Zj

θ(Yj−Yk)− eβ·Z

Yi∑j eβ·Z

= 0 (327)

∑i eβ·Z

log(Yi)Yi∑i eβ·Z

Yi− 1

log Yi (328)

This will only be satisfied in very special cases. Generally, therefore, one should not use

parametrised base hazard rates in conjunction with the conventional formulae for the Cox

regression parameters.

9. Overfitting and p-values

9.1. What is overfitting?

Extracting covariate-to-outcome patterns from examples. In survival analysis we are given examples

of covariates Zi and corresponding ‘outcome variables’ (Xi,∆i). We assumed that the outcomes are

not generated purely randomly, but drawn from a distribution P (X,∆|Z) that actually depends on

Z. We have no direct information on P (X,∆|Z), but have to infer it from the N input-outcome

combinations (Zi, Xi,∆i) in our data set D. This is the objective of regression or classification.

One uses the term regression when the outcome is a real variable, classification when it is discrete;

in survival analysis we generally have both.

Model complexity. The danger in extracting patterns from such data is that we may extract

regularities that describe perfectly the detailed realisation of the specific data in D, but not the larger

population from which the data set D was drawn. If the individuals in D were selected randomly

from a larger population, there will be sampling noise in this selection (exactly which individuals

were picked?); we are not interested in any information that pertains only to the individuals in Dbut that is not representative of the population. Let us be more specific. Suppose we were to take

the extimator P (X,∆|Z) in (191) as our description of the true event statistics in the population,

P (X,∆|Z) =

∑i δ(X−Xi)δ∆,∆iδ(Z−Zi)∑

i δ(Z−Zi)(329)

This would suggest that we truly believe that no events can ever occur at times other than those

observed in the data set D. This would be nonsense. At best we hope that for sufficiently large data

sets this expression gives a reasonable approximation of the survival statistics in a distributional

sense, but the precise locations of the individual δ-peaks reflect randomness specific to D that most

probably does not describe actual biochemistry in the population.

Overfitting happens when we use a model that is too complex for the amount of availabe

data, and that therefore describes not just generic regularities but also the ‘noise’ in the data at

hand. It is clear that when one uses a model with more adjustable parameters than the number of

available data points, one is simply fitting a curve through data, and one cannot expect this curve

to generalise to the wider population and make reproducible predictions.

The example in figure 7.6 illustrates the impact of complexity reduction. The maximum

likelihood estimator π(t|Z) on the left is a finite sum of δ-peaks, whose precise locations we do not

expect to be reproducible. We allowed π(t|Z) to take any shape, however irregular or discontinuous.

The Bayesian estimator on the right in the figure, in contrast, is constrained by the prior to describe

the data with a smooth function π(t|Z), and consequently focuses on the density of δ-peaks in each

time range, which we would hope to be a more realistic description of the population.

9.2. Overfitting in binary classification

Binary outcome prediction from binary expression data. Let us make a small detour and simplify

our problem to its core. We assume that (i) we have just one risk (so the label ∆i is obsolete), (ii) all

our covariates are binary, i.e. Ziµ ∈ 0, 1 for all µ and all patients i (for instance we have covariates

giving gene expression levels, rounded off to the values Z = 1 ‘expressed’ or Z = 0 ‘non-expressed’).

Rather than predicting the event time X via P (X|Z) we try to classify our patients i simply into

good outcome and poor outcome ones, via a time cut-off τ that separates the classes: σi = 1 if

Xi > τ , and σi = 0 if Xi < τ . For example, if we have 4 patients and in each patient we measure

60 expression levels, our data could look like this:

outcome expression pattern

i = 1 : σ1 = 1 Z1 = (100101001010010101010010001010111001001001001001001000011111)

i = 2 : σ2 = 1 Z2 = (010001000010101001010101010010101000111100101001001010101000)

i = 3 : σ3 = 0 Z3 = (001010001110101101100100100111001110010100101010101000101010)

i = 4 : σ4 = 0 Z4 = (101011001010110010100100111100100101100111010111010001010010)

We would then try to find patterns in the Zi that enable us to predict the outcome σi; if our

covariates are gene expresison levels, such patterns are often called ‘gene signatures’. Here a

candidate signature gene would be a gene that takes the same values for all patients i in the

σi = 1 group, and opposite values for all patients in the σi = 0 group. Inspection reveals that there

are seven candidate signature genes in the above data, coloured in red (and marked with arrows):

outcome expression pattern ↓↓ ↓ ↓ ↓ ↓ ↓

i = 1 : σ1 = 1 Z1 = (100101001010010101010010001010111001001001001001001000011111)

i = 2 : σ2 = 1 Z2 = (010001000010101001010101010010101000111100101001001010101000)

i = 3 : σ3 = 0 Z3 = (001010001110101101100100100111001110010100101010101000101010)

i = 4 : σ4 = 0 Z4 = (101011001010110010100100111100100101100111010111010001010010)

However, the above data were in fact generated purely randomly, so these ‘patterns’ are not real,

but just random accidents. To illustrate this, let us randomize the outcome labels. This will again

reveal a set of alternative candidate ‘signature genes’ that appear to predict outcome:

outcome expression pattern ↓ ↓ ↓

i = 1 : σ1 = 1 Z1 = (100101001010010101010010001010111001001001001001001000011111)

i = 2 : σ2 = 0 Z2 = (010001000010101001010101010010101000111100101001001010101000)

i = 3 : σ3 = 1 Z3 = (001010001110101101100100100111001110010100101010101000101010)

i = 4 : σ4 = 0 Z4 = (101011001010110010100100111100100101100111010111010001010010)

Clearly, the larger the number p of covariates, the more frequent will be the accidental ‘signature

genes’. The numbers p = 60 and N = 4 chosen in the above example may be too small to be

realistic, but their ratio p/N = 15 is in fact similar to that in genetic data bases (where we may

well have values like N ≈ 1000 and p ≈ 15000).

Mulitple-testing correction to p-values. Let us quantify all this in more detail. If we generate the

values Zµ = (Z1µ, . . . , Z

Nµ ) for each gene randomly, with Prob(Z = 1) = Prob(Z = 0) = 1

2 (as was

done for the above data), then for a given assigment of outcome variables (σ1, . . . , σN ) we find

Prob(µ is signature gene) = Prob[Ziµ = σi for all i

]+ Prob

[Ziµ = 1−σi for all i

]= 2.(

2)N = (

2)N−1 (330)

If we have p such random genes and N patients, we will find on average p(12)N−1 accidental signature

genes (giving an average of 60.(12)3 = 71

2 for our above example), and the probability to see k

accidental signature genes will be

P (k signature genes) =(pk

2)k(N−1)

(1− (

2)N−1

)p−k(331)

In particular, the probability to find at least one candidate signature gene isp∑

P (k signature genes) = 1− P (0 signature genes)

= 1−(1− (

2)N−1

)p(332)

If we look for candidate signature genes in real data, and we find one or more of these, we should

calculate the p-value (i.e. the probability that our observation is just the result of chance) using as

our ‘null hypothesis’ the above situation of randomly generated expression values. This gives:

p− value of observed signature gene = 1−(1− (

2)N−1

)p(333)

We cannot not just use the naive probability (12)N−1 of an individual gene being a signature as our

p-value (unless we had indeed selected this particular gene beforehand and investigated only this

chosen gene, rather then looking for interesting genes in a list of p candidates). We must take into

account the number of genes we inspect for candidacy. This is called ‘multiple-testing correction’.

For the data above we would have a probability of a given gene to be a signature gene the value

(12)N−1 = 0.125 (nearly a signal), but with multiple testing the burden of evidence is much higher,

and we find multiple-testing p-value of 1 − (78)60 ≈ 0.9997 (no signal at all). If in our data set the

balance between expressed and non-expressed genes differs from 50%, or the balance between good

and poor outcome patients differs from 50%, then formula (333) is modified in a trivial way.

9.3. Overfitting in Cox regression

Using Cox regression for binary classification. Although Cox regression was not designed for binary

classification, it can be used for classifying patients into low risk (σi = 1 if Xi > τ) versus low

risk (σi = 0 if Xi < τ) classes. Once the Cox regression parameters β have been calculated from

our data D = (Z1, X1,∆1), . . . , (ZN , XN ,∆N ), we can use (250) to predict new classifications

σ ∈ 0, 1 on the basis of covariate information Z alone. To emphasise more the similarity with the

above example of gene signatures, we first rewrite SCox(τ |Z) as

SCox(τ |Z) = Φ(w ·Z) (334)

with w = −β, and where Φ(u) is a monotonically increasing function:

Φ(u) = exp(−e−uΛ0(τ |β)

)(335)

with the integrated base hazard rate Λ0(τ |β) as defined in (249). The function Φ(u) is monotonically

increasing, with Φ(−∞) = 0 (provided there is at least one i withXi < τ and ∆i = 1) and Φ(∞) = 1.

This definition allows us to write the class probability for an individual with covariates Z within

the Cox formalism as

P (σ|Z) = δσ,1Φ(w ·Z) + δσ,0[1− Φ(w ·Z)] (336)

Figure 9. Classification performance QT on training sets and classification performance

QV on validation sets, in Cox regression used for binary classification, with breast cancer

tumour biomarkers as covariates. Classes were defined by cancer relapse before 8 years

(σi = 0) or after 8 years (σi = 1) following diagnosis of the primary tumour. Here N = 70

(with randomly selected training and validation sets of equal sizes NV = NT = 35), and

p = 1 . . . 18. The curves were generated iteratively, starting from Cox regression and

performance measurement for all p = 18 available covariates, followed by iterative removal of

the least important covariate (the one with the smallest average value of |βµ|) and repetition

of the Cox regression and performance measurement, down to p = 1 (which leaves only the

most informative covariate). Classes are assigned according to (337), and the measures QVand QT are defined in (339). Error bars indicate standard deviations over all divisions into

training and validation sets.

Expression (336) is of a form commonly used by all machine learning binary classification algorithms

that are based on linear separation. We can then allocate a class σ(Z) to each individual with

covariates Z, defined as the most probable class according to (336):

σ(Z) =

1 if P (1|Z) > 1

0 if P (0|Z) > 12

Evidence of overfitting in classification performance on training and validation sets. We like to know

whether overfitting affects also Cox regression. To answer this we need to identify a measurable

marker of overfitting. Overfitting in pattern detection algorithms and regression methods means

that the final model (i.e. a quantitative pattern linking covariates to outcome) predicts outcome

significant better on the specific data set used in the analysis than on data from the wider population

from which the data were drawn, i.e. patterns which claim to predict outcome are insufficiently

reproducible beyond the data set from which they were extracted. The only unambiguous test for

overfitting in pattern detection algorithms and methods therefore requires having two data sets:

one set from which to extract the patterns (if any), the ‘training set’, and one set on which to test

these patterns, called the ‘validation set’. Starting from our full cohort 1, . . . , N we would this

separate this cohort in two, 1, . . . , N = ΩT ∪ΩV (the simplest division would be into two random

subsets of equal size NT = NV = 12N), where only data from ΩT are used for finding classification

parameters. The fractions correcty classified for a given division are defined as

QT (ΩT ) =1

∑i∈ΩT

δσi,σ(Zi

)QV (ΩV ) =

∑i∈ΩV

δσi,σ(Zi

)(338)

and the average performance measures are defined as the averages of the above values over all

possible divisions of 1, . . . , N into equally large training and validation sets:

QT = 〈QT (ΩT )〉ΩT QV = 〈QV (ΩV )〉ΩV (339)

An example calculation of the qantities in (339) as a function of (iteratively reduced) values of p

(the number of covariates included) is shown in figure 9. Here the problem was to predict whether

breast cancer relapse occurs after 8 years (σi = 1) or before 8 years (σi = 0). This figure reveals

obvious overfitting as soon as p > 6 (where NT /p > 6). For p ≤ 6 both training and validation

performance increase with increased model complexity, but for p > 7 we see the classic fingerprint

of overfitting: the performance on the validation set continuous to increase, but the performance

on the validation set deteriorates consistently (here the regression model is trying to predict the

‘noise’ in the training set, as opposed to a reproducible pattern). One finds very similar figures if

in this same N = 70 data set one uses clinical covariates (e.g. tumour size, tumour grade, lymph

node involvement, etc); also there overfitting sets in after inclusion of just a few covariates.

It is not possible to say beforehand which is the optimal ratio NT /p, i.e. the ratio where QT and

QV start to separate, in a given classification problem. This ratio must depend on various details of

the data, e.g. correlations amongst covariates, any imbalance between the sizes of the σ = 1 versus

σ = 0 classes, etc. For large training sets ΩT with randomly generated Zi ∈ 0, 1p (with equal

probabilities) and randomly generated outcome variables σi ∈ 0, 1 (with equal probabilities) an

old exact result by Cover on binary classifiers tells us that we need at least NT /p > 2.

Evidence of overfitting in regression parameters. We have seen that Cox regression suffers from

overfitting if the number p of covariates is too large relative to the size N of the data set. One also

observes this when calculating β for small data sets. The simplest case is to consider p = N = 2,

with just one risk, where we know we are in the overfitting regime. For reasons that will become clear

we add a weak Gaussian prior P (β) ∼ exp(−12αβ

2) to the Bayesian description, with 0 < α 1.

Here we have D = (Z11 , Z

12 , X1), (Z2

1 , Z22 , X2) and find (246) reducing to

L(β1, β2|D) = L(0, 0|D) + β1(Z11 +Z2

1 ) + β2(Z12 +Z2

2 )− 1

2α(β2

1 + β22)

− log 1

2eβ1Z11+β2Z1

2 + eβ1Z21+β2Z2

2 θ(X2−X1)12 + θ(X2−X1)

− log

eβ1Z11+β2Z1

2 θ(X1−X2) + 12eβ1Z2

1+β2Z22

θ(X1−X2) + 12

using θ(0) = 12 . Without loss of generality we may take X1 < X2, giving

L(β1, β2|D) = L(0, 0|D) + β1Z11 + β2Z

12 −

2α(β2

1 + β22)

− log(1

3eβ1Z1

1+β2Z12 +

3eβ1Z2

1+β2Z22

)(341)

To find the maximum we differentiate, and find the equations from which to solve (β1, β2) and the

curvature matrix at the extremal point:

∂β1L(β1, β2|D) = Z1

1 − αβ1 −Z1

1eβ1Z11+β2Z1

2 + 2Z21eβ1Z2

1+β2Z22

eβ1Z11+β2Z1

2 + 2eβ1Z21+β2Z2

1 − Z21 )eβ1Z2

1+β2Z22

eβ1Z11+β2Z1

2 + 2eβ1Z21+β2Z2

− αβ1

1 − Z21 )

eβ1(Z11−Z

21 )+β2(Z1

2−Z22 ) + 2

− αβ1 (343)

∂β2L(β1, β2|D) = Z1

2 − αβ2 −Z1

2eβ1Z11+β2Z1

2 + 2Z22eβ1Z2

1+β2Z22

eβ1Z11+β2Z1

2 + 2eβ1Z21+β2Z2

2 − Z22 )eβ1Z2

1+β2Z22

eβ1Z11+β2Z1

2 + 2eβ1Z21+β2Z2

− αβ2

2 − Z22 )

eβ1(Z11−Z

21 )+β2(Z1

2−Z22 ) + 2

− αβ2 (345)

∂β21

L(β1, β2|D) =−2(Z1

1−Z21 )2eβ1(Z1

1−Z21 )+β2(Z1

2−Z22 )

[eβ1(Z11−Z

21 )+β2(Z1

2−Z22 ) + 2]2

− α (346)

∂β1∂β2L(β1, β2|D) =

−2(Z11−Z2

1 )(Z12−Z2

2 )eβ1(Z11−Z

21 )+β2(Z1

2−Z22 )

[eβ1(Z11−Z

21 )+β2(Z1

2−Z22 ) + 2]2

∂β22

L(β1, β2|D) =−2(Z1

2−Z22 )2eβ1(Z1

1−Z21 )+β2(Z1

2−Z22 )

[eβ1(Z11−Z

21 )+β2(Z1

2−Z22 ) + 2]2

− α (348)

If we write the first derivatives in terms of the quantity Ξ = 12eβ1(Z1

1−Z21 )+β2(Z1

2−Z22 ) we find

∂L∂β1

1−Z21

Ξ + 1− αβ1,

∂L∂β2

2−Z22

Ξ + 1− αβ2 (349)

The maximum is achieved for

β1 =Z1

1−Z21

α(Ξ + 1), β2 =

Z12−Z2

α(Ξ + 1)(350)

with Ξ(α) to be solved from the transcendental equation

(1+Ξ) log(2Ξ) = (Z1−Z2)2/α (351)

Clearly Ξ(α) → ∞ for α → 0, such that limα→0 αΞ(α) = 0. Hence the regression parameters β1,2

both diverge as α→ 0, which is a clear sign of overfitting.

Although in the example above the regression parameters diverge, it turns out that the z-

scores do not diverge, as a result of which we also find large p-values (260) for the hazard ratios

HRµ (equivalently: for the βµ). The latter are given by

µ−th p−value = 1− Erf(zµ/√

2), zµ = |βµ|/σµ (352)

The second derivatives of L are needed to calculate p-values, and must be evaluated for the null

model β1 = β2 = 0, giving

∂2L∂β2

∣∣∣β=0

= −2

1−Z21 )2 − α, ∂2L

∂β1∂β2

∣∣∣β=0

= −2

1−Z21 )(Z1

2−Z22 ) (353)

∂2L∂β2

∣∣∣β=0

= −2

2−Z22 )2 − α (354)

For the matrix A(0) this implies

Aµν(0) = − ∂2L∂βµ∂βν

∣∣∣β=0

= αδµν + ZµZν , with Zµ =

µ−Z2µ) (355)

The normalised eigenvectors of A(0) are e1 = (Z1, Z2)/√Z2

1 + Z22 and e2 = (Z2,−Z1)/

1 + Z22 ,

wih respective eigenvalues:

λ1 = e1 ·A(0)e1 = α+ Z21 + Z2

2 (356)

λ2 = e2 ·A(0)e2 = α (357)

This allows to write down the entries of A−1(0) in explicit form:

A−1(0)µν =1

λ1(e1)µ(e1)ν +

λ2(e2)µ(e2)ν (358)

In particular we find the two variances appearing in our formulae for p-values:

σ21 = A−1(0)11 =

(e1)21

(e2)21

λ1(Z21 +Z2

λ2(Z21 +Z2

(α+ Z21 + Z2

2 )(Z21 +Z2

α(Z21 +Z2

2 )(359)

σ22 = A−1(0)22 =

(e1)22

(e2)22

λ1(Z21 +Z2

λ2(Z21 +Z2

(α+ Z21 + Z2

2 )(Z21 +Z2

α(Z21 +Z2

2 )(360)

It follows that for small α we will have

α(1 + Z2

1/Z22 )−1 +O(1) =

1√α

(1 + Z21/Z

22 )−

12 +O(

√α) (361)

α(1 + Z2

2/Z21 )−1 +O(1) =

1√α

(1 + Z22/Z

21 )−

12 +O(

√α) (362)

For small α we thus obtain the following z-scores for our regression parameters β1,2:

z1 =|β1|σ1

1−Z21√

α(Ξ + 1)(1 + Z2

1/Z22 )

12 (1 +O(α)) (363)

z2 =|β2|σ2

2−Z22√

α(Ξ + 1)(1 + Z2

2/Z21 )

12 (1 +O(α)) (364)

The remaining question is: how does [1 + Ξ(α)]√α scale as α→ 0? Let us write Q = [1 + Ξ(α)]

√α,

so α = Q2/[1 + Ξ(α)]2. Substitution into (351) gives

Q2 log(2Ξ)

1 + Ξ= (Z1−Z2)2 (365)

From this we conclude that as α → 0 (where Ξ → ∞) we must have Q → ∞. Consequently

[1 + Ξ(α)]√α→∞ and therefore

limα→0

z1 = limα→0

z2 = 0 (366)

This, in turn, implies that for α→ 0 both p-values become 1, so here the individual diverging values

of the regression parameters β1,2 are recognised to be nonsignificant.

Generally, if we carry out Cox regression and then ask whether any of the regression parameters

are significantly away from zero, we are once more doing multiple testing, and consequently need

to use a corrected p-value for the combined test. If we write the p-value for component βµ as pµ,

then the corrected p-value is 1 minus the probability that none of the parameters are significant:

p−value = 1−p∏

(1− pµ) = 1−p∏

Erf(zµ/√

9.4. p-values for Kaplan-Meier curves

9.5. Notes and examples

10. Heterogeneous cohorts and competing risks

In the picture below we have rows representing a number of selected genes, and columns representing

patients of a breast cancer cohort. Each small square gives the expression level Ziµ of a specific gene

µ for a specific individual i (red: upregulated; green downregulated). The patients are clustered

on the basis of similarity in their expression profiles. This gives a dendrogram (at the top), which

reveals the existence of breast cancer sub-types (at least in terms of expression profiles):

The clustering in the prevous figure was done in the space IRp of the covariates Zi, so this maps

covariate heterogeneity. However, the individuals are all breast cancer patients, so relative to the

overall population we are looking at a high-risk group. If this cohort is homogeneous in terms of

the covariate-to-risk relation, we expect to see some ‘colour pattern’ that is common to all patients

(in contrast to the situation of generating such a figure for all members of the population, with low

and high breast cancer risk). We see in the figure that there is no such pattern: most genes are

consistently upregulated in some cancer sub-types, but consistently downregulated in others. If we

would apply Cox regression to this full cohort, with all disease sub-types included, we would not

extract significant information. In Cox regression there is just one parameter for each covariate (an

upregulated gene is assumed to be either consistently good news for any patient, or consistently

bad news, but never good news for some and bad news for others), so the regression parameter

βµ of any gene µ that is not mostly of one colour would in our formulae be averaged out by the

different patient sub-groups, and take a value close to zero (note that the expression patterns of the

basal-like cancers are nearly opposite to those of the luminal subtype A).

In the data set at hand we are lucky: the covariate heterogeneity (which is directly measurable)

allows us to produce clusters which also describe the covariate-to-risk heterogeneity. This is not

always the case; one could have absent covariate clusters, but still clusters in the covariate-to-risk

patterns. In this section we try to develop intuition and theory for handling such scenarios.

10.1. Population-level hazard rate correlations and competing risks

Competing risks caused by hazard rate heterogeneity. We have seen that most survival analysis

methods assume different risks to be uncorrelated. Risk correlations are defined by a non-factorising

joint event time distributions, i.e. P (t0, . . . , tR) 6=∏r P (tr). This could happen because at

the level of individuals we have Pi(t0, . . . , tR) 6=∏r Pi(tr). A more plausible mechanism is for

correlated risks to be a caused by correlations in risk-specific susceptibilities. We could have

Pi(t0, . . . , tR) =∏r Pi(tr) for each i, and still find correlated risks in the population, since generally

P (t0, . . . , tR) =1

Pi(tr))6=

P (tr) (368)

Let us illustrate this. Imagine we have a cohort of N = 1000 individuals, subject to two risks. At

the level of individuals the event times are not correlated, so Pi(t1, t2) = Pi(t1)Pi(t2) for all i, and

all individual hazard rates (πi1, πi2) are time-independent.

• If our cohort is homogeneous, all individuals have

(πi1, πi2) = (π1, π2). There cannot be risk correlations

since there is no risk variability. There can never be

competing risk or false protectivity problems.

(π1, π2)

N=1000

• Imagine that our cohort is heterogeneous, e.g.

there are four subgroups (A,B,C,D) with distinct

cause specific hazard rates, but the subgroups

are of equal size. There is risk variability, but

there are still no risk correlations.

%(π1 ↑, π2 ↓)

NC =250

(π1 ↑, π2 ↑)

NA=250 (π1 ↓, π2 ↓)

ND=250

(π1 ↓, π2 ↑)

NB=250

• Imagine that our cohort is heterogeneous, e.g.

there are four subgroups (A,B,C,D) with distinct

cause specific hazard rates, and distinct sizes.

Here there will be significant risk correlations.

Patients with elevated risk for event 1 tend to

have also elevated risk for event 2.

%(π1 ↑, π2 ↓)

NC =20

(π1 ↑, π2 ↑)

NA=480 (π1 ↓, π2 ↓)

ND=480

(π1 ↓, π2 ↑)

In the third scenario risk 2 will exert a ‘false protectivity’ effect on risk 1: the type-2 censoring

events will tend to remove individuals from the population that have an elevated type-1 risk.

We have seen ‘cohort filtering’ already in previous examples. Failure to take heterogeneity and

competing risks into account (as in Kaplan-Meier estimation and Cox regression) can give serious

problems – it can underestimate risks and hazard rates (if competing risks correlate positively) or

overestimate them (if they correlate negatively). It can make harmful treatments appear beneficial

and beneficial treatments appear harmful. The point here is that finding competing risks does

not require risk correlations at the level of individuals; they could be generated at the level of the

statistics of cause specific hazard rates over the cohort. But cause specific hazard rates can be

estimated from survival data so this problem is in principle solvable.

Structure of regression models for heterogeneous cohorts. Any regression model that is to capture

the relation between covariates and risk in a heterogeneous cohort, where risks can compete, should

(i) be able to describe the distribution of the individual cause-specific hazard rates, and how this

distribution depends on covariates, and (ii) do so for all risks simultaneously, not just for the primary

risk. We are then led automatically to the conditioning picture of survival analysis with covariates

in subsection 7.2, which was formulated in terms of W (π|Z), the probability that a randomly drawn

individual with covariates Z will have individual cause-specific hazard rates π.

All prediction formulae are already given in subsection 7.2. However, just as in the sub-cohort

formulation we were forced (by computation limitations and the danger of overfitting) to use simple

parametrisations of π(Z), also here we need simple parametrisations of W (π|Z) for the same

reasons. Let us write our chosen paramerised form as Wθ = W (π|Z,θ), in which θ ∈ IRq are the

parameters. We choose one-to-one parametrisations, so that Wθ = Wθ′ if and only if θ = θ′. The

parametrisation is built into the prior P(W ), which is zero if W is not of the required form:

P(W ) =

∫dθ P (θ)δ[W−Wθ] (369)

in which P (θ) is a parameter prior. Insertion into (197,198) tells us that

logP(W |D) =N∑i=1

∫dπ W (π|Zi)P (Xi,∆i|π) + log

∫dθ P (θ)δ[W−Wθ] + constant (370)

So indeed P(W |D) > 0 only for thoseW that are of the parametrised form. All statistical predictions

to be made with P(W |D) can now always be written as

∫dW P(W |D)

∫dπ W (π|Z)Q(π,Z)

∫dW P(W )P(D|W )

∫dπ W (π|Z)Q(π,Z)∫

dW P(W )P(D|W )

∫dθ P (θ|D)

∫dπ W (π|Z,θ)Q(π,Z) (371)

in which

P (θ|D) =P (θ)P(D|θ)∫

dθ′ P (θ′)P(D|θ′)(372)

P(D|θ) =N∏i=1

∫dπ W (π|Zi,θ)P (Xi,∆i|π) (373)

The structure of the formalism has remained unchanged by the imposition of our parametrisation,

but our predictions are now written in terms of averages over a posterior distribution of θ (instead

of averages over all W ), with the usual contributions from a prior and from the evidence in the data.

The log-likelihood of the posterior parameter distribution L(θ|D) = logP (θ|D) takes the form

L(θ|D) =N∑i=1

∫dπ W (π|Zi,θ)P (Xi,∆i|π) + logP (θ) + constant (374)

These formulae are still completely general. Any theory will have this form; we could even choose

as our parametrisation the set of all possible functions W (π|Z) and work our way from (374) back

to (197,198). Theories can differ only in the chosen paramerisation W (π|Z,θ) and prior P (θ).

10.2. Rational paramerisations for heterogeneous population models

Heterogeneity caused by missing covariates in Cox-type models – frailty models. Let us try

to construct a parametrisation W (π|Z,θ) from the following reasonable (but not necessarily

true) assumption: any covariate-to-risk heterogeneity in our cohort is due to the fact that we

haven’t measured enough information-carrying covariates. It follows that for each heterogeneous

cohort with covariates Z = (Z1, . . . , Zp) there exists an expanded set of covariates (Z, Z) =

(Z1, . . . , Zp, Z1, . . . , Zn) such that upon measuring the expanded set the cohort would have been

homogeneous. Hence

W (π|Z, Z) = δ[π − π(Z, Z)] (375)

for some cause-specific hazard rates π(Z, Z), and

W (π|Z) =

∫dZ P (Z)W (π|Z, Z) =

∫dZ P (Z)δ[π − π(Z, Z)] (376)

In the spirit of Cox regression, we now focus only on the primary risk hazard rate, and choose the

full primary hazard rate to be of the Cox form:

π1(t|Z, Z) = λ0(t)eβ·Z+˜β· ˜Z = π1(t|Z)e

˜β· ˜Z (377)

with the conventional Cox formula π1(t|Z) = λ0(t)eβ·Z . This implies that we postulate that the

primary risk hazard rate of each individual i is of the form

πi1(t) = λ0(t)eui+β·Zi

The (to us) unknown variable ui raises or lowers the overall hazard rate, and is called a ‘frailty

factor’. Models of this type are called ‘frailty models’. We can now write the distribution W (π|Z),

which we seek to parametrise, in terms of the original covariates as

W (π|Z) =

∫du P (u) δ[π−π?(Z, u)] (379)

P (u) =

∫dZ P (Z)δ[u−β · Z] (380)

π1(t|Z, u) = λ0(t) eu+β·Z , πr 6=1(t|Z, u) = πr 6=1(t|Z) unspecified (381)

The average 〈u〉 =∫

du P (u)u = β · 〈Z〉 can always be absorbed into λ0(t), via λ0(t) → λ0(t)e〈u〉.

We may therefore always take P (u) to have zero average. We now simply parametrize P (u), and

each such parametrisation then gives a class of frailty models that are simple generalisations of

the standard Cox model, For instance, upon choosing P (u) = (σ√

2π)−1e−12u2/σ2

(coresponding to

Gaussian distributed missing covariates Z) or P (u) =∑L`=1w` δ(u−u`) (corresponding to discrete

missing covariates Z), we would get

Gaussian : W (π|Z,β, λ0, σ) =

∫dz√2π

e−12z2δ[π−π(Z, σz)], θ = (β, λ0, σ) (382)

discrete : W (π|Z,β, λ0, w`, u`) =L∑`=1

w` δ[π−π(Z, u`)], θ = (β, λ0, w`, u`) (383)

with the Cox-type hazard rates (381).

Since one does not specify πr 6=1(t|Z, u), we focus in frailty models on extracting the parameters

relating to the primary risk. This is possible if we choose a simple prior in (374) that factorises over

risks, viz. P (θ) = P (β, λ0, σ)P (θ′), where θ′ refers to any parameters relating to the non-primary

risks r 6= 1. For such factorising priors we find (374) reducing for the above frailty models to

L(θ|D) =N∑i=1

∫du P (u)P (Xi,∆i|π(Zi, u)) + logP (θ) + constant

=N∑i=1

δ∆i,r log

∫du P (u)πr(Xi|Zi, u)e−

∑r′∫ Xi

0dt πr′ (t|Z

i,u) + logP (θ) + constant

=N∑i=1

δ∆i,1 log∫

du P (u)π1(Xi|Zi, u)e−∫ Xi

0dt π1(t|Zi

,u)−∑

r′ 6=1

∫ Xi0

dt πr′ (t|Zi)

+ logP (θ) + terms independent of θ

=N∑i=1

δ∆i,1 log

∫du P (u)λ0(Xi) exp

[u+ β ·Zi − eu+β·Zi

∫ Xi

0dt λ0(t)

]+ logP (θ) + terms independent of θ

=N∑i=1

δ∆i,1 log λ0(Xi) +N∑i=1

δ∆i,1 log

∫du P (u)eu exp

[− eu+β·Zi

∫ Xi

0dt λ0(t)

+N∑i=1

δ∆i,1β ·Zi + logP (θ) + terms independent of θ (384)

For P (u) → δ(u) the above log-likelihood reverts back to that of the Cox model. For nontrivial

P (u), however, the maximisation of L(θ|D) is clearly more involved. Variation of λ0(t) gives

new statistical methodology

data: 2047 prostate cancer patients

Cox regression: BMI selenium leis act work act smoking

PC 0.1 -0.1 0.2 -0.1 -0.1

new explanation:

2 classes52%, low overall frailty

BMI selenium leis act work act smoking

PC 1.5 -4.8 2.2 -1.4 4.1

other 0.8 -0.2 -0.2 0.0 1.5

48%, high overall frailty

BMI selenium leis act work act smoking

PC 0.0 0.0 0.0 0.0 0.0

other 0.0 -0.1 -0.1 0.0 0.2

10.3. Types of heterogeneity

10.4. Impact of heterogeneity on hazard ratios

10.5. False protectivity

10.6. Frailty models

10.7. Fine and Gray regression

10.8. Bayesian regression

10.9. Notes and examples

11. Further topics

11.1. Binary classification

Bayesian approach to gene signatures

• prognostic signatures

e.g. ‘typical’ profile of patients with poor outcome

‘typical’ profile of patients with good outcome

‘typical’ profile of patients that respond to treatment

profile similar to signature → treat differently

various heuristic definitions ...

different normalisation of signals ...

what does similar mean ...

distance? covariance?

• •

x2 x2 use ‘good outcome’ or

‘poor outcome’ signatures?

how to map similarity to

classification reliability?

Bayesian prediction of responders

medical trial data: D

classes: σ = 1, 0 (response yes/no)

fraction responders: φ

prob that woman

with profile x

will respond:

p(1|x) =

∫dθ p(θ|D)

[ φp(x|1,θ)

φp(x|1,θ) + (1−φ)p(x|0,θ)

]p(x|σ,θ): parametrised distr

p(θ|D): explicit formula• no ambiguities

• formulated in terms of response prob

• find cohort’s profile characteristics

simplest case:

Gaussian p(x|σ,θ),

variances ∆0,∆1

∆0<∆1

∆0 =∆1

∆0>∆1

• • •

heterogenenous disease?

one peak in non-responders’ gene stats

two peaks • in responders’ gene stats ...

prognostic signatures? Bayesian p(1|x)

Bayesian decision boundaries adapt

to shape of gene profile statistics of the classes

11.2. Multiple testing corrections

11.3. Log-rank test

Appendix A. The δ-distribution

Definition. We define the δ-distribution as the probability distribution δ(x) corresponding to a

zero-average random variable x in the limit where the randomness in the variable vanishes. So∫dx f(x)δ(x) = f(0) for any function f

By the same token, the expression δ(x−a) will then represent the distribution for a random variable

x with average a, in the limit where the randomness vanishes, since∫dx f(x)δ(x− a) =

∫dx f(x+ a)δ(x) = f(a) for any function f

Formulas for the δ-distribution. A problem arises when we want to write down a formula for δ(x).

Intuitively one could propose to take a zero-average normal distribution and send its width to zero,

δ(x) = limσ→0

pσ(x) pσ(x) =1

2πe−x

2/2σ2(A.1)

This is not a true function in a mathematical sense: δ(x) is zero for x 6= 0 and δ(0) =∞. However,

we realize that δ(x) only serves to calculate averages; it only has a meaning inside an integration. If

we adopt the convention that one should set σ → 0 in (A.1) only after performing the integration,

we can use (A.1) to derive the following properties (for sufficiently well-behaved functions f):∫dx δ(x)f(x) = lim

σ→0

∫dx pσ(x)f(x) = lim

σ→0

∫dx√2π

e−x2/2f(σx) = f(0)∫

dx δ′(x)f(x) = limσ→0

dx[pσ(x)f(x)]− pσ(x)f ′(x)

σ→0[pσ(x)f(x)]∞−∞ − f

′(0) = −f ′(0)

The following relation links the δ-distribution to the step-function:

δ(x) =d

dxθ(x) θ(x) =

1 if x > 0

0 if x < 0(A.2)

This one proves by showing that both sides of the equation have the same effect inside an integration:∫dx

[δ(x)− d

dxθ(x)

]f(x) = f(0)− lim

ε→0

∫ ε

−εdx

dx[θ(x)f(x)]−f ′(x)θ(x)

= f(0)− lim

ε→0[f(ε)−0] + lim

ε→0

∫ ε

0dx f ′(x) = 0

Finally one can use the definitions of Fourier transforms and inverse Fourier transforms to obtain

the following integral representation of the δ-distribution:

δ(x) =

∫ ∞−∞

2πeikx (A.3)

Appendix B. Steepest descent integration

Steepest descent (or ‘saddle-point’) integration is a method for dealing with integrals of the following

type, with x ∈ IRp, continuous functions f(x) and g(x) of which f is bounded from below, and

with N ∈ IR positive and large:

IN [f, g] =

∫IRp

dx g(x)e−Nf(x) (B.1)

We first take f(x) to be real-valued; this is the simplest case, for which finding the asymptotic

behaviour of (B.1) as N → ∞ goes back to Laplace. We assume that f(x) can be expanded in a

Taylor series around its minimum f(x?), which we assume to be unique, i.e.

f(x) = f(x?)+1

p∑ij=1

Aij(xi−x?i )(xj−x?j ) +O(|x−x?|3), Aij =∂2f

∂xi∂xj|x? (B.2)

If the integral (B.1) exists, inserting (B.2) into (B.1) followed by transforming x = x?+y/√N gives

IN [f, g] = e−Nf(x?)∫

IRpdx g(x)e

− 12N∑

ij(xi−x?i )Aij(xj−x?j )+O(N |x−x?|3)

= N−p2 e−Nf(x?)

∫IRp

dy g(x?+y√N

) e− 1

∑ijyiAijyj+O(N−

12 |y|3)

From this latter expansion, and given the assumptions made, we can obtain two important identities:

− limN→∞

∫IRp

dx e−Nf(x) = − limN→∞

Nlog IN [f, 1]

= f(x?) + limN→∞

p logN

2N− 1

∫IRp

dx e− 1

∑ijyiAijyj+O(N−

12 |y|3)

= f(x?) = min

x∈IRpf(x) (B.4)

limN→∞

∫dx g(x)e−Nf(x)∫

dx e−Nf(x)= lim

N→∞

IN [f, g]

IN [f, 1]= lim

N→∞

IRpdy g(x?+ y√N

) e− 1

∑ijyiAijyj+O(N−

12 |y|3)

∫IRpdy e

− 12

∑ijyiAijyj+O(N−

12 |y|3)

=g(x?)(2π)p/2/

√DetA

(2π)p/2/√

DetA= g(x?) = g(arg min

x∈IRpf(x)) (B.5)

If f(x) is complex, the correct procedure to be followed is to deform the integration paths in the

complex plane (using Cauchy’s theorem) such that along the deformed path the imaginary part of

the function f(x) is constant, and preferably (if possible) zero. One then proceeds using Laplace’s

argument and finds the leading order in N of our integral in the usual manner by extremization of

the real part of f(x). In combination, our integrals will thus again be dominated by an extremum

of the (complex) function f(x), but since f is complex this extremum need not be a minimum:

− limN→∞

∫IRp

dx e−Nf(x) = extrx∈IRpf(x) (B.6)

limN→∞

∫dx g(x)e−Nf(x)∫

dx e−Nf(x)= g(arg extrx∈IRpf(x)) (B.7)

Appendix C. Maximum likelihood and Bayesian parameter estimation

To illustrate the procedures of maximum likelihood and Bayesian estimation of parameters from

data we consider the following problem. We are given a dice and want to know the true (but as yet

unknown) probabilities (π1, . . . , π6) of each possible throw. A fair dice would have πr = 1/6 for all

r. Note that∑6r=1 πr = 1. Our data from which to extract the information consists of the results

of N independent throws of the dice:

D = X1, X2, . . . , XN, Xi ∈ 1, 2, . . . , 6 for each i (C.1)

• Ad hoc estimators:

Our problem is sufficiently transparent for us to simply guess suitable estimators. It would be

natural to choose for πk the empirical frequency with which the throw Xi = k is observed:

(∀k = 1 . . . 6) : πk =1

δXi,k (C.2)

This choice satisfies the constraint∑6k=1 πk = 1, and for N → ∞ the law of large numbers

indeed gives limN→∞ πk =∑6r=1 πrδrk = πk. So our πk are proper estimators. The results of

simulating this estimation process numerically for a loaded dice are shown in figure C1. One

clearly needs data sets of size N∼2000 or more for (C.2) to approach the true values.

• Maximum likelihood estimators:

The maximum likelihood estimators are determined by maximizing over (π1, . . . , π6) the

likelihood of the data D, given the values of (π1, . . . , π6). Here we have

logP (D|π1, . . . , π6) = logN∏i=1

P (Xi|π1, . . . , π6) =N∑i=1

log πXi (C.3)

Let us maximize this quantity over (π1, . . . , π6), subject to the constraint∑6r=1 πr = 1, using

the Lagrange formalism:

∂πk

N∑i=1

log πXi = λ∂

∂πk

6∑r=1

N∑i=1

πkδXi,k = λ, hence πk =

N∑i=1

δXi,k (C.4)

Summation over k in both sides gives∑k πk = N/λ, so our normalisation constraint tells us

that λ = N . Hence the maximum likelihood estimator is identical to our estimator (C.2).

• Bayesian estimation:

Finally, when following the Bayesian route we calculate P (π1, . . . , π6|D), defined as

P (π1, . . . , π6|D) =P (D|π1, . . . , π6)P (π1, . . . , π6)

=P (D|π1, . . . , π6)P (π1, . . . , π6)∫

Ω dπ′1 . . . dπ′6 P (D|π′1, . . . , π′6)P (π′1, . . . , π

0 1000 2000 3000 4000 5000 6000 7000 8000 900010000

Figure C1. The six empirical frequencies (or estimators) πk = N−1∑i δXi,k, for each

possible dice throw k = 1 . . . 6, versus the size N of the data set. In this example of a loaded

dice the actual probabilities are (π1, . . . , π6) = (0.16, 0.16, 0.16, 0.16, 0.16, 0.20).

Here Ω is the set of all parameters (π1, . . . , π6) that satisfy the relevant constraints, i.e.

Ω = (π1, . . . , π6) ∈ IR6| πr ≥ 0 ∀r,∑r≤6 πr = 1. Alternatively (and equivalently) we can

integrate over IR6 and implement the constraints via the prior, i.e. by defining P (π1, . . . , π6) = 0

if (π1, . . . , π6) /∈ Ω.

Next we need to determine the values of the prior P (π1, . . . , π6) for (π1, . . . , π6) ∈ Ω.

Information theory tells us that if the only prior information available is our knowledge of

the constraints, we should choose the prior that maximizes the Shannon entropy subject to

these constraints. This is again done via the Lagrange method, where we now vary the entries

of the prior P (π1, . . . , π6) (it turns out that non-negatively will be satisfied automalically, so

we only impose the normalisation constraint):

δP (π1, . . . , π6)

dπ′1 . . . dπ′6 P (π′1, . . . , π

′6) logP (π′1, . . . , π

= Λδ

δP (π1, . . . , π6)

dπ′1 . . . dπ′6 P (π′1, . . . , π

∀(π1, . . . , π6) ∈ Ω : 1 + logP (π1, . . . , π6) = Λ (C.6)

We see that the maximum entropy prior is flat over Ω, so P (π1, . . . , π6) = 1/|Ω|. Hence, upon

insertion into (C.5) we get for (π1, . . . , π6) ∈ Ω an expression that can again be written in

terms of the estimator (C.2):

P (π1, . . . , π6|D) =P (D|π1, . . . , π6)∫

Ω dπ′1 . . . dπ′6 P (D|π′1, . . . , π′6)

∏i πXi∫

Ω dπ′1 . . . dπ′6

∏i π′Xi

=e∑6

k=1log πk

∑iδk,Xi∫

Ω dπ′1 . . . dπ′6 e∑6

k=1log π′

∑iδk,Xi

=eN∑6

k=1πk log πk∫

Ω dπ′1 . . . dπ′6 eN

k=1πk log π′

Let us work out the denominator, using the standard integral representation δ(z) =

(2π)−1∫∞−∞dx eixz for the delta-function:

Nlog Den =

dπ′1 . . . dπ′6 eN

k=1πk log π′k

∫ ∞0

dπ1 . . . dπ6 δ[1−

6∑k=1

πk]eN∑6

k=1πk log πk

∫ ∞−∞

2πeix

6∏k=1

∫ ∞0

dy eNπk log y−ixy

∫ ∞−∞

2π/NeiNx

6∏k=1

∫ ∞0

dy eN[πk log y−ixy]

Focusing on the y integral, we note that for large N the dominant contribution comes from

the saddle-point, i.e. after shifting the contour in the complex plane from the solution ofddy [πk log y−ixy] = 0, giving y = −iπk/x. So steepest descent integration gives us (see Appendix

B for an introduction to steepest descent integration), using log(−i) = iArg(−i) = −iπ/2:∫ ∞0

dy eN[πk log y−ixy] = eN[πk log(−iπk/x)−πk]+O(N0)

= eN[πk log πk− 12

iππk−πk log x−πk]+O(N0) (C.9)

Nlog Den =

∫ ∞−∞

dx eN[ix+∑6

k=1(πk log πk− 1

2iππk−πk log x−πk)]+O(N0) +

∫ ∞−∞

dx eN[ix+∑6

k=1πk log πk− 1

2iπ−log x−1]+O(N0) +

πk log πk −1

2iπ − 1 +

∫ ∞−∞

dx eN(ix−log x) +logN

(C.10)

Steepest descent integration over x gives N−1∫

dx eN(ix−log x) = 1 + 12 iπ +O(N−1). Thus we get

Nlog Den =

6∑k=1

πk log πk +logN

N+O(N−1)

Den = NeN∑6

k=1πk log πk+O(N0) (C.11)

The end result is the following appealing large-N form of our formula (C.7):

P (π1, . . . , π6|D) =1

NeN∑6

k=1πk log πk−N

k=1πk log πk+O(N0)

Ne−N

k=1πk log(πk/πk)+O(N0) (C.12)

The leading order in the exponent, apart from the factor N , is the Kullback-Leibler distance between

the true and estimated probability distributions πk and πk. The most probable values of the

probabilities are therefore again seen to be the estimators (C.2), but now we know more: we also

have quantified our uncertainty for large but finite N .

Maximum prediction accuracy with Cox regression

Assume we know our regression parameters exactly, the hazard rate is indeed of the Cox form, and

there is no censoring (the ideal scenario). We take all covariates to beindependent zero average and

unit variance Gaussian variables. The survival probability for risk 1 is then

S(t|Z) = exp(−

N∑i=1

eˆβ·Zθ(t−Xi)∑N

j=1 eˆβ·Zj

θ(Xj−Xi)

)(C.13)

and we would classify a patient with covariates Z at time t according to the most probable outcome:

σ(Z) = θ[S(t|Z)− 1

[log(2)− 1

N∑i=1

eˆβ·Zθ(t−Xi)

∑Nj=1 e

ˆβ·Zj

θ(Xj−Xi)

](C.14)

For N →∞ there will be no difference between training and validation sets in terms of prediction

accuracy, and the fraction predicted correctly will simply be

Qt =⟨1

2sgn(X−t) sgn

[log(2)− 1

N∑i=1

eˆβ·Zθ(t−Xi)

∑Nj=1 e

ˆβ·Zj

θ(Xj−Xi)

]〉Z ,X

⟨sgn(X−t) sgn

[log(2)− e

ˆβ·Z⟨ θ(t−X ′)

〈e ˆβ·Z ′′θ(X ′′−X ′)〉Z ′′,X′′

⟩X′

]〉Z ,X (C.15)

We first do the average in the denominator:

〈eˆβ·Z ′′θ(X ′′−X ′)〉Z ′′,X′′ =

∫DZ e

ˆβ·Z∫ ∞X′

ds P (s|Z)

∫DZ e

ˆβ·Z∫ ∞X′

ds π1(s|Z)e−∫ s

0ds′ π1(s′|Z)

∫DZ e

ˆβ·Z[− e−

∫ s0

ds′ π1(s′|Z)]∞X′

∫DZ e

ˆβ·Z(e−∫ X′

0ds π1(s|Z) − e−

∫∞0

ds π1(s|Z))

∫DZ e

ˆβ·Z(S(X ′|Z)− S(∞|Z)

)(C.16)

principles of survival analysis - king's college london · 2014-02-10 · principles of...

Documents

survival analysis in clinical trials: the need to …...ich...

survival analysis ppt

basic principles of survival

survival analysis - university of warwick · survival...

an introduction to survival analysis - massey university...

survival analysis overview

subject cs2 risk modelling and survival analysis core...

stata for survival analysis -...

survival analysis : glioma treatment

r for survival analysis 2020 - umass for survival...biostats...

multivariate survival analysis

survival analysis -definitions

introduction to survival analysis proc lifetest and survival...

survival analysis

6/1/2015 1 survival analysis. introduction abbreviated...

survival analysis

new features survival analysissurvival analysis · 2019. 8....

survival analysis project

introduction to survival analysis. what is survival...

survival analysis - purdue university · 2011-05-04 ·...