sb3b statistical lifetime-modelssteinsaltz.me.uk/sb3b/sb3b_lecture_notes_full.pdfthe single...

149
SB3b Statistical Lifetime-Models David Steinsaltz 1 University of Oxford Based on early editions by Matthias Winkel and Mary Lunn 1 2 3 4 1 2 3 4 Time Age P'(1,t) P'(2,t) P'(3,t) HT 2018 1 University lecturer at the Department of Statistics, University of Oxford

Upload: truongthien

Post on 29-May-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

SB3bStatistical Lifetime-Models

David Steinsaltz1

University of OxfordBased on early editions by Matthias Winkel and Mary Lunn

1 2 3 4

1

2

3

4

Time

Age

P'(1,t)

P'(2,t)

P'(3,t)

HT 2018

1University lecturer at the Department of Statistics, University of Oxford

SB3bStatistical Lifetime-Models

David Steinsaltz – 16 lectures HT [email protected]

Prerequisites

Part A Probability, Part A Statistics and Part B Applied Probability are prerequisites.

Website: http://www.steinsaltz.me.uk/SB3b/SB3b.html

Aims

Statistical Lifetime-Models follows on from Applied Probability. Models introduced there areexamined in the first part of the course more specifically in a life insurance context wheretransitions typically model the passage from ‘alive’ to ‘dead’, possibly with intermediate stageslike ‘loss of a limb’ or ‘critically ill’. The aim is to develop statistical methods to estimatetransition rates and more specifically to construct life tables that form the basis in the calculationof life insurance premiums.

We will then move on to survival analysis, which is widely used in medical research, inaddition to insurance, in which we consider the effect of covariates and of partially observeddata. We also explain demographic concepts, and how life tables are adapted to the context ofchanging mortality rates.

Synopsis

Life tables: Basic notation, life expectancy and remaining life expectancy, curtate lifetimes.Survival models: general lifetime distributions, force of mortality (hazard rate), survival function,the single decrement model and mortality in mixed populations.

Estimation procedures for lifetime distributions: empirical lifetime distributions, censoringand truncation, Kaplan-Meier estimate, Nelson-Aalen estimate. Parametric models, acceleratedlife models including Gompertz, Weibull, log-normal, log-logistic. Plot-based methods for modelselection. Cox regression. Proportional hazards, partial likelihood, semiparametric estimationof survival functions, use and overuse of proportional hazards in insurance calculations andepidemiology.

Two-state and multiple-state Markov models, with simplifying assumptions. Estimation ofMarkovian transition rates: Maximum likelihood estimators, time-varying transition rates, censusapproximation. Life expectancy and occupation times in Markov models: eigenvector formalism.Applications to reliability, medical statistics, ecology.

Model selection and diagnostics for survival models, including log-rank tests, serial-correlationstests, and Cox–Snell residuals.

iv

Exercises and Computing

The scope of exercises goes significantly beyond that of exam questions in many cases, butunderstanding the exercises is essential to coping with the variety of exam questions that mightcome up. There is a great range of difficulty in the exercises, and most students should find atleast some of the exercises very challenging. Try all of them, but don’t spend hours and hourson questions if you are not making any progress.

Try to start solving exercises when you get the problem sheet, not the day before you haveto hand in your solutions. This allows you to have second attempts at exercises that you can’tsolve straight away.

Lecture notes are meant to be useful when solving exercises. You may use any result fromthe lectures, except where the contrary is explicitly stated.

Some problem sheets will include optional computing exercises. Anyone who is interested inlearning to use life-table and survival techniques on real problems should do these.

Reading

There are lots of good books on survival analysis. Look for one that suits you. Some pointerswill be given in the lecture notes to readings that are connected, but look in the index to findtopics that confuse you and/or interest you.

The actuarial material in the course is modeled on the CT4 Core Reading from the Instituteof Actuaries.

CT4 Models Core Reading. Faculty & Institute of Actuaries

This was the core reading for the actuarial professional examination on survival models. Insome places, the approach is more practically oriented and often placed in an insurance context,whereas the course is more academic and not only oriented towards insurance applications. Allin all, this is the main reference for the first half the course.

D.R. Cox and D. Oakes: Analysis of Survival Data. Chapman & Hall (1984)

This is the classical text on survival analysis. The presentation is concise, but gives a broad viewof the subject. The text contains exercises. This is the main reference for about half the course.It contains also much more related material beyond the scope of the course.

H.U. Gerber: Life Insurance Mathematics. 3rd edition, Springer (1997)

The presentation is concise. Only three chapters are relevant. Chapter 2 gives an introductionto lifetime distributions, Chapter 7 discusses the multiple decrement model and Chapter 11estimation procedures for lifetime distributions. The remainder combines the ideas with interestrate theory.Klein & Moeschberger: Survival Analysis: Techniques for Censored and TruncatedData, 2nd edition, Springer (2003)This is an excellent source for a lot of the survival analysis topics, particularly censoring andtruncation, and the Kaplan-Meier and Nelson-Aalen estimators. Lots of terrific examples.

Contents

Glossary x

1 Introduction: Survival Models 11.1 Early life tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Basic statistical methods for lifetime distributions . . . . . . . . . . . . . . . . . . 3

1.2.1 Plot the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 Fit a model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.3 Significance test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Goals of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Survival function and hazard rate (force of mortality) . . . . . . . . . . . . . . . . 81.5 Residual lifetimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Force of mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Lifetime distributions and life tables 102.1 Defining mortality laws from hazards . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Curtate lifespan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Mortality laws: Simple or Complex? Parametric or Nonparametric? . . . . . . . . 122.4 Life Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Notation for life tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.6 Continuous and discrete models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.7 Crude estimation of life tables – discrete method . . . . . . . . . . . . . . . . . . 17

3 Estimating life tables 193.1 Crude life table estimation – direct method for continuous data . . . . . . . . . . 193.2 Are life tables continuous or discrete? . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Interpolation for non-integer ages . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 An example: Fractional lifetimes can matter . . . . . . . . . . . . . . . . . . . . . 22

4 CER and Census approximation 244.1 The “continuous” method for life-table estimation . . . . . . . . . . . . . . . . . . 244.2 Comparing continuous and discrete methods . . . . . . . . . . . . . . . . . . . . . 254.3 Central exposed to risk and the census approximation . . . . . . . . . . . . . . . 26

4.3.1 Insurance data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3.2 Census approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Lexis diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

v

CONTENTS vi

5 Life expectancy and comparing life tables 335.1 Life Expectancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1.1 What is life expectancy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.1.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.1.3 Life expectancy and mortality . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 An example of life-table computations . . . . . . . . . . . . . . . . . . . . . . . . 355.3 Cohorts and period life tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Comparing Life Tables 416.1 The Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.1.1 Defining the Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . 416.1.2 Intuitive justification for the Poisson model . . . . . . . . . . . . . . . . . 42

6.2 Testing hypotheses for qx and µx+ 12

. . . . . . . . . . . . . . . . . . . . . . . . . . 426.2.1 The tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.2.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7 Graduation (smoothing life tables) and the single-decrements model 457.1 Graduation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.1.1 Parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457.1.2 Reference to a standard table . . . . . . . . . . . . . . . . . . . . . . . . . 457.1.3 Nonparametric smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.1.4 Methods of fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.1.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7.2 Single-decrement model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.3 Rates in the single-decrement model . . . . . . . . . . . . . . . . . . . . . . . . . 51

8 Multiple-decrements model 528.1 Introduction to the multiple-decrements model . . . . . . . . . . . . . . . . . . . 52

8.1.1 An introductory example . . . . . . . . . . . . . . . . . . . . . . . . . . . 528.1.2 Basic theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548.1.3 Multiple decrements – time-homogeneous rates . . . . . . . . . . . . . . . 54

8.2 Estimation for general multiple decrements . . . . . . . . . . . . . . . . . . . . . 548.3 Example: Workforce model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568.4 The distribution of the endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . 568.5 Cohabitation–dissolution model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

CONTENTS vii

9 Continuous-time Markov chains 619.1 General Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

9.1.1 Discrete time, estimation of ⇧-matrix . . . . . . . . . . . . . . . . . . . . 619.1.2 Estimation of the Q-matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 61

9.2 The induced Poisson process (non-examinable) . . . . . . . . . . . . . . . . . . . 629.3 Estimation of Markov rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649.4 Parametric and time-dependent models . . . . . . . . . . . . . . . . . . . . . . . . 65

9.4.1 Example: Marital status model . . . . . . . . . . . . . . . . . . . . . . . . 669.4.2 The general birth-and-death process . . . . . . . . . . . . . . . . . . . . . 679.4.3 Lower-dimensional parametric models of simple birth-and-death processes 679.4.4 Asymptotic behaviour of estimates . . . . . . . . . . . . . . . . . . . . . . 68

9.5 Time-varying transition rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689.5.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . 689.5.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

9.6 Occupation times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699.6.1 The multiple decrements model . . . . . . . . . . . . . . . . . . . . . . . . 719.6.2 The illness model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

10 Survival analysis: Introduction 7310.1 Censoring and truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7310.2 Likelihood and Right Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

10.2.1 Random censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7410.2.2 Non-informative censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

10.3 Time on test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7510.4 Non-parametric survival estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 76

10.4.1 Review of basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7610.4.2 Kaplan–Meier estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7610.4.3 Nelson–Aalen estimator and new estimator of S . . . . . . . . . . . . . . . 7710.4.4 Invented data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

10.5 Example: The AML study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

11 Confidence intervals and left truncation 8111.1 Greenwood’s formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

11.1.1 The cumulative hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8111.1.2 The survival function and Greenwood’s formula . . . . . . . . . . . . . . . 8211.1.3 Reminder of the � method . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

11.2 Left truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8311.3 AML study, continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8511.4 Survival to 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

11.4.1 Example: Time to next birth . . . . . . . . . . . . . . . . . . . . . . . . . 8711.5 Computing survival estimators in R (non-examinable) . . . . . . . . . . . . . . . . 90

11.5.1 Survival objects with only right-censoring . . . . . . . . . . . . . . . . . . 9011.5.2 Other survival objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

CONTENTS viii

12 Semiparametric models 9412.1 Introduction to semiparametric modeling . . . . . . . . . . . . . . . . . . . . . . . 9412.2 Accelerated Life models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9412.3 Proportional Hazards models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

12.3.1 Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9512.3.2 Generalised linear survival models . . . . . . . . . . . . . . . . . . . . . . 95

12.4 Simulation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

13 Relative-risk models 9913.1 The relative-risk regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . 9913.2 Partial likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10013.3 Significance testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10113.4 Estimating baseline hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

13.4.1 Breslow’s estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10113.4.2 Individual risk ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

14 Relative risk regression, continued 10414.1 Dealing with ties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10414.2 The AML example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10514.3 The Cox model in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10914.4 Graphical tests of the proportional hazards assumption . . . . . . . . . . . . . . . 111

14.4.1 Log cumulative hazard plot . . . . . . . . . . . . . . . . . . . . . . . . . . 11114.4.2 Andersen plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11114.4.3 Arjas plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11114.4.4 Leukaemia example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

15 Testing Hypotheses 11515.1 Tests in the regression setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11515.2 Non-parametric testing of survival between groups . . . . . . . . . . . . . . . . . 115

15.2.1 General principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11515.2.2 Standard tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

15.3 The AML example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11815.3.1 Kidney dialysis example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11815.3.2 Nonparametric tests in R (non-examinable) . . . . . . . . . . . . . . . . . 119

CONTENTS ix

16 Testing the fit of survival models 12216.1 General principles of model selection . . . . . . . . . . . . . . . . . . . . . . . . . 122

16.1.1 The idea of model diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . 12216.1.2 A simulated example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

16.2 Cox–Snell residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12416.3 Bone marrow transplantation example . . . . . . . . . . . . . . . . . . . . . . . . 12516.4 Testing the proportional hazards assumption: Schoenfeld residuals . . . . . . . . 12616.5 Martingale residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

16.5.1 Definition of martingale residuals . . . . . . . . . . . . . . . . . . . . . . . 12716.5.2 Application of martingale residuals for estimating covariate transforms . . 128

16.6 Outliers and influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12916.6.1 Deviance residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12916.6.2 Delta–beta residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

16.7 Residuals in R (non-examinable) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13016.7.1 Dutch Cancer Institute (NKI) breast cancer data . . . . . . . . . . . . . . 13016.7.2 Complementary log-log plot . . . . . . . . . . . . . . . . . . . . . . . . . . 13116.7.3 Andersen plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13216.7.4 Cox–Snell residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13216.7.5 Martingale residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13316.7.6 Schoenfeld residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

A Problem sheets IA.1 Revision, lifetime distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . IIA.2 Estimation of lifetime distributions and Markov models . . . . . . . . . . . . . . VA.3 Graduation, Markov models, basic survival analysis . . . . . . . . . . . . . . . . . VIIIA.4 Markov models; Model testing; Proportional-hazards regression . . . . . . . . . . XI

B Solutions XIVB.1 Revision, lifetime distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . XVB.2 Estimation of lifetime distributions and Markov models . . . . . . . . . . . . . . XXB.3 Graduation, Markov models, basic survival analysis . . . . . . . . . . . . . . . . . XXXIB.4 Markov models; Model testing; Proportional-hazards regression . . . . . . . . . . XXXVII

Glossary

cdf Cumulative distribution function. 8

census approximation Method of estimating Central Exposed To Risk based on observationsof curtate age at death. 33

Central Exposed To Risk Total time that individuals are at risk. Under some circumstances,this is about the number of individuals at risk at the midpoint the estimation period. 32

cohort A group of individuals of equivalent age (in whatever sense relevant to the study),observed over a period of time. 8, 17

cohort life table Life table showing mortality of individuals born in the same year (or approx-imately same year). 25

curtate lifetime The integer part of a real-valued life time. 18

force of mortality Same as mortality rate, but also used in a discrete context. 8

graduation Smoothing for life tables. 39

hazard rate Density divided by survival. Thus, the instantaneous probability of the eventoccurring, conditioned on survival to time t. 8

Initial Exposed To Risk Number of individuals at risk at the start of the estimation period.18, 32

Maximum Likelihood Estimator Estimator for a parameter, chosen to maximise the likeli-hood function. 21

mortality rate Same as hazard rate, in a mortality context. 8

period life table Life table showing mortality of individuals of a given age living in the sameyear (or approximately same year). 25

Radix The initial number of individuals in the nominal cohort described by a life table. 17

single-decrement model Two-state Markov model with transient state ‘alive’ and absorbingstate ‘dead’. 9

stopping time A random time that does not depend on the future. 56

x

Lecture 1

Introduction: Survival Models

1.1 Early life tablesIn one of the earliest treatises on probability George Leclerc Buffon considered the problem offinding the fundamental unit of risk, the smallest discernible probability. He wrote that “all fearor hope, whose probability equals that which produces the fear of death, in the moral realm maybe taken as unity against which all other fears are to be measured.” [2, p. 56] In other words,because no healthy man in the prime of life (he argued) attends to the risk that he may die inthe next twenty-four hours, Buffon considered that events with this probability could be treatedas negligible; after all, “since the intensity of the fear of death is a good deal greater than theintensity of any other fear or hope,” any other risk of equivalent probability of a less troublingevent — such as winning a lottery — would leave a person equally indifferent. He decided thatthe appropriate age to consider for a man to be in the prime of health was 56 years. But what isthat probability, that a 56 year old man dies in the next day?

To answer this, Buffon turned to mortality tables. A colleague (one M. Dupré of Saint-Maur)assembled the registers of 12 rural parishes and 3 parishes of Paris, in which 23,994 deaths wererecorded. The ages at death were all recorded, so that he knew that 174 of the deaths were atage 56; that is, between the 56th and 57th birthdays.1 Our naïve estimator for the probability ofan event is

probability of occurrence =number of occurrences

number of opportunities.

The number of occurrences of the event (death of an individual aged 56) is observed to be174. But what about the denominator? The number of “opportunities” for this event is justthe number of individuals in the population at the appropriate age. The most direct way todetermine this number would be a time-consuming census. Buffon’s approach (and that of other17th and 18th creators of such life tables) depended upon the following implicit logic: Supposethe population is stable, so that the same number of people in each age group die each year.Since every person dies at some time (it is believed), the total number of people in the populationwho live to their 56th birthday will be exactly the same as the number of people observed to

1Actually, Buffon’s statistical procedure was a bit more complicated than this. The recorded numbers of deathsat ages 55,56,57,58,59,60 were 280,130,129,182,90,534 respectively. Buffon observed that the priests (“particularlythe country priests”) were likely to record round numbers for the age at death, rather than the exact age — whichthey may not know anyway. He thus decided that it would make more sense to smooth (as statisticians would callthe procedure today) or graduate (as actuaries call it) the data. We will learn about graduation in Lecture.

1

LECTURE 1. INTRODUCTION: SURVIVAL MODELS 2

have died after their 56th birthday in the particular year under observation, which happens tobe 5031. The probability of dying in one day may then be estimated as

1

365⇥ 174

5031⇡ 1

10000,

and Buffon proceeds to reason with this estimate.From this elementary exercise we see that:

• Mortality probabilities can be estimated as the ratio of the number of deaths to the numberof individuals “at risk”.

• The numerator (the number of deaths) is usually straightforward to determine.

• The denominator (the number at risk) can be challenging.

• Mortality can serve as a model for thinking about risks (and opportunities) more generally,for events happening at random times.

• You don’t get very far in thinking about mortality and other risks without some sort oftheoretical model.

The last claim may require a bit more elucidation. What would a naïve, empirical approachto life tables look like? Given a census of the population by age, and a list of the ages at deathin the following year, we could compute the proportion of individuals aged x who died in thefollowing year. This is merely a free-floating fact, which could be compared with other facts, suchas the measured proportion of individuals aged x who died in a different year (or at a differentage, or a different place, etc.) If you want to talk about a probability of dying in that year(for which the proportion would serve as an estimate), this is a theoretical construct, which canbe modelled (as we will see) in different ways. Once you have a probability model, this allowsyou to pose (and perhaps answer) questions about the probability of dying in a given day, makepredictions about past and future trends, and isolate the effect of certain medications or life-stylechanges on mortality.

There are many different kinds of problems for which the same survival analysis statisticsmay be applied. Some examples which we will consider at various points in this course are:

• Time to failure of a machine with multiple internal components.

• Time from infection until a subject shows signs of illness.

• Time from starting to try to conceive a baby until a woman is pregnant.

• Time until a person diagnosed with (and perhaps treated for) a disease has a recurrence.

• Time until an unmarried couple marries or separates.

Often, though, we will use the term “lifetime” to represent any waiting time, along with itsattendant vocabulary: survival probability, mortality rate, cause of death, etc.

LECTURE 1. INTRODUCTION: SURVIVAL MODELS 3

1.2 Basic statistical methods for lifetime distributionsIn Table 1.1 we see the estimated ages at death for 103 tyrannosaurs, from four different species,as reported in [7]. Let us treat them here as a single population.

A. sarcophagus 2,4,6,8,9,11,12,13,14,14,15,15,16,17,17,18,19,19,20,21,23,28

T. rex 2,6,8,9,11,14,15,16,17,18,18,18,18,18,19,21,21,21,22,22,22,22,22,22,23,23,24,24,28

G. libratus 2,5,5,5,7,9,10,10,10,11,12,12,12,13,13,14,14,14,14,14,15,16,16,17,17,17,18,18,18,19,19,19,20,20,21,21,21,21,22

Daspletosaurus 3,9,10,17,18,21,21,22,23,24,26,26,26

Table 1.1: 103 estimated ages of death (in years) for four different tyrannosaur species.

In Part A Statistics you learned to do the following:

1.2.1 Plot the data

The most basic thing you can do with any data is to sort the observations into bins of somewidth �, and plot the histogram, as in Figure 1.1). This does not presuppose any model.

Histogram of tyrannosaur deaths

age (yrs)

Frequency

0 5 10 15 20 25 30

02

46

810

(a) Narrow bins

Histogram of tyrannosaur deaths

age (yrs)

Frequency

0 5 10 15 20 25 30

05

10

15

20

25

30

(b) Wide bins

Figure 1.1: Histogram of tyrannosaur mortality data from Table 1.1.

LECTURE 1. INTRODUCTION: SURVIVAL MODELS 4

1.2.2 Fit a model

Suppose we believe the list of lifetimes to be i.i.d. samples from a fixed (unknown) distribution.We can then use the data to determine which distribution it was that generated the samples.

In Part A statistics you learned parametric maximum likelihood estimation. Suppose theunknown distribution is believed to be one of a family of distributions that is indexed by apossibly multivariate (k-dimensional) parameter � 2 ⇤ ⇢ Rk. That is — taking just the caseof data from a continuous distribution — the distribution of the independent observations hasdensity f(T ;�) at the point T , if the true value of the parameter is �. Suppose we have observedn independent lifetimes T1, . . . , Tn. We define the log-likelihood function to be the (natural) logof the density of the observations, considered as a function of the parameter. By the assumptionof independence, this is

`T1,...,Tn(�) = `T :=

nX

i=1

log f(Ti;�). (1.1)

(We use T to represent the vector (T1, . . . , Tn).) The maximum likelihood estimator (MLE) issimply the value of � that makes this as large as possible:

� = �(T) = �(T1, . . . , Tn) := argmax�2⇤

nY

i=1

f(Ti;�). (1.2)

Notice the nomenclature: max�2⇤ f(�) picks the maximal value in the range of f , argmax�2⇤ f(�)picks the �-value in the domain of f for which this maximum is attained.

The most basic model for lifetimes is the exponential. This is the “memoryless” waiting-time distribution, meaning that the remaining waiting time always has the same distribution,conditioned on the event not having occurred up to any time t. This distribution has a singleparameter (k = 1) µ, and density

f(µ;T ) = µe�µT .

The parameter µ is chosen from the domain ⇤ = (0,1). If we observe independent lifetimesT1, . . . , Tn from the exponential distribution with parameter µ, and let T := n�1Pn

i=1 Ti be theaverage, the log likelihood is

`T(µ) =nX

i=1

log

⇣µe�µTi

⌘= n

�logµ� T µ

�,

which has maximum at µ = 1/T = n/P

Ti. This is an example of what we will see to be ageneral principle:

Estimated rate =# events

total time at risk. (1.3)

In some cases we will be thinking of the time as random, in other cases the number of events,but the formula (1.3) remains. The challenge will be to estimate the number of events and thetotal time in a way that they correspond to the same time period and the same population, sincethey are often estimated from different data sources and timed in different ways.

For large n, the estimator �(T1, . . . , Tn) is approximately normally distributed, under someregularity conditions, and it has some other optimality properties (finite-sample and asymptotic).This allows us to construct approximate confidence intervals/regions to indicate the precision ofmaximum likelihood estimates. Specifically, for

� ⇠ N⇣�, (I(�))�1

⌘, where Ij1j2(�) = �E

2

4 @2

@�j1@�j2

nX

i=1

log(f(Ti;�))

3

5 = �E"@2`T(�)

@�j1@�j2

#

LECTURE 1. INTRODUCTION: SURVIVAL MODELS 5

are the entries of the Fisher Information matrix. Of course, we generally don’t know what � is —otherwise, we probably would not be bothering to estimate it! — so we may approximate theInformation matrix by computing Ij1j2(�) instead. Furthermore, we may not be able to computethe expectation in any straightforward way; in that case, we use the principle of Monte Carloestimation: We approximate the expectation of a random variable by the average of a sample ofobservations. We already have the sample T1, . . . , Tn from the correct distribution, so we definethe observed information matrix

Jj1j2(�, T1, . . . , Tn) = � 1

n

nX

i=1

@2log f(Ti;�)

@�j1@�j2.

Again, we may substitute Jj1j2(�, T1, . . . , Tn), since the true value of � is unknown. Thus, in thecase of a one-dimensional parameter (where the covariance matrix is just the variance and thematrix inverse (I(�))�1 is just the multiplicative inverse in R), we obtain

2

4�� 1.96

s1

I(�), �+ 1.96

s1

I(�)

3

5

as an approximate 95% confidence interval for the unknown parameter �.In the case of the exponential model, we have

`00T(µ) = � n

µ2,

so that the standard error for µ is µ/pn, which we estimate by µ/

pn. For the tyrannosaur data

of Table 1.1, we have

T = 16.03,

µ = 0.062,

SEµ = 0.0061,

95% confidence interval for µ = (0.050, 0.074).

Aside: In the special case of exponential lifetimes, we can construct exact confidence intervals,since we know the distribution of n/µ ⇠ �(n, µ), so that 2nµ/µ ⇠ �2

2n allows us to use �2-tables.Is the fit any good? We have various standard methods of testing goodness of fit — we

discuss an example in section 1.2.3 — but it’s pretty easy to see by eye that the histograms inFigure 1.1 aren’t going to fit an exponential distribution, which is a declining density, very well.In Figure 1.2 we show the empirical (observed) cumulative distribution of tyrannosaur deaths,together with the cdf of the best exponential fit, which is obviously not a very good fit at all.

We also show (in green) the fit to a class of distribution which is an example of a largerclass that we will meet later, called the “Weibull” distributions. Instead of the exponential cdfF (t) = 1� e�µt, suppose we take F (t) = 1� e�↵t2 . Note that if we define Yi = T 2

i , we have

P (Yi y) = P (Ti py) = 1� e�↵y,

so Yi is actually exponentially distributed with parameter ↵. Thus, the MLE for ↵ is

↵ =nPT 2i

.

We see in Figure 1.2 that this fits much better than the exponential distribution.

LECTURE 1. INTRODUCTION: SURVIVAL MODELS 6

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

Age

CDF

Figure 1.2: Empirical cumulative distribution of tyrannosaur deaths (circles), together with cdfof exponential fit (red) and Weibull fit (green).

1.2.3 Significance test

The maximum-likelihood approach is optimal in many respects for picking the correct parameterbased on the observed data, under the assumption that the observed data did actuallycome from a distribution in the appointed parametric family. But did they? We alreadylooked at a plot, Figure 1.2, comparing the fit cdf to the observed cdf. The Weibull fit was clearlybetter. But how much better?

One way to answer this question is to apply a significance test. We start with a set ofdistributions H1, such that we know that it includes the true distribution (for instance, the setof all distributions on (0,1), and a null hypothesis H0 ⇢ H1, and we wish to test how plausiblethe observations are as a sample from H0, rather than from the alternative hypothesis H1 \H0.The standard parametric procedure is to use a �2 goodness of fit test, based on the statistic

X2=

mX

j=1

(Oj � Ej)2

Ej⇠ �2

m�k�1 approximately, under H0, (1.4)

where m is the number of bins (e.g. from your histogram), but merged to satisfy size restrictions,and k the number of parameters estimated. Oj is the random variable modelling the numberobserved in bin j, Ej the number expected under maximum likelihood parameters. To justifythe approximate distribution for the test statistic, we require that at most 20% of bins haveEj 5, none Ej 1 (‘size restriction’).

We obtain then X2= 17.9 for the Weibull model, and X2

= 92.2 for the exponentialdistribution. The latter produces a p-value on the order of 10

�18, but the former has a p-value around 0.0013. Thus, while the data could not possibly have come from an exponentialdistribution, or anything like it, the Weibull distribution, while unlikely to have produced exactlythese data, is a plausible candidate.

LECTURE 1. INTRODUCTION: SURVIVAL MODELS 7

Age Observed Expected ExpectedExponential Weibull

0–4 8 22.7 5.45–9 13 21.5 19.3

10–14 22 15.7 25.315–19 39 11.5 22.720–24 25 8.4 15.725+ 15 23.1 14.6

Table 1.2: �2 computation for fitting tyrannosaur data.

1.3 Goals of the courseWhy do we need special statistical methods for lifetime data? Some reasons are:

• Large samples Other models, such as single-decrement models with time-varying transi-tion rates, may be closer to the truth. We may have more elaborate multivariate parametricmodels for the transition rates, but they are unlikely to be precisely true. The problemthen is that the parametric families will eventually be rejected, once the sample size islarge enough — and since we may be concerned with statistical surveys of, for example,the entire population of the UK, the sample sizes will be very large indeed. Nonparametricor semiparametric methods will be better able to let the data speak for themselves.

• Small samples While nonparametric models allow the data to speak for themselves,sometimes we would prefer that they be somewhat muffled. When the number of observeddeaths is small — which can be the case, even in a very large data set, when consideringadvanced ages, above 90, and certainly above 100, because of the small number of individualswho survive to be at risk, but also in children, because of the very low mortality rate — theestimates are less reliable, being subject to substantial random noise. Also, the mortalitypattern changes over time, and we are often interested in future mortality, but only havehistorical data. A non-parametric estimate that precisely reflects the data at hand mayreflect less well the underlying processes, and be ill-suited to projection into the future.Graduation (smoothing) and extrapolation methods have been developed to address theseissues.

• Incomplete observations Some observations will be incomplete. We may not know theexact time of a death, but only that it occurred before a given time, or after a given time,or between two known times, a phenomenon called “censoring”. (When we are informedonly of the year of a death, but not the day or time, this is a kind of censoring. Or we mayhave observed only a sample of the population, with the sample being not entirely random,but chosen according to being alive at a certain date, or having died before a certain date,a phenomenon known as “truncation”. We need special techniques to make use of thesepartial observations.) Since we are observing times, subjects who break off a study midwaythrough provide partial information in a clearly structured way.

• Successive events A key fact about time is its sequence. A patient is infected, developssymptoms, has a diagnosis, a treatment, is cured or relapses, at some point dies. Someor all of these events may be considered as a progression, and we may want to model the

LECTURE 1. INTRODUCTION: SURVIVAL MODELS 8

sequence of random times. Some care is needed to carry out joint maximum likelihoodestimation of all transition rates in the model, from one or several individuals observed.This can be combined with time-varying transition rates.

• Comparing lifetime distributions We may wish to compare the lifetime distributionsof different groups (e.g., smokers and nonsmokers; those receiving a traditional cholesterolmedication and those receiving the new drug) or the effect of a continuous parameter (e.g.,weight) on the lifetime distribution.

• Changing rates Mortality rates are not static in time, creating disjunction between periodmeasures — looking at a cross-section of the population by age as it exists at a giventime — and cohort measures — looking at a group of individuals born at a given time,and following them through life.

1.4 Survival function and hazard rate (force of mortality)As discussed in chapter 1, the simplest lifetime model is the single-decrement model: Theindividual is alive for some length of time L, at the end of which he/she becomes dead. This is ahomogeneous Markov process if and only if L has an exponential distribution. In general, wemay describe a lifetime distribution — which is simply the distribution of a nonnegative randomvariable — in several different ways:

cdf F (t) = P�L t

;

survival function S(t) = F (t) = 1� F (t) = P�L > t

;

density function f(t) = dF/dt;hazard rate �(t) = f(t)/F (t)

The hazard rate is also called mortality rate in survival contexts. The traditional name indemography is force of mortality. This may be thought of as the instantaneous rate of dying perunit time, conditioned on having already survived.s The exponential distribution with parameter� 2 (0,1) is given by

cdf F (t) = 1� e��t;survival function F (t) = e��t;density function f(t) = �e��t;

hazard rate �(t) = �.

Thus, the exponential is the distribution with constant force of mortality, which is a formalstatement of the “memoryless” property.

1.5 Residual lifetimesAssume that there is an overall lifetime distribution, and every individual born has a randomlifetime according to this distribution. Then, if we observe sombody now aged x, and we denotehis residual lifetime T � x by Tx, then we have

FTx(t) = FT�x|T>x(t) =FT (x+ t)

FT (x), fTx(t) = fT�x|T>x(t) =

fT (x+ t)

FT (x), t � 0. (1.5)

LECTURE 1. INTRODUCTION: SURVIVAL MODELS 9

So, any distribution of a full lifetime T is naturally associated with a family of conditionaldistributions of T given T > x.

1.6 Force of mortalityWe now look more closely at the hazard rate, which may be defined as

hT (t) = µt = lim"#0

1

"P(T t+ "|T > t) = lim

"#0

1"P(t < T t+ ")

P(T > t)=

fT (t)

FT (t). (1.6)

The density fT (t) is the (unconditional) infinitesimal probability to die at age t. The hazardrate hT (t) is the (conditional) infinitesimal probability to die at age t of an individual known tobe alive at age t. It may seem that the hazard rate is a more complicated quantity than thedensity, but it is very well suited to modelling mortality. Whereas the density has to integrate toone and the distribution function (survival function) has boundary values 0 and 1, the force ofmortality has no constraints, other than being nonnegative — though if “death” is certain theforce of mortality has to integrate to infinity. Also, we can read its definition as a differentialequation and solve

F 0T (t) = �µtFT (t), F (0) = 1 ) FT (t) = exp

(�Z t

0µsds

), t � 0. (1.7)

We can now express the distribution of Tx as

FTx(t) =FT (x+ t)

FT (x)= exp

(�Z x+t

xµsds

)= exp

(�Z t

0µx+rdr

), t � 0. (1.8)

Note that this implies that hTx(t) = hT (x+ t), so it is really associated with age x+ t only, notwith initial age x nor with time t after initial age. Also note that, given a measurable functionµ : [0,1) ! R, FTx(0) = 1 always holds, FTx decreasing if and only if µ � 0. FTx(1) = 0 if andonly if

R10 µtdt = 1. This leaves a lot of modelling freedom via the force of mortality.

Densities can now be obtained from the definition of the force of mortality (and consistency)as fTx(t) = µt+xFTx(t).

Lecture 2

Lifetime distributions and life tables

All the stochastic models in this course will be within the class of discrete state-space Markovprocesses which may be time inhomogeneous. We will not be using the general form of thesemodels, but will be simplifying and specialising them substantially. What unifies this courseis the nature of the questions we will be asking. In the standard theory of Markov processes,we focus early on stationary processes. Our models will not be stationary, because they haveabsorbing states. The key questions will concern the absorbing states: When the process isabsorbed (the “lifetime”), and, in some models, which state absorbs it.

We need to be careful to distinguish between representations of the population and repre-sentations of the individual. In the present context, the Markov process always represents anindividual. The population consists of some number of independently running copies of the basicMarkov process. In simple cases — for instance, exponential mortality — the population-levelprocess (total population at time t) will also be a Markov process, a “pure-death” chain. Thisraises the complication that there are usually two different kinds of time running: The “internal”time of the individual process, which usually represents age in some way, and calendar time.The full implications of these interacting time-frames — also called the cohort and the periodperspective — are a major topic in demography, and we will only touch on them in this course.

2.1 Defining mortality laws from hazardsWe are now in the position to model mortality laws via their force of mortality. Clearly, theExp(�) distribution has a constant hazard rate µt ⌘ �, and the uniform distribution on [0,!]has a hazard rate

hT (t) =1

! � t, 0 t < !. (2.1)

Note that hereR !0 hT (t)dt = 1 squares with FT (!) = 0 and forces the maximal age !. This is a

general phenomenon: distributions with compact support have a divergent force of mortality atthe supremum of their support, and the singularity is not integrable.

The Gompertz distribution is given by µt = Be✓t. More generally, Makeham’s law is given by

µt = A+Be✓t, FTx(t) = exp

⇢�At�m

⇣e✓(x+t) � e✓x

⌘�, x � 0, t � 0, (2.2)

for parameters A > 0, B > 0, ✓ > 0; m = B/✓. Note that mortality grows exponentially. If ✓ isbig enough, the effect is very close to introducing a maximal age !, as the survival probabilitiesdecrease very quickly. There are other parameterisations for this family of distributions. TheGompertz distribution is named for British actuary Benjamin Gompertz, who in 1825 first

10

LECTURE 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 11

0 20 40 60 80 100

1e-04

5e-04

5e-03

5e-02

5e-01

Canadian Lifetime data

age

Mo

rta

lity r

ate

male

female

(a) Mortality rates

0 20 40 60 80 100

0.01

0.02

0.05

0.10

0.20

0.50

1.00

Canadian Lifetime data

ageS

urv

iva

l fu

nctio

n

male

female

(b) Survival function

Figure 2.1: Canadian mortality data, 1995–7.

published his discovery [10] that human mortality rates over the middle part of life seemedto double at constant age intervals. It is unusual, among empirical discoveries, for havingbeen confirmed rather than refuted as data have improved and conditions changed, and it (orMakeham’s modification) serves as a standard model for mortality rates not only in humans, butin a wide variety of organisms. As an example, see Figure 2.1, which shows Canadian mortalityrates from life tables produced by Statistics Canada (available at http://www.statcan.ca:

80/english/freepub/84-537-XIE/tables.htm). Notice how close to a perfect line the mid-lifemortality rates for both males and females is, when plotted on a logarithmic scale, showing thatthe Gompertz model is a very good fit.

Figure 2.1(b) shows the corresponding survival curves. It is worth recognising how much moreinformative the mortality rates are. in Figure 2.1(a) we see that male mortality is regularly higherthan female mortality at all ages (and by a fairly constant ratio), we see several phases of mortality— early decline, jump in adolescence, then steady increase through midlife, and deceleration inextreme old age — whereas Figure 2.1(b) shows us only that mortality is accelerating overall,and that males have accumulated higher mortality by late life.

The Weibull distribution suggests a polynomial rather than exponential growth of mortality

µt = ↵⇢↵t↵�1, FTx(t) = exp

n�⇢↵

�(x+ t)↵ � x↵

�o, x � 0, t � 0, (2.3)

for rate parameter ⇢ > 0 and exponent ↵ > 0. The Weibull model is commonly used in engineeringcontexts to represent the failure-time distribution for machines. The Weibull distribution arisesnaturally as the lifespan of a machine with n redundant components, each of which has constantfailure rate, such that the machine fails only when all components have failed. Later in thecourse we will discuss how to fit Weibull and Gompertz models to data.

Another class of distributions is obtained by replacing the parameter � in the exponential dis-tribution by a (discrete or continuous) random variable M . Then the specification of exponential

LECTURE 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 12

conditional densitiesfT |M=�(t) = �e��t (2.4)

determines the unconditional density of T as

fT (t) =

Z 1

0fT,M (t,�)d� =

Z 1

0�e��tfM (�)d� or fT (t) =

X

�>0

�e��tP(M = �). (2.5)

Various special cases of exponential mixtures and other extensions of the exponential distributionhave been suggested in a life insurance context. Some of these will be presented later.

E.g., for M ⇠ Geom(p), i.e. P(M = k) = pk�1(1� p), k � 1, we obtain

FT (t) =

Z 1

tfT (s)ds =

Z 1

t

1X

k=1

fT |M=k(s)pk�1

(1� p)ds

=

1X

k=1

Z 1

tke�ktpk�1

(1� p)ds =(1� p)e�t

1� pe�t

and one easily deduces

fT (t) =(1� p)e�t

(1� pe�t)2, t � 0.

The corresponding hazard rate is

hT (t) =fT (t)

FT (t)=

1

1� pe�t,

which is an increasing but bounded.

2.2 Curtate lifespanWe have implicitly assumed that the lifetime distribution is continuous. However, we can alwayspass from a continuous random variable T on [0,1) to a discrete random variable K = [T ], itsinteger part, on N. If T models a lifetime, then K is called the associated curtate lifetime.

2.3 Mortality laws: Simple or Complex? Parametric or Nonpara-metric?

Consider the data for Albertosaurus sarcophagus in Table 1.1. We see here the estimated agesat death for 22 members of this species. Let us assume, for the sake of discussion, that theseestimates are correct, and that our skeleton collection represents a simple random sample of allAlbertosaurs that ever lived. If we assume that there was a large population of these dinosaurs,and that they died independently (and not, say, in a Cretaceous suicide pact), then these are22 independent samples T1, . . . , T22 of a random variable T whose distribution we would like toknow. Consider the probabilities

qx := P�x T < x+ 1

.

Then the number of individuals observed to have curtate lifespan x has binomial distri-bution Bin(22, qx). The MLE for a binomial probability is just the naïve estimate qx =

# successes/# trials (where a “success”, in this case, is a death in the age interval under con-sideration). To compute q2, then, we observe that there were 22 Albertosaurs from our samplestill alive on their 22 birthdays, of which one unfortunate met its maker in the following year:

LECTURE 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 13

q2 = 1/22 ⇡ 0.046. As for q3, on the other hand, there were 21 Albertosaurs observed alive ontheir third birthdays, and all of them arrived safely at their fourth, making q3 = 0/21. Thisleads us to the peculiar conclusion that our best estimate for the probability of an albertosaurdying in its third year is 0.046, but that the probability drops to 0 in its fourth year, thenbecomes nonzero again in the fifth year, and so on. This violates our intuition that mortalityrates should be fairly smooth as a function of age. This problem becomes even more extremewhen we consider continuous lifetime models. With no constraints, the optimal estimator for themortality distribution would put all the mass on just those moments when deaths were observedin the sample, and no mass elsewhere — in other words, infinite hazard rate at a finite set ofpoints at which deaths have been observed, and 0 everywhere else.

As we see from Figure 1.1, the mortality distribution for the tyrannosaurs becomes muchsmoother and less erratic when we use larger bins for the histogram. This is no surprise, sincewe are then sampling from a larger baseline, leading to less random fluctuation. The simplestway to impose our intuition of regularity upon the estimators is to increase the time-step andreduce the number of parameters to estimate. An extreme version of this, of course, is to imposea parametric model with a small number of parameters. This is part of the standard tradeoff instatistics: a free, nonparametric model is sensitive to random fluctuations, but constraining themodel imposes preconceived notions onto the data.

Notation: When the hazard rate µx is being assumed constant over each year of life, thecontinuous mortality rate has been reduced to a discrete set of parameters. What do we callthese parameters? By convention, the value of µ that is in effect for all ages in [x, x + 1) isidentified with just one age, namely µx+ 1

2.

2.4 Life TablesLife tables represent a discretised form of the hazard function for a population, often togetherwith raw mortality data. Apart from an aggregate table subsuming the whole population (ofthe UK, say), such tables exist for various groups of people characterized by their sex, smokinghabits, job type, insurance level etc. This immediately raises interesting questions concerningthe interdependence of such tables, but we focus here on some fundamental issues, which arealready present for the single aggregate table.

We begin with a naïve, empirical approach. In Table 2.2 we see a life table for men in theUK, in the years 1990–2, as provided by the Office of National Statistics. In the column labelledEx we see the number of years “exposed to risk” in age-class x. Since everyone alive is at risk ofdying, this should be exactly the sum of the number of individuals alive in the age class in years1990, 1991, and 1992. The 1991 number is obtained from the census of that year, and the othertwo years are estimated. The column dx shows the number of men of the given age known tohave died during this three-year period. The final column is mx := dx/Ex.

Again, this is an empirical fact, but we find ourselves in a quandary when we try to interpretit. What is mx? If the number of deaths is reasonably stable from year to year, then mx shouldbe close to the fraction of men aged x who died each year. How close? The number of men atrisk changes constantly, with each birthday, each death, each immigration or emigration. Wesense intuitively that the effect of these changes would be small, but how small? And what wouldwe do to compensate for this in a smaller population, where the effects are not negligible? Howdo we make projections about future states of the population?

LECTURE 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 14

AGE x Ex dx mx ⇥ 105

0 1066867 8779 8231 1059343 661 622 1054256 403 383 1047298 319 304 1037973 251 245 1022032 229 226 1003486 201 207 989008 186 198 976049 180 189 981422 180 1810 988020 179 1811 984778 179 1812 950853 185 1913 909437 212 2314 891556 259 2915 913423 366 4016 954339 496 5217 1002077 758 7618 1057508 922 8719 1124668 930 8320 1163581 979 8421 1195366 1030 8622 1210521 1073 8923 1238979 1105 8924 1263313 1083 8625 1296300 1068 8226 1313794 1145 8727 1311662 1090 8328 1291017 1110 8629 1259644 1129 9030 1219278 1101 9031 1176120 1144 9732 1135091 1128 9933 1103162 1095 9934 1071474 1142 10735 1035587 1218 11836 1017422 1291 12737 1010544 1399 13838 1006929 1536 15339 1006500 1660 16540 1016727 1662 16341 1046632 1967 18842 1092927 2240 20543 1167798 2543 21844 1134652 2656 23445 1071729 2836 26546 974301 2930 30147 955329 3251 34048 914107 3354 36749 848419 3486 41150 815653 3836 47051 811134 4251 524

AGE x Ex dx mx ⇥ 105

52 827414 4781 57853 822603 5324 64754 810731 5723 70655 794930 6411 80656 775350 6925 89357 759747 7592 99958 755475 8477 112259 761913 9484 124560 764497 10735 140461 753706 11880 157662 736868 12871 174763 725679 14463 199364 721743 16094 223065 713576 17704 248166 700666 19097 272667 681977 20930 306968 676972 22507 332569 678157 25127 370570 684764 27159 396671 600343 26508 441572 504808 24443 484273 422817 22792 539174 422480 24921 589975 431321 27286 632676 422822 29712 702777 399257 30856 772878 365168 30744 841979 328386 30334 923780 293014 29788 1016681 260517 28483 1093382 229149 27399 1195783 197322 25697 1302384 165896 23717 1429685 136103 20930 1537886 110565 18689 1690387 87989 16370 1860588 68443 13571 1982889 52151 11284 2163790 40257 9061 2250891 29000 7032 2424892 20124 5405 2685893 13406 4057 3026394 9392 3069 3267795 6446 2219 3442496 4384 1578 3599597 2795 1091 3903498 1761 701 3980799 1059 489 46176100 624 292 46795101 359 178 49582102 216 118 54630103 107 63 58879

Table 2.1: Male mortality data for England and Wales, 1990–2. From [9] (available online).

LECTURE 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 15

AGE x `x qx ex

0 100000 0.0082 73.41 99180 0.0006 73.02 99119 0.0004 72.13 99081 0.0003 71.14 99052 0.0002 70.15 99028 0.0002 69.16 99006 0.0002 68.27 98986 0.0002 67.28 98967 0.0002 66.29 98950 0.0002 65.210 98932 0.0002 64.211 98914 0.0002 63.212 98896 0.0002 62.213 98877 0.0002 61.214 98855 0.0003 60.215 98826 0.0004 59.316 98786 0.0005 58.317 98735 0.0008 57.318 98660 0.0009 56.419 98574 0.0008 55.420 98492 0.0008 54.521 98410 0.0009 53.522 98325 0.0009 52.523 98238 0.0009 51.624 98150 0.0009 50.625 98066 0.0008 49.726 97986 0.0009 48.727 97900 0.0008 47.828 97819 0.0009 46.829 97735 0.0009 45.830 97647 0.0009 44.931 97559 0.0010 43.932 97465 0.0010 43.033 97368 0.0010 42.034 97272 0.0011 41.035 97168 0.0012 40.136 97053 0.0013 39.137 96930 0.0014 38.238 96796 0.0015 37.239 96648 0.0016 36.340 96489 0.0016 35.441 96332 0.0019 34.442 96151 0.0021 33.543 95954 0.0022 32.544 95745 0.0023 31.645 95521 0.0027 30.746 95269 0.0030 29.847 94982 0.0034 28.948 94660 0.0037 28.049 94313 0.0041 27.150 93926 0.0047 26.251 93486 0.0052 25.3

AGE x `x qx ex

52 92997 0.0058 24.453 92461 0.0065 23.654 91865 0.0070 22.755 91219 0.0080 21.956 90486 0.0089 21.057 89682 0.0099 20.258 88791 0.0112 19.459 87800 0.0124 18.660 86714 0.0139 17.961 85505 0.0156 17.162 84168 0.0173 16.463 82710 0.0197 15.664 81078 0.0221 14.965 79290 0.0245 14.366 77347 0.0269 13.667 75267 0.0302 13.068 72992 0.0327 12.469 70605 0.0364 11.870 68037 0.0389 11.271 65391 0.0432 10.672 62567 0.0473 10.173 59610 0.0525 9.674 56481 0.0573 9.175 53246 0.0613 8.676 49982 0.0679 8.177 46590 0.0744 7.778 43125 0.0807 7.279 39643 0.0882 6.880 36145 0.0967 6.581 32651 0.1036 6.182 29270 0.1127 5.783 25971 0.1221 5.484 22800 0.1332 5.185 19763 0.1425 4.886 16946 0.1555 4.587 14310 0.1698 4.288 11881 0.1799 4.089 9744 0.1946 3.890 7848 0.2016 3.691 6266 0.2153 3.392 4917 0.2355 3.193 3759 0.2611 2.994 2777 0.2787 2.795 2003 0.2912 2.696 1420 0.3023 2.597 991 0.3232 2.398 670 0.3284 2.299 450 0.3698 2.1100 284 0.3737 2.0101 178 0.3909 1.9102 108 0.4209 1.8103 63 0.4450 1.7

Table 2.2: Life table for English men, computed from data in Table 2.1

LECTURE 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 16

2.5 Notation for life tables

qx Probability that individual aged x dies before reaching age x+ 1

px Probability that individual aged x survives to age x+ 1

tqx Probability that individual aged x dies before reaching age x+ t

tpx Probability that individual aged x survives to age x+ tlx Number of people who survive to age x. Note: This is based

on starting with a fixed number l0 of lives, called the Radix;most commonly, for human populations the radix is 100,000

dx Number of individuals who die aged x (from the standard population)tmx Mortality rate between exact age x and exact age x+ tex Remaining life expectancy at age x

Note the following relationships:

dx = lx � lx+1;

lx+1 = lxpx = lx(1� qx);

tpx =

t�1Y

i=0

px+i

The quantities qx may be thought of as the discrete analogue of the mortality rate — we willcall it the discrete mortality rate or discrete hazard function — since it describes the probabilityof dying in the next unit of time, given survival up to age x. In Table 2.1 we show the life tablecomputed from the raw data of Table 2.2. (It differs slightly from the official table, because theofficial table added some slight corrections. The differences are on the order of 1% in qx, andmuch smaller in lx.) The life table represents the effect of mortality on a nominal populationstarting with size l0 called the Radix, and commonly fixed at 100,000 for large-population lifetables. Imagine 100,000 identical individuals — a cohort — born on 1 January, 1900. In thecolumn qx we give the estimates for the probability of an individual who is alive on his x birthdaydying in the next year, before his x+1 birthday. (We discuss these estimates later in the chapter.)Thus, we estimate that 820 of the 100,000 will die before their first birthday. The survivingl1 = 99, 180 on 1 January, 1901, face a mortality probability of 0.00062 in their next year, sothat we expect 61 of them to die before their second birthday. Thus l2 = 99119. And so it goes.The final column of this table, labelled ex, gives remaining life expectancy; we will discuss thisin section 5.1.

2.6 Continuous and discrete modelsThe first decision that needs to be made in setting up a lifetime model is whether to modellifetimes as continuous or discrete random variables. On first consideration, the discrete approachmay seem to recommend itself: after all, we are commonly concerned with mortality data given inwhole years or, if not years, then whole numbers of months, weeks, or days. Real measurementsare inevitably discrete multiples of some minimal unit of precision. In fact, though, discretemodels for measured quantities are problematic because

• They tie the analysis to one unit of measurement. If you start by measuring lifespans inyears, and restrict the model accordingly you have no way of even posing a question about,for instance, the effect of shifting the reporting date within the year.

LECTURE 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 17

• Discrete methods are comfortable only when the numbers are small, whereas moving downto the smallest measurable unit turns the measurements into large whole numbers. Onceyou start measuring an average human lifespan as 30000 days (more or less), real numbersbecome easier to work with, as integrals are easier than sums.

• It is relatively straightforward to embed discrete measures within a continuous-time model,by considering the integer part of the continuous random lifetime, called the curtate lifetimein actuarial terminology.

(Compare this to the suggestion once made by the physicist Enrico Fermi, that lecturers mighttake their listeners’ investment of time more seriously if they thought of the 50-minute spanof a lecture as a “microcentury”.) The discrete model, it is pointed out by A. S. Macdonaldin [18] (and rewritten in [5, Unit 9]), “is not so easily generalised to settings with more thanone decrement. Even the simplest case of two decrements gives rise to difficult problems,” andinvolves the unnecessary complication of estimating an Initial Exposed To Risk. We will generallytreat the continuous model as the fundamental object, and treat the discrete data as coarserepresentations of an underlying continuous lifetime. However, looking beyond the actuarialsetting, there are models which really do not have an underlying continuous time parameter. Forinstance, in studies of human fertility, time is measured in menstrual cycles, and there simplyare no intermediate chances to have the event occur.

2.7 Crude estimation of life tables – discrete methodSince their invention in the 17th century, the basic methodology for life table has been to collect(from the church registry or whoever kept records of births and deaths) lifetimes, truncate tointeger lifetimes, count the numbers dx of deaths between ages x and x+ 1, relate this to thenumbers `x alive at age x — called the Initial Exposed to Risk — and use q(0)x = dx/`x, or similarquantities as an estimate for the one-year death probability qx.

In our model, the deaths are Bernoulli events with probability qx, so we know that theMaximum Likelihood Estimator for qx is q(0)x = # successes/# trials = dx/`x for n = `0independently observed curtate lifetimes k1, . . . , kn, observed from random variables with commonprobability mass function (m(x))x2N parameterized by (qx)x2N. If we denote m(x) = (1 �q0) . . . (1� qx�1)qx, the likelihood is

nY

i=1

m(k(i)) =Y

x2N(m(x))dx =

Y

x2N(1� qx)

`x�dxqdxx , (2.6)

where only max{k1, . . . , kn}+ 1 factors in the infinite product differ from 1, and

dx = dx(k1, . . . , kn) = #

n1 i n : k(i) = x

o,

`x = `x(k1, . . . , kn) = #

n1 i n : k(i) � x

o.

This product is maximized when its factors are maximal (the xth factor only depending onparameter qx). An elementary differentiation shows that q 7! (1� q)`�dqd is maximal for q = d/`,so that

q(0)x = q(0)x (k1, . . . , kn) =dx(k1, . . . , kn)

`x(k1, . . . , kn), 0 x max{k1, . . . , kn}.

Note that for x = max{k1, . . . , kn}, we have q(0)x = 1, so no survival beyond the highest ageobserved is possible under the maximum likelihood parameters, so that (q(0))0xmax{k1,...,kn}

LECTURE 2. LIFETIME DISTRIBUTIONS AND LIFE TABLES 18

specifies a unique distribution. (Varying the unspecified parameters qx, x > max{k1, . . . , kn},has no effect.)

Lecture 3

Estimating life tables

3.1 Crude life table estimation – direct method for continuousdata

The estimator described in section 2.7 treats lifetimes as discrete numbers. We will discussin this section how we translate between discrete data and a continuous model. If we havecontinuous lifetime data, though, then we can estimate a continuous model directly. Assumethat you observe n = `0 independent lives t1, ndots, tn. Then the likelihood function is

nY

i=1

fT (ti) =nY

i=1

µti exp

(�Z ti

0µsds

)(3.1)

Now assume that the force of mortality µs is constant on [x, x + 1), x 2 N and denote thesevalues by

µx+ 12= � log(px)

0

@remember px = exp

(�Z x+1

xµsds

)1

A . (3.2)

Then, the likelihood takes the formY

x2Nµdxx+ 1

2

exp

n�µx+ 1

2

˜x

o(3.3)

where only max{t1, . . . , tn}+ 1 factors in the infinite product differ from 1, and

dx = dx(t1, . . . , tn) = #�1 i n : [ti] = x

,

˜x = ˜

x(t1, . . . , tn) =nX

i=1

Z x+1

x1{ti>s}ds.

˜x is called the total exposed to risk.

The quantities µx+ 12, x 2 N, are the parameters, and we can maximise the product by

maximising each of the factors. An elementary differentiation shows that µ 7! µde�µ` has aunique maximum at µ = d/`, so that

µx+ 12= µx+ 1

2(t1, . . . , tn) =

dx(t1, . . . , tn)˜x(t1, . . . , tn)

, 0 x max{t1, . . . , tn}.

19

LECTURE 3. ESTIMATING LIFE TABLES 20

Since maximum likelihood estimators are invariant under reparameterisation (the range of thelikelihood function remains the same, and the unique parameter where the maximum is obtainedcan be traced through the reparameterisation), we obtain

qx = qx(t1, . . . , tn) = 1� px = 1� exp

n�µx+ 1

2

o= 1� exp

(�dx(t1, . . . , tn)

˜x(t1, . . . , tn)

). (3.4)

For small dx/˜x, this is close to dx/˜x, and therefore also close to dx/`x.Note that under qx, x 2 N, there is a positive survival probability beyond the highest observed

age, and the maximum likelihood method does not fully specify a lifetime distribution, leavingfree choice beyond the highest observed age.

3.2 Are life tables continuous or discrete?The standard approach to life tables mixes the continuous and discrete, in sometimes confusingways. The data upon which life tables are based are measured in discrete units, but in mostapplications we assume that the risk is actually continuous. If we were to observe a fixed numberof individuals for exactly one year, and count the number of deaths at the end of the year, andif the number of deaths during the year were a small fraction of the total number at risk, itwould hardly matter whether we chose a discrete or continuous model. As we discuss in section4.3.2, the distinction becomes significant to the extent that the number of individuals at riskchanges substantially over a single time unit; then we need to distinguish among three traditionalapproaches:

• Discrete (or direct) method;

• Continuous method;

• Census method.

By direct method we mean that we treat the given lifetimes as being exact observations. Wewill discuss this approach in the context of lifetimes rounded to the nearest integer, but it is thesame if the recording is to several decimal places; the crucial thing is simply that our model doesnot admit intermediate values. If lifetimes are reported equal, then they are treated as beingexactly equal.

The connection between discrete and continuous laws is fairly straightforward, at least inone direction. Suppose T is a lifetime with hazard rate µx at age x, and qx is the probability ofdying on or after birthday x, and before the x+ 1 birthday. Then

tqx = e�R x+tx µsds.

Another way of putting this is to say that the discrete model may be embedded in thecontinuous model, by considering the associated curtate lifespan K = bT c. The remainder(fractional part) S = T � K = {T} can often be treated separately in a simplified way (seebelow). Clearly, the probability mass function of K on N is given by

P(K = n) = P(n T < n+ 1) =

Z n+1

nfT (t)dt = FT (n)� FT (n+ 1)

= exp

⇢�Z n

0µT (t)dt

�0

@1� exp

(�Z n+1

nµT (t)dt

)1

A

LECTURE 3. ESTIMATING LIFE TABLES 21

and if we denote the one-year death probabilities (discrete hazard function) by

qk = P(K = k|K � k) =P(K = k)

P(K � k)= 1� exp

(�Z k+1

kµT (t)dt

)

and pk = 1� qk, k 2 N, we obtain the probability of success after n independent Bernoulli trialswith varying success probabilities qk:

P(K = n) = p0 . . . pn�1qn.

Note that qk only depends on the hazard rate between ages k and k + 1. As a consequence, forKx = bTxc

P(Kx = n) = px . . . px+n�1qx+n

are also easily represented in terms of (qk)k2N.

3.3 Interpolation for non-integer agesSuppose now that we have modeled the curtate lifetime K. The fractional part S of the lifetimeis a random variable on the interval [0, 1], commonly modeled in one of the following ways:

Constant force of mortality µ(x) is constant on the interval [k, k+1), and is called µT (k+12),

or sometimes µk+ 12

when T is clear from the context. Then

1qk = 1� e�µ

k+12 ; µk+ 1

2= � log pk.

S has the distribution of an exponential random variable conditioned on S < 1, so it hasdensity

fS(s) = µk+ 12

e�µ

k+12s

1� e�µ

k+12

.

This assumption thus implies decreasing density of the lifetime through the interval. Wealso have, for 0 s 1, and k an integer,

spk = P(T > k + s|T > k) = exp

(�Z k+s

kµtdt

)= exp

n�sµk+ 1

2

o= (1� qk)

s.

Note that K and S are not independent, under this assumption. We also may write

FT (k + s) = exp

8<

:�k�1X

i=0

µi+ 12+ sµk+ 1

2

9=

;

fT (k + s) = FT (k + s) · µk+ 12e�sµ

k+12 .

Uniform If S is uniform on [0, 1), this implies that for s 2 [0, 1),

sqk = s · 1qk,FT (k + s) = FT (k)

�1� sqk

�,

fT (k + s) = FT (k)qk,

µT (k + s) =fT (k + s)

FT (k + s)=

qk1� sqk

.

LECTURE 3. ESTIMATING LIFE TABLES 22

So this assumption implies that the force of mortality is increasing over the time unit.Note that µ is discontinuous at (some if not all) integer times unless q0 = ↵ = 1/n andqx+1 = qx/(1 � qx), i.e. qk =

↵1�k↵ , k = 1, . . . , n � 1, with ! = n maximal age. Usually,

one accepts discontinuities.

Balducci 1�sqk+s = (1� s)qk for s 2 [0, 1), so that the probability of death in the remainingtime 1� s, having survived to k + s, is the product of the time left and the probability ofdeath in [k, k + 1). There is a trivial identity that the probability of surving 1 time unitfrom time k is the probability of surviving s time units from time k, times the probabilityof surviving 1� s time units from time k + s. Thus

qk = 1� (1� sqk) · (1� 1�sqk+s) = 1� (1� sqk) · (1� (1� s)qk),

so thattqk = 1� 1� qk

1� (1� s)qk,

This implies that

FT (k + s) = FT (k)P�T > k + s

��T > k = FT (k)

1� qk1� qk + sqk

fT (k + s) =d

dsFT (k + s) =

FT (k)qk(1� qk)

(1� qk + sqk)2,

µT (k + s) =fT (k + s)

FT (k + s)=

qk1� qk + sqk

So this assumption implies that the force of mortality is decreasing over the time unit.

Once we have made one of these assumptions, we can reconstruct the full distribution of a lifetimeT from the entries (qx)x2N of a life table. When the force of mortality is small, these differentassumptions are all equivalent to µk+ 1

2= qk. Notice again that the choice of a measurement unit

for discretisation implies a certain level of smoothing, in continuous nonparametric life tablecomputations. Taking the evidence at face value, we would have to say that we have observedzero mortality rate, except at the instants at which deaths were observed, where mortality jumpsto 1. Of course, we average over a period of time, either by imposing the constraint thatmortality rates be step functions, constant over a single measurement unit (or multiple units, ifwe wish to impose additional smoothing, usually because the number of observations is small).

Moving in the other direction is not so straightforward. The continuous model cannot beembedded in the discrete model, for obvious reasons: within the framework of the discrete model,there is no such thing as a death midway through a time period. Traditionally, when the discretenature of lifetable data has been in the foreground, a model of the fractional part, such as one ofthose listed above, has been adjoined to the model. As described in section ??, this approachquickly collapses under the weight of unnecessary complications, which is why we will always treatthe continuous lifetime as the fundamental object, except when the lifetime truly is measuredonly in discrete units.

3.4 An example: Fractional lifetimes can matterImagine an insurance company that insures valuable pieces of construction machinery, which wewill call piddledonks. For safety reasons, piddledonks cannot be used more than 3 years, butthey may fail before that time. The company has records on 1000 of these machines, summarised

LECTURE 3. ESTIMATING LIFE TABLES 23

in Table 3.1. That is, 100 failed in their first year (age 0), 400 in the second year, and 400 in thethird year of operation. The last column shows the estimated failure probabilities.

Table 3.1: Life table for piddledonks.

age x lx dx qx0 1000 100 0.101 900 400 0.442 500 400 0.80

Suppose the company sells one-year insurance policies that pay £1000 when a piddledonkfails. The fair price for such a contract will be £100 for a new-built piddledonk. (That is, theprice equal to the expected value of the contract; obviously, a company that wants to cover itscosts and even turn a profit needs to sell its insurance somewhat above the nominal fair price.)It will be £444 for a piddledonk on its first birthday, and £800 for a piddledonk on its secondbirthday. Suppose, though, someone comes with a piddledonk that is 18 months old, and wishesto buy insurance for the next half year. What would be the fair price?

We have no data on when in the year failure occurs. It is possible, in principle, thatpiddledonks fail only on their birthdays; if they survive that day, they’re good for the rest of theyear. In that case, the insurance could be free, since the probability of a failure in the secondhalf year is 0. This seems implausible, though. Suppose we adopt the constant-hazard model.Calling the constant hazard µ, we see that p1 = e�µ, and

p1 = 0.5p1 · 0.5p1.5. (3.5)

Thus,0.5p1.5 = 0.5p1 = e�µ/2

=pp1 =

p1� q1 =

p.555 = 0.745,

and 0.5q1.5 = 0.255, and the fair price for the half year of insurance is £255. Suppose, on theother hand, we adopt the uniform model for S. We still have (3.5), but now

0.5p1 = 1� 0.5q1 = 1� 1

21q1,

so that0.5p1.5 =

p11� 1

21q1=

0.555

.778= 0.713,

implying that the fair price for this insurance would be £287.If we make the Balducci assumption, the calculation becomes particularly simple: The

probability of failure in the second half of year 1, given survival through the first half of year 1, issimply 1

2q1, so the premium should be exactly half, or £222. This is convenient, but it comes atthe cost of an otherwise usually implausible assumption that failure rates are decreasing throughthe year.

Lecture 4

Central Exposed to Risk and the

Census Approximation

4.1 The “continuous” method for life-table estimationSuppose we observe n identically distributed, independent lives aged x for exactly 1 year, andrecord the number dx who die. Using the notation set up for the discrete model, a life dies withprobability qx within the year.

Hence Dx, the random variable representing the numbers dying in the year conditional on nalive at the beginning of the year, has distribution

Dx ⇠ B(n, qx)

giving a maximum likelihood estimator

bqx =Dx

n, with var(bqx) =

qx (1� qx)

n

where using previous notation we have set lx = n.While attractively simple, this approach has significant problems. Ordinarily failures, deaths,

and other events of interest, happen continuously, even if we happen to observe or tabulate themat discrete intervals. While we get a perfectly valid estimate of qx, the probability of an eventhappening in this time interval, we have no way of generalising to a question about how manyindividuals died in half a year, for example. And real data may be interval censored: That is,the life is not under observation during the entire year, but only during the interval of ages(x+ a, x+ b), where 0 a < b 1. If we write Di

x for the indicator of the event that individuali is observed to die at (curtate) age x, we have

P (Dix = 1) = bi�aiqx+ai

Hence

EDx = E

0

@nX

i=1

Dix

1

A =

nX

i=1

bi�aiqx+ai

There is no way to analyse (or even describe) this intra-interval refinement within the frameworkof the binomial model.

Nonetheless, the simplicity and tradition of the binomial model have led actuaries to developa kind of continuous prosthetic for the binomial model, in the form of a supplemental (and

24

LECTURE 4. CER AND CENSUS APPROXIMATION 25

hidden) model for the unobserved continuous part of the lifetime. These have been discussed insection 3.3. In the end, these are applied through the terms Initial Exposed To Risk (E0

x) andCentral Exposed To Risk (Ec

x). These are defined more by their function than as a particularquantity: the Initial Exposed To Risk plays the role of n in a binomial model, and Ec

x plays therole of total time at risk in an exponential model. They are linked by the actuarial estimator

E0x ⇡ Ec

x +1

2dx.

This may be justified from any of our fractional-lifetime models if the number of deaths is smallrelative to the number at risk. Thus, the actuarial estimator for qx is

eqx =dx

Ecx +

12dx

.

The denominator, Ecx +

12dx, comprises the observed time at risk (also called central exposed to

risk) within the interval (x, x+ 1), added to 1/2 the number of deaths (assumes deaths evenlyspread over the interval). This is an estimator for Ex which is the inital exposed to risk and iswhat is required for the binomial model.

NB assumptions (i)-(iii) collapse to the same model, essentially (i), if µx+ 12

is very small,since all become tqx ⇡ tµx+ 1

2, 0 < t < 1.

Definitions, within year (x, x+ 1)

a) Ecx = observed total time (in years) at risk = central exposed to risk, with approximation

Ecx ⇡ Ex � 1

2dx, if required.b) E0

x(= Ex) = initial exposed to risk = # in risk set at age x, with approximationEx ⇡ Ec

x +12dx, if required.

4.2 Comparing continuous and discrete methodsThere appears to be a contradiction between the discrete life-table estimation of section 2.7 andthe continuous life-table estimation of section 3.1. While the models are different, there arequestions to which both offer an answer, and the answers are different. In the discrete model, weestimate

P�T < x+ 1

��T � x = qx ⇡ qx =

dxEx

.

The continuous model suggests that we estimate the same quantity by

P�T < x+ 1

��T � x = 1� e

�µx+1

2 ⇡ 1� e�µ

x+12 = 1� e�dx/Ec

x dxEc

x. (4.1)

The direct discrete method treats the curtate lifetimes as the true lifetimes; then Ex is thesame as Ec

x, so the continuous model gives a strictly smaller answer, unless dx = 0. Why is that?The difference here is that the continuous model presumes that individuals are dying all throughthe year, making Ec

x somewhat smaller than Ex. In fact, the actuarial estimator Ecx ⇡ Ex � dx/2

(essentially presuming that those who died lived on average half a year), and substituting theTaylor series expansion into (4.1) shows that in the continuous model

P�T < x+ 1

��T � x =

dxEx � dx/2

� dx2(Ex � dx/2)2

+O

✓⇣ dxEx � dx/2

⌘3◆

=dxEx

+O

✓⇣ dxEx � dx/2

⌘3◆.

That is, when the mortality fraction dx/Ex is small, the estimates agree up to second order indx/Ex.

LECTURE 4. CER AND CENSUS APPROXIMATION 26

4.3 Central exposed to risk and the census approximation4.3.1 Insurance data

Insurance data have several special features. In the best of cases, we have full information fromeach person insured as follows; for a simple life assurance paying death benefits on death only,for individual m:

• date of birth bm

• date of entry into observation: policy start date xm

• reason for exit from observation (death Dm = 1, or expiry/withdrawal Dm = 0)

• date of exit from observation ym

This then easily translates into a likelihood

1{Dm=1}fTxm�bm(ym�xm)+1{Dm=0}FTxm�bm

(ym�xm) = µDmym�bm

exp

(�Z ym�bm

xm�bm

µtdt

), (4.2)

and it is clear how much time this individual was exposed to risk at age x, i.e. aged [x, x+ 1) forall x 2 N. We can calculate the Central exposed to risk Ec

x as the aggregate quantity across allindividuals exactly. We can also read off the number of deaths aged [x, x+ 1), dx, and hence

µx+ 12=

dxEc

x(4.3)

This is the maximum likelihood estimator under the assumption of a constant force of mortalityon [x, x + 1). Note that this estimator conforms with the Principle of Correspondence whichstates that

A life alive at time t should be included in the exposure at age x at time t if and onlyif, were that life to die immediately, he or she would be counted in the death data dxat age x.

In practice, data are often not provided in this form and approximations are required. E.g.,policy start and end dates may not be available; instead, only total numbers of policies perage group at annual census dates are provided, and there is ambiguity as to when individualschange age group between the census dates. The solution to the problem is called the censusapproximation.

The key point is that we can tolerate a substantial amount of uncertainty in the numeratorand the denominator (number of events and total time at risk), but failing to satisfy the Principleof Correspondence can lead to serious error. For example, [19] analyses the “Hispanic Paradox,”the observation that Latin American immigrants in the USA seem to have substantially lowermortality rates than the native population, despite being generally poorer (which is usuallyassociated with shorter lifespans). This difference is particularly pronounced at more advancedages. Part of the explanation seems to be return migration: Some old hispanics return to theirhome countries when they become chronically ill or disabled. Thus, there are some members ofthis group who count as part of the US hispanic population for most of their lives, but whosedeaths are counted in their home-country statistics.

LECTURE 4. CER AND CENSUS APPROXIMATION 27

4.3.2 Census approximation

The task is to approximate Ecx (and often also dx) given census data. There are various forms of

census data. The most common one is

Px,k = Number of policy holders aged [x, x+ 1) at time k = 0, . . . , n.

The problem is that we do not know policy start and end dates. The basic assumption of thecensus approximation is that the number of policies changes linearly between any two consecutivecensus dates. It is easy to see that

Ecx =

Z n

0Px,tdt (4.4)

We only know the integrand at integer times, and the linearity approximation gives

Ecx ⇡

nX

k=1

1

2(Px,k�1 + Px,k). (4.5)

This allows us to estimate µx+ 12

if we also know dx, the number of deaths aged x.Now assume that, in fact, you are not given dx but only calendar years of birth and death

leading to

d0x = Number of deaths aged x on the birthday in the calendar year of death.

Then, some of the deaths counted in d0x will be deaths aged x� 1, not x, in fact we should viewd0x as containing deaths aged in the interval (x� 1, x+1), but not all of them. If we assume thatbirthdays are uniformly spread over the year, we can also specify that the proportion of deathscounted under d0x changes linearly from 0 to 1 and back to 0 as x� 1 increases to x and x+ 1.

In order to estimate a force of mortality, we need to identify the corresponding (approximationto) Central exposed to risk. The Principle of Correspondence requires

Ec0x =

Z n

0P 0x,tdt, (4.6)

where

P 0x,t = Number of policy holders at t with xth birthday in calendar year [t].

Again, suppose we know the integrand at integer times. Here the linear approximation requiressome care, since the policy holders do not change age group continuously, but only at censusdates. Therefore, all continuing policy holders counted in P 0

x,k�1 will be counted in P 0x,t for all

k � 1 t < k, but then in P 0x+1,k at the next census date. Therefore

Ec0x ⇡

nX

k=1

1

2(P 0

x,k�1 + P 0x+1,k). (4.7)

The ratio d0x/Ec0x gives a slightly smoothed (because of the wider age interval) estimate of µx (and

not µx+ 12). Note however that it is not clear if this estimate is a maximum likelihood estimate

for µx under any suitable model assumptions such as constancy of the force of mortality betweenhalf-integer ages.

LECTURE 4. CER AND CENSUS APPROXIMATION 28

4.4 Lexis diagramsA graphical tool that helps in making sense of estimates like the census approximation is theLexis diagram.1 These reduce the three dimensions of demographic data — date, age, andmoment of birth — to two, by an ingenious application of the diagonal.

Consider the diagram in Figure 4.1. The horizontal axis represents calendar time (which wewill take to be in years), while the vertical axis represents age. Lines representing the lifetimes ofindividuals start at their birthdate on the horizontal axis, then ascend at a 45� angle, reflectingthe fact that individuals age at the rate of one year (of age) per year (of calendar time). Eventsduring an individual’s life may be represented along the lifeline — for instance, the line mightchange colour when the individual buys an insurance policy — and the line ends at death. (Herewe have marked the end with a black dot.) The collection of lifelines in a diagonal strip —individuals born at the same time (more or less broadly defined) — comprise what demographerscall a “cohort”. They start out together and march out along the diagonal through life, exposedto similar (or at least simultaneous) experiences. (A “cohort” was originally a unit of a Romanlegion.) Note that cohorts need not be birth cohorts, as the horizontal axis of the Lexis diagramneed not represent literal birthdates. For instance, a study of marriage would start “lifelines”at the date of marriage, and would refer to the “marriage cohort of 2008”, for instance, while astudy of student employment prospects would refer to the “student cohort of 2008”, the collectionof all students who completed (or started) their studies in that year.

The census approximation involves making estimates for mortality rates in regions of the Lexisdiagram. Vertical lines represent the state of the population, so a census may be represented bycounting (and describing) the lifelines that cross a given vertical line. The goal is to estimate thehazard rate for a region (in age-time space) by

# eventstotal time at risk

The total time at risk is the total length of lifelines intersecting the region (or, to be geometricabout it, the total length divided by

p2), while the number of events is a count of the number

of dots. The problem is that we do not know the exact total time at risk. Our censuses do tellus, though, the number of individuals at risk

The count dx described in section 4.3.2 tells us the number of deaths of individuals agedbetween x and x + 1 (for integer x), so it is counting events in horizontal strips, such as wehave shown in Figure 4.3. We are trying to estimate the central exposed to risk Ec

x :=R T0 Px,tdt,

where Px,t is the number of individuals alive at time t whose curtate age is x. We can representthis as

Ecx =

Z T

0Px,tdt =

T�1X

k=0

Px,k, (4.8)

where Px,k is defined to be the average of Px,t over t in the interval [k, k + 1). If we assumethat Px,t is approximately linear over such an interval, we may approximate this average by12(P (x, k) + P (x, k + 1)). Then we get the approximation

Ecx =

T�1X

k=0

Px,k ⇡ 1

2Px,0 +

T�1X

k=1

Px,k +1

2Px,T .

1These diagrams are named for Wilhelm Lexis, a 19th century statistician and demographer of many accomplish-ments, none of which was the invention of these diagrams, in keeping with Stigler’s law of eponymy, which states that“No scientific discovery is named after its original discoverer.” (cf. Christophe Vanderschrick, “The Lexis diagram, amisnomer”, Demographic Research 4:3, pp. 97–124, http://www.demographic-research.org/Volumes/Vol4/3/.)

LECTURE 4. CER AND CENSUS APPROXIMATION 29

1 2 3 4

1

2

3

4

Time

Age

Figure 4.1: A Lexis diagram.

Note that this is just the trapezoid rule for approximating the integral (4.8).Is this assumption of linearity reasonable? What does it imply? Consider first the individuals

whose lifelines cross a box with lower corner (k, x). (Note that, unfortunately, the order of theage and time coordinates is reversed in the notation when we go to the geometric picture. Thishas no significance except sloppiness which needs to be cleaned up.) They may enter either theleft or the lower border. In the former case (corresponding to individuals born in year x� k)they will be counted in Px,k; in the latter (born in x� k+ 1) case in Px,k+1. If the births in yearx� k + 1 differ from those in year x� k by a constant (that is, the difference between January 1births in the two years is the same as the difference between February 12 births, and so on, thenon average the births in the two years on a given date will contribute 1/2 year to the centralyears at risk, and will be counted once in the sum Px,k + Px,k+1. Important to note:

• This does not actually require that births be evenly distributed through the year.

• When we say births, we mean births that survive to age k. If those born in, say, Decemberof one year had substantially lowered survival probability relative to a “normal” December,this would throw the calculation off.

• These assumptions are not about births and deaths in general, but rather about births anddeaths of the population of interest: those who buy insurance, those who join the clinicaltrial, etc.

LECTURE 4. CER AND CENSUS APPROXIMATION 30

1 2 3 4

1

2

3

4

Time

Age

Figure 4.2: Census at time 2 represented by open circles. The population consists of 7 individuals.4 are between ages 1 and 2, and 3 are between 0 and 1.

If mortality levels are low, this will suffice, since nearly all lifelines will be counted amongthose that cross the box. If mortality rates are high, though, we need to consider the contributionof years at risk due to those lifelines which end in the box. In this case, we do need to assume thatbirths and deaths are evenly spread through the year. This assumption implies that conditionedon a death occurring in a box, it is uniformly distributed through the box. On the one hand, thatimplies that it contributes (on average) 1/4 year to the years at risk in the box. On the otherhand, it implies that the probability of it having been counted in our average 1

2(Px,k + Px,k+1)

is 12 , since it is counted only if it is in the upper left triangle of box. On average, then, these

should balance.What happens when we count births and deaths only by calendar year? Note that P 0

x,k = Px,k

for integers k and x. One difference is that the regions in question, which are parallelograms,follow the same lifelines from the beginning of the year to the end. This makes the analysis morestraightforward. Lifelines that pass through the region are counted on both ends. The otherdifference is that the region that begins with the census value Px,k ends not with Px,k+1, butwith Px+1,k+1. Thus all the lifelines passing through the region will be counted in Px,k and inPx+1,k+1, hence also in their average. This requires no further assumptions. For the lifelines thatend in the region to be counted appropriately, on the other hand, requires that the deaths beevenly distributed throughout the year. (Other, slightly less restrictive assumptions, are alsopossible.) In this case, each death will contribute exactly 1/2 to the estimate 1

2(Px,k + Px+1,k+1)

LECTURE 4. CER AND CENSUS APPROXIMATION 31

1 2 3 4

1

2

3

4

Time

Age

P(1,t)

P(0,t)

Figure 4.3: Census approximation when events are counted by actual curtate age. The verticalsegments represent census counts.

(since it is counted only in Px,k), and it contributes on average 1/2 year of time at risk.

LECTURE 4. CER AND CENSUS APPROXIMATION 32

1 2 3 4

1

2

3

4

Time

Age

P'(1,t)

P'(2,t)

P'(3,t)

Figure 4.4: Census approximation when events are counted by calendar year of birth and death.Vertical segments bounding the coloured regions represent census counts.

Lecture 5

Life expectancy and comparing life

tables

5.1 Life Expectancy5.1.1 What is life expectancy?

One of the most interesting (and most discussed) features of life tables is the life expectancy. Ithas an intuitive meaning — the average length of life — and is commonly used as a summary ofthe life table, to compare mortality between countries, regions, and subpopulations. For instance,Table 5.1 shows the estimated life expectancy in some rich and poor countries, ranging from 37.2years for a man in Angola, to 85.6 years for a woman in Japan. The UK is in between (though,of course, much closer to Japan), with 76.5 years for men and 81.6 years for women.

Table 5.1: 2009 Life expectancy at birth (LE) in years and infant mortality rate per thousandlive births (IMR) in selected countries, by sex. Data from US Census Bureau. InternationalDatabase available at http://www.census.gov/ipc/www/idb/idbprint.html

Country IMR IMR male IMR female LE LE male LE femaleAngola 180 192 168 38.2 37.2 39.2France 3.33 3.66 2.99 81.0 77.8 84.3India 30.1 34.6 25.2 69.9 67.5 72.6Japan 2.79 2.99 2.58 82.1 78.8 85.6Russia 10.6 12.1 8.9 66.0 59.3 73.1South Africa 44.4 48.7 40.1 49.0 49.8 48.1United Kingdom 4.85 5.40 4.28 79.0 76.5 81.6United States 6.26 6.94 5.55 78.1 75.7 80.7

Life expectancies can vary significantly, even within the same country. For example, the UKOffice of National Statistics has published estimates of life expectancy for 432 local areas inthe UK. We see there that, for the period 2005–7, men in Kensington and Chelsea had a lifeexpectancy of 83.7 years, and women 87.8 years; whereas in Glasgow (the worst-performing area)the corresponding figures were 70.8 and 77.1 years. Overall, English men live 2.7 years longer onaverage than Scottish men, and English women 2.0 years longer.

33

LECTURE 5. LIFE EXPECTANCY AND COMPARING LIFE TABLES 34

When we think of lifetimes as random variables, the life expectancy is simply the mathematicalexpectation E[T ]. By definition,

E[T ] =Z 1

0tfT (t)dt.

Integration by parts, using the fact that fT = �F 0T , turns this into a much more useful form,

E[T ] = �tFT (t)���1

0+

Z 1

0FT (t)dt =

Z 1

0FT (t)dt =

Z 1

0e�

R t0 µsdsdt. (5.1)

That is, the life expectancy may be computed simply by integrating the survival function. Thediscrete form of this is

E[K] =

1X

k=0

kP�K = k

=

1X

k=0

P�K > k

. (5.2)

Applying this to life tables, we see that the expected curtate lifetime is

E[K] =

1X

k=0

P�K > k

=

1X

k=1

lkl0

=

1X

k=1

p0 · · · pk�1.

Note that expected future lifetimes can be expressed as

�ex:= E[Tx] =

Z 1

xexp

(�Z t

xµsds

)dt and ex := E[Kx] =

X

k2Npx . . . px+k =

!X

k=x+1

lklx.

(Recall that Tx is the remaining lifetime at age x. It is defined on the event {T � x}, and hasthe value T � x. Its distribution is understood to be the conditional distribution of T � x on{T � x}.) We see that ex �

ex< ex+1. For sufficiently smooth lifetime distributions, �ex⇡ ex+

12

will be a good approximation.For variances, formulas in terms of y 7! µy and (px)x�0 can be written down, but do not

simplify as neatly. Also the approximation V ar(Tx) ⇡ V ar(Kx)+112 requires rougher arguments:

this follows e.g. if we assume that Sx = Tx �Kx is independent of Kx and uniformly distributedon [0, 1].

5.1.2 Example

Table 5.2 shows a life table based on the mortality data for tyrannosaurs from Table 1.1. Noticethat the life expectancy at birth e0 = 16.0 years is exactly what we obtain by averaging all theages at death in Table 5.2.

age 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14dx 0 0 3 1 1 3 2 1 2 4 4 3 4 3 8lx 103 103 103 100 99 98 95 93 92 90 86 82 79 75 72qx 0.00 0.00 0.03 0.01 0.01 0.03 0.02 0.01 0.02 0.04 0.05 0.04 0.05 0.04 0.11ex 16.0 15.0 14.0 13.5 12.6 11.7 11.1 10.3 9.4 8.7 8.1 7.5 6.7 6.1 5.4

age 15 16 17 18 19 20 21 22 23 24 25 26 27 28dx 4 4 7 10 6 3 10 8 4 3 0 3 0 2lx 64 60 56 49 39 33 30 20 12 8 5 5 2 2qx 0.06 0.07 0.12 0.20 0.15 0.09 0.33 0.40 0.33 0.38 0.00 0.60 0.00 1.00ex 5.0 4.4 3.7 3.2 3.0 2.6 1.8 1.7 1.8 1.8 1.8 0.8 1.0 0.00

Table 5.2: Life table for tyrannosaurs, based on data from Table 1.1.

LECTURE 5. LIFE EXPECTANCY AND COMPARING LIFE TABLES 35

5.1.3 Life expectancy and mortality

The connection between life expectancy and mortality is somewhat subtle. It is well known thatlife expectancy at birth — e0 — has been rising for well over a century. For males it is 73.4years on the 1990-2 UK life table, but was only 44.1 years on the life table a century before.However, it would be a mistake to suppose this means that a typical man was dying at an agethat we now consider active middle-age. This becomes clearer when we look at the remaininglife expectancy at age 44. In 1990 it was 31.6 years; in 1890 it was 22.1 years. Less, to be sure,but still a substantial number of years remaining. The low average length of life in 1890 wasdetermined in large part by the number of zeroes being included in the average.

Imagine a population in which everyone dies at exactly age 75. The expectation of liferemaining at age x would then be exactly 75� x. While that is not, of course, our true situation,mortality in much of the developed world today is quite close to this extreme: There is almost norandomness, as witnessed by the fact that the remaining life expectancy column of the lifetablemarches monotonously down by one year per year lived. The only exception is at the beginning— the newborn has only lost about 0.4 remaining years for the year it has lived. This is becausethe mortality in the first year is fairly high, so that overcoming that hurdle gives a significantboost to ones remaining life expectancy. We can compute

e0 = p0(1 + e1).

This follows either from (5.2), or directly from observing that if someone survives the first year(which happens with probability p0) he will have lived one year, and have (on average) e1 yearsremaining. Thus,

q0 = 1� e01 + e1

=1 + e1 � e0

1 + e1=

.6

74.4= 0.008,

which is approximately right. On the 1890 life table we see that the life expectancy of a newbornwas 44.1 years, but this rose to 52.3 years for a boy on his first birthday. This can only meanthat a substantial portion of the children died in infancy. We compute the first-year mortality asq0 = (1 + 52.3� 44.1)/53.3 = 0.17, so about one in six.

How much would life expectancy have been increased simply by eliminating infant mortality— that is, mortality in the first year of life? In that case, all newborns would have reached theirfirst birthday, at which point they would have had 52.2 years remaining on average — thus, 53.2years in total. Today, with infant mortality almost eliminated, there is only a potential 0.6 yearsremaining to be achieved from further reductions.

5.2 An example of life-table computationsSuppose we are studying a population of creatures that live a maximum of 4 years. For simplicity,we will assume that births all occur on 1 January. (The complications of births going onthroughout the year will be addressed briefly in section 4.3.2.) The entire population is underobservation, and all deaths are recorded. We make the following observations:

Year 1: 300 born, 100 die.

Year 2: 350 born, 150 die. 20 1-year-olds die.

Year 3: 400 born, 100 die. 40 1-year-olds die. 90 2-year-olds die.

Year 4: 300 born, 50 die. 75 1-year-olds die. 100 2-year-olds die. 90 3-year-olds die.

LECTURE 5. LIFE EXPECTANCY AND COMPARING LIFE TABLES 36

(a) Cohort 1 life table

x dx `x qx ex

0 100 300 0.333 1.571 20 200 0.10 1.352 90 180 0.50 0.503 90 90 1.0 0

(b) Cohort 2 life table

x dx `x qx ex

0 150 350 0.43 1.21 40 200 0.20 1.12 100 160 0.625 0.3753 60 60 1.0 0

(c) Period life table for year 4

x qx `x dx ex

0 0.167 1000 167 1.691 0.25 833 208 1.032 0.625 625 391 0.3753 1.0 235 1.0 0

Table 5.3: Alternative life tables from the same data.

In Table 5.3 we compute different life tables from these data. The two cohort life tables(Tables 5.2(a) and 5.2(b)) are fairly straightforward: We start by writing down `0 (the numberof births in that cohort) and then in the dx column the number of deaths in each year from thatcohort. Subtracting those successively from `0 yields the number of survivors in each age class`x, and qx = dx/`x. Finally, we compute the remaining life expectancies:

e0 =`1`0

+`2`0

+`3`0

e1 =`2`1

+`3`1

e2 =`3`2.

The period life table is computed quite differently. We start with the qx numbers, whichcome from different cohorts:

q0 comes from cohort 4 newborn deaths;q1 comes from cohort 3 age 1 deaths;q2 comes from cohort 2 age 2 deaths;q3 comes from cohort 1 age 3 deaths.

We then write in the radix `0 = 1000. Of 1000 individuals born, with q0 = 0.167, we expect 167

to die, giving us our d0. Subtracting that from `0 tells us that `1 = 833 of the 1000 newbornslive to their first birthday. And so it continues. The life expectancies are computed by the sameformula as before, but now the interpretation is somewhat different. The cohort remaining lifeexpectancies were the same as the actual average number of (whole) years remaining for thepopulation of individuals from that cohort who reached the given age. The period remaining lifeexpectancies are fictional, telling us how many individuals would have remained alive if we had a

LECTURE 5. LIFE EXPECTANCY AND COMPARING LIFE TABLES 37

cohort of 1000 that experienced in each age the same mortality rates that were in effect for thepopulation in year 4.

5.3 Cohorts and period life tablesYou may have noticed a logical fallacy in the arguments of sections 2.7 and 3.1. The lifeexpectancy at birth should be the average length of life of individuals born in that year. Ofcourse, we would have to go back to about 1890 to find a birth year whose cohort — theindividuals born in that year — have completed their lives, so that the average lifespan can becomputed as an average.

Consider, for instance, the discrete-time non-homogeneous model. “Time” in the model isindividual age: An individual starts out at age 0, then progresses to age 1 if she survives, and soon. We estimate the probability of dying aged x by dividing the number of deaths observed agex by the number of individuals observed to have been at that age.

In our life-tables, called period life tables, these numbers came from a census of the individualsalive at one particular time, and the count of those who died in the same year, or period of a fewyears. No individual experiences those mortality rates. Those born in 2009 will experience themortality rates for age 10 in 2019, and the mortality rates for age 80 in 2089. Putting togetherthose mortality rates would give us a cohort life table. (Actually, this is not precisely true. Youmight think about why not. The answer is given in a footnote.1) If, as has been the case for thepast 150 years, mortality rates decline in the interval, that means that the survival rates will behigher than we see in the period table.

We show in Figure 5.1 a picture of how a cohort life table for the 1890 cohort would berelated to the sequence of period life tables from the 1890s through the 2000s. The mortalityrates for ages 0 through 9 (thus 1q0, 4q1, 5q5)2 are on the 1890s period life table, while theirmortality rates for ages 10 through 19 are on the 1900–1909 period life table, and so on. Notethat the mortality rates for the 1890s period life table yield a life expectancy at birth e0 = 44.2years. That is the average length of life that babies born in those years would have had, if theirmortality in each year of their lives had corresponded to the mortality rates which were realisedin for the whole population in the year of their birth. Instead, though, those that survived theirearly years entered the period of late-life high mortality in the mid- to late 20th century, whenmortality rates were much lower. It may seem surprising, then, that the life expectancy for thecohort life table only goes up to 44.7 years. Is it true that this cohort only gained 6 months oflife on average, from all the medical and economic progress that took place during their lives?

Yes and no. If we look more carefully at the period and cohort life tables in Table 5.3 we seean interesting story. First of all, a substantial fraction of potential lifespan is lost in the first year,due to the 17% infant mortality, which is obviously the same for the cohort and period life tables.25% died before age 5. If mortality to age 5 had been reduced to modern levels — close to zero —the period and cohort mortality would both be increased by about 14 years. Second, notice thatthe difference in life expectancies jumps to over 5 years at age 30. Why is that? For the 1890cohort, age 30 was 1920 — after World War I, and after the flu pandemic. The male mortalityrate in this age class was around 0.005 in 1900–9, and less than 0.004 in 1920–9. Averaged over

1The main difference between a cohort life table and the life table constructed from the corresponding ageclasses of successive period life tables is immigration: The cohort life table for 1890 should include, in the row for(let us say) ages 60–4 the mortality rates of those born in 1890 in the relevant region — England and Wales inthis case — who are still alive at age 60. But these are not identical to the 60 year old men living in England andWales in 1950. Some of the original cohort have moved away, and some residing in the country were not bornthere.

2Actually, we have given µx for the intervals [0, 1), [1, 5), and [5, 10). We compute 1q0 = 1�e�µ0 , 4q1 = 1�e

�4µ1 ,5q5 = 1� e

�5µ5 .

LECTURE 5. LIFE EXPECTANCY AND COMPARING LIFE TABLES 38

the intervening decade, though, male mortality was close to 0.02. (Most of the effect is due tothe war, as we see from the fact that it almost exclusively is seen in the male mortality; femalemortality in the same period shows a slight tick upward, but it is on the order of 0.001.) Oneway of measuring the horrible cost of that war is to see that for the generation of men born inthe 1890s, that was most directly affected, the advances of the 20th century procured them onaverage about 4 years of additional life, relative to what might have been expected from themortality rates in the year of their birth. Of these 4 years, 31

2 were lost in the war. Another wayof putting this is to see that the approximately 4.5 million boys born in the UK between 1885and 1895 lost cumulatively about 16 million years of potential life in the war.

(a) Period life table for men in England andWales 1890–9

x µx `x dx ex0 0.187 100000 17022 44.21 0.025 82978 7923 51.75 0.004 75055 1655 52.810 0.002 73400 908 48.915 0.004 72492 1379 44.520 0.005 71113 1766 40.325 0.006 69347 2087 36.230 0.008 67260 2550 32.235 0.010 64710 3229 28.440 0.013 61481 3970 24.745 0.017 57511 4703 21.250 0.022 52808 5515 17.855 0.030 47293 6508 14.660 0.042 40785 7710 11.665 0.061 33075 8636 8.970 0.086 24439 8511 6.575 0.122 15928 7281 4.580 0.193 8647 5346 2.785 0.262 3301 2410 1.790 0.358 891 742 0.995 0.477 149 135 0.5100 0.590 14 13 0.3105 0.695 1 1 0.2110 0.772 0 0 0.0

(b) Cohort life table for the 1890 cohort ofmen in England and Wales

x µx `x dx ex0 0.187 100000 17022 44.71 0.025 82978 7923 52.35 0.004 75055 1655 53.510 0.002 73400 774 49.615 0.003 72626 1167 45.120 0.020 71459 6749 40.825 0.017 64710 5219 39.730 0.004 59491 1257 37.935 0.006 58234 1608 33.640 0.006 56626 1671 29.545 0.009 54955 2384 25.350 0.012 52571 3027 21.355 0.019 49544 4388 17.560 0.028 45156 5956 14.065 0.044 39200 7760 10.870 0.067 31440 8985 8.075 0.102 22455 8940 5.780 0.146 13515 6997 3.885 0.215 6518 4294 2.390 0.288 2224 1697 1.495 0.395 527 454 0.8100 0.516 73 67 0.4105 0.645 6 6 0.2110 0.733 0 0 0.0

Table 5.4: Period and cohort tables for England and Wales. The period table is taken directlyfrom the Human Mortality Database. The cohort table is taken from the period tables of theHMD, not copied from their cohort tables.

There are, in a sense, three basic kinds of life tables:

1. Cohort life table describing a real population. These make most sense in a biologicalcontext, where there is a small and short-lived population. The `x numbers are actualcounts of individuals alive at each time, and the rest of the table is simply calculated fromthese, giving an alternative descriptions of survival and mortality.

LECTURE 5. LIFE EXPECTANCY AND COMPARING LIFE TABLES 39

1890-1899 1900-1909 1910-1919 1920-1929 1930-1939 1940-1949

1950-1959 1960-1969 1970-1979 1980-1989 1990-1999 2000-

Figure 5.1: Decade period life tables, with the pieces joined that would make up a cohort lifetable for individuals born in 1890.

LECTURE 5. LIFE EXPECTANCY AND COMPARING LIFE TABLES 40

2. Period life tables, which describe a notional cohort (usually starting with radix `0 beinga nice round number) that passes through its lifetime with mortality rates given by theqx. These qx are estimated from data such as those of Table 2.2, giving the number ofindividuals alive in the age class during the period (or number of years lived in the ageclass) and the number of deaths.

3. Synthetic cohort life tables. These take the qx numbers from a real cohort, but expressthem in terms of survival `x starting from a rounded radix.

Lecture 6

Comparing Life Tables

6.1 The Poisson model6.1.1 Defining the Poisson model

Under the assumption of a constant hazard rate (force of mortality) µx+ 12

over the year (x, x+1],

we may view the estimation problem as a chain of separate hazard rate estimation problems, onefor each year of life. Each individual lives some portion of a year in the age interval (x, x+ 1],the portion being 0 (if he dies before birthday x), 1 (if he dies after birthday x+ 1), or between0 and 1 if he dies between the two birthdays. Suppose now we lay these intervals end to end,with a mark at the end of an interval where an individual died. It is not hard to see that whatresults is a Poisson process on the interval [0, Ec

x], where Ecx is the total observed years at risk.

Suppose we treat Ecx as though it were a constant. Then if Dx represents the numbers dying

in the year the model uses

P�Dx = k

=

⇣µx+ 1

2Ec

x

⌘ke�µ

x+12Ec

x

k!, k = 0, 1, 2, · · · .

The estimator for the constant force of mortality over the year is

eµx+ 12=

Dx

Ecx, with observed value

dxEc

x.

Under the Poisson model we therefore have that

vareµx+ 12=

µx+ 12Ec

x

(Ecx)

2 =

µx+ 12

Ecx

.

So the estimate will bevareµx+ 1

2⇡ dx

(Ecx)

2 .

If we compare with the 2-state stochastic model over year (x, x+ 1), assuming constantµ = µx+ 1

2, then the likelihood is

L =

nY

1

µ�ie�µti ,

41

LECTURE 6. COMPARING LIFE TABLES 42

where �i = 1 if life i dies and ti = bi � ai in previous terminology (see the binomial model).Hence

L = µdxe�µEc

x

and sobµx+ 1

2=

dxEc

x.

The estimator is exactly the same as for the Poisson model except that both Dx and Ecx are

random variables. Using asymptotic likelihood theory we see that the estimate for the variance is

varbµ ⇡ µ2

dx⇡ dx

(Ecx)

2 .

6.1.2 Intuitive justification for the Poisson model

Are we justified in treating Ecx as though it were fixed? Certainly it’s not exactly the same: The

numerator and denominator are both random, and they are not even independent. One wayof looking at this is to ask, how different would our estimate have been in a given realisation,had we fixed the total time under observation in advance. If we observe m lives from the startof year x, we see that Dx is approximately normal with mean mqx and variance mqx(1� qx),while Ec

x is normal with mean m�mqx(1� e⇤), where e⇤ is the expected remaining length ofa life starting from age x, conditioned on its being less than 1; and variance m�2, where �2 isthe variance in time under observation of a single life. (If µx is not very large, e⇤ is close to 1

2 .)Looking at the first-order Taylor series expansion, we see that the ratio Dx/Ec

x varies only by anormal error term times m�1/2, plus a bias of order m�1. For large m, then, the estimate on thebasis of fixed m (number of individuals) is almost the same as the estimate we would have madefrom observing the Poisson model for the fixed total time at risk m�mqx(1� e⇤).

We will return to the Poisson model in Lectures 8 and 9, when we look at estimation forgeneral Markov processes.

6.2 Testing hypotheses for qx and µx+ 12

We note the following normal approximations:

(i) Binomial model:

Dx ⇠ B(Ex, qx) =) Dx ⇠ N�Exqx, Exqx (1� qx)

andbqx =

Dx

Ex⇠ N

✓qx,

qx (1� qx)

Ex

◆.

(ii) Poisson model or 2-state model:

Dx ⇠ N(Ecxµx+ 1

2, Ec

xµx+ 12)

andbµx+ 1

2⇠ N

✓µx+ 1

2,µx+ 1

2

Ecx

◆.

Tests are often done using comparisons with a published standard life table. These canbe from national tables for England and Wales published every 10 years, or insurance companydata collected by the Continuous Mortality Investigation Bureau, or from other sources.

LECTURE 6. COMPARING LIFE TABLES 43

A superscript "s" denotes "from a standard table", such as qsx and µsx+ 1

2

.

Test statistics are generally obtained from the following:

Binomial:zx =

dx � ExqsxpExqsx (1� qsx)

✓⇡ O � Ep

V

Poisson/2-state:

zx =

dx � Ecxµ

sx+ 1

2qEc

xµsx+ 1

2

✓⇡ O � Ep

V

◆.

Both of these are denoted as zx since under a null hypothesis that the standard table isadequate Zx ⇠ N(0, 1) approximately.

6.2.1 The tests

�2 test

We takeX =

X

all ages x

z2x

This gives the sum of squares of standard normal random variables under the null hypothesisand so is a sum of �2

(1). Therefore

X ⇠ �2(m) , if m = # years of study.

H0 : there is no difference between the standard table and the data,HA : they are not the same.It is normal to use 5% significance and so the test fails if X > �2

(m)0.95.It tests large deviations from the standard table.Disadvantages:1. There may be a few large deviations offset by substantial agreement over part of the table.

The test will not pick this up.2. There might be bias, that is, although not necessarily large, all the deviations may be of

the same sign.3. There could be significant groups of consecutive deviations of the same sign, even if not

overall.

Signs test

Test statistic X is given byX = #{zx > 0}

Under the null hypothesis X ⇠ Binom(m, 12), since the probability of a positive sign should be1/2. This should be administered as a two-tailed test. It is under-powered since it ignores thesize of the deviations but it will pick up small deviations of consistent sign, positive or negative,and so it addresses point 2 above.

LECTURE 6. COMPARING LIFE TABLES 44

Cumulative deviations test

This again addresses point 2 and essentially looks very similar to the logrank test between twosurvival curves. If instead of squaring dx � Exqsx or dx � Ec

xµsx+ 1

2

, we simply sum then

P(dx � Exqsx)pPExqsx (1� qsx)

⇠ N(0, 1), approximately

andP✓

dx � Ecxµ

sx+ 1

2

qPEc

xµsx+ 1

2

⇠ N(0, 1) approximately.

H0 : there is no biasHA : there is a bias.

This test addresses point 2 again, which is that the chi-square test does not test for consistentbias.

Other tests

There are tests to deal with consecutive bias/runs of same sign. These are called the groups ofsigns test and the serial correlations test. Again a very large number of years, m, are required torender these tests useful.

6.2.2 An example

Table 6.1 presents imaginary data for men aged 90 to 95. The column `x lists the initial atrisk, the number of men in the population on the census date, and dx is the number of deathsfrom this initial population over the course of the year. Ec

x is the central at risk, estimated as`x � dx/2. Standard male British mortality for these ages is listed in column µs

x. (The columnµx is a graduated estimate, which will be discussed in section 7.1.

age `x dx Ecx µx+ 1

2µsx zx µx

90 40 10 35 0.29 0.202 1.1 0.2591 35 8 31 0.258 0.215 0.52 0.2892 22 4 18 0.20 0.236 �0.33 0.33593 14 6 11 0.545 0.261 1.85 0.4094 11 4 9 0.444 0.279 0.94 0.4595 7 3 5.5 0.545 0.291 1.11 0.48

Table 6.1: Table of mortality rates for an imaginary old-people’s home, with standard Britishmale mortality given as µs

x, and graduated estimate µx.

We note substantial differences between the estimates µx and the standard mortality µsx, but

none of them is extremely large relative to the standard error: The largest zx is 1.85. We test thetwo-sided alternative hypothesis, that the mortality rates in the old-people’s home are differentfrom the standard mortality rates, with a �2 test, adding up the z2x. The observed X2 is 7.1,corresponding to an observed significance level p = 0.31. (Remember that we have 6 degrees offreedom, not 5, because these zx are independent. This is not an incidence table.)

Lecture 7

Graduation (smoothing life tables) and

the single-decrements model

7.1 GraduationGraduation is what statisticians would call "smoothing". Suppose that a company has collectedits own data, producing estimates for either qx or µx+ 1

2. The estimates may be rather irregular

from year to year and this could be an artefact of the population the company happens to havein a particular scheme. The underlying model should probably (but not necessarily) be smootherthan the raw estimates. If it is to be considered for future predictions, then smoothing should beconsidered. This is called graduation.

There is always a tradeoff in smoothing procedures. Without smoothing, real patterns getlost in the random noise. Too much smoothing, though, can swamp the data in the model, sothat the final estimate reflects more our choice of model than any truth gleaned from the data.

7.1.1 Parametric models

We may fit a formula to the data. Possible examples are

µx = µ (Exponential);

µx = Be✓x (Gompertz);

µx = A+Be✓x (Makeham)

The Gompertz can be a good model for a population of middle older age groups. The Makehammodel has an extra additive constant which is sometimes used to model “intrinsic mortality”,which is supposed to be independent of age. We could use more complicated formulae putting inpolynomials in x.

7.1.2 Reference to a standard table

Here q0x, µ0x represent the graduated estimates. We could have a linear dependence

q0x = a+ bqsx, µ0x = a+ bµs

x

or possibly a translation of years

q0x = qsx+k, µ0x = µs

x+k

In general there will be some assigned functional dependence of the graduated estimate onthe standard table value. These are connected with the notions of accelerated lifetimes andproportional hazards, which will be central topics in the second part of the course.

45

LECTURE 7. GRADUATION (SMOOTHING LIFE TABLES) AND THESINGLE-DECREMENTS MODEL 46

7.1.3 Nonparametric smoothing

We effectively smooth our data when we impose the assumption that mortality rates are constantover a year. We may tune the strength of smoothing by requiring rates to be constant over longerintervals. This is a form of local averaging, and there are more and less sophisticated versions ofthis. In Matlab or R the methods available include kernel smoothing, orthogonal polynomials,cubic splines, and LOESS. These are beyond the scope of this course.

In Figure 7.1 we show a very simple example. The mortality rates are estimated by individualyears or by lumping the data in five year intervals. The green line shows a moving average of theone-year estimates, in a window of width five years.

● ●

● ●

● ● ●

●●

● ● ● ●

0 5 10 15 20 25

0.00.2

0.40.6

0.81.0

Estimates of A. sarcophagus mortality (based on Erickson et al.)

Age (years)

Estim

ated m

ortali

ty pro

babil

ity

● yearly estimatecontinuous (5−year groupings)5−year moving average

Figure 7.1: Different smooothings for A. sarcophagus mortality from Table 1.1.

7.1.4 Methods of fitting

1. In any of the models (binomial, Poisson, 2-state) set (say) qx = a+ bqsx in the likelihoodand use maximum likelihood estimators for the unknown parameters a, b and similarly forµx and other functional relationships with the standard values.

2. Use weighted least squares and minimise

X

all ages x

wx

⇣bqx � q0x

⌘2or

X

all ages x

wx

✓bµx+ 1

2� µ0

x+ 12

◆2

LECTURE 7. GRADUATION (SMOOTHING LIFE TABLES) AND THESINGLE-DECREMENTS MODEL 47

0 5 10 15 20 25

0.010.02

0.050.100.20

0.501.002.00

Tyrannosaur mortality rates

Age (yrs)

Mo

rta

lity r

ate

Estimates from dataConstant hazard fitGompertz hazard fit

Figure 7.2: Estimated tyrannosaurus mortality rates from Table 5.2, together with exponentialand Gompertz fits.

as appropriate. For the weights suitable choices are either Ex or Ecx respectively. Alterna-

tively we can use 1/var, where the variance is estimated for bqx or bµx+ 12, respectively.

The hypothesis tests we have already covered above can be used to test the graduation fit tothe data, replacing qsx, µ

sx+ 1

2

by the graduated estimates. Note that in the �2 test we must

reduce the degrees of freedom of the �2 distribution by the number of parametersestimated in the model for the graduation. For example if q0x = a+ bqsx, then we reducethe degrees of freedom by 2 as the parameters a, b are estimated.

7.1.5 Examples

Standard life table

We graduate the estimates in Table 6.1, based on the standard mortality rates listed in thecolumn µs

x, using the parametric model µx = a+ bµsx. The log likelihood is

` =X

dx log µx+ 12� µx+ 1

2Ec

x.

LECTURE 7. GRADUATION (SMOOTHING LIFE TABLES) AND THESINGLE-DECREMENTS MODEL 48

We maximise by solving the equations

0 =@`

@a=

X0

B@dx

a+ bµsx+ 1

2

� Ecx

1

CA

0 =@`

@b=

X0

@dxµs

x+ 12

a+ bµsx

� µsx+ 1

2Ec

x

1

A .

We can solve these equations numerically, to obtain a = �0.279 and b = 2.6. This yields thegraduated estimates µ tabulated in the final column of Table 6.1. Note that these estimates havethe virtue of being, on the one hand, closer to the observed data than the standard mortalityrates; on the other hand smoothly and monotonically increasing.

If we had used ordinary least squares to fit the mortality rates, we would have obtainedvery different estimates: a = �0.472 and b = 3.44, because we would be trying to minimise theerrors in all classes equally, regardless of the nimber of observations. Weighted least squares,with weights proportional to Ec

x (inverse variance) solves this problem, more or less, and gives usestimates a⇤ = �0.313 and b⇤ = 2.75 very close to the MLE.

In Figure 7.2 we plot the mortality rate estimates for the complete population of tyrannosaursdescribed in Table 1.1, on a logarithmic scale, together with two parametric model fits: theexponential model, with one parameter µ estimated by

µ =1

t=

n

t1 + · · ·+ tn⇡ n

k1 + · · ·+ kn + n/2= 0.058,

where t1, . . . , tn are the n lifetimes observed, and ki = btic the curtate lifetimes; and the Gompertzmodel µs = Be✓s, estimated by

✓ solvesQ0

(✓)

Q(✓)� 1� 1

✓= t,

B :=✓

Q(✓)� 1,

where Q(✓) :=1

n

Xe✓ti .

This yields ✓ = 0.17 and B = 0.0070. It seems apparent to the eye that the exponential fit isquite poor, while the Gompertz fit might be pretty good. It is hard to judge the fit by eye,though, since the quality of the fit depends in part on the number of individuals at risk that gointo the individual mortality-rate estimates, something which does not appear in the plot.

To test the hypothesis, we compute the predicted number of deaths in each age classd(exp)x = lx · q(exp)x if there is a constant µx = µ = 0.058, meaning that q(exp)x = 0.057, andd(Gom)x = lx · q(Gom)

x if

qx = q(Gom)x := 1� exp

(�B

✓e✓x⇣e✓ � 1

⌘),

which is obtained by integrating the Gompertz hazard.

LECTURE 7. GRADUATION (SMOOTHING LIFE TABLES) AND THESINGLE-DECREMENTS MODEL 49

age lx dx q(exp)x d(exp)x z(exp)x q(Gom)x d(Gom)

x z(Gom)x

0 103 0 0.007 5.87 -2.50 0.008 0.78 -0.891 103 0 0.008 5.87 -2.50 0.009 0.93 -0.972 103 3 0.010 5.87 -1.23 0.011 1.10 1.843 100 1 0.012 5.70 -2.03 0.013 1.26 -0.244 99 1 0.014 5.64 -2.02 0.015 1.48 -0.405 98 3 0.016 5.59 -1.14 0.018 1.73 0.986 95 2 0.019 5.42 -1.52 0.021 1.99 0.017 93 1 0.023 5.30 -1.93 0.025 2.30 -0.878 92 2 0.027 5.24 -1.47 0.029 2.69 -0.439 90 4 0.032 5.13 -0.52 0.035 3.12 0.5210 86 4 0.038 4.90 -0.42 0.041 3.52 0.2711 82 3 0.044 4.67 -0.80 0.048 3.96 -0.5012 79 4 0.052 4.50 -0.25 0.057 4.50 -0.2513 75 3 0.062 4.28 -0.64 0.067 5.04 -0.9514 72 8 0.073 4.10 2.04 0.079 5.70 1.0315 64 4 0.086 3.65 0.19 0.093 5.96 -0.8616 60 4 0.101 3.42 0.33 0.109 6.56 -1.0817 56 7 0.118 3.19 2.27 0.128 7.18 -0.0818 49 10 0.139 2.79 4.69 0.150 7.36 1.1119 39 6 0.162 2.22 2.72 0.175 6.84 -0.3720 33 3 0.189 1.88 0.86 0.204 6.74 -1.6521 30 10 0.220 1.71 7.15 0.237 7.12 1.3522 20 8 0.255 1.14 7.40 0.275 5.49 1.4023 12 4 0.295 0.68 4.52 0.317 3.80 0.1424 8 3 0.339 0.46 4.30 0.363 2.91 0.0825 5 0 0.388 0.29 -0.55 0.414 2.07 -1.8826 5 3 0.441 0.29 6.26 0.470 2.35 0.7027 2 0 0.498 0.11 -0.35 0.528 1.06 -1.5028 2 2 0.558 0.11 8.13 0.590 1.18 1.67

Table 7.1: Life table for tyrannosaurs, with fit to exponential and Gompertz models, and basedon data from Table 1.1.

It matters little how we choose to interpret the deviations in the column z(exp)x — with valuesgoing up as high as 8.13, it is clear that these could not have come from a normal distribution,and we must reject the null hypothesis that these lifetimes came from an exponential distribution.

As for the Gompertz model, the deviations are all quite moderate. We computeP

z2x = 28.7.There are 29 categories, but we have estimated 2 parameters, so this needs to be compared tothe �2 distribution with 27 degrees of freedom. (You have learned this rule of thumb of reducingthe degrees of freedom by 1 for every parameter estimated in the multinomial setting. What weare doing here is slightly different, but the principle is the same.) The cutoff for a test at the0.05 level is 40.1, so we do not reject the null hypothesis.

There are several problems with this �2 test. As you have already learned, the �2 approxi-mation doesn’t work very well when the expected numbers in some categories are too low. Thisis certainly the case in this case, with dGom

x as low as 0.78. (That is, we are using a normalapproximation with mean 0.78 and SD 0.88 for a quantity that takes on nonnegative integer

LECTURE 7. GRADUATION (SMOOTHING LIFE TABLES) AND THESINGLE-DECREMENTS MODEL 50

values. That obviously cannot be right.) A better version would lump categories together. Ifwe replace the first 10 years by a single category, it will have an expected number of deathsequal to 0.168 · 103 = 17.3, as compared with exactly 17 observed deaths, producing a z value of0.08. Similarly, we cut off the final three years (since collectively they correspond to the certainevent that all remaining individuals die), leaving us with

Pz2x = 11 on 14 degrees of freedom.

Again, this is a perfectly ordinary value of the �2 variable, and we do not reject the hypothesisof Gompertz mortality.

7.2 Single-decrement modelThe exponential model may also be represented as a Markov process. Let S = {0, 1} be our statespace, with interpretation 0=‘alive’ and 1=‘dead’, and consider the Q-matrix

Q =

�µ µ0 0

!. (7.1)

Then a continuous-time Markov chain X = (Xt)t�0 with X0 = 0 and Q-matrix Q will have aholding time T ⇠ exp(µ) in state 0 before a transition to 1, where it is absorbed, i.e.

Xt =

(0 if 0 t < T1 if t � T

. (7.2)

The transition matrix is

Pt = etQ =

e�µt

1� e�µt

0 1

!.

It seems that this is an overly elaborate description of a simple model (diagrammed in Figure7.3), but this viewpoint will be useful for generalisations. Also, the ‘rate parameter’ µ has amore concrete meaning, and the lack of memory property of the exponential distribution is alsoreflected in the Markov property: given that the chain is still in state 0 at time t (i.e. givenT > t), the residual holding time (i.e. T � t) has conditional distribution Exp(µ).

Alive Deadμ

Figure 7.3: The single-decrement model.

This model may be generalised by allowing the transition rate µ to become an age-dependentrate function t 7! µ(t). This may be seen as a very special kind of inhomogeneous Markov process,or as a special kind of renewal process (one with only one transition). The general two-statemodel with transient state ‘alive’ and absorbing state ‘dead’, is called the ‘single-decrementmodel’.

LECTURE 7. GRADUATION (SMOOTHING LIFE TABLES) AND THESINGLE-DECREMENTS MODEL 51

7.3 Rates in the single-decrement modelThe rate parameter in the two-state Markov model with L ⇠ Exp(�) has the infinitesimalinterpretation

P(Xt+" = 1|Xt = 0) = P(L t+ "|L > t) = 1� e��"= �"+ o("), (7.3)

and for a general L with right-continuous density and hence right-continuous force of mortalityt 7! µt, we have

P(Xt+" = 1|Xt = 0) = P(L t+ "|L > t) = 1� exp

(�Z t+"

tµsds

)= µt"+ o("), (7.4)

since, by l’Hôpital’s rule

lim"#0

1

"

0

@1� exp

(�Z t+"

tµsds

)1

A = lim"#0

µt+" exp

(�Z t+"

tµsds

)= µt. (7.5)

It is therefore natural to express the two-state model by a time-dependent Q-matrix

Q(t) =

��(t) �(t)

0 0

!, where �(t) = µt = hL(t). (7.6)

For estimation purposes, it has been convenient to add as additional assumption that�(t) = µt = µx+ 1

2= �(x+

12) is constant on x t < x+ 1, x 2 N.

We have expressed the process X = (Xt)t�0 as Xt = 0 for 0 t < L and Xt = 1 for t > L,where L is the transition time. Given the observed transition times t1, . . . , tn of n independentcopies of X (corresponding to n different ‘individuals’), we have constructed two different sets ofmaximum likelihood estimates

µ(0)

x+ 12

(t1, . . . , tn) = � log

⇣q(0)x (t1, . . . , tn)

⌘= log

✓1� dx(t1, . . . , tn)

`x(t1, . . . , tn)

◆,

µx+ 12(t1, . . . , tn) =

dx(t1, . . . , tn)˜x(t1, . . . , tn)

, 0 x max{t1, . . . , tn}.

If we furthermore assume that �(t) ⌘ � for all t � 0, then the maximum likelihood estimatoris simply

� =n

t1 + . . .+ tn=

d0 + . . .+ d[max{t1,...,tn}]˜0 + . . .+ ˜

[max{t1,...,tn}]. (7.7)

Lecture 8

Multiple-decrements model

8.1 Introduction to the multiple-decrements modelThe simplest (and most immediately fruitful) way to generalise the single-decrements model is toallow transitions to multiple absorbing states. Of course, as demographer Kenneth Wachter hasput it, it may seem peculiar to introduce multiple “dead” states into our models since there isonly one way of being dead; but (as he continues), there are many ways of getting there. Further,there are many other settings which can be modelled by a single nonabsorbing state transitioninginto one of several possible absorbing states. Some examples are

• A working population insured for disability might transition into multiple different possiblecauses of disability, which may be associated with different costs.

• Workers may leave a company through retirement, resignation, or death.

• A model of unmarried cohabitations, which may end either by separation or marriage.

• Unemployed individuals may leave that state either by finding a job, or by giving up lookingfor work and so becoming “long-term unemployed”.

An important common element is that calling the states “absorbing” does not have to mean thatit is a deathlike state, from which nothing more happens. Rather, it simply means that ourmodel does not follow any further developments.

8.1.1 An introductory example

This example is taken from section 8.2 of [25].According to United Nations statistics, the probability of dying for men in Zimbabwe in

2000 was 5q30 = 0.1134, with AIDS accounting for approximately 4/5 of the deaths in this agegroup. Suppose we wish to answer the question: what would be the effect on mortality rates of acomplete cure for AIDS?

One might immediately be inclined to think that the mortality rate would be reduced to 1/5of its current rate, so that what the probability of dying of some other cause in the absence ofAIDS, which we might write as 5qOTHER⇤

30 , would be 0.02268. On further reflection, though, itseems that this is too low: This is the proportion of people aged 30 who currently die of causesother than AIDS. If AIDS were eliminated, surely some of the people who now die of AIDSwould instead die of something else.

Of course, this is not yet a well-defined mathematical problem. To make it such, we need toimpose extra conditions. In particular, we impose the competing risks assumption: Individual

52

LECTURE 8. MULTIPLE-DECREMENTS MODEL 53

causes of death are assumed to act independently. You might imagine an individual drawing lotsfrom multiple urns, labelled “AIDS”, “Stroke”, “Plane crash”, to determine whether he will die ofthis cause in the next year. The fraction of black lots among the white is precisely qx, when theindividual has age x. If he gets no black lot, he survives the year. If he draws two or more, weonly get to see the one drawn first, since he can only die once. The probability of surviving isthen the product of the survival probabilities:

tqx = 1��1� tq

CAUSE1x

��1� tq

CAUSE2x

�· · · (8.1)

What is the fraction of deaths due to a given cause? Assuming constant mortality rate over thetime interval due to each cause, we have

1� tqCAUSE1x = e�t�CAUSE1

x .

Given a death, the probability of it being due to a given cause, is proportional to the associatedhazard rate. Consequently,

�CAUSE1x = fraction of deaths due to CAUSE 1 ⇥ �x,

which implies that

tqCAUSE1x = 1� (1� tqx)

fraction of deaths due to CAUSE 1 .

(Note that this is the same formula that we use for changing lengths of time intervals: tqx =

1� (1� 1qx)t.) This tells us the probability of dying from cause 1 in the absence of any othercause. The probability of dying of any cause at all is then given by (8.1).

Applying this to our Zimbabwe AIDS example, treating the causes as being either AIDS orOTHER, we see that the probability of dying of AIDS in the absence of any other cause is

5qAIDS⇤30 = 1� (1� 5q30)

4/5= 1� 0.88664/5 = 0.0918,

while the probability of dying of any other cause, in the absence of AIDS, is

5qOTHER⇤30 = 1� (1� 5q30)

4/5= 1� 0.88661/5 = 0.0238.

Appropriately, we have the total cause of death 0.1138 = 1� (1� 0.0918)(1� 0.0238).Is the competing risks assumption reasonable? Another way of putting this is to ask, what

circumstances would cause the assumption to be violated? The answer is: Competing risksis violated when a subpopulation is at higher than average risk for multiple causes of deathsimultaneously; or conversely, when those at higher than average risk for one cause of death areprotected from another cause of death. For example, smokers have more than 10 times the risk ofdying from lung cancer that nonsmokers have; but they also have substantially higher mortalityfrom other cancers, heart disease, stroke, and so on. If a perfect cure for lung cancer were tobe found, it would not save nearly as many lives as one might suppose, from a competing-riskscalculation like the one above, because the lives that would be saved would be almost all thoseof smokers, and they would be more likely to die of something else than an equivalent number ofsaved lives from the general population.

LECTURE 8. MULTIPLE-DECREMENTS MODEL 54

8.1.2 Basic theory

We consider here more general 1 +m-state Markov models with state space S = {0, . . . ,m} thatonly have one transition from 0 to j, for some 1 j m, with absorption in j. We can writedown an – in general time-dependet – Q-matrix

Q(t) =

0

BBBB@

��+(t) �1(t) · · · �m(t)0 0 · · · 0

...... . . . ...

0 0 · · · 0

1

CCCCA, where �+(t) = �1(t) + . . .+ �m(t). (8.2)

Such models occur naturally where insurance policies provide different benefits for differentcauses of death, or distinguish death and disability, possibly in various different strengths orforms. This is also clearly a building block (one transition only) for general Markov models,where states j = 1, . . . ,m may not all be absorbing.

Such a model depends upon the assumption that different causes of death act independently— that is, the probability of dying is the product of what might be understood as the probabilityof dying from each individual cause acting alone.

8.1.3 Multiple decrements – time-homogeneous rates

In the time-homogenous case, we can think of the multiple decrement model as m exponentialclocks Cj with parameters �j , 1 j m, and when the first clock goes off, say, clock j, theonly transition takes place, and leads to state j. Alternatively, we can describe the model asconsisting of one L ⇠ Exp(�+) holding time in state 0, after which the new state j is chosenindependently with probability �j/�+, 1 j m. The likelihood for a sample of size 1 consistsof two ingredients, the density �+e�t�+ of the exponential time, and the probability �j/�+ ofthe transition observed. This gives �je�t�+ , or, for a sample of size n of lifetimes ti and statesji, 1 i n,

nY

i=1

�jie�ti�+ =

mY

j=1

�nj

j e��j(t1+...+tn), (8.3)

where nj is the number of transitions to j. Again, this can be solved factor by factor to give

�j =nj

t1 + . . .+ tn, 1 j m. (8.4)

In particular, we find again �+ = n/(t1 + . . .+ tn), since n1 + . . .+ nm = n.In the competing-clocks description, we can interpret the likelihood as consisting of m

ingredients, namely the density �je��jt of clock j to go off at time t, and probabilities e��kt ofclocks Ck, k 6= j, to go off after time t.

8.2 Estimation for general multiple decrementsWe can deduce from either description in the previous section that the likelihood for a sample ofn independent lifetimes t1, . . . , tn and respective new states j1, . . . , jn, each (ti, ji) sampled from(L, J), is given by

nY

i=1

�ji(ti) exp

(�Z ti

0�+(t)dt

). (8.5)

LECTURE 8. MULTIPLE-DECREMENTS MODEL 55

Let us assume that the forces of decrement �j(t) = �j(x+12) are constant on x t < x+ 1, for

all x 2 N and 1 j m. Then the likelihood can be given as

Y

x2N

mY

j=1

⇣�j(x+

12)

⌘dj,xexp

n�˜

x�j(x+12)

o, (8.6)

where dj,x is the number of decrements to state j between ages x and x+ 1, and ˜x is the total

time spent alive between ages x and x+ 1.Now the parameters are �j(x+

12), x 2 N, 1 j m, and they are again well separated to

deduce�j(x+

12) =

dj,x˜x

, 1 j m, 0 x max{L1, . . . , Ln}. (8.7)

Similarly, we can try to adapt the method to get maximum likelihood estimators from the curtatelifetimes. We can write down the likelihood as

nY

i=1

p(J,K)(ji, [ti]) =Y

x2N(1� qx)

`x�dxmY

j=1

qdj,xj,x , (8.8)

but 1� qx = 1� q1,x � . . .� qm,x does not factorise, so we have to maximise simultaneously forall 1 j m expressions of the form

(1� q1 � . . .� qm)`�d1�...�dm

mY

j=1

qdjj . (8.9)

(We suppress the indices x.) A zero derivative with respect to qj amounts to

(`� d1 � . . .� dm)qj = dj(1� q1 � . . .� qm), 1 j m, (8.10)

and summing over j gives

(`� d)q = d(1� q) ) q =d

`. (8.11)

and then(`� d)qj = dj(1� q) ) qj =

dj(1� q)

`� d=

dj`

(8.12)

so that, if we display the suppressed indices x again,

q(0)j,x = q(0)j,x(t1, j1, . . . , tn, jn) =dj,x`x

. (8.13)

Now we’ve done essentially all maximum likelihood calculations. This one was the only one thatwas not totally trivial. At repeated occurrences of the same factors, we have been and will beless explicit about these calculations. We’ll derive likelihood functions, note that they factoriseand identify the factors as being of one of the three forms

(1� q)`�dqd ) q = d/`

µde�µ` ) µ = d/`

(1� q1 � . . .� qm)`�d1�...�dm

mY

j=1

qdjj ) qj = dj/`, j = 1, . . . ,m.

and deduce the estimates.

LECTURE 8. MULTIPLE-DECREMENTS MODEL 56

8.3 Example: Workforce modelA company is modelling its workforce using the model

Q(t) =

0

BBB@

��(t)� �(t)� µ(t) �(t) �(t) µ(t)0 0 0 0

0 0 0 0

0 0 0 0

1

CCCA(8.14)

with four states S = {W,V, I,�}, where W =‘working’, V =’left the company voluntarily’,I =’left the company involuntarily’ and � =’left the company through death’.

If we observe nx people aged x, then

�x+ 12=

dx,V˜x

, �x+ 12=

dx,I˜x

, µx+ 12=

dx,�˜x

(8.15)

where ˜x is the total amount of time spent working aged x, dx,V is the total number of workers

who left the company voluntarily aged x, dx,I is the total number of workers who left the companyinvoluntarily aged x, dx,� is the total number of workers dying aged x.

8.4 The distribution of the endpointThe time-homogeneous multiple-decrement model makes a transition at the minimum of mexponential clocks as opposed to one clock in the single decrement model. In the same way, wecan construct the time-inhomogeneous multiple-decrement model from m independent clocks Cj

with hazard function �j(t), 1 j m. Then the likelihood for a transition at time t to state jis the product of fCj (t) and FCk(t).

By Exercise A.1.2, the hazard function of L = min{C1, . . . , Cm} is given by hL(t) = hC1(t) +. . .+ hCm(t) = �1(t) + . . .+ �m(t) = �+(t), and we can also calculate

P�L = Cj

��L = t = lim

"#0

P�t Cj < t+ ", min{Ck : k 6= j} � Cj

P�t L < t+ "

8>>>><

>>>>:

� lim"#0

1"P�t Cj < t+ ", min{Ck : k 6= j} � t+ "

1"P�t L < t+ "

lim"#0

1"P�t Cj < t+ ", min{Ck : k 6= j} � t

1"P�t L < t+ "

=fCj (t)

Qk 6=j FCk(t)

fL(t)=

hCj (t)

hL(t)=

�j(t)

�+(t),

and we obtain

P(L = Cj) =

Z 1

0P(L = Cj |L = t)fL(t)dt =

Z 1

0�j(t)FL(t)dt = E(⇤j(L)), (8.16)

where ⇤j(t) =R t0 �j(s)ds is the integrated hazard function. (For the last step we used that

E(g(L)) =R10 g0(t)FL(t)dt for all increasing differentiable g : [0,1) ! [0,1) with g(0) = 0.)

The discrete (curtate) lifetime model: We can also split the curtate lifetime K = [L]according to the type of decrement J (J = j if L = Tj) and define

qj,x = P(L < x+ 1, J = j|L > x), 1 j m, x 2 N, (8.17)

LECTURE 8. MULTIPLE-DECREMENTS MODEL 57

then clearly for x 2 Nq1,x + . . .+ qm,x = qx (8.18)

and, for 1 j m,

p(J,K)(j, x) = P(J = j,K = x) = P(L x+ 1, J = j|L > x)P(L > x) = p0 . . . px�1qj,x. (8.19)

Note that this bivariate probability mass function is simple, whereas the joint distribution of(L, J) is conceptually more demanding since L is continuous and J is discrete. We chose toexpress the marginal probability density function of L and the conditional probability massfunction of J given L = t. In the assignment questions, you will see an alternative descriptionin terms of sub-probability densities gj(t) =

ddtP(L t, J = j), which you can normalise –

gj(t)/P(J = j) is the conditional density of L given J = j.

8.5 Cohabitation–dissolution modelThere has been considerable interest in the influence of nonmarital birth on the likelihood of achild growing up without one of its parents. In the paper [16] relevant data are given for ninedifferent western European countries. We give a summary of some of the UK data in Table 8.1.We represent the data in terms of a multiple decrement model in which the one nonabsorbingstate is cohabitation, and this leads to the two absorbing states, which are marriage or separation.(Of course, there is a third absorbing state, corresponding to the death of one of the partners,but this did not appear in the data. And of course, the marriage state is not actually absorbing,except in fairy tales. A more complete analysis would splice this model onto a model of the fate ofmarriages. There are data in the article on rates of separation after marriage, for those interestedin filling out the model.) Time, in this model, begins with the birth of the first child. Becauseof the way the data are given, we treat the hazard rates as constant in the time intervals [0, 1],[1, 3], and [3, 5]. There are no data about what happens after 5 years. We write dMx and dSx forthe number of individuals marrying and separating, respectively, and similar for the estimationof hazard rates. (For simplicity, we have divided the separation data, which were actually onlygiven for the periods [0, 3] and [3, 5], as though there were a count for separations in [0, 1].)

Table 8.1: Data from [16] on rates of conversion of cohabitations into marriage or separation, byyears since birth of first child

(a) % cohabiting couples remaining together(from among those who did not marry)

n after 3 years after 5 years

106 61 48

(b) % of cohabiting couples who marrywithin stated time.

n 1 year 3 years 5 years

150 18 30 39

Translating the data in Table 8.1 into a multiple-decrement life table requires some interpretivework.

1. There are only 106 individuals given for the data on separation; this is both because theindividuals who eventually married were excluded from this tabulation, and because thetwo tabulations were based on slightly different samples.

2. The data are given in percentages.

LECTURE 8. MULTIPLE-DECREMENTS MODEL 58

3. There is no count of separations in the first year.

4. Note that separations are given by survival percentages, while marriages are given by losspercentages.

We now construct a combined life table from the data in Table 8.1. The purpose of thismodel is to integrate information from the two data sets. This requires some assumptions, towit, that the transition rates to the two different absorbing states are the same for everyone, andthat they are constant over the periods 0–1, 1–3, 3–5 (and constant over 0–3 for separation).

The procedure is essentially the same as the construction of the single-decrement life table,except that the survival is decremented by both loss counts dMx and dSx ; and the estimation ofyears at risk ˜

x now depends on both decrements, so is

˜x = `x0(x0 � x) + (dMx + dSx )

x0 � x

2,

where x0 is the next age on the life table. Thus, for example, ˜1, which is the number of years atrisk from age 1 to 3, is 64 · 2 + 41 · 1 = 169.

One of the challenges is that we observe transitions to Separation conditional on never beingMarried, but the transitions to Married are unconditioned. The data for marriage are morestraightforward: These are absolute decrements, rather than conditional ones. If we set up a lifetable an a radix of 1000, we know that the decrements due to marriage should be exactly thepercentages given in Table 8.1(b); that is, 180, 120, and 90. We begin by putting these into ourmultiple-decrements life table, Table 8.2.

Of these nominal 1000 individuals, there are 610 who did not marry. Multiplying by thepercentages in Table 8.1(a) we estimate 238 separations in the first 3 years, and 79 in the next 2years.

Table 8.2: Multiple decrement life table for survival of cohabiting relationships, from time ofbirth of first child, computed from data in Table 8.1.

x `x dMx dSx ˜x µM

x µSx

0–1 1000 180 ? ? ? ?1–3 ? 120 ? ? ? ?3–5 462 90 79 755 ? ?

At this point, the only thing preventing us from completing the life table is that we don’tknow how to allocate the 238 separations between the first two rows, which makes it impossibleto compute the total years at risk for each of these intervals. The easiest way is to begin bymaking a first approximation by assuming the marriage rate is the unchanged over the first threeyears, and then coming back to correct it. Our first step, then, is the multiple-decrement tablein Table 8.3. According to our usual approximation, we estimate the years at risk for the firstperiod as 1000 · 3� 538 · 1.5 = 2193; and in the second period as 462 · 2� 169 · 1 = 755. The twodecrement rates are then estimated by dividing the number of events by the number of years atrisk.

What correction do we need? We need to estimate the number of separations in the first year.Our model says that we have two independent times S and M , the former with constant hazard

LECTURE 8. MULTIPLE-DECREMENTS MODEL 59

Table 8.3: First approximation multiple decrement table.

x `x dMx dSx ˜x µM

x µSx

0–3 1000 300 238 2193 0.137 0.1093–5 462 90 79 755 0.119 0.105

rate µS1.5 on [0, 3], the other with rate µM

12

on [0, 1] and µM2 on [1, 3]. Of the 238 separations in

[0, 3], we expect the fraction in [0, 1] to be about

P�S < 1

��S < min{3,M} =

P{min{S,M} < 1&S < M}P{min{S,M} < 1&S < M}+ P{1 min{S,M} < 3&S < M} .

Since min{S,M} is exponential with rate µM12

+ µS1.5 on [0, 1] and rate µM

2 + µS1.5 on [1, 3], and

independent of {S < M}, we have

P{min{S,M} < 1&S < M} =µS1.5

µM12

+ µS1.5

✓1� e

�µM12�µS

1.5

◆;

P{1 < min{S,M} < 3&S < M} =µS1.5

µM2 + µS

1.5

e�µM

12�µS

1.5⇣1� e�2µM

2 �2µS1.5

⌘.

In Table 8.3 we have taken both values µM to be equal, the estimate being 0.137; were we to doa second round of correction we could take the estimates to be distinct. We thus obtain

P{min{S,M} < 1&S < M} ⇡ 0.109

0.137 + 0.109

⇣1� e�0.137�0.109

⌘= 0.0966;

P{1 < min{S,M} < 3&S < M} ⇡ 0.109

0.137 + 0.109e�0.137�0.109

⇣1� e�2·0.137�2·0.109

⌘= 0.135.

Thus, we allocate 0.418 ⇤ 238 = 99.4 of the separations to the first period, and 138.6 to thesecond. This way we can complete the life table by computing the total years at risk for eachperiod, and hence the hazard rate estimates.

In principle, we could do a second round of approximation, based on the updated hazard rateestimates. In fact, there is no real need to do this. The approximations are not very sensitive tothe exact value of the hazard rate. If we do substitute the new values in, the estimated fractionof separations occurring in the first year will shift only from 0.418 to 0.419.

Table 8.4: Second approximation Multiple decrement life table.

x `x dMx dSx ˜x µM

x µSx

0–1 1000 180 99.4 860.2 0.209 0.1161–3 720.6 120 138.6 1183 0.101 0.1163–5 462 90 79 755 0.119 0.105

We are now in a position to use the model to draw some potentially interesting conclusions.For instance, we may be interested to know the probability that a cohabitation with children will

LECTURE 8. MULTIPLE-DECREMENTS MODEL 60

end in separation. We need to decide what to do with the lack of observations after 5 years. Forsimplicity, let us assume that rates remain constant after that point, so that all cohabitationswould eventually end in one of these fates. Applying the formula (8.16), we see that

P�separate

=

Z 1

0µSx F (x)dx.

We have then

P�separate

= 0.116

Z 1

0e�0.325xdx+ 0.116

Z 3

1e�0.325�0.217(x�1)dx+ 0.105

Z 1

3e�0.759�0.224(x�3)dx

=0.116

0.325

h1� e�0.325

i+

0.116

0.217

he�0.325 � e�0.759

i+

0.105

0.224

he�0.759

i

= 0.099 + 0.136 + 0.219

= 0.454.

Lecture 9

Continuous-time Markov chains

9.1 General Markov chainsLet S be a countable state space, and M = (Mn)n�0 or X = (Xt)t�0 a time-homogeneous Markovchain with unknown transition probabilities/rates. In this lecture, we will develop methodsto construct maximum likelihood estimators for the transition probabilities/ rates. You maythink of a population model where birth rates and death rates may depend on the populationsize, and also multiple births (or maybe immigration) and multiple deaths (accidents, disasters,emigration) are allowed.

9.1.1 Discrete time, estimation of ⇧-matrix

Suppose we start a Markov chain at M0 = i0 and then observe (M0, . . . ,Mn) = (x0, . . . , xn).The general transition matrix P = (pxy)x,y2S contains our parameters pxy, x, y 2 S, and we canwrite down the likelihood (probability mass function) for our observations

nY

k=1

pxk�1,xk =

Y

x2S

Y

y2Spnxyxy , (9.1)

where nxy is the number of transitions from x to y. Now note that pxx = 1�P

y 6=x pxy so that,as before, the maximum likelihood estimators are

pxy =nxy

nx, provided nx =

X

y2Snxy = #{0 k n� 1 : xk = x} > 0. (9.2)

If nx = 0, then the likelihood is the same for all pxy, y 2 S.

9.1.2 Estimation of the Q-matrix

Suppose we start a continuous-time Markov chain at X0 = i0 and observe (Xs)0sTn whereTn is the nth transition time and record the data as successive holding times ⌧j := Tj � Tj�1

and sequence of states of the jump chain (x0, . . . , xn). The general Q-matrix (qxy)x,y2S containsthe parameters qxy, x, y 2 S, and the likelihood, as a product of densities of holding times andtransition probabilities, is given by

nY

k=1

�xk�1 exp{��xk�1⌧k}qxk�1,xk

�xk�1

=

Y

x2S

Y

y 6=x

qnxyxy exp

��exqxy

, (9.3)

61

LECTURE 9. CONTINUOUS-TIME MARKOV CHAINS 62

where �x = �qxx =P

y 6=x qxy, nxy is the number of transitions from x to y and ex is the totaltime spent in x. This is maximised factor by factor by

qxy = qxy(x0, ⌧0, x1, . . . , ⌧n�1, xn) =nxy

ex, x 6= y. (9.4)

In fact, the holding times and transitions may come from several chains (with the same unknownQ-matrix) without affecting the form of the likelihood, if we define

nxy = n(1)xy + . . .+ n(r)

xy and ex = e(1)x + . . .+ e(r)x (9.5)

for observed chains (X(k)s )

0sT(k)nk

, 1 k r. This is useful to save time by simultaneousobservation, and to reach areas of the state space not previously visited (e.g. for reducible ortransient chains).

9.2 The induced Poisson process (non-examinable)In order to derive a rigorous estimation theory for a general finite-state Markov process, weconsider the embedded Poisson processes that comes from considering the process only when itis in a given state. You might imagine a “state-x estimator” that is tasked with estimating thetransition rates to all other states from state x; its clock runs while the process is in state x, andit counts the transitions, but it slumbers when the process is in any other state.

Suppose we run infinitely many copies of the Markov process, started in states X(1)0 , X(2)

0 , . . .for some lengths of time S(1), S(2), . . . . Suppose that the times S(i) are stopping times: That is,they do not depend upon knowing the future of the process. (For a precise technical definition,see any basic text on stochastic processes, such as [15].) The realisation i makes transitions attimes T (i)

1 , . . . , T (i)Ni

to states X(i)1 , . . . , X(i)

Ni. (We do not wish to rule out the simple possibility

that there is only one run of a positive recurrent Markov process. To admit that alternative,though, would complicate the notation. Instead, we may suppose that the single infinite runis broken into infinitely many pieces, for instance by stopping after each full unit of time, andrestarting with X(i+1)

0 = X(i)Ni

.) We assume that the realisations are independent, except thatthe starting state of a realisation may be dependent on

Consider some fixed state x. Suppose all K realisations visit state x a total of Mx times, andlet ⌧x(j) be the length of the j-th sojourn in x — so that Ex := ⌧x(1) + · · · ⌧x(Mx) is the totalof all the time intervals when the process is in state x. Of the sojourns in x, some end with atransition to a new state, and some end because a stopping time S(i) intervened; define

�x(j) =

(1 if sojourn j ends with a transition;0 if sojourn j ends with a stopping time.

Consider now the random interval [0, Ex], and the set of points

Sx :=�⌧x(1) + · · ·+ ⌧x(j) : s.t. �x(j) = 1

.

The idea is that we take the events that occur only while the process is waiting at state xout, and stitch them together. Theorem 2 tells us that we obtain thereby a Poisson process. Anillustration may be found in Figure 9.1. We start with a strong restatement of the “memoryless”property of the exponential distribution.

LECTURE 9. CONTINUOUS-TIME MARKOV CHAINS 63

Run 1

Run 2

Run 3

Run 4

T1

(1)T

2

(1)T

3

(1)T

4

(1)τ

T1

(2)T

2

(2)T

3

(2)T

4

(2)T

5

(2)

τ

T1

(3)T

2

(3)T

3

(3)T

4

(3)T

5

(3)

τ

T1

(4)T

2

(4)T

3

(4)

τ

Figure 9.1: Illustration of the “stitching-together” construction, by which the process confinedto a particular state generates a marked Poisson process. The Markov process has three states,represented by red, green, and black. We are estimating the transition rates from the red state.The diamond shapes represent the colour to which transitions are made. Stars represent censoredobservations; that is, times at which a realisation of the process was ended (at the time ⌧) instate R, without having transitioned out. The estimates based on these observations would beqRG = 5/E and qRB = 1/E, where E is the total length of the red line at the bottom.

LECTURE 9. CONTINUOUS-TIME MARKOV CHAINS 64

Lemma 1. Suppose T1, T2, . . . is an i.i.d. sequence of exponential random variables with param-eter �, and S1, S2, . . . independent random variables such that each Si is a stopping time withrespect to Ti. That is, Ti � t is independent of Si on the event {Si t}, for any fixed t. LetK = min{k : Tk Sk}. Then

T⇤ := TK +

K�1X

i=1

Si

is exponential with parameter �.

Proof. The stopping-time property tells us that (Ti � Si) is independent of Si on the event{Ti > Si}. Consequently, conditioned on {Ti > Si = s}, (Ti � Si) has the distribution of (Ti � s)conditioned on {Ti > s}, which is exponential with parameter �; and conditioned on {Ti s}, Ti

has exponential distribution with parameter �. Conditioned on {K = 1}, then, it is immediatelytrue that T⇤ = T1 has the correct distribution. Suppose now that T⇤ has the correct distribution,conditioned on {K = k}. Then conditioned on {K = k + 1}, T⇤ := Tk+1 + Sk +

Pk�1i=1 Si. Note

that Sk + Tk+1 conditioned on {K = k + 1} has the same distribution as Tk conditioned on{K = k} (by the induction hypothesis). Since either of these is independent of

Pk�1i=1 Si, the

distribution is the same T⇤ conditioned on {K = k+1} is identical to the distribution conditionedon {K = k}, which completes the induction.

Theorem 2. The random set Sx is a Poisson process with rate qx :=P

y 6=x qxy, and the totaltime Ex is a stopping time for the process. If we condition on (Ex : x 2 X), the processescorresponding to different states are independent. Finally, conditioned on Sx, the transitions thattake place at the times Sx are independent, with the probability of transitioning to y being qxy/qx.

Proof. Consider the interarrival time between two points of Sx. By Lemma 1 it is exponentialwith parameter qx. By the Markov property, the interarrival times are all independent. Hence,these are independent Poisson processes. The independence of the transitions from the waitingtimes is standard.

Define Nx(s) to be the number of visits to state x up to total time s (where total time ismeasured by stitching together the processes X(1)

· , followed by X(2)· , and so on) which end in a

transition (as opposed to ending in a stopping time, and shift to a new realisation of the process).Let Nxy(s) be the number of visits to state x up to total time s which end in a transition tostate y. Let Ex(s) be the total amount of time spent in state x up to total time s. (Thus,P

x2XEx(s) = s identically.)

9.3 Estimation of Markov ratesConsequences of Theorem 2 are:

MLE The maximum likelihood estimator for the rate of a Poisson process is # events/total time.Thus, if we observe realisations of the process which add up to total time S (where S maybe a random stopping time), the MLE for qxy is

qxy(S) =Nxy(S)

Ex(S). (9.6)

Consistency lims!1Nxy(S)Ex(S)

= qxy, on the event {lims!1Ex(s) = 1}. (Question to consider: Howwould it be possible to arrange the realisations of the process so that the condition{lims!1Ex(s) = 1} does not have probability 1?)

LECTURE 9. CONTINUOUS-TIME MARKOV CHAINS 65

Sampling dist. Suppose we run realisations of the process until a random time S that we will callSx(t) := inf{s : Ex(s) = t}; that is, we run the process (in its various successive realisations)until such time as the total time spent in x is exactly t. Then the estimator qxy(S) is equalto Nxy(S)/Ex(S) = Nxy(S)/t. Since Nxy(S) has Poisson distribution with parameter tqxy,this tells us the distribution of qxy(S). Its expectation is qxy, and its variance is qxy/t. Ast ! 1, qxy(Sx(t)) converges to a normal distribution.

The sampling distribution is a complicated matter. If we have observed up to time Sx(t) thenwe know the exact distribution of qxy, and we can compute approximate 100(1� ↵)% confidenceintervals as qxy ± z1�↵/2

pqxy/t, where z is the appropriate quantile of the normal distribution.

Even then, we do not have the exact distribution even for the estimators of transition ratesstarting from any other state.

Can this be a serious problem? Suppose we observe instead up to a constant total time s. Isthe distribution substantially different? In some respects it is, particularly in the tails. Supposewe decompose the estimate by the number of visits to x.

qxy(s) =Nxy(s)

Ex(s)=

1X

n=0

1{Nx(s)=n}Nxy(s)

Ex(s).

The summand corresponding to n = 0 is 0/0, which is problematic. Moving on to n = 1, we havewith probability qxy/qx the expression 1/E, where E is exponential with parameter qx. This hasexpectation 1. Consequently, qxy(s) also has infinite expectation. Other choices of a randomtime S at which to observe the process can similarly distort the distribution.

On the other hand, this is only a problem with the expectation and variance, not with themain bulk of the distribution. That is, as long as S is chosen so that there will be, with highprobability, a large number of visits to x, the normal approximation should be fairly accurate.

The general rule for estimating transition rates is

qxy =# transitions x ! y

total time spent in state x

Var(qxy) ⇡qxy

total time spent in state x

We compute an approximate 100(1� ↵)% confidence interval as qxy ± z↵/2pVar(qxy).

If the number of transitions is not very large, we may do better estimating qxy from the Poissonparameter estimated by Nxy. Exact confidence intervals may be computed, using the identity

P�k Ns

= P

�Tk s

= P

�2kµTk 2kµs

= ↵ for µ = �2k,↵/(2ks).

This is carried out in the exercises.

9.4 Parametric and time-dependent modelsSo far, we have assumed that the Q-matrix was completely arbitrary, i.e. with entries qij � 0,j 6= i, and qii = �

Pj:j 6=i qij > �1. We estimated all “parameters” qij by observing a Markov

chain with that unknown Q-matrix (or several independent Markov chains with the same unknownQ-matrix).

LECTURE 9. CONTINUOUS-TIME MARKOV CHAINS 66

On the other hand, we studied the multiple decrement model, which we can view as acontinuous-time Markov chain, where the Q-matrix contains lots of zeros, namely everywhereexcept in the first row. Here, the maximum likelihood estimates that we derived were of thesame form

numbers of transitions from i to j

total time spent in i, (9.7)

except that we did not specify the zero rows as estimates but as model assumptions.Here, we will merge ideas and study more systematically Markov chain models where the

Q-matrices are not completely unknown. Instead, we assume/know some structure, e.g. certainzero entries and/or that certain transition rates are the same or stand in a known relationship toeach other.

Secondly, we will incorporate time-dependent transition rates (as we had in the multipledecrement model) into the general Markov model.

9.4.1 Example: Marital status model

A natural model for marital status consists of five states “unmarried” (U), “married” (M),“widowed” (W ), “divorced” (D) and “dead” (�).

We can set up a model with 9 parameters corresponding to the 9 possible transitions (U ! M ,U ! �, M ! W , M ! D, M ! �, W ! M , W ! �, D ! M , D ! �). Note that also state� is absorbing and there is no reason to continue to observe chains that have run into this state.This means that we agree that four entries vanish:

q�U = q�M = q�W = q�D = 0. (9.8)

Furthermore, it is also impossible to have direct transitions between U , W and D or indeed togo from M to U , so we also know in advance that

qUD = qDU = qUW = qWU = qDW = qWD = qMU = 0. (9.9)

With states arranged in the above order, this gives a Q-matrix

Q =

0

BBBBB@

�↵� µU ↵ 0 0 µU

0 �⌫ � � � µM ⌫ � µM

0 � �� � µW 0 µW

0 ⇢ 0 �⇢� µD µD

0 0 0 0 0

1

CCCCCA. (9.10)

Alternatively, we can assume that the death rate does not depend on the current state U , M , Wor D, so that the Q-matrix only contains 6 parameters as

Q =

0

BBBBB@

�↵� µ ↵ 0 0 µ0 �⌫ � � � µ ⌫ � µ0 � �� � µ 0 µ0 ⇢ 0 �⇢� µ µ0 0 0 0 0

1

CCCCCA. (9.11)

Finally, we can allow age-varying transition rates by having Q(t), where now the parameters↵(t), µ(t), ⌫(t), �(t), �(t) and ⇢(t) depend on age t.

LECTURE 9. CONTINUOUS-TIME MARKOV CHAINS 67

9.4.2 The general birth-and-death process

The general birth-and-death process is a continuous-time Markov chain with Q-matrix

Q =

0

BBBB@

��0 �0 0 0 0 · · ·µ1 ��1 � µ1 �1 0 0 · · ·0 µ2 ��2 � µ2 �2 0 · · ·... . . . . . . . . . . . . . . .

1

CCCCA, (9.12)

where birth rates �j and death rates µj are in arbitrary (unknown) dependence on j. Note thatthis model has an infinite number of parameters, but just as with unspecified maximal ages inthe single decrement model, we would only estimate (�0, µ1, . . . , µmax,�max), where indeed �max

may be left unspecified or equal to zero for the highest observed population size.This model has the same maximum likelihood estimators as the general Q-matrix with all

entries unknown, since that likelihood factorises completely, and given samples from simplebirth-and-death processes, there are no multiple births and deaths, so the maximum likelihoodestimates of the multiple birth and death rates (jumps of sizes two or higher) are zero. The(remaining) likelihood is

nY

k=1

qXTk�1,XTk

exp{�X

j:j 6=XTk�1

qXTk�1,j(Tk � Tk�1)} =

Y

i2N

Y

|j�i|=1

qNij

ij exp��Eiqij

, (9.13)

where now qi,i+1 = �i and qi,i�1 = µi, i � 1. Nij is the number of transitions from i to j and Ei

is the total time spent in i. Note that we have taken the liberty and written the likelihood interms of the underlying random variables, not the realisations.

9.4.3 Lower-dimensional parametric models of simple birth-and-death pro-

cesses

Often population models come with some additional structure. The simplest structure is thatof independent individuals each giving birth repeatedly at rate � until their death at rate µ.Here �j = j� and µj = jµ, j 2 N, are all expressed in terms of two parameters � and µ. Thelikelihood in this model is the same as in the general model, but has to be factorised as

Y

i2N

Y

|j�i|=1

qNij

ij exp��Eiqij

=

0

@Y

i2N(i�)Ni,i+1 exp {�Eii�}

1

A

0

@Y

i2N(iµ)Ni,i�1 exp {�Eiiµ}

1

A

(9.14)to separate the two parameters. This can best be maximised via the log likelihood, which for theµ-factor is

1X

i=1

�Ni,i�1(log(i) + log(µ))� Eiiµ

�. (9.15)

Differentiation leads to the maximum likelihood estimator µ = D/W where D =P

iNi,i�1 isthe total number of deaths and W =

Pi iEi is the weighted sum of exposure times at population

sizes i 2 N. This quantity has again the interpretation as the total time spent at risk since instate i there are i individuals at risk to die. Similarly, � = B/W , where B is the total number ofbirths. (W is also the total time spent at risk to give birth.)

Note that the state 0 is absorbing, so an nth transition may never come. This can be helpedby running several Markov chains. The likelihood function of the given sample is the one given,if we understand that Nij are aggregate counts and Ei are aggregate waiting times, i, j 2 N.

LECTURE 9. CONTINUOUS-TIME MARKOV CHAINS 68

9.4.4 Asymptotic behaviour of estimates

Let us take up the study of asymptotic properties for the example of Section 9.4.3. Since theestimates are � = B/W and µ = D/W , where W =

Pi iEi, it is worth considering W to be

constant. This can be achieved by observing realisations of the chains until the aggregate waitingtime W reaches a fixed threshold w, i.e. until

⌧w = inf

8<

:t � 0 :

1X

i=1

iEi(t) � w

9=

; . (9.16)

SinceP

i iEi(t) remains constant after absorption in zero, and any observation of holding timesat zero is pointless, we start a new population (think of bacteria colonies, e.g.) as soon as one isextinct, possibly keeping a fixed number > 1 at any given time.

It is not hard to convince oneself (a bit tricky to write down formally, though), that theprocess

Zw = X⌧w (9.17)

is a continuous-time Markov chain with the same jump chain as X, but with i.i.d. Exp(µ+ �)holding times, since t 7! Wt has slope n when X is in state n, so w 7! ⌧w has slope 1/n and anyshort holding time Y ⇠ Exp(n(µ+ �)) of X is expanded to a holding time nY ⇠ Exp(µ+ �) ofZ.

Now, observing X until ⌧w is the same as observing Z until w. Now, Z has transitionstransitions at the times of a Poisson process with rate µ+ �, i.e. a Poi((µ+ �)w) number beforetime w, and each one of them is a birth with probability �/(µ+ �), a death with probabilityµ/(µ+ �). This thinning of the Poisson process results in two independent Poisson processescounting births and deaths, i.e. B ⇠ Poi(�w) and D ⇠ Poi(µw) are independent. This allows toconstruct confidence intervals as in the non-parametric case, also two-dimensional confidenceregions using independence.

9.5 Time-varying transition rates9.5.1 Maximum likelihood estimation

We take up the setting of a general Markov model (Xt)t�0, but have the Q-matrix depend on t,Q(t). For simplicity think of t as age. Denote the finite or countably infinite state space by S.Given we have reached state i 2 S aged exactly y 2 [0,1), the situation is exactly as for themultiple decrement model, there are competing hazards qij(y + t), t � 0, j 6= i, and the totalholding time in state i has hazard rate

� qii(y + t) =X

j:j 6=i

qij(y + t), t � 0. (9.18)

Given the holding time is Z = t, the transition is from i to j with probability

P(Xy+Z = j|Xy = i, Z = t) =qij(y + t)P

j:j 6=i qij(y + t). (9.19)

To be able to estimate time-varying transition rates, we require more than one realisation of X,say realisations (X(m)

t )0tT

(m)nm

, m = 1, . . . , r, where Tnm is the nmth transition time of X(m).

LECTURE 9. CONTINUOUS-TIME MARKOV CHAINS 69

Then the likelihood is given by

rY

m=1

0

BBBB@

nmY

k=1

qX

(m)

T(m)k�1

,X(m)

T(m)k

(T (m)k ) exp

8>>>><

>>>>:

�Z T

(m)k

T(m)k�1

X

j:j 6=X(m)

T(m)k�1

qX

(m)

T(m)k�1

,j(s)ds

9>>>>=

>>>>;

1

CCCCA(9.20)

If we also postulate simplifying assumptions such as piecewise constant transition rates qij(t) =qij(x+

12), x t < x+ 1, x 2 N, we can reexpress this in a factorised form as

Y

x2N

Y

i2S

Y

j 6=i

qij(x+12)

Nij(x) exp

n�Ei(x)qij(x+

12)

o, (9.21)

where Nij(x) is the number of transitions from i to j at an age x, i.e. aged t with x t < x+ 1,and Ei(x) is the total time spent in state i while aged x.

We read off the maximum likelihood estimators for all x 2 N and i 2 N with Ei(x) > 0:

qij(x+12) =

Nij(x)

Ei(x). (9.22)

9.5.2 Example

Clearly, a reasonably complete set of reasonably reliable estimates can only be obtained if thestate space S is small and the number of observations is very large, e.g., in the illness-deathmodel with three states H=able, S=sick and D=dead, with age-dependent sickness rates �xfrom H to S, recovery rates ⇢x from S to H and death rates �x from H to D and �x from S toD.

Suppose, we observe r individuals over their whole life [0, ⌧ (m)d ], then we get estimates

�x+ 12=

dxvx

, �x+ 12=

cxwx

, �x+ 12=

sxvx

, ⇢x+ 12=

rxwx

, (9.23)

where vx (wx) is the total waiting time of lives aged x in the able (ill) state, and dx, cx, sx, rxare the aggregate counts of the respective transitions at age x.

9.6 Occupation timesThe last topic we consider is the generalisation of the formulas that we had earlier for lifeexpectancy. For a lifetime with constant mortality rate µ, the expected lifetime is µ�1. Considernow the illness model described in section 9.5.2

We say that a matrix is negative-definite if all of its eigenvalues are negative. Define

Tx(t) := total time spent in state x up to time t,

and yEx(t) := Ey[Tx(t)], where Ey represents the expectation given that the process starts instate y.

Theorem 3. Let Q be the (m+k)⇥(m+k) transition matrix for a continuous-time discrete-spaceMarkov process with m absorbing states and k non-absorbing states. Let Q⇤ be the k⇥k submatrixconsisting of transition rates among the non-absorbing states. If Q⇤ is irreducible and somerow has negative sum, then Q⇤ is negative definite, and the process is eventually absorbed withprobability 1. Then

yEx(t) = Q�1⇤

⇣etQ⇤ � I

⌘.

The limit yEx := limt!1 yEx(t) is finite and given by the (y, x) entry of �Q�1⇤ .

LECTURE 9. CONTINUOUS-TIME MARKOV CHAINS 70

H

S

σ

ρ

γ

Figure 9.2: Diagram of the Illness model

Proof. The matrix of transition probabilities at time t is given by Pt = etQ. Then

yEx(t) = Ey

"Z t

01{Xs=x}ds

#

=

Z t

0Ps(y, x)ds

=

"Z t

0EsQds

#(y, x)

=

"Z t

0EsQ⇤ds

#(y, x) because the other states are absorbing

= Q�1⇤

⇣etQ⇤ � I

⌘(y, x).

By negative-definiteness of Q⇤ we have limt!1 etQ⇤ = 0, so this converges to �Q�1⇤ .

We are also interested in the state that the process ends up in. We state the following resultin terms of the diagonalisation of Q. Of course, in the special case where Q is not diagonalisable,we can carry out the same construction with the Jordan Normal Form.

Theorem 4. Let v1, . . . , vm be the right-eigenvectors (columns) of Q with eigenvalue 0 (in otherwords, a basis for the kernel of Q), such that vi has a 1 in coordinate k+ i and 0 for the remainderof the absorbing states. Then

Pj�absorbed in state k + i

= (vi)j .

Proof. Fix some i, and let vj = Pj�absorbed in state k + i

. Let Xt be the position of the

process at time t. Then Ej0 [vXt ] = P t(j0, ·)v. Note that this function is constant in time (by the

Chapman-Kolmogorov equation). By the Forward Equation, for all t > 0

0 =d

dtP t

(j0, ·)v = P t(j0, ·)Qv,

which implies that Qv = 0, since limt#0 P t is the identity matrix. Obviously v has the statedvalues on the absorbing states.

LECTURE 9. CONTINUOUS-TIME MARKOV CHAINS 71

9.6.1 The multiple decrements model

The simplest application of this formula is to the multiple decrements model. In that case, wehave just a single non-absorbing state 0, and absorbing states 1, 2, . . . ,m. Then Q⇤ = (��+), sothat the expected time spent in state 0 is 1/�+, which we already knew.

The only non-trivial eigenvector is

(�1�1

�+· · · �m

�+).

Thus

R�1=

0

BBBBBB@

�1�1�+

�2�+

· · · �m�+

0 1 0 · · · 0

0 0 1 · · · 0

......

... . . . ...0 0 0 · · · 1

1

CCCCCCA

Thus, the probability of ending up in state i is �i/�+.

9.6.2 The illness model

Consider now the illness model, with � = 0.1, � = 0.1, � = 0.5, and ⇢ = 0. The generator (takingthe states in the order H, S, D, is

Q =

0

B@�0.2 0.1 0.10 �0.5 0.50 0 0

1

CA Q⇤ =

�0.2 0.10 �0.5

!

We calculate

Q�1⇤ =

�5 �1

0 �2

!

Thus, a sick individual survives on average 2 years. A healthy individual survives on average 6years, of which 1 year, on average, is spent sick.

There is only one absorbing state. If we want to study the state that individuals died from(sick or healthy), one approach is to make two absorbing states, one corresponding to death afterbeing healthy, the other after being sick. The core Q⇤ matrix stays the same, but now Q becomes

0

BBB@

�0.2 0.1 0.1 0

0 �0.5 0 0.50 0 0 0

0 0 0 0

1

CCCA

The eigenvectors with eigenvalue 0 are0

BBB@

0.50

1

0

1

CCCA,

0

BBB@

0.51

0

1

1

CCCA

These tell us that someone who starts sick will with certainty die from the sick state (since thereis no recovery in this model), while an initially healthy individual will have probability 1/2 ofdying from the healthy or the sick state.

LECTURE 9. CONTINUOUS-TIME MARKOV CHAINS 72

Suppose the recovery rate ⇢ now becomes 1. Then

Q =

0

BBB@

�0.2 0.1 0.1 0

1 �1.5 0 0.50 0 0 0

0 0 0 0

1

CCCAQ⇤ =

�0.2 0.11 �1.5

!Q�1

⇤ =

�7.5 �.5�5 �1

!

Thus, a healthy individual will now live, on average, 8 years, of which only 0.5 will be sick, andsomeone who is sick will have 6 years on average, with 1 of those sick.

The eigenvectors with eigenvalue 0 are now0

BBB@

0.750.51

0

1

CCCA,

0

BBB@

0.250.50

1

1

CCCA.

Thus we see that when starting from state H, the probability of transitioning to D from H hasgone up to 3/4. Starting from S, the probability is now 1/2 of transitioning to D from S, and1/2 of transitioning from H. This is consistent with the observation we make from the jumpchain, that the healthy person transitions to sick or to dead with equal probabilities. Thus,

PH(last state H) =1

2+

1

2PS(last state H) =

1

2+

1

4.

Lecture 10

Survival analysis: Introduction

10.1 Incomplete observations: Censoring and truncationWhy do we need a special statistical theory of event times? If we observe a collection of randomtimes T1, . . . , Tn, we may study their distribution just as we study any other random variables.The problem is, event times are often incompletely observed, and the incompleteness of the datais connected with the one-dimensional structure of time.

The most common incomplete observation for event times is right censoring. Right censoringoccurs when a subject leaves the study before an event occurs, or the study ends before theevent has occurred. For example, we consider patients in a clinical trial to study the effect oftreatments on stroke occurrence. The study ends after 5 years. Those patients who have had nostrokes by the end of the year are censored. If the patient leaves the study at time te, then theevent occurs in (te,1) .

Left censoring is when the event of interest has already occurred before enrolment. This isvery rarely encountered.

Truncation is deliberate and due to study design.Right truncation occurs when the entire study population has already experienced the

event of interest (for example: a historical survey of patients on a cancer registry).Left truncation occurs when the subjects have been at risk before entering the study (for

example: life insurance policy holders where the study starts on a fixed date, event of interest isage at death).

Generally we deal with right censoring and left truncation.Three types of independent right censoring:Type I: All subjects start and end the study at the same fixed time. All are censored at the

end of the study. If individuals have different but fixed censoring times, this is called progressivetype I censoring.

Type II: study ends when a fixed number of events amongst the subjects has occurred.Type III or random censoring: Individuals drop out or are lost to followup at times

that are random, rather than predetermined. We generally assume that the censoring times areindependent of the event times, in which case there is no need to distinguish this from Type I.

Skeptical question: Why do we need special techniques to cope with incomplete observa-tions? Aren’t all observations incomplete? After all, we never see all possible samples from thedistribution. If we did, we wouldn’t need any sophisticated statistical analysis.

The point is that most of the basic techniques that you have learned assume that the observedvalues are interchangeable with the unobserved values. The fact that a value has been observeddoes not tell us anything about what the value is. In the case of censoring or truncation, there is

73

LECTURE 10. SURVIVAL ANALYSIS: INTRODUCTION 74

dependence between the event of observation and the value that is observed. In right-censoring,for instance, the fact of observing a time implies that it occurred before the censoring time. Thedistribution of a time conditioned on its being observed is thus different from the distribution ofthe times that were censored.

There are different levels of independence, of course. In the case of random (type III)censoring, the censoring time itself is independent of the (potentially) observed time. In Type IIcensoring, the censoring time depends in a complicated way on all the observation times.

10.2 Likelihood and Right Censoring10.2.1 Random censoring

Suppose that eT is the time to event and that C is the time to the censoring event. Assume thatall subjects may have an event or be censored. That is, we get to observe for each individual ionly ti := min{eti, ci}, and the censoring indicator

�i :=

(1 if eti ci,

0 if eti > ci.

It was shown by [23] that it is impossible to reconstruct the joint distribution of ( eT ,C) fromthese data. It is possible to reconstruct the marginal distributions under the assumption thatevent times eT and censoring times C are independent.

Define

f(t) and fC(t) to be the densities,S(t) and SC(t) to be the survival functions,h(t) and hC(t) to be the hazard functions

of the event time and the censoring time respectively. Then the likelihood is

L =

Y

�i=1

f(eti)SC(eti)Y

�i=0

S(ci)fC(ci)

=

Y

i

h(ti)�iS(ti)⇥

Y

i

hC(ti)1��iSC(ti).

Since this factors into an expression involving the distribution of eT and one involving thedistribution of C, we may perform likelihood inference on the event-time distribution withoutreference to the censoring time distribution.

10.2.2 Non-informative censoring

As stated above, the random censoring assumption is both too strong and too weak. It is toostrong because it makes highly restrictive assumptions about the nature of the censoring processNote that this factorisation depends upon the assumption of independent censoring. In fact, thesame conclusion follows from a weaker assumption, called non-informative censoring.

We generally want to retain the assumption that the event times eTi are jointly independent.The censoring process is called non-informative if it satisfies the following conditions:

1. For each fixed t, and each i, the distribution of (Ti � t)1{Ti>t} is independent of {Ci > t}.That is, knowing that an individual was or was not censored by time t gives no informationabout the lifetime remaining after time t, if they were still alive at time t.

LECTURE 10. SURVIVAL ANALYSIS: INTRODUCTION 75

2. The event-time distribution and the censoring distribution do not depend on the sameparameters.

The latter condition cannot be formalised in a frequentist framework, since we would be takingthe distributions as fixed (but unknown), allowing no formal way to define “dependence”. In aparametric setting this is simply the assertion that we maximise the likelihood in eT withoutregard to the distribution of C. (Note that in the absence of an independence assumption, suchas the first condition above, it would not be possible to define a joint distribution in which thedistribution of eT varies in a way that the distribution of C ignores.) In a Bayesian context weare asserting that prior distribution on the distribution of ( eT ,C) makes the distribution of eTindependent of the distribution of C.

Note that this is a convenient modelling assumption, not an assumption about theunderlying process. We are free to impose this assumption. If the assumption has been imposedincorrectly, the main effect will be only to reduce the efficiency of estimation.

The first assumption, on the other hand — or some substitute assumption — is absolutelynecessary for the procedures we describe — or, indeed, any such procedure — to yield unbiasedestimates. In a regression setting where we consider the covariates xi to be random we weakenthe independence assumptions to be merely independence conditional on the observed covariates.

As an example, Type II censoring is clearly not independent, but it is non-informative.Warning: The definitions of random and non-informative censoring are not entirely standardised.

10.3 Time on testOur models include a “time” parameter, whose interpretation can vary. First of all, in population-level models (for instance, a birth-death model of population growth, where the state representsthe number of individuals) the time is true calendar time, while in individual-level models (suchas our multiple-decrement model of death due to competing risks, or the healthy-sick-deadprocess, where there is a single model run for each individual) the time parameter is more likelyto represent individual age. Within the individual category, the time to event can literally bethe age, for instance in a life insurance policy. In a clinical trial it will more typically be timefrom admission to the trial, also called “time on test”.

For example, consider the following data from a Sydney hospital pilot study, concerning thetreatment of bladder cancer:

Time to cancer Time to recurrence Time between Recurrence status0.00 4.97 4.97 121.02 22.99 1.97 145.03 61.09 16.05 052.17 55.03 2.86 148.06 65.03 16.97 0

All times are in months. Each patient has their own zero time, the time at which the patiententered the study (accrual time). For each patient we record time to event of interest or censoringtime, whichever is the smaller, and the status, � = 1 if the event occurs and � = 0 if the patientis censored. If it is the recurrence that is of interest, so in fact the relevant time, the “timebetween”, is measured relative to the zero time that is the onset of cancer.

LECTURE 10. SURVIVAL ANALYSIS: INTRODUCTION 76

10.4 Non-parametric survival estimation10.4.1 Review of basic concepts

Consider random variables X1, . . . , Xn which represent independent observations from a distribu-tion with cdf F . Given a class F of possibilities for F , an estimator is a choice of the “best”, onthe basis of the data. That is, it is a function from Rn

+ to F that maps a collection of observations(x1, . . . , xn) to F.

Estimators for distribution functions may be either parametric or non-parametric, de-pending on the nature of the class F. The distinction is not always clear-cut. A parametricestimator is one for which the class F depends on some collection of parameters. For example, itmight be the two-dimensional family of all gamma distributions. A non-parametric estimator isone that does not impose any such parametric assumptions, but allows the data to “speak forthemselves”. There are intermediate non-parametric approaches as well, where an element of Fis not defined by any small number of parameters, but is still subject to some constraint. Forexample, F might be the class of distributions with smooth hazard rate, or it might be the classof log-concave distribution functions (equivalent to having increasing hazard rate). We will alsobe concerned with semi-parametric estimators, where an underlying infinite-dimensional classof distributions is modified by one or two parameters of special interest.

The disadvantage of parametrisation is always that it distorts the observations; the advantageis that it allows the data from different observations to be combined into a single parameterestimate. (Of course, if the data are known to come from some distribution in the parametricfamily, the “distortion” is also an advantage, because the real distortion was in the data, due torandom sampling.)

We start by considering nonparametric estimators of the cdf. These have the advantage oflimiting the assumptions imposed upon the data, but the disadvantage of being too strictlylimited by the data. That is, taken literally, the estimator we obtain from a sample of observedtimes will imply that only exactly those times actually observed are possible.

If there are observations x1, . . . , xn from a random sample then we define the empiricaldistribution function

bF (x) =1

n# {xi : xi x}

This is an appropriate non-parametric estimator for the cdf — it is consistent, for example,and F (x) is an unbiased estimator for F (x) for each fixed x — in the absence of censoring. Itneeds to be modified when the data are censored.

Suppose that the observations are (Ti, �i) for i = 1, 2 . . . , n. We consider the component ofthe likelihood that relates to the distribution of the survival times:

L =

Y

i

f(Ti)�iS(Ti)

1��i

=

Y

i

f(Ti)�i�1� F (ti)

�1��i

10.4.2 Kaplan–Meier estimator

The first nonparametric estimator for the survival function in the presence of non-informativeright censoring was described by Kaplan and Meier in their 1958 paper [14]. The work hasbeen cited more than 50 thousand times, which probably underestimates its influence, since theKaplan–Meier estimator has become so ubiquitous in survival analysis that most applications donot bother to cite a source.

LECTURE 10. SURVIVAL ANALYSIS: INTRODUCTION 77

The basic idea is the following: There is no way to estimate a hazard rate from data withoutsome kind of smoothing, so the most direct representation of the data comes from estimating thesurvival function directly. On any interval [a, b) on which no events have been observed, it isnatural to estimate S(b)� S(a) = 0. That is, the estimated probability of an event occurring inthis interval is the empirical probability 0. This leads us to estimate the survival function by astep function whose jumps are at the points tj where events have been observed.

We conventionally list the event times (not the censoring times) in order as (t0 = 0 <)t1 <. . . < tj < . . . . If there are ties — several individuals i with Ti = tj and �i = 1 — we representthis by a count dj , being the number of individuals whose event was at tj . (Thus all dj are atleast 1.)

At the point ti there is a drop in S. (Like cdfs, survival functions are taken to be right-continuous: S(t) = P{ eT > t}.) The size of the drop at an event time tj is

S(tj�)� S(tj) = P{ eT = tj}

since S(tj�) = P{ eT � tj}. We may rewrite this as

S(tj) = S(tj�)

1� P{ eT = tj}

S(tj)

!

= S(tj�)

⇣1� P

�eT = tj�� eT � tj

⌘.

Estimating the change in S is then a matter of estimating this conditional probability, which isalso known as the discrete hazard hj . We have the natural estimator (which is also the MLE) forthis conditional probability, that is the empirical fraction of those observed to meet the condition— that is, they survived at least to time tj — who died at time tj . This will be dj/nj , where nj isthe number of individuals at risk at time tj — that is, individuals i such that Ti � tj , meaningthat they have neither been censored, nor have they had their event, before time tj .

The Kaplan-Meier estimator is the result of this process:

bS(t) =Y

tjt

⇣1� hj

⌘=

Y

tjt

1� dj

nj

!

where

nj = #{in risk set at tj},dj = #{events at tj}.

Notenj+1 + cj + dj = nj

where cj = #{censored in [tj , tj+1)}. If there are no censored observations before the first failuretime then n0 = n1 = #{in study}. Generally we assume t0 = 0.

10.4.3 Nelson–Aalen estimator and new estimator of S

The Nelson–Aalen estimator for the cumulative hazard function is

bH(t) =X

tjt

djnj

0

@=

X

tjt

bhj

1

A

LECTURE 10. SURVIVAL ANALYSIS: INTRODUCTION 78

This is natural for a discrete estimator, as we have simply summed the estimates of the hazardsat each time, instead of integrating, to get the cummulative hazard. This correspondingly givesan estimator of S of the form

eS(t) = exp

⇣� bH(t)

= exp

0

@�X

tit

dini

1

A

It is not difficult to show by comparing the functions 1� x, exp(�x) on the interval 0 x 1,that eS(t) � bS(t).10.4.4 Invented data set

Suppose that we have 10 observations in the data set with failure times as follows:

2, 5, 5, 6+, 7, 7+, 12, 14+, 14+, 14+ (10.1)

Here + indicates a censored observation. Then we can calculate both estimators for S(t) at alltime points. It is considered unsafe to extrapolate much beyond the last time point, 14, evenwith a large data set.

Table 10.1: Computations of survival estimates for invented data set (10.1)

tj dj nj hj S(tj) eS(tj)

2 1 10 0.10 0.90 0.905 2 9 0.22 0.70 0.726 1 6 0.17 0.58 0.6312 1 4 0.25 0.44 0.54

10.5 Example: The AML studyIn the 1970s it was known that individuals who had gone into remission after chemotherapyfor acute lymphatic leukemia would benefit — by longer remission times — from a course ofcontinuing “maintenance” chemotherapy. A study [6] pointed out that “Despite a lack of conclusiveevidence, it has been assumed that maintenance chemotherapy is useful in the management ofacute myelogenous leukemia (AML).” The study set out to test this assumption, comparing theduration of remission between an experimental group that received the additional chemotherapy,and a control group that did not. (This analysis is based on the discussion in [20].)

The data are from a preliminary analysis of the data, before completion of the study. Theduration of complete remission in weeks was given for each patient (11 maintained, 12 non-maintained controls); those who were still in remission at the time of the analysis are censoredobservations. The data are given in Table 10.2. They are included in the survival package of R,under the name aml.

The first thing we do is to estimate the survival curves. The summary data and computationsare given in Table 10.3. The Kaplan-Meier survival curves are shown in Figure 10.1.

In Lecture 11 we describe how to compute confidence intervals for the survival probabilities.

LECTURE 10. SURVIVAL ANALYSIS: INTRODUCTION 79

Table 10.2: Times of complete remission for preliminary analysis of AML data, in weeks. Censoredobservations denoted by +.

maintained 9 13 13+ 18 23 28+ 31 34 45+ 48 161+

non-maintained 5 5 8 8 12 16+ 23 27 30 33 43 45

Table 10.3: Computations for the Kaplan–Meier and Nelson–Aalen survival curve estimates ofthe AML data.

Maintenance Non-Maintenance (control)

ti ni di hi S(ti) HieS(ti) ni di hi S(ti) Hi

eS(ti)

5 11 0 0.00 1.00 0.00 1.00 12 2 0.17 0.83 0.17 0.858 11 0 0.00 1.00 0.00 1.00 10 2 0.20 0.67 0.37 0.699 11 1 0.09 0.91 0.09 0.91 8 0 0.00 0.67 0.37 0.6912 10 0 0.00 0.91 0.09 0.91 8 1 0.12 0.58 0.49 0.6113 10 1 0.10 0.82 0.19 0.83 7 0 0.00 0.58 0.49 0.6118 8 1 0.12 0.72 0.32 0.73 6 0 0.00 0.58 0.49 0.6123 7 1 0.14 0.61 0.46 0.63 6 1 0.17 0.49 0.66 0.5227 6 0 0.00 0.61 0.46 0.63 5 1 0.20 0.39 0.86 0.4230 5 0 0.00 0.61 0.46 0.63 4 1 0.25 0.29 1.11 0.3331 5 1 0.20 0.49 0.66 0.52 3 0 0.00 0.29 1.11 0.3333 4 0 0.00 0.49 0.66 0.52 3 1 0.33 0.19 1.44 0.2434 4 1 0.25 0.37 0.91 0.40 2 0 0.00 0.19 1.44 0.2443 3 0 0.00 0.37 0.91 0.40 2 1 0.50 0.10 1.94 0.1445 3 0 0.00 0.37 0.91 0.40 1 1 1.00 0.00 2.94 0.0548 2 1 0.50 0.18 1.41 0.24 0 0

LECTURE 10. SURVIVAL ANALYSIS: INTRODUCTION 80

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Figure 10.1: Kaplan-Meier estimates of survival in maintenance (black) and non-maintenancegroups in the AML study.

Lecture 11

Confidence intervals and left truncation

11.1 Greenwood’s formulaWe need to find confidence intervals (pointwise) for the estimators of S(t) at each time point.The estimators are both based on the discrete hazard hj at event time tj , which we estimate byhj = dj/nj :

Kaplan–Meier: S(t) =Y

tjt

⇣1� hj

⌘, Nelson–Aalen: eS(t) = exp

n�X

tjt

hjo.

11.1.1 The cumulative hazard

Suppose the survival function were genuinely discrete, with conditional probability hj (the“discrete hazard”) of an event occurring at tj for any individual i with Ti � tj . Then hj = dj/nj

is an unbiased estimator for hj . The dj are binomially distributed with parameters (nj , hj),hence variance njhj(1� hj). Furthermore, while the numbers of events dj are not independent— the underlying parameter nj , the number of individuals at risk, depends on the outcomes ofall the survival events preceding tj — the expectated value of dj/nj is always hj . If the nj arelarge, then, we may write

hj = hj + Zj

shj(1� hj)

nj,

where the Zj are approximately independent standard normal, and approximately independentof the nj . If we take this to be real independence, then conditioned on nj the variance of thesum is the sum of the variances, so

bH(t)�H(t) =X

tjt

(hj � hj) =

0

@X

tjt

hj(1� hj)

nj

1

A1/2

Z,

where Z is approximately standard normal. Substituting, as usual, hj for the unknown hj weget the approximation

bH(t)�H(t) ⇡

0

@X

tjt

dj(nj � dj)

n3j

1

A1/2

Z. (11.1)

81

LECTURE 11. CONFIDENCE INTERVALS AND LEFT TRUNCATION 82

This yields

�2(t) :=

X

tjt

dj(nj � dj)

n3j

(11.2)

as an estimator for the variance 1 of bH(t). So an approximate 100(1� ↵)% confidence intervalfor H(t) is

bH(t)± z1�↵/2�(t). (11.3)

In fact, we don’t want to model the cumulative hazard as a step function with fixed jumptimes. We could imagine interpolating extra “jump times”, when no events happen to havebeen observed — hence with discrete hazard estimate 0 — and note that the estimate of Hand its variance is unchanged, allowing us to approximate a continuous cumulative hazard toany accuracy we wish. To see a more mathematically satisfying derivation for �(t), that usesmartingale methods to avoid discretisation, see [1].

11.1.2 The survival function and Greenwood’s formula

We can apply (11.3) directly to obtain a confidence interval for S(t) = e�H(t)

✓exp

n� bH(t)� z1�↵/2�(t)

o, exp

n� bH(t) + z1�↵/2�(t)

o◆

=

✓eS(t) exp

n�z1�↵/2�(t)

o, eS(t) exp

nz1�↵/2�(t)

o◆.

(11.4)

While the confidence interval (11.4) is preferable in some respects, we also need a confidenceinterval centred on the traditional Kaplan–Meier estimator. To derive an estimator for thevariance we start with the formula

log S(t) =X

tjt

log(1� hj).

The delta method gives us the variance of the increments to be approximately

(1� hj)�2

Var(hj) =hj

nj(1� hj)⇡ dj

nj(nj � dj).

By the same reasoning as in section 11.1.1 we get an estimator

�(t) :=X

tjt

djnj(nj � dj)

for the variance of log S(t). We may use this exactly as in (11.4) to produce confidence intervalsfor logS(t), and hence for S(t). Traditionally we apply the delta method again to produce theestimator

�2G(t) := S(t)2

X

tjt

djnj(nj � dj)

for the variance of S(t). This is called Greenwood’s formula.1Note that the thing �

2 is estimating is actually a conditional variance, conditioned on nj . The true varianceof H(t) is the expected value of this quantity. This is much like the distinction between observed and expectedinformation. And as in that situation you are free to take one of two perspectives: You could think of the“observed variance” as a good-enough approximation to the expected variance, given that the expected varianceis hard to compute — and actually includes a non-zero probability of a 0/0. But more appropriate is to referto (11.1), to see that the observed variance is the thing you really want, since the error of this observation isapproximately normal with SD �(t).

LECTURE 11. CONFIDENCE INTERVALS AND LEFT TRUNCATION 83

11.1.3 Reminder of the � method

If the random variation of Y around µ is small (for example if µ is the mean of Y and varY hasorder 1

n), we use:

g(Y ) ⇡ g(µ) + (Y � µ)g0(µ) +1

2(Y � µ)2 g00(µ) + . . .

Taking expectations

E(g(Y )) = g(µ) +O

✓1

n

var(g(Y)) = g0(µ)2varY + o

✓1

n

More generally, if Y1, . . . , Yk are highly concentrated around their means, and approximatelymultivariate normal with covariances ⌃ij (where ⌃ii = VarYi), we may write

Yi ⇡ µi +X

j

AijZj

where A is the symmetric square-root of ⌃ (that is, A is symmetric and A2= ⌃) and Z is a

vector of independent standard normal random variables. If X = g(Y1, . . . , Yk) is any functionthat is well-behaved near µ, we may use the first-order Taylor expansion to write

g(Y ) ⇡ g(µ) +X

i,j

Aij@g

@yi(µ)Zj .

To this order,

E⇥g(Y1, . . . , Yk)

⇤⇡ g(µ1, . . . , µk),

Var�g(Y1, . . . , Yk)

�⇡X

i,j

@g

@yi(µ)

@g

@yj(µ)⌃ij .

(11.5)

In the special case k = 1 this yields the familiar form

E⇥g(Y )

⇤⇡ g(µ),

Var�g(Y )

�⇡ g0(µ)2�2

(11.6)

When k = 2 it becomes

E⇥g(Y1, Y2)

⇤⇡ g(µ1, µ2),

Var�g(Y1, Y2)

�⇡✓

@g

@y1(µ)

◆2

�21 +

✓@g

@y2(µ)

◆2

�22 + 2

✓@g

@y1(µ)

◆✓@g

@y2(µ)

◆Cov (Y1, Y2) .

(11.7)

11.2 Left truncationLeft truncation is easily dealt with in the context of nonparametric survival estimation. Supposethe invented data set comes from the following hidden process: There is an event time, and anindependent censoring time, and, in addition, a truncation time, which is the time when thatindividual becomes available to be studied. For example, suppose this were a nursing home

LECTURE 11. CONFIDENCE INTERVALS AND LEFT TRUNCATION 84

Table 11.1: Invented data illustrating left truncation. Event times after the censoring time maybe purely nominal, since they may not have occurred at all; these are marked with *. The rowObservation shows what has actually been observed. When the event time comes before thetruncation time the individual is not included in the study; this is marked by a �.

Patient ID 5 2 9 0 1 3 7 6 4 8

Event time 2 5 5 * 7 * 12 * * *Censoring time 10 8 7 8 11 7 14 14 14 14Truncation time �2 3 6 0 1 0 6 6 �5 1

Observation 2 5 � 8+ 7 7+ 12 14+ 14+ 14+

Table 11.2: Computations of survival estimates for invented data set of Table 11.1.

tj dj nj hj S(tj) eS(tj)

2 1 6 0.17 0.83 0.855 1 6 0.17 0.69 0.727 1 7 0.14 0.58 0.6212 1 4 0.25 0.45 0.48

population, and the time being studied is the number of years after age 80 when the patient firstshows signs of dementia. The censoring time might be the time when the person dies or movesaway, or when the study ends. The study population consists of those who have entered thenursing home free of dementia. The truncation time would be the age at which the individualmoves into the nursing home.

We give a version of these data in Table 11.1. Note that patient number 9 was truncated attime 6 (i.e., entered the nursing home at age 86) but her event was at time 5 (i.e., she had alreadysuffered from dementia since age 85), hence was not included in the study. In table 11.2 we givethe computations for the Kaplan–Meier estimate of the survival function. The computations areexactly the same as those of section 10.4.4, except for one important change: The number at risknj is not simply the number n�

Ptj<t dj �

Ptj<t kj of individuals who have not yet had their

event or censoring time. Rather, an individual is at risk at time t if her event time and censoringtime are both � t, and if the truncation time is t. (As usual, we assume that individuals whohave their event or are censored in a given year, were at risk during that year. We are similarlyassuming that those who entered the study at age x are at risk during that year.) At the start ofour invented study there are only 6 individuals at risk, so the estimated hazard for the event atage 2 becomes 1/6.

In the most common cases of truncation we need do nothing at all, other than be carefulin interpreting the results. For instance, suppose we were simply studying the age after 80 atwhich individuals develop dementia by a longitudinal design, where 100 healthy individuals 80years old are recruited and followed for a period of time. Those who are already impaired at age80 are truncated. All this means is that we have to understand (as we surely would) that theresults are conditional on the individual not suffering from dementia until age 80.

LECTURE 11. CONFIDENCE INTERVALS AND LEFT TRUNCATION 85

We can compute variances for the Kaplan-Meier and Nelson-Aalen estimators using Green-wood’s formula exactly as before, only taking care to use the reinterpreted number at risk. Theone problem that arises is that individuals may enter into the study slowly, yielding a smallnumber at risk, and hence very wide error bounds, which of course will carry through to the end.

11.3 AML study, continuedIn Table 11.3 we show the computations for confidence intervals just for the Kaplan-Meier curveof the maintenance group. The confidence intervals are based on the logarithm of survival, using(??) directly. That is, the bounds on the confidence interval are

exp

8><

>:log S(t)± z

vuutX

tjt

djnj(nj � dj)

9>=

>;,

where z is the appropriate quantile of the normal distribution. Note that the approximationcannot be assumed to be very good in this case, since the number of individuals at risk is toosmall for the asymptotics to be reliable. We show the confidence intervals in Figure 11.1.

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Time (weeks)

Survival

Figure 11.1: Greenwood’s estimate of 95% confidence intervals for survival in maintenance groupof the AML study.

Important: The estimate of the variance is more generally reliable thanthe assumption of normality, particularly for small numbers of events.Thus, the first line in Table 11.3 indicates that the estimate of log S(9)is associated with a variance of 0.009. The error in this estimate is onthe order of n�3, so it’s potentially about 10%. On the other hand, thenumber of events observed has binomial distribution, with parametersaround (11, 0.909), so it’s very far from a normal distribution. We could

LECTURE 11. CONFIDENCE INTERVALS AND LEFT TRUNCATION 86

Table 11.3: Computations for Greenwood’s estimate of the standard error of the Kaplan-Meiersurvival curve from the maintenance population in the AML data. “lower” and “upper” arebounds for 95% confidence intervals, based on the log-normal distribution.

tj nj djdj

nj(nj�dj)Var(log S(tj)) lower upper

9 11 1 0.009 0.009 0.754 1.00013 10 1 0.011 0.020 0.619 1.00018 8 1 0.018 0.038 0.488 1.00023 7 1 0.024 0.062 0.377 0.99931 5 1 0.050 0.112 0.255 0.94634 4 1 0.083 0.195 0.155 0.87548 2 1 0.500 0.695 0.036 0.944

improve our confidence interval by using the Poisson confidence intervalsworked out in Problem Sheet 3, question 2, or binomial confidenceinterval. We will not go into the details in this course.

11.4 Survival to 1Let T be a survival time, and define the conditional survival function

S0(t) := P�T > t

��T < 1 ;

that is, the probability of surviving past time t given that the event eventually does occur. Wehave

S0(t) =P{1 > T > t}P{1 > T} . (11.8)

How can we estimate S0? Nelson–Aalen estimators will never reach 1 (which would mean 0survival); Kaplan–Meier estimators will reach 0 if and only if the last individual at risk actuallyhas an observed event. In either case, there is no mathematical principle for distinguishingbetween the actual survival to 1 — that is, the probability that the event never occurs — andsimply running out of data. Nonetheless, in many cases there can be good reasons for thinkingthat there is a time t@ such that the event will never happen if it hasn’t happened by that time.In that case we may use the fact that {T < 1} = {T < t@} to estimate

S0(t) =S(t)� S(t@)

1� S(t@). (11.9)

In this case, assuming that S(t) is constant after t = t@ , we need to estimate the variance ofS0(t). To compute a confidence interval for S0(t) we apply the delta method again, in a slightlymore complicated form. Suppose we set Y1 = S(t) and Y2 = S(t@)/S(t). The variance of Y1 isapproximated just by Greenwood’s formula

�21 := Var

⇣bS(t)

⌘⇡ �2

(t) = bS(t)2X

tjt

djnj�nj � dj

� ,

LECTURE 11. CONFIDENCE INTERVALS AND LEFT TRUNCATION 87

Y2 is just the estimated survival from t to t@ , so Greenwood’s formula yields

�22 := Var

bS(t@)bS(t)

!⇡ bS(t@)bS(t)

!2 X

t<tjt@

djnj�nj � dj

� ,

Since Y1 and Y2 depend on distinct survival events they are uncorrelated. We may apply thedelta method (11.7) to the two-variable function g(y1, y2) = y1(1� y2)/(1� y1y2), obtaining

�20(t) := Var bS0(t) = Var

�g(Y1, Y2)

⇡✓

@g

@y1(Y1, Y2)

◆2

�21 +

✓@g

@y2(Y1, Y2)

◆2

�22

⇣bS(t)� bS(t@)

⌘2

⇣1� bS(t@)

⌘4

X

tjt

djnj�nj � dj

� +⇣1� bS(t)

⌘2 X

t<tjt@

djnj�nj � dj

��.

(11.10)

If there is no a priori reason to choose a value of t@ , we may estimate it from the data asmax{ti}, as long as there is a significant length of time during which there is a significant numberof individuals under observation, when an event could have been observed.

11.4.1 Example: Time to next birth

This is an example discussed repeatedly in [1]. It has the advantage of being a large data set,where the asymptotic assumptions may be assumed to hold; it has the corresponding disadvantagethat we cannot write down the data or perform calculations by hand.

The data set at http://folk.uio.no/borgan/abg-2008/data/second_births.txt lists, for53,558 women listed in Norway’s birth registry, the time (in days) from first to second birth.(Obviously, many women do not have a second birth, and the observations for these women willbe treated as censored.)

In Figure 11.2(a) we show the Kaplan–Meier estimator computed and automatically plottedby the survfit command. Figure 11.2(b) shows a crude estimate for the distribution of time-to-second-birth for those women who actually had a second birth. We see that the last birthtime recorded in the registry was 3677, after which time none of the remaining 131 women had arecorded second birth. Thus, the second curve is simply the same as the first curve, rescaled togo between 1 and 0, rather than between 1 and 0.293 as the original curve does.

The code used to generate the plots is in Code 1.

LECTURE 11. CONFIDENCE INTERVALS AND LEFT TRUNCATION 88

Code 1 Code for Second Birth Figure1 l i b r a r y ( ' s u r v i v a l ' )2 sb=read . t ab l e ( ' second_b i r t h s . dat ' , header=TRUE)3 attach ( sb )4 sb . surv=Surv ( time , s t a tu s )5

6 sb . f i t 1=s u r v f i t ( sb . surv⇠rep (1 ,53558) )7

8 p lo t ( sb . f i t 1 , mark . time=FALSE, xlab= 'Time ( days ) ' ,9 main= ' Norwegian b i r th r e g i s t r y time to second b i r th ' )

10

11 # Condit ion on l a s t event . D i r e c t l y changing s u r v i v a l ob j e c t .12 c l e=func t i on (SF , alpha =.05) {13 z=qnorm(1�alpha / 2)14 minsurv=min (SF$ surv )15 maxSE=max(SF$ std . e r r ) ## SE at f i n a l event time16 newSE=((SF$surv�minsurv )^2/(1�minsurv ) ^4)∗ (SF$ std . e r r ∗ (2 ∗SF$surv�SF$ surv ^2)

+(1�SF$ surv )^2∗minsurv )17 SF$ std . e r r=newSE18 SF$ surv=(SF$surv�minsurv ) /(1�minsurv )19 SF$upper=SF$ surv+z∗newSE20 SF$ lower=SF$surv�z∗newSE21 SF22 }23

24 sb . f i t 2=c l e ( sb . f i t 1 )25

26 p lo t ( sb . f i t 2 , mark . time=FALSE, xlab= 'Time ( days ) ' ,27 main= 'Time to second b i r th cond i t i oned on occurrence ' )

LECTURE 11. CONFIDENCE INTERVALS AND LEFT TRUNCATION 89

0 1000 2000 3000 4000 5000

0.0

0.2

0.4

0.6

0.8

1.0

Norwegian birth registry time to second birth

Time (days)

(a) Original Kaplan–Meier curve

0 1000 2000 3000 4000 5000

0.0

0.2

0.4

0.6

0.8

1.0

Time to second birth conditioned on occurrence

Time (days)

(b) Kaplan–Meier curve conditioned on second birth occurring

Figure 11.2: Time (in days) between first and second birth from Norwegian registry data.

LECTURE 11. CONFIDENCE INTERVALS AND LEFT TRUNCATION 90

11.5 Computing survival estimators in R (non-examinable)The main package for doing survival analysis in R is survival. Once the package is installedon your computer, you include library(survival) at the start of your Rcode This works with“survival objects”, which are created by the Surv command with the following syntax:Surv(time, event) or Surv(time, time2, event,type)

11.5.1 Survival objects with only right-censoring

We begin by discussing the first version, which may be applied to right-censored (or uncensored)survival data. The individual times (whether censoring or events) are entered as vectors time.The vector event (of the same length) has values 0 or 1, depending on whether the time is acensoring time or an event time respectively. These alternatives may also be labelled 1 and 2, orFALSE and TRUE.

For an example, we turn to our tiny simulated data set

21, 47, 47, 58+, 71, 71+, 125, 143+, 143+, 143+ (11.11)

into a survival object with

sim.surv = Surv(c(21, 47, 47, 58, 71, 71, 125, 143, 143, 143), c(1, 1, 1, 0, 1, 0, 1, 0, 0, 0)).

Fitting models is done with the survfit command. This is designed for comparing distribu-tions, so we need to put in a some sort of covariate. Then we can writesim.fit=survfit(sim.surv⇠1,conf.int=.99)

and then plot(sim.fit), orplot(sim.fit,main=‘Kaplan-Meier for simulated data set’,

xlab=‘Time’,ylab=‘Survival’)

to plot the Kaplan–Meier estimator of the survival function, as in Figure 11.3. The dashed linesare the Greenwood estimator of a 99% confidence interval. (The default for conf.int is 0.95.)

The Nelson–Aalen estimatorcan also be computed with survfit. The associated survivalestimator eS = e�A is called the Fleming–Harrington estimator, and it may be estimated withfit=survfit(formula, type=’fleming-harrington’). The cumulative hazard — the log ofthe survival estimator — may be printed with plot(fit, fun=’cumhaz’).

If you want to compute it more directly, you can extract the information in the survfit

object. If you want to see what’s inside an R object, you can use the str command. The outputis shown in Code 2.

We can then compute the Nelson–Aalen estimator with a function such as the one in Figure3. This is plotted together with the Kaplan–Meier estimator in Figure 11.4. As you can see, thetwo estimators are similar, and the Nelson–Aalen survival is always higher than the KM.

11.5.2 Other survival objects

Left-censored data are represented withSurv(time, event,type=’left’).Here event can be 0/1 or 1/2 or TRUE/FALSE for alive/dead, i.e., censored/not censored.

Left-truncation is represented withSurv(time,time2, event).event is as before. The type is ’right’ by default.

LECTURE 11. CONFIDENCE INTERVALS AND LEFT TRUNCATION 91

Code 2 Example of structure of a survfit object.

> str(sim.fit)

List of 13

$ n : int 10

$ time : num [1:6] 21 47 58 71 125 143

$ n.risk : num [1:6] 10 9 7 6 4 3

$ n.event : num [1:6] 1 2 0 1 1 0

$ n.censor : num [1:6] 0 0 1 1 0 3

$ surv : num [1:6] 0.9 0.7 0.7 0.583 0.438...

$ type : chr “right”

$ std.err : num [1:6] 0.105 0.207 0.207 0.276 0.399 ...

$ upper : num [1:6] 1 1 1 1 0.957 ...

$ lower : num [1:6] 0.732 0.467 0.467 0.34 0.2 ...

$ conf.type: chr “log”

$ conf.int : num 0.95

$ call : language survfit(formula = sim.surv ⇠ type, conf.int = 0.95) -

attr(*, “class”)= chr “survfit”

Code 3 Function to compute Nelson–Aalen estimator.1 # SF i s output o f s u r v f i t2 NAest=func t i on (SF) {3 t imes=SF$time [ SF$n . event >0]4 events=SF$n . event [ SF$n . event >0]5 nr i s k=SF$n . r i s k [ SF$n . event >0]6 increment=sapply ( seq ( l ength ( n r i s k ) ) , f unc t i on ( i )7 sum(1 / seq ( n r i s k [ i ] , n r i s k [ i ]� events [ i ]+1) ) )8 var iance increment=sapply ( seq ( l ength ( n r i s k ) ) , f unc t i on ( i )9 sum(1 / seq ( n r i s k [ i ] , n r i s k [ i ]� events [ i ]+1) ^2) )

10 hazard=cumsum( increment )11 var iance=cumsum( var iance increment )12 l i s t ( time=times , Hazard=hazard , Var=var iance )

LECTURE 11. CONFIDENCE INTERVALS AND LEFT TRUNCATION 92

0 20 40 60 80 100 120 140

0.0

0.2

0.4

0.6

0.8

1.0

Kaplan-Meier for simulated data set

Time

Survival

Figure 11.3: Plot of Kaplan–Meier estimates from data in (11.11). Dashed lines are 95%confidence interval from Greenwood’s estimate.

Interval censoring also takes time and time2, with type=’interval’. In this case, the eventcan be 0 (right-censored), 1 (event at time), 2 (left-censored), or 3 (interval-censored).

LECTURE 11. CONFIDENCE INTERVALS AND LEFT TRUNCATION 93

0 20 40 60 80 100 120 140

0.0

0.2

0.4

0.6

0.8

1.0

Estimators for simulated data set

Time

Survival

+

+

Kaplan-MeierNelson-Aalen

Figure 11.4: Plot of Kaplan–Meier (black) and Nelson–Aalen (red) estimates from data in (11.11).Dashed lines are pointwise 95% confidence intervals.

Lecture 12

Semiparametric models: accelerated

life, proportional hazards

12.1 Introduction to semiparametric modelingWe learned in section 6.2 how to compare observed mortality to a standard life table. In manysettings, though, we are interested to compare observed mortality (or more general event times)between groups, or between individuals with different values of a quantitative covariate, and inthe presence of censoring. For example,

Often we are interested to compare two (or more) different lifetime distributions. Anapproach that has been found to be effective is to think of there being a “standard” lifetime whichmay be modified in various simple ways to produce the lifetimes of the subpopulations. Thestandard lifetime is commonly estimated nonparametrically, while the modifications — usuallythe characteristic of primary interest — is reduced to one or a few parameters. The modificationsmay either involve a discrete collection of parameters — one parameter for each of a small numberof subpopulations — or a regression-type parameter multiplied by a continuous covariate.

Examples of the former type would be clinical trials, where we compare survival time betweentreatment and control groups, or an observational study where we compare survival rates ofsmokers and non-smokers. An example of the second time would be testing time to appearanceof full-blown AIDS symptoms as a function of measured T-cell counts.

There are two popular general classes of model: Accelerated life (AL, also called Acceleratedfailure models) and Proportional hazards (PH, also called Relative risk models).

12.2 Accelerated Life modelsSuppose there are (several) groups, labelled by index i. The accelerated life model has a survivalcurve for each group defined by

Si(t) = S0(⇢it)

where S0(t) is some baseline survival curve and ⇢i is a constant specific to group i.If we plot Si against log t, i = 1, 2, . . . , k, then we expect to see a horizontal shift as

Si(t) = S0(elog ⇢i+log t

) .

94

LECTURE 12. SEMIPARAMETRIC MODELS 95

12.3 Proportional Hazards modelsIn this model we assume that the hazards in the various groups are proportional so that

hi(t) = ⇢ih0(t)

where h0(t) is the baseline hazard. Hence we see that

Si(t) = S0(t)⇢i

Taking logs twice we get

log�� logSi(t)

�= log ⇢i + log

�� logS0(t)

So if we plot the RHS of the above equation against either t or log t we expect to see a verticalshift between groups.

12.3.1 Plots

Taking both models together it is clear that we could plot

log

⇣� log bSi(t)

⌘against log t

as then we can check for AL and PH in one plot. Generally bSi will be calculated as the Kaplan–Meier estimator for group i, and the survival function estimator for both groups will be plottedon the same graph.

• If the accelerated life model is plausible we expect to see a horizontal shift between groups.

• If the proportional hazards model is plausible we expect to see a vertical shift betweengroups.

Of course, if the data came from a Weibull distribution, with differences in the ⇢ parameter,it is simultaneously AL and PH. We see that

log�� logSi(t)

�= log ⇢i + ↵ log t.

Thus, survival curve estimates for different groups should appear approximately as parallel lines,which of course may be viewed as vertical or as horizontal shifts of one another.

In section 12.4 we illustrate this with two simulations of populations with Gompertz mortality.

12.3.2 Generalised linear survival models

Any parametric model may be turned into an AL model by replacing t by ⇢it. And it may beturned into a PH model by replacing the hazard h(t) by ⇢ih(t).

There remains then the question, how to link ⇢i to the explanatory factors such as age,smoking status, blood pressure and so on that have been observed for the individual.

We need to incorporate these into a model using some sort of generalised regression. It isusual to do so by making ⇢ a function of the explanatory variables. For each observation (sayindividual in a clinical trial) we set the scale parameter ⇢ = ⇢(� · x), where � · x is a linearpredictor composed of a vector x of known explanatory variables (covariates) and an unknownvector � of parameters which will be estimated. The most common link function is

log ⇢ = � · x , equivalently ⇢ = e�·x .

The shape parameter ↵ is assumed to be the same for each observation in the study.There are often very many covariates measured for each subject in a study. A row of data

will have include

LECTURE 12. SEMIPARAMETRIC MODELS 96

• response

– event time ti;– status �i (=1 if failure, =0 if censored);– possibly a left-truncation time.

• covariates, often a mix of quantitative and categorical variables such as

– age;– sex;– systolic blood pressure;– treatment group.

As an example, suppose we think the Weibull distribution is generally a good fit. Then

Si(t) = e�(⇢t)↵ and ⇢ = e

�·x

� · x = �0 + �1 agei+�2 sexi+�3 sbpi+�4 trti,

where �0 is the intercept and all regression coefficients �j are to be estimated, as well as estimating↵. Note this model assumes that ↵ is the same for each subject. We have not shown interactionterms such as xage ⇤ xtrt, but these could be added as well. This would allow a different effect ofage according to treatment group.

Suppose subject i has covariate vector xi, and so scale parameter

⇢i = e�·xi

.

This gives a likelihood

L(↵,�) =Y

i

⇣↵⇢↵i t

↵�1j

⌘�ie�(⇢iti)

=

Y

j

⇣↵e↵�·x

it↵�1j

⌘�ie�(e�·xi ti)

.

We can now compute MLEs for ↵ and all components of the vector �, using numerical optimisation,giving estimators b↵, b� together with their standard errors (

pVar b↵,

qVar b�j), estimated from the

observed information matrix. Of course, the same could have been done for another parametricmodel instead of the Weibull.

In general, if we have a parametric model with cumulative hazard H↵(t) (where ↵ representsthe parameters of the model that do not vary between individuals), and h↵(t) = H 0

↵(t) is thehazard function, we have the likelihood

L(↵,�) =Y

i

e�·xi

h↵⇣e�·xi

ti⌘��i

e�H(e�·x

iti).

If observations are left-truncated, say at a time si, the survival term includes only the cumulativehazard from si to ti:

L(↵,�) =Y

i

e�·xi

h↵⇣e�·xi

ti⌘e�H(e�·x

iti)+H(e�·x

isi).

We can test for ↵ = 1 using

2 log bLweib � 2 log bLexp ⇠ �2(1), asymptotically.

Packages allow for Weibull, log-logistic and log-normal models, sometimes others.

LECTURE 12. SEMIPARAMETRIC MODELS 97

12.4 Simulation exampleThe Gompertz hazard fits naturally into either the AL or the PH framework. If individual i hashazard rate Bie

✓x at age x, where ✓ is always the same, this is a PH model. If the hazard rate isBe

✓ix, this is an AL model. We can use this to illustrate the graphical approach to identifyingAL or PH variation between different groups.

In Figure 12.1 we show the results of two different simulations where we have simulated 1000survival times for each of two groups, one of which has Gompertz mortality rates approximatingthose of UK women, and the other somewhat higher rates, which we think of as “men”. All areconditioned on survival to age 20 (since human survival ages before maturity are very differentfrom Gompertz). In each case we plot the log estimated cumulative hazard of each group againstage on a log scale, and draw the horizontal and vertical differences at various levels.

In Figure 12.1(a) the (B, ✓) parameters are (2 ⇥ 10�5, 0.11) and (4 ⇥ 10

�5, 0.11). We see,as expected, very nearly constant vertical differences, but large variation in the horizontal gapbetween the curves. Figure 12.1(b), on the other hand, shows the AL version, with parameters(1 ⇥ 10

�5, 0.11) and (1 ⇥ 10�5, 0.14). The horizontal gaps are now similar, while the vertical

differences change.

LECTURE 12. SEMIPARAMETRIC MODELS 98

20 40 60 80 100

−6−4

−20

2

Age

Log

cum

ulat

ive m

orta

lity

0.64

0.64

0.61

0.23

0.06FemaleMale

(a) PH Gompertz difference

20 40 60 80 100

−6−4

−20

2

Age

Log

cum

ulat

ive m

orta

lity

0.96

1.45

2.17

0.18

0.22FemaleMale

(b) AL Gompertz difference

Figure 12.1: Simulations to show effect of PH and AL differences in Gompertz populations.

Lecture 13

Relative-risk models

It is possible to estimate AL models in a completely semi-parametric way — that is, baselinesurvival function S0 is modelled non-parametrically, with no assumptions about its form, andeach subject has time ti scaled to ⇢iti. This model is beyond the scope of this course.

A simpler model — and by far the most popular survival model — is the proportional hazardssemi-parametric model with log link function, generally known as Cox regression, after DavidCox, who introduced the model in 1972. (Cox’s paper [4] was listed in a recent survey by Natureas the 24th most frequently cited paper of all time, over all sciences.)

13.1 The relative-risk regression modelIn the regression setting, the most mathematically tractable models are the relative-risk models.These model the hazard by

hi(t) = h�t��xi�= h0(t)r

��,xi(t); t

�(13.1)

Here xi is a vector of (possibly time-varying) covariates belonging to individual i, and � is avector of parameters. Since this model has a nonparametric piece ↵0 and a parametric piece �,it is called semiparametric. In this lecture we will generally be assuming that r is a functiononly of � and x (no direct dependence on t), and in that case we will drop t from the notation,writing r(�,x) or r(�,x(t)).

This model is called relative-risk or proportional hazards because there is an unchanging ratioof the hazard rate (or risk of the event) between individuals with parameter values xi and xj .(Of course, if the covariates themselves are time-varying, this looks more complicated when welook at individual hazard rates over time.)

Different choices for the function relative risk function r are possible. We will focus mainlyon the choice

r(�,x) = e�Tx

= eP

�jxj , (13.2)

which assigns a constant proportional change to the hazard rate to each unit change in thecovariate. This regression model with this risk function, which is by far the most commonly usedsurvival model in medical applications, is called the Cox proportional hazards regression model.

One might naively suppose that the choice of model should be guided by a belief aboutwhat is “really true” about the survival process. But, as George Box famously said, “All modelsare wrong, but some are useful.” A wrong model will still summarise the data in a potentiallynon-misleading way. (On the other hand, the model can’t be too wrong. We will discuss modeldiagnostics later on.)

99

Relative risk regression 100

For example, in medical statistics the excess relative risk of a category is defined as

observed rate � expected rateexpected rate

.

If we have a single parameter that is supposed to summarise the excess relative risk created perunit of covariate, we would take the risk function

r(�, x) = 1 + �x.

In the multidimensional-covariate setting we can generalise this to the excess relative risk model(taking p to be the dimension of the covariate)

r(�,x) =pY

j=1

�1 + �jxj

�. (13.3)

This allows each covariate to contribute its own excess relative risk, independent of the others.Alternatively, we can define the linear relative risk function

r(�,x) = 1 +

pX

j=1

�jxj . (13.4)

13.2 Partial likelihoodA precise mathematical treatment of this material may be found in section 4.1.1 of [1].

We represent the data in the following form:

1. A list of event times t1 < t2 < · · · . (We are assuming no ties, for the moment.)

2. The identity ij of the individual whose event is at time tj .

3. The values of all individuals’ covariates (at times tj , if they are varying).

4. The process Yi(t) for all subjects. These may be summarised in risk sets Rj = {i : Yi(tj) =1}, the set of individuals who are at risk at time tj .

The most common way to fit a relative-risk model is to split the likelihood into two pieces:The likelihood of the event times, and the conditional likelihood of the choice of subjects giventhe event times. The first piece is assumed to contain relatively little information about theparameters, and its dependence on the parameters is quite complicated.

We use the second piece to estimate �. Conditioned on some event happening at time t, theprobability that it is individual i is

⇡�i�� t�:=

hi(t)

h(t)=

r(�,xi(t); t)1{i2R(t)}Pj2R(t) r(�,xj(t); t)

,

where R(t) is the risk set, the set of individuals at risk at time t. We have the partial likelihood

LP (�) =Y

tj

⇡�ij�� tj�=

Y

tj

r(�,xij (tj); tj)Pl2Rj

r(�,xl(tj); tj). (13.5)

The partial likelihood is useful because it involves only the parameters �, isolating them fromthe nonparametric (and often less interesting) ↵0. In section ?? we derive a special case of theasymptotic behaviour of the maximum partial likelihood estimator. As we state here, this hasthe same essential properties as the MLE.

Relative risk regression 101

Theorem 13.1. Let � maximise LP , as given in (13.5). Then � is a consistent estimator ofthe true parameter �0, and

pn(� � �0) converges to a multivariate normal distribution with

mean 0 and covariance matrix consistently approximated by J(�)�1, where J(�) is the observedinformation matrix, with (i, j) component given by

� @2

@�i@�jlogLP (�).

13.3 Significance testingAs with the MLE, when � is p-dimensional there are three (asymptotically equivalent) conventionaltest statistics used to test the null hypothesis � = �0:

• Wald statistic: ⇠2W := (� � �0)T J(�0)(� � �0);

• Score statistic: ⇠2SC = U(�0)T J(�0)U(�0);

• Likelihood ratio statistic: ⇠2LR = 2⇥`P (�)� `P (�0)

⇤, where `P := logLP .

Under the null hypothesis these are all asymptotically chi-squared distributed with p degrees offreedom. Here J(�) is the observed Fisher partial information matrix. There is a computableestimate for this, which is fairly straightforward, but notationally slightly tricky in the general(multivariate) case, so we do not include it here. (See equations (4.46) and (4.48) of [1] if you areinterested.) As usual, we can approximate the expected information by the observed information.

Here U(�) is the vector of score functions @`P /@�j .

13.4 Estimating baseline hazardIn relative-risk regression we are usually primarily interested in estimating the coefficients � —the difference between groups. Thus, if we are performing a clinical trial, we are interested toknow whether the treatment group had better survival than the control. How long the groupsactually survived in an absolute sense is irrelevant to the experimental outcome.

But that doesn’t mean that the hazard rates are merely a nuisance. And even when survivalrates are the primary concern, if hazard is affected by a known covariate there is no way toestimate the appropriate survival function for individuals without using a regression model topool the data over different values of the covariate. Even when the covariate has only a smallnumber of categories, so that pooling within categories is possible, it may still be worth using aregression model — as long as it is plausibly accurate — which allows us effectively to pool moreindividuals together to estimate a single survival curve. Thus variance is reduced, at the cost ofpotentially introducing a bias from modelling error (because the regression assumptions won’thold exactly).

13.4.1 Breslow’s estimator

Suppose the baseline survival is given by

cS0(t) = e�cH0(t),

where the discrete hazard estimation ch0 is given by

ch0(tj) =1P

i2Rj

r(�,xi(tj))

Relative risk regression 102

Breslow’s estimator is given bych0 =

1Pi2Rj

r(�,xi(tj))(13.6)

For the special case of Cox regression

ch0 =1

Pi2Rj

e�·xi(tj)(13.7)

In some sense the discrete estimates for ch0(tj) can be thought of as the maximum likelihoodestimators from the full likelihood, provided we assume that the hazard distribution is discrete(which of course it generally is not). When � = 0 or when the covariates are all 0, thisreduces simply to the Nelson–Aalen estimator. Otherwise, we see that this is equivalent to amodified Nelson–Aalen estimator, where the size of the risk set is weighted by the relative risksof the individuals. In other words, the estimate of ch0 is equivalent to the standard estimate# events/time at risk, but now time at risk is weighted by the relative risk.

The estimator may be loosely derived as follows:

`(h) =X

tj

log(1� e�hij (tj))�X

i2Rj

k 6=ij

hk

=

X

tj

log(1� e�rijh0(tj))�

X

i2Rj

j 6=[j]

rih0(tj).

We estimate h0(tj) by

0 =rije

�rij h0(tj)

1� e�rij h0(tj)�X

k2Rj

k 6=ij

rk

⇡rij (1� rij h0(tj))

rij h0(tj)�X

k2Rj

k 6=ij

rk

=(1� rij h0(tj))

h0(tj)�X

k2Rj

k 6=ij

rk.

(In the second line we have assumed h0(tj) to be small.) Thus

1 ⇡ h0(tj)

0

@X

i2Rj

ri

1

A ,

which is the same as (13.7).

Relative risk regression 103

13.4.2 Individual risk ratios

A benefit of relative risk models is that they give a straightforward interpretation to the question ofindividual risk. In the case where covariates are constant in time, an individual with covariates xhas a relative risk of r(�,x) compared with individuals with baseline covariates (with r(�,x) = 1;typically corresponding to x = 0). Thus, their cumulative hazard to age t is approximated by

H(t��x) = r(�,x)H0(t). (13.8)

In case of time-varying covariates we have an individual cumulative hazard

H(t��x) =

Z t

0r(�,x(u))h0(u)du,

which we approximate by

H(t��x) =

X

tjt

r(�,x(tj))Pi2Rj

r(�,xi(tj)). (13.9)

In the special case of Cox regression we have

H(t��x) =

X

tjt

e�·x(tj)

Pi2Rj

e�·x(tj). (13.10)

Lecture 14

Relative risk regression, continued

14.1 Dealing with tiesWe discuss here the methods of dealing with tied event times for the Cox model.

Until now in this section we have been assuming that the times of events are all distinct.In situations where event times are equal, we can carry out the same computations for Coxregression, only using a modified version of the partial likelihood. Suppose Rj is the set ofindividuals at risk at time tj , and Dj the set of individuals who have their event at that time.We assume that the ties are not real ties, but only the result of discreteness in the observation.Then the probability of having precisely those individuals at time tj will depend on the order inwhich they actually occurred. For example, suppose there are 5 individuals at risk at the start,and two of them have their events at time t1. If the relative risks were {r1, . . . , r5}, then thefirst term in the partial likelihood would be

r1r1 + r2 + r3 + r4 + r5

· r2r2 + r3 + r4 + r5

+r2

r1 + r2 + r3 + r4 + r5· r1r1 + r3 + r4 + r5

.

The number of terms is dj !, so it is easy to see that this computation quickly becomes intractable.A very good alternative — accurate and easy to compute — was proposed by B. Efron.

Observe that the terms differ in the denominator merely by a small change due to the individualslost from the risk set. If the deaths at time tj are not a large proportion of the risk set, then wecan approximate this by deducting the average of the risks that depart. In other words, in theabove example, the first contribution to the partial likelihood becomes

r1r2(r1 + r2 + r3 + r4 + r5)(

12(r1 + r2) + r3 + r4 + r5)

.

More generally, the log partial likelihood becomes

`P (�) =X

tj

X

i2Dj

log r(�,xi(tj))�dj�1X

k=0

log

0

@X

i2Rj

r(�,xi(tj)�k

di

X

i2Dj

r(�,xi)

1

A .

We take the same approach to estimating the baseline cumulative hazard:

H0(t) =X

tjt

dj�1X

k=0

0

@X

i2Rj

r��,xi(tj)

�� k

di

X

i2Dj

r��,xi(tj)

�1

A�1

.

104

Relative risk regression II 105

An alternative approach, due to Breslow, makes no correction for the progressive loss of riskin the denominator:

`BreslowP (�) =

X

tj

X

i2Dj

log r(�,xi(tj))� dj logX

i2Rj

r��,xi(tj)

�.

This approximation is always too small, and tends to shift the estimates of � toward 0. It iswidely used as a default in software packages (SAS, not R!) for purely historical reasons.

14.2 The AML exampleWe continue looking at the leukemia study that we started to consider in section 15.3. First,in Figure 14.1 we plot the iterated logarithm of survival against time, to test the proportionalhazards assumption. The PH assumption corresponds to the two curves differing by a verticalshift. The result makes this assumption at least credible. (This method, and others, for examiningthe PH assumption are discussed in detail in chapter 16.)

0 10 20 30 40 50

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

Age

log(-log(Survival))

Figure 14.1: Iterated log plot of survival of two populations in AML study, to test proportionalhazards assumption.

We code the data with covariate x = 0 for the maintained group, and x = 1 for the non-maintained group. Thus, the baseline hazard will correspond to the maintained group, and e�

will be the relative risk of the non-maintained group. From Table 10.3 we see that the partial

Relative risk regression II 106

likelihood is given by

LP (�) =

e2�

(12e� + 11)(11e� + 11)

! e2�

(10e� + 11)(9e� + 11)

!

⇥✓

1

8e� + 11

◆ e�

8e� + 10

!✓1

7e� + 10

◆✓1

6e� + 8

e� · 1(6e� + 7)(5.5e� + 6.5)

! e�

5e� + 6

! e�

4e� + 5

!

⇥✓

1

3e� + 5

◆ e�

3e� + 4

!✓1

2e� + 4

◆ e�

2e� + 3

! e�

e� + 3

!✓1

2

(14.1)

A plot of LP (�) is shown in Figure 14.2.

0.6 0.8 1.0 1.2 1.41.6e-18

2.0e-18

2.4e-18

2.8e-18

β

LP

Figure 14.2: A plot of the partial likelihood from (14.1). Dashed line is at � = 0.9155.

In the one-dimensional setting it is straightforward to estimate � by direct computation. Wesee the maximum at � = 0.9155 in the plot of Figure 14.2. In more complicated settings, thereare good maximisation algorithms built in to the coxph function in the survival package of R.Applying this to the current problem, we obtain:

Table 14.1: Output of the coxph function run on the aml data set.

coxph(formula = Surv(time, status) ⇠ x, data = aml)coef exp(coef) se(coef) z p

⇥Nonmaintained 0.916 2.5 0.512 1.79 0.074

Likelihood ratio test=3.38 on 1 df p=0.0658 n= 23

The z is simply the Z-statistic for testing the hypothesis that � = 0, so z = �/SE(�). We

Relative risk regression II 107

see that z = 1.79 corresponds to a p-value of 0.074, so we would not reject the null hypothesis atlevel 0.05.

We show the estimated baseline hazard in Figure 14.3; the relevant numbers are given inTable 14.2. For example, the first hazard, corresponding to t1 = 5, is given by

h0(5) =1

12e� + 11+

1

11e� + 11= 0.050,

substituting in � = 0.9155.

Table 14.2: Computations for the baseline hazard LME for the AML data, in the proportionalhazards model, with maintained group as baseline, and relative risk e� = 2.498.

Maintenance Non-Maintenance Baseline(control)

ti Y M(ti) dMi Y N

(ti) dNi h0(ti) A0(ti) eS0(ti)

5 11 0 12 2 0.050 0.050 0.9518 11 0 10 2 0.058 0.108 0.8989 11 1 8 0 0.032 0.140 0.86912 10 0 8 1 0.033 0.174 0.84113 10 1 7 0 0.036 0.210 0.81118 8 1 6 0 0.043 0.254 0.77623 7 1 6 1 0.095 0.348 0.70627 6 0 5 1 0.054 0.403 0.66930 5 0 4 1 0.067 0.469 0.62531 5 1 3 0 0.080 0.549 0.57733 4 0 3 1 0.087 0.636 0.52934 4 1 2 0 0.111 0.747 0.47443 3 0 2 1 0.125 0.872 0.41845 3 0 1 1 0.182 1.054 0.34848 2 1 0 0 0.500 1.554 0.211

Relative risk regression II 108

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Figure 14.3: Estimated baseline hazard under the PH assumption. The purple circles show thebaseline hazard; blue crosses show the baseline hazard shifted up proportionally by a multipleof e� = 2.5. The dashed green line shows the estimated survival rate for the mixed population(mixing the two estimates by their proportions in the initial population).

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Figure 14.4: Comparing the estimated population survival under the PH assumption (greendashed line) with the estimated survival for the combined population (blue dashed line), foundby applying the Nelson–Aalen estimator to the population, ignoring the covariate.

Relative risk regression II 109

14.3 The Cox model in RThe survival package includes a function coxph that computes Cox regression models. Toillustrate this, we simulate 100 individuals with hazard rate texi , where xi are normal with mean0 and variance 0.25. We also right-censor observations at constant rate 0.5. The simulationsmay be carried out with the following commands:

require(’survival’)

n=100

censrate=0.5

covmean=0

covsd=0.5

beta=1

x=rnorm(n,covmean,covsd)

T=sqrt(rexp(n)*2*exp(-beta*x))

# Censoring times

C=rexp(2*n,censrate)

t=pmin(C,T)

delta=1*(T<C)

Then the Cox model may be fit with the command

cfit=coxph(Surv(t,delta)~x)

> summary(cfit)

Call:

coxph(formula = Surv(t, delta) ~ x)

n= 100, number of events= 50

coef exp(coef) se(coef) z Pr(>|z|)

x 0.8998 2.4590 0.3172 2.836 0.00457 **

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

exp(coef) exp(-coef) lower .95 upper .95

x 2.459 0.4067 1.32 4.579

Concordance= 0.601 (se = 0.049 )

Rsquare= 0.079 (max possible= 0.966 )

Likelihood ratio test= 8.19 on 1 df, p=0.004203

Wald test = 8.04 on 1 df, p=0.004565

Score (logrank) test = 8.18 on 1 df, p=0.004225

If we give a coxph object to survfit it will automatically compute the hazard estimate,using the Efron method as default. We need to give it a data frame of “newdata”, with onecolumn for each covariate. It will output one survival curve for each row of the data frame. Inparticular, inputting the new data x = 0 we get the baseline hazard. If we plot this object itcomes by default with a 95% confidence interval. We show the plot in Figure 14.5.

Relative risk regression II 110

plot(survfit(cfit,data.frame(x=0)),main=’Cox example’)

tt=(0:300)/100

lines(tt,exp(-tt^2/2),col=2)

legend(.1,.2,c(’baseline survival estimate’,’true baseline’),col=1:2,lwd=2)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Cox example

baseline survival estimatetrue baseline

Figure 14.5: Survival estimated from 100 individuals simulated from the Cox proportional hazardsmodel. True baseline survival e�t2/2 is plotted in red.

Suppose now we have a categorical variable — for example, three different treatment groups,labelled 0,1,2 — with relative risk 1,2,3, let us say. If we were to use the command

cfit=coxph(Surv(t,delta)~x)

we would get the wrong result:

coef exp(coef) se(coef) z p

x 0.335 1.4 0.0865 3.88 0.00011

Likelihood ratio test=15 on 1 df, p=0.000107 n= 300, number of events= 191

The problem is that this assumes the effects are monotonic: Group 1 has relative risk e� andGroup 2 has relative risk e2� , which is simply wrong.

The estimates of survival for the three different groups may be estimated correctly by definingnew covariates x1=(x==1) and x2=(x==2) — that is, they are indicator variables for x to be 1or 2 respectively, and we would then use the command

cfit2=coxph(Surv(t,delta)~x1+x2).

Even easier is to use the factor command, which tells Rto treat the vector as non-numeric. Ifwe give the command

cfit2=coxph(Surv(t,delta)~factor(x))

Relative risk regression II 111

it produces the output

coef exp(coef) se(coef) z p

factor(x)2 0.801 2.23 0.187 4.29 1.8e-05

factor(x)3 0.707 2.03 0.185 3.82 1.3e-04

Likelihood ratio test=23 on 2 df, p=1.01e-05 n= 300, number of events= 191

This comes close to estimating correctly the relative risks 2 and 2.5, which in the first versionwere estimated as 1.4 and 1.42 = 1.96.

If you want to include time-varying covariates, this may be done crudely by having multipletime intervals for each individual, with all having a start and a stop time, and all but (perhaps)the last being right-censored. (Of course, if individuals have multiple events, then there may bemultiple intervals that end with an event.) This allows the covariate to be changed stepwise.

14.4 Graphical tests of the proportional hazards assumptionWe describe here several plots that can be made from survival data to determine whether theproportional hazards assumption is reasonable.

14.4.1 Log cumulative hazard plot

The simplest graphical test require that the covariate take on a few discrete values, with asubstantial number of subjects observed in each category. If the covariate is continuous westratify it, defining a new categorical covariate by the original covariate being in some fixedregion.

The first approach is to consider, for categories 1, . . . ,m, the Nelson–Aalen estimators bHi(t)of the cumulative hazard for individuals in category i. If any relative-risk model holds thenbHi(t) = r(�, i) bH0(t), so that

log bHi(t)� log bHj(t) ⇡ log r(�, i)� log r(�, j)

should be approximately constant.

14.4.2 Andersen plot

In the Andersen plot we plot all the pairs ( bHi(t), bHj(t)). If the proportional hazards assumptionholds then each pair (i, j) should produce (approximately) a straight line through the origin. It isknown (cf. section 11.4 of [17]) that when the ratio of hazard rates ↵i(t)/↵j(t) is increasing, thecorresponding Andersen plot is a convex function; decreasing ratios produce concave Andersenplots.

14.4.3 Arjas plot

The Arjas plot is a more sophisticated graphical test, that is capable of testing the proportionalhazards assumption for a single categorical covariate, within a model that includes other covariates(that might follow a different sort of model).

Suppose we have fit the model hi(t) = h(t|xi) for the hazard rate of individual i — forexample, it might be hi(t) = e�·xi↵0(t), but it might be something else — and we are interestedto decide whether an additional (categorical) covariate zi (taking on values 0 and 1) ought to beincluded as well. For each individual we have an estimated cumulative hazard bH(t|xi). Define

Relative risk regression II 112

the weighted time on test for individual i at event time tj as bH(tj ^ Ti|xi), and the total time ontest for level g (of the covariate z) as

TOTg(tj) =X

i:zi=g

bH(tj ^ Ti|xi);

the number of events at level g will be

Ng(tj) =X

i:zi=g

�i1{Titj}.

The idea is that if the covariate z has no effect, the difference Ng(tj)�TOTg(tj) is a martingale,so a plot of Ng against TOT would lie close to a straight line with 45

� slope. If levels of z haveproportional hazards effects, we expect to see lines of different slopes. If the effects are notproportional, we expect to see curves that are not lines.

We give an example in Figure 14.6. We have simulated a population of 100 males and 100females, whose hazard rates are ↵i(t) = exi+�maleImalet, where xi ⇠ N(0, 0.25). In Figure 14.6(a)we show the Arjas plot in the case �male = 0; in Figure 14.6(b) we show the plot for the case�male = 1.

0 10 20 30 40 50 60

010

2030

4050

Male effect= 0

Number of events

Cum

ulat

ive h

azar

d malefemale

(a) �male = 0

0 10 20 30 40 50 60

010

2030

40

Male effect= 1

Number of events

Cum

ulat

ive h

azar

d

malefemale

(b) �male = 1

Figure 14.6: Arjas plots for simulated data.

1 r e qu i r e ( ' s u r v i v a l ' )2

3 n=1004 #n=10005 c en s ra t e =0.5

Relative risk regression II 113

6 covmean=07 covsd=0.58 male e f f=19 beta=1

10 n=10011

12 xmale=rnorm (n , covmean , covsd )13 xfem=rnorm (n , covmean , covsd )14 x=c ( xmale , xfem )15

16

17 Tmale=sq r t ( rexp (n) ∗2∗exp(�beta ∗xmale�male e f f ) )18 Tfem=sq r t ( rexp (n) ∗2∗exp(�beta ∗xfem ) )19

20 T=c (Tmale , Tfem)21

22 # Censoring t imes23 C=rexp (2 ∗n , c en s ra t e )24

25 t=pmin (C,T)26

27 de l t a=1∗ (T<C)28 sex=f a c t o r ( c ( rep ( 'M' , 100) , rep ( 'F ' , 100) ) )29

30 c f i t=coxph ( Surv ( t , d e l t a )⇠x )31

32 #Make a p l o t to see how c l o s e the est imated ba s e l i n e s u r v i v a l i s to the c o r r e c tone

33 p lo t ( s u r v f i t ( c f i t , data . frame (x=0) ) )34 t t =(0:300) /10035 l i n e s ( tt , exp(� t t ^2/ 2) , c o l =2)36

37 beta=as . numeric ( c f i t $ c o e f )38 r e l r i s k=exp ( beta ∗x ) ∗exp ( ma l e e f f ∗c ( rep (1 , n) , rep (0 , n ) ) )39

40

41 eventord=order ( t )42

43 et=t [ eventord ]44 es=sex [ eventord ]45 ec=x [ eventord ]46

47 cumhaz=�l og ( s u r v f i t ( c f i t , data . frame (x=ec ) ) $ surv )48 haztrunc=sapply ( 1 : ( 2 ∗n) , f unc t i on ( i ) pmin ( cumhaz [ i , i ] , cumhaz [ , i ] ) )49 haztruncmale=haztrunc [ , e s== 'M' ]50 haztruncfem=haztrunc [ , e s== 'F ' ]51

52 # Maximum cumulat ive hazard comes f o r i nd i v i dua l i when we ' ve gotten53 # to the row correspond ing to eventord [ i ]54

55 TOTmale=apply ( haztruncmale , 1 , sum)56 TOTfem=apply ( haztruncfem , 1 , sum)57

58 Nmale=cumsum( cumhazmale$n . event ∗ ( es== 'M' ) )59 Nfem=cumsum( cumhazfem$n . event ∗ ( es== 'F ' ) )60

61 p lo t (Nmale ,TOTmale , x lab= 'Number o f events ' , y lab= ' Cumulative hazard ' , type= ' l ' , c o l=2, lwd=2,main=paste ( 'Male e f f e c t= ' , ma l e e f f ) )

62 ab l i n e (0 , 1 )

Relative risk regression II 114

63 l i n e s (Nfem ,TOTfem, c o l =3, lwd=2)64 l egend ( . 1 ∗n , . 4 ∗n , c ( ' male ' , ' f emale ' ) , c o l =2:3 , lwd=2)

14.4.4 Leukaemia example

Section 1.9 of [17] describes data on 101 leukaemia patients, comparing the disease-free survivaltime between 50 who received allogenic bone marrow transplants and 51 who received autogenictransplants. We follow section 11.4 of that book in testing the proportional hazards model withtwo graphical tests. The data are available in the object alloauto in KMsurv.

Both plots show that the data are badly suited to a proportional hazards model. TheAndersen plot looks clearly convex, suggesting that the hazard ratio of autogenic to allogenic isincreasing. This is also what one would infer from the crossing of the two log cumulative hazardcurves.

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

log cumulative hazard plot for alloauto data

time (weeks)

log

cum

ulat

ive

haza

rd

(a) Cumulative log hazard plot

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.0

0.2

0.4

0.6

0.8

1.0

Andersen plot for alloauto data

log cumulative allogenic hazard

log

cum

ulat

ive

auto

logo

us h

azar

d

(b) Andersen plot

Figure 14.7: Graphical tests for proportional hazards assumption in alloauto data.

Lecture 15

Testing Hypotheses

A common question that we may have is, whether two (or more) samples of survival times maybe considered to have been drawn from the same distribution: That is, whether the populationsunder observation are subject to the same hazard rate.

15.1 Tests in the regression setting1) A package will produce a test of whether or not a regression coefficient is 0. It uses propertiesof mle’s. Let the coefficient of interest be b say. Then the null hypothesis is HO : b = 0 andthe alternative is HA : b 6= 0. At the 5% significance level, HO will be accepted if the p-valuep > 0.05, and rejected otherwise.

2) In an AL parametric model if ↵ is the shape parameter then we can test HO : log↵ = 0

against the alternative HA : log↵ 6= 0. Again mle properties are used and a p-value is producedas above. In the case of the Weibull if we accept log↵ = 0 then we have the simpler exponentialdistribution (with ↵ = 1).

3) We have already mentioned that, to test Weibull v. exponential with null hypothesis HO :exponential is an acceptable fit, we can use

2 log bLweib � 2 log bLexp ⇠ �2(1), asymptotically.

15.2 Non-parametric testing of survival between groups15.2.1 General principles

We will consider only the case where the data splits into two groups. There is a relatively easyextension to k > 2 groups.

We define the following notation

Event times are 0 < t1 < t2 < · · · < tm.

For i = 1, 2, . . . ,m, and j = 1, 2, dij = # events at ti in group j,

nij = # in risk set at ti from group j,

di = # events at ti,

ni = # in risk set at ti.

Thus, when the number of groups k = 2, we have di = di1 + di2 and ni = ni1 + ni2.Generally we are interested in testing the null hypothesis H0, that there is no difference

between the hazard rates of the two groups, against the two-sided alternative that there is

115

Testing hypotheses 116

a difference in the hazard rates. The guiding principle is quite elementary, quite similar toour approach to the proportional hazards model: We treat each event time ti as a new andindependent experiment. Under the null hypothesis, the next event is simply a random samplefrom the risk set. Thus, the probability of the death at time ti being from group 1 is ni1/ni, andthe probability of it being from group 2 is ni2/ni.

This describes only the setting where the events all occur at distinct times: That is, di are allexactly 1. More generally, the null hypothesis predicts that the group identities of the individualswhose events are at time ti are like a sample of size di without replacement from a collection ofni1 ‘1’s and ni2 ‘2’s. The distribution of di1 under such sampling is called the hypergeometricdistribution. It has

expectation = dini1

ni, and

variance =: �2i =

ni1ni2(ni � di)din2i (ni � 1)

.

Note that if di is negligible with respect to ni, this variance formula reduces to di(ni1ni

)(ni2ni

),which is just the variance of a binomial distribution.

Conditioned on all the events up to time ti (hence on ni, ni1, ni2) and on di, the randomvariable di1 � ni1

dini

has expectation 0 and variance �2i . If we multiply it by an arbitrary weight

W (ti), determined by the data up to time ti, we still have W (ti)(di1 � ni1dini) being a random

variable with (conditional) expectation 0, but now (conditional) variance W (ti)2�2i . This means

that if we define for k = 1, . . . ,m

Mk :=

0

@kX

j=1

W (tj)�dj1 � nj1

djnj

�1

Am

k=1

,

these will be random variables with expectation 0 and variancePk

i=1W (ti)2�2i . While the

increments are not independent, we may still apply a version of the Central Limit Theoremto show that Mk is approximately normal when the sample size is large enough. (In technicalterms, the sequence of random variables Mk is a martingale, and the appropriate theorem is theMartingale Central Limit Theorem. See [11] for more details.) We then base our tests on thestatistic

Z :=

Pmj=1W (tj)

�dj1 � nj1

djnj

rPmj=1W (tj)2

nj1nj2(nj�dj)djn2j (nj�1)

which should have a standard normal distribution under the null hypothesis.Note that, as in the Cox regression setting, right censoring and left truncation are automatically

taken care of, by appropriate choice of the risk sets.

15.2.2 Standard tests

Any choice of weights W (ti) defines a valid test. Why do we need weights? Since any choiceof weights produces a correct test, there is no canonical choice. Changing the weights changesthe power with respect to different alternatives. Which alternative you choose — hence, whichweights you choose — should depend on what deviations from equality you are most interestedin detecting. As always, the test should be chosen beforehand. Multiple testing makes theinterpretation of test results problematic.

Some common choices are:

Testing hypotheses 117

1. W (ti) = 1, 8i. This is the log rank test, and is the test in most common use. The logrank test is aimed at detecting a consistent difference between hazards in the two groupsand is best placed to consider this alternative when the proportional hazard assumptionapplies. It is maximally asymptotically efficient in the proportional hazards context; infact, it is equivalent to the score test for the Cox regression parameter being 0, hence isasymptotically equivalent to the likelihood ratio test. A criticism is that it can give toomuch weight to the later event times when numbers in the risk sets may be relatively small.

2. R. Peto and J. Peto [21] proposed a test that emphasises deviations that occur early on,when there are more individuals under observation. Petos’ test uses a weight dependenton a modified estimated survival function, estimated for the whole study. The modifiedestimator is

eS(t) =Y

tit

ni + 1� dini + 1

and the suggested weight is then

W (ti) = eS(ti�1)ni

ni + 1

This has the advantage of giving more weight to the early events and less to the later oneswhere the population remaining is smaller. Much like the cumulative-deviations test, thisavoids giving extra weight to excess deaths that come later.

3. W (ti) = ni has also been suggested (Gehan, Breslow). This again downgrades the effect ofthe later times.

4. D. Harrington and T. Fleming [12] proposed a class of tests that include Petos’ test andthe logrank test as special cases. The Fleming–Harrington tests use

W (tj) =⇣bS(tj�1)

⌘p ⇣1� bS(tj�1)

⌘q

where bS is the Kaplan-Meier survival function, estimated for all the data. Then p = q = 0

gives the logrank test and p = 1, q = 0 gives a test very close to Peto’s test. If we were toset p = 0, q > 0 this would emphasise the later event times if needed for some reason. Thismight be the case if we were testing the effect of a medication that we expected to needsome time after initiation of treatment to become fully effective, or if the patients were

All of these tests may be written in the formP

(Oi1 � Ei1)WiqP�2i1W

2i

,

where Oi and Ei are observed and expected numbers of events. Consequently, positive andnegative fluctuations can cancel each other out. This could conceal a substantial differencebetween hazard rates which is not of the proportional hazards form, but where the hazard rates(for instance) cross over, with group 1 having (say) the higher hazard early, and the lower hazardlater. One way to detect such an effect is with a test statistic to which fluctuations contributeonly their absolute values. For instance, we could use the standard �2 statistic

X :=

mX

i=1

kX

j=1

(Oij � Eij)2

Eij.

Testing hypotheses 118

Asymptotically, this should have the �2 distribution with (k� 1)m degrees of freedom. Of course,if the number of groups k = 2, this is the same as

X :=

mX

i=1

(Oi1 � Ei1)2

dini1ni

(1� ni1ni

).

15.3 The AML exampleWe can use these tests to compare the survival of the two groups in the AML experiment discussedin section 10.5. The relevant quantities are tabulated in Table 15.1.

tj nj1 nj2 dj1 dj2 �2j Peto weight

5 11 12 0 2 0.476 0.9588 11 10 0 2 0.474 0.8759 11 8 1 0 0.244 0.792

12 10 8 0 1 0.247 0.75013 10 7 1 0 0.242 0.70818 8 6 1 0 0.245 0.66123 7 6 1 1 0.456 0.61427 6 5 0 1 0.248 0.51930 5 4 0 1 0.247 0.46731 5 3 1 0 0.234 0.41633 4 3 0 1 0.245 0.36434 4 2 1 0 0.222 0.31243 3 2 0 1 0.240 0.26045 3 1 0 1 0.188 0.208

Table 15.1: Data for testing equality of survival in AML experiment.

When the weights are all taken equal, we compute Z = �1.84, whereas the Peto weights —which reduce the influence of later observations — give us Z = �1.67. This yields one-sidedp-values of 0.033 and 0.048 respectively — a marginally significant difference — or two-sidedp-values of 0.065 and 0.096.

Applying the �2 test yields X = 16.86, which needs to be compared to �2 with 14 degrees offreedom. The resulting p-value is 0.24, which is not at all significant. This should not be seenas surprising: The differences between the two survival curves are clearly mostly in the samedirection, so we lose power when applying a test that ignores the direction of the difference.

15.3.1 Kidney dialysis example

This is example 7.2 from [17]. The data are from a clinical trial of alternative methods ofplacing catheters in kidney dialysis patients. The event time is the first occurrence of an exit-siteinfection. The data are in the KMsurv package, in the object kidney. (Note: There is a differentdata set with the same name in the survival package. To make sure you get the right one,enter data(kidney,package=’KMsurv’).) The Kaplan–Meier estimator is shown in Figure 15.1.There are two survival curves, corresponding to the two different methods.

We show the calculations for the nonparametric test of equality of distributions in Table 15.2.The log-rank test — obtained by simply dividing the sum of all the deviations by the square rootof the sum of terms in the �2

i column — is only 1.59, so not significant. With the Peto weights the

Testing hypotheses 119

0 5 10 15 20 25

0.00.2

0.40.6

0.81.0

Kaplan−Meier plot for kidney dialysis data

Time to infection (months)

Figure 15.1: Plot of Kaplan–Meier survival curves for time to infection of dialysis patients,based on data described in section 1.4 of [17]. The black curve represents 43 patients withsurgically-placed catheter; the red curve 76 patients with percutaneously placed catheter.

statistic is only 1.12. This is not surprising, because the survival curves are close together (andactually cross) early on. On the other hand, they diverge later, suggesting that weighting thelater times more heavily would yield a significant result. It would not be responsible statisticalpractice to choose a different test after seeing the data. On the other hand, if we had startedwith the belief that the benefits of the percutaneous method are cumulative, so that it wouldmake sense to expect the improved survival to appear later on, we might have planned from thebeginning to use the Harrington–Fleming weights with, for example, p = 0, q = 1, tabulated inthe last column. Applying these weights gives us a test statistic ZFH = 3.11, implying a highlysignificant difference.

15.3.2 Nonparametric tests in R (non-examinable)The survival package includes a command survdiff that carries out the tests described inthis lecture. The following code carries out the log-rank test and the Petos’ test (actually, theHarrington–Fleming (1, 0) test.)

1 > kS=Surv ( kidney $time , kidney $ de l t a )2 > su r v d i f f ( kS⇠kidney $ type ) #log�rank t e s t3 Cal l :

Testing hypotheses 120

tj nj1 nj2 dj1 dj2 �2j Peto wt. H–F (0, 1) wt.

0.5 43 76 0 6 1.326 0.992 0.0001.5 43 60 1 0 0.243 0.941 0.0502.5 42 56 0 2 0.485 0.931 0.0593.5 40 49 1 1 0.489 0.912 0.0784.5 36 43 2 0 0.490 0.890 0.0995.5 33 40 1 0 0.248 0.867 0.1216.5 31 35 0 1 0.249 0.854 0.1338.5 25 30 2 0 0.487 0.839 0.1469.5 22 27 1 0 0.247 0.807 0.176

10.5 20 25 1 0 0.247 0.790 0.19311.5 18 22 1 0 0.247 0.770 0.21015.5 11 14 1 1 0.472 0.741 0.23016.5 10 13 1 0 0.246 0.681 0.28918.5 9 11 1 0 0.247 0.649 0.31923.5 4 3 1 0 0.245 0.568 0.35126.5 2 3 1 0 0.240 0.473 0.432

Table 15.2: Data for kidney dialysis study.

4 s u r v d i f f ( formula = kS ⇠ kidney $ type )5

6 N Observed Expected (O�E)^2/E (O�E)^2/V7 kidney $ type=1 43 15 11 1 .42 2 .538 kidney $ type=2 76 11 15 1 .05 2 .539

10 Chisq= 2 .5 on 1 degree s o f freedom , p= 0.11211 > su r v d i f f ( kS⇠ kidney $type , rho=1) #Petos t e s t12 Cal l :13 s u r v d i f f ( formula = kS ⇠ kidney $type , rho = 1)14

15 N Observed Expected (O�E)^2/E (O�E)^2/V16 kidney $ type=1 43 12 .0 9 .48 0 .686 1 .3917 kidney $ type=2 76 10 .4 12 .98 0 .501 1 .3918

19 Chisq= 1 .4 on 1 degree s o f freedom , p= 0.239

Note that the outputWe conclude with the R code used to generate Figure 15.1 and Table 15.2

1 r e qu i r e ( 'KMsurv ' )2 data ( kidney , package= 'KMsurv ' )3 attach ( kidney )4 kid . surv=Surv ( time , d e l t a )5 kid . f i t=s u r v f i t ( k id . surv⇠type )6 p lo t ( kid . f i t , c o l =1:2 , xlab= 'Time to i n f e c t i o n (months ) ' ,7 main= 'Kaplan�Meier p l o t f o r kidney d i a l y s i s data ' )8

9 t=so r t ( unique ( time ) )10 d=lapp ly ( 1 : 2 , f unc t i on ( i ) sapply ( t , f unc t i on (T) sum( ( time [ type==i ]==T)&( de l t a [ type

==i ]==1) ) ) )

Testing hypotheses 121

11 n=lapp ly ( 1 : 2 , f unc t i on ( i ) sapply ( t , f unc t i on (T) sum( time [ type==i ]>=T) ) )12

13 keep=(n [ [ 1 ] ] ∗n [ [ 2 ] ] >0 )14 t=t [ keep ]15 d=lapp ly (d , f unc t i on (D) D[ keep ] )16 n=lapp ly (n , f unc t i on (D) D[ keep ] )17 ddot=d [ [ 1 ] ] + d [ [ 2 ] ]18 ndot=n [ [ 1 ] ] + n [ [ 2 ] ]19

20 s i=ddot∗n [ [ 1 ] ] ∗n [ [ 2 ] ] ∗ ( ndot�ddot ) /ndot/ndot/ ( ndot�1)21 petos=cumprod ( ( ndot�ddot ) / ( ndot ) )22 petow=c (1 , petos [ 1 : ( l ength ( petos )�1) ] ) ∗ndot/ ( ndot+1)23 fhw=c (0 ,(1� petos [ 1 : ( l ength ( petos )�1) ] ) )24 e i=ddot∗n [ [ 1 ] ] /ndot25 wk=rep (1 , l ength ( e i ) )26

27 zLR=sum(wk∗ (d [ [ 1 ] ] � e i ) ) / sq r t (sum(wk^2∗ s i ) )28 zP=sum( petow∗ (d [ [ 1 ] ] � e i ) ) / sq r t (sum( petow^2∗ s i ) )29 zFH=sum( fhw∗ (d [ [ 1 ] ] � e i ) ) / sq r t (sum( fhw^2∗ s i ) )30

31 xt=xtab l e ( cbind ( t , n [ [ 1 ] ] , n [ [ 2 ] ] , d [ [ 1 ] ] , d [ [ 2 ] ] , s i , petow , fhw ) ,32 d i sp l ay=c ( 'd ' , ' f ' , 'd ' , 'd ' , 'd ' , 'd ' , ' f ' , ' f ' , ' f ' ) ,33 d i g i t s=c (0 , 1 , 0 , 0 , 0 , 0 , 3 , 3 , 3 ) )34 pr in t ( xt , i n c lude . rownames=FALSE, in c lude . colnames=FALSE)

Lecture 16

Testing the fit of survival models

In section 14.4 we looked at graphical plots that can be used to test the appropriateness ofthe proportional hazards model. These graphical methods work when we have large groups ofindividuals whose survival curves may be estimated separately, and then compared with respectto the proportional-hazards property.

In this lecture we look more generally at the problem of testing model assumptions — mainly,but not exclusively, the assumptions of the Cox regression model — and describe a suite of toolsthat may be used for models with quantitative as well as categorical covariates.

16.1 General principles of model selection16.1.1 The idea of model diagnostics

A statistical model is a family of probability distributions, whose realisations are data of the sortthat we are trying to analyse. Fitting a model means that we find the member of the family thatis closest to the data in terms of some criterion. If the model is good, then this best-fit modelwill be a reasonable proxy — for all sorts of inference purposes — for the process that originallygenerated the data. That means that other properties of the data that were not used in choosingthe representative member of the family should also be close to the corresponding properties ofthe model fit.

(An alternative approach, that we won’t discuss here, is model averaging, where we accept upfront that no model is really correct, and so give up the search for the “one best”. Instead, wedraw our statistical inferences from all the models in the family simultaneously, appropriatelyweighted for how well they suit the data.)

The idea, then, is to look at some deviations of the data from the best-fit model — theresiduals — which may be represented in terms of test statistics or graphical plots whoseproperties are known under the null hypothesis that the data came from the family of distributionsin question, and then evaluate the performance. Often, this is done not in the sense of formalhypothesis testing — after all, we don’t expect the data to have come exactly from the model, sothe difference between rejecting and not rejecting the null hypothesis is really just a matter ofsample size — but of evaluating whether the deviations from the model seem sufficiently crass toinvalidate the analysis. In addition, residuals may show not merely that the model does not fitthe data adequately, but also show what the systematic difference is, pointing the way to animproved model. This is the main application of martingale residuals, which we will discuss insection 16.5. Alternatively, it may show that the failure is confined to a few individuals. Togetherwith other information, this may lead us to analyse these outliers as a separate group, or todiscover inconsistencies in the data collection that would make it appropriate to analyse the

122

Model diagnostics 123

remaining data without these few outliers. The main tool for detecting outliers are devianceresiduals, which we discuss in section 16.6.1.

16.1.2 A simulated example

Suppose we have data simulated from a very simple survival model, where individual i hasconstant hazard 1 + xi, where xi is an observed positive covariate, with independent rightcensoring at constant rate 0.2. Now suppose we choose to fit the data to a proportional hazardsmodel. What would go wrong? Not surprisingly, for such a simple model, the main conclusion —that the covariate has a positive effect on the hazard — would still be qualitatively accurate.But what about the estimate of baseline hazard?

We simulated this process with 1000 individuals, where the covariates were the absolutevalues of independent normal random variables. We must first recognise that it is not entirelyclear what it even means to evaluate the accuracy of fit of such a misspecified model. If we plugthe simulated data into the Cox model, we necessarily get an exponential parameter out, tellingus that the hazard rate corresponding to covariate x is e�x. Since the hazard is actually 1 + �x,it is not clear what it would mean to say that the parameter was well estimated. Certainly,positive � should be translated into positive �. Similarly, the baseline hazard of the Cox modelagrees with the baseline hazard of the additive-hazards model, in that both are supposed tobe the hazard rates for an individual with covariates 0, but their roles in the two models aresufficiently different that any comparison is on uncertain ground.

Still, the fitted Cox model makes a prediction about the hazard rate of an individual whosecovariates are all 0, and that prediction is wrong. When we fit the data from 1000 individuals tothe Cox model, we get this output:

coef exp(coef) se(coef) z p

x 0.581 1.79 0.0559 10.4 0

Likelihood ratio test=98.6 on 1 df, p=0 n= 1000, number of events= 882

In Figure 16.1(a) we see the baseline survival curve estimated from these data by the Coxmodel. The confidence region is quite narrow, but we see that the true survival curve — thered curve — is nowhere near it. In Figure 16.1(b) we have a smoothed version of the hazardestimate, and we see that forcing the data into a misspecified Cox model has turned the constantbaseline hazard into an increasing hazard.

Model diagnostics 124

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

Cox baseline survival estimate

(a) Baseline survival

0 1 2 3 4

01

23

45

67

Cox smoothed hazard estimate

Time

Hazard

(b) Smoothed hazard

Figure 16.1: Baseline survival and hazard estimated from the Cox proportional hazards modelfor data simulated from the additive hazards model. Red is the true baseline hazard.

16.2 Cox–Snell residualsGiven a linear regression model Yi = �Xi+ ✏i, we stress-test the model by looking at the residualsYi � �Xi. If the model is reasonably appropriate, the residuals should look something like asample from the distribution posited for the errors ✏i. There are many ways the residuals can failto have the “right” distribution — wrong tails, dependence between different residuals, changingdistributions over time or depending on Xi — and consequently many ways to test them, rangingfrom formal test statistics to graphical plots that are evaluated by eye.

In evaluating regression models for survival we are doing something similar, except that theconnection between the individual observations and the parameters being estimated is muchmore indirect, thus demanding more ingenuity to even define the residuals, and then to evaluatethem. There is a large family of different residuals that have been defined for survival models,each of which is useful for different parts of the task of model diagnostics, including:

• Generally evaluating the appropriateness of a regression model (such as Cox proportionalhazards or additive hazards);

• Specifically evaluating assumptions of the regression model (such as the proportionalhazards assumption, or the log-linear action of the covariates;

• Finding specific outlier individuals in an otherwise reasonably well specified model.

The most basic version is called the Cox-Snell residual. It is based on the observation thatif T is a sample from a distribution with cumulative hazard function H, then H(T ) has anexponential distribution with parameter 1.

Given a parametric model H(T,�) we would then generate and evaluate Cox-Snell residualsas follows:

1. We use the samples (Ti, �i) to estimate a best fit �;

2. Compute the residuals ri := H(Ti, �);

Model diagnostics 125

3. If the model is well specified — a good fit to the data — then (ri, �i) should be like aright-censored sample from a distribution with constant hazard 1.

4. A standard way of evaluating the residuals is to compute and plot a Nelson–Aalen estimatorfor the cumulative hazard rate of the residuals. The null hypothesis — that the data camefrom the parametric model under consideration — would predict that this plot should lieclose to the line y = x.

Of course, things aren’t quite so straightforward when evaluating a semi-parametric model(such as the Cox model) or nonparametric model (such as the Aalen additive hazards model). Wedescribe here the standard procedure for computing Cox-Snell residuals for the Cox proportionalhazards model (the original application):

1. We use the samples (Ti,xi, �i) to estimate a best fit �;

2. We compute the Breslow estimator bH0(t) for the baseline hazard;

3. Compute the residuals ri := e�Txi bH0(Ti).

After this, we proceed as above. Of course, there is nothing special about the Cox log-linearform of the relative risk. Given any relative risk function r(�,x), we may define residualsri := r(�,xi)

bH0(Ti).

16.3 Bone marrow transplantation example(This is based on Example 11.1 of [17].)

The data here are in the object bmt of the KMsurv package. They describe the disease-freesurvival time of leukaemia patients following a bone marrow transplant. We consider the modelwith proportional effects for the following variables:

Z1 = patient age at transplant (centred at 28 years);Z2 = donor age at transplant (centred at 28 years);Z1 ⇥ Z2;

Z3 = patient sex (1=male, 0=female);Z7 = waiting time to transplant.

We fit the model in R as follows:1 > bmcox=coxph ( Surv ( t2 , d3 )⇠z1+z2+z1∗ z2+z3+z7 , data=bmt)2 > summary(bmcox)3 Cal l :4 coxph ( formula = Surv ( t2 , d3 ) ⇠ z1 + z2 + z1 ∗ z2 + z3 + z7 , data = bmt)5

6 n= 137 , number o f events= 837

8 co e f exp ( co e f ) se ( c o e f ) z Pr(>| z | )9 z1 �0.1142944 0.8919953 0.0356292 �3.208 0.00134 ∗∗

10 z2 �0.0859570 0.9176336 0.0302892 �2.838 0.00454 ∗∗11 z3 �0.3062033 0.7362370 0.2298515 �1.332 0.1828012 z7 0.0001135 1.0001135 0.0003274 0 .347 0.7287513 z1 : z2 0.0036177 1.0036243 0.0009203 3 .931 8 .46 e�05 ∗∗∗14 ���15 S i g n i f . codes : 0 ' ∗∗∗ ' 0 .001 ' ∗∗ ' 0 .01 ' ∗ ' 0 .05 ' . ' 0 .1 ' ' 116

Model diagnostics 126

17 exp ( co e f ) exp(� co e f ) lower . 95 upper . 9518 z1 0 .8920 1 .1211 0 .8318 0 .956519 z2 0 .9176 1 .0898 0 .8647 0 .973820 z3 0 .7362 1 .3583 0 .4692 1 .155221 z7 1 .0001 0 .9999 0 .9995 1 .000822 z1 : z2 1 .0036 0 .9964 1 .0018 1 .005423

24 Concordance= 0.605 ( se = 0.033 )25 Rsquare= 0.104 (max po s s i b l e= 0.996 )26 Like l i hood r a t i o t e s t= 15.06 on 5 df , p=0.0101227 Wald t e s t = 18.02 on 5 df , p=0.0029228 Score ( logrank ) t e s t = 18 .9 on 5 df , p=0.002009.

So we see that patient age and donor age both seem to have strong effects on disease-freesurvival time (with increasing age acting negatively — that is, increasing the length of disease-freesurvival — as we might expect if we consider that many forms of cancer progress more rapidlyin younger patients). Somewhat surprisingly, the effect of a year of donor age is almost as strongas the effect of a year of patient age. There is also a strong positive interaction term, suggestingthat the prognosis for an old patient receiving a transplant from an old donor is not as favourableas we would expect from simply adding their effects. Thus, for example, the oldest patient was 80years old, while the oldest donor was 84. The youngest patient was 35, the youngest donor just30. The model suggests that the 80-year-old patient should have a hazard rate for relapse that isa factor of e�0.1143⇥45

= 0.006 that of the youngest. Indeed, the youngest patient relapsed afterjust 43 days, while the oldest died after 363 days without recurrence of the disease. The patientwith the oldest donor would be predicted to have his or her hazard rate of recurrence reduced bya factor of e�0.0860⇥54

= 0.0096. Indeed, the youngest patient did have the youngest donor, andthe oldest patient nearly had the oldest. Had that been the case, the ratio of hazard rate for theoldest to that of the youngest would have been 0.0096⇥ 0.006 = 0.00006. Taking account of theinteraction term we see that the actual log proportionality term predicted by the model is

exp��0.1143⇥ 45� 0.0860⇥ 54 + 0.036177(45⇥ 54)

= 0.45.

16.4 Testing the proportional hazards assumption: Schoenfeldresiduals

At least as important as checking the log-linearity of a covariate’s influence for the Cox proportionalhazards model is checking that the influence is constant in time. We already described how tomake a graphical test of the proportional hazards assumption between two groups of subjects. Itis less obvious how to test the proportional hazards assumption associated with a relative-riskregression model.

For this purpose the standard tool is the Schoenfeld residuals. The formal derivation is leftas an exercise, but the definition of the j-th Schoenfeld residual for parameter �k is

Skj(tj) :=X

Xijk � Xk(tj),

where as usual ij is the individual with event at time tj , and

Xk(t) =

Pni=1 Yi(t)Xik(t)e�

T Xi(t)

Pni=1 Yi(t)e

�T Xi(t)

Model diagnostics 127

is the weighted mean of covariate Xk at time t. Thus the Schoenfeld residual measures thedifference between the covariate at time t and the average covariate at time t. If �k is constantthis has expected value 0. If the effect of Xk is increasing we expect the estimated parameter�k to be an overestimate early on — so the individuals with events then have lower Xk thanwe would have expected, producing negative Schoenfeld residuals; at later times the residualswould tend to be positive. Thus, increasing effect is associated with increasing Schoenfeldresiduals. Likewise decreasing effect is associated with decreasing Schoenfeld residuals. As withthe martingale residuals, we typically make a smoothed plot of the Schoenfeld residuals, to get ageneral picture of the time trend.

We can also make a formal test of the hypothesis �k is constant by fitting a linear regressionline to the Schoenfeld residuals as a function of time, and testing the null hypothesis of zeroslope, against the alternative of nonzero slope. Of course, such a test will have little or no powerto detect nonlinear deviations from the hypothesis of constant effect — for instance, thresholdeffects, or changing direction.

16.5 Martingale residuals16.5.1 Definition of martingale residuals

A close approximation to the residuals that we use in the linear-regression setting are themartingale residuals. (A good source for the practicalities of martingale residuals is chapter 11 of[17]. The mathematical details are best presented in section 4.5 of [8].)

Intuitively, the martingale residual for an individual is the difference between the numberof observed events for the individual and the expected number under the model. The expectednumber of events from time 0 to t for individual i is Hi(t), so in principle this is

�i �Z Ti

0h0(s)e

�·xi(s)ds.

We turn this into a residual — something we can compute from the data – by replacing theintegral with respect to the hazard by the differences in the estimated cumulative hazard

fMi(t) : = �i � bHi(Ti)

= �i �X

tjTi

e�·xi(tj)h0(tj)

= �i �X

tjTi

e�·xi(tj) 1

P`2Rj

e�·x`(tj)

(16.1)

It differs from the (negative of the) Cox–Snell residual only by the addition of �i. When thecovariates are constant in time,

fMi = �i � e�Txi bH0(Ti).

An individual martingale residual is a very crude measure of deviation, since for any given tthe only observation here that is relevant to the survival model is �i, which is a binary observation.

If there are no ties, or if we use the Breslow method for resolving ties, sum of all the martingaleresiduals is 0.

Model diagnostics 128

16.5.2 Application of martingale residuals for estimating covariate trans-

forms

Martingale residuals are not very useful in the way that linear-regression residuals are, becausethere is no natural distribution to compare them to. The main application is to estimateappropriate modifications to the proportional hazards model by way of covariate transformation:Instead of a relative risk of e�x, it might be ef(x), where f(x) could be 1{x<x0} or

px, or

something else.We assume that in the population xi and zi are independent. This won’t be exactly true in

reality, but obviously a strong correlation between two variables complicates efforts to disentangletheir effects through a regression model. We derive the formula under the assumption ofindependence, understanding that the results will be less reliable the more intertwined thevariable zi is with the others.

Suppose the data (Ti,xi(·), zi, �i) are sampled from a relative-risk model with two covariates:a vector xi and an additional one-dimensional covariate zi, with

log r(�,x, z) = �Tx + f(z);

that is, the Cox regression model holds except with regard to the last covariate, which acts ash(z) := ef(z). Let � be the p-dimensional vector corresponding to the Cox model fit without thecovariate z, and let fMi be the corresponding martingale residuals.

Another complication is that we use � instead � (which we don’t know). We will derive therelationship under the assumption that they are equal; again, errors in estimating � will makethe conclusions less correct. For large n we may assume that � and � are close.

Let

h(s,x) :=E[1at risk time sef(z) |x]P{at risk time s |x} = P

�h(z)

�� at risk time s;x⇤;

h(s) := E⇥h(s,x)

⇤.

We assume that h(s,x) is approximately the constant h(s).

FactE⇥fM

�� z⇤⇡P

�in

�f(z)� log h(1)

�. (16.2)

Thus, we may estimate f(z0) by estimating the local average of fM , averaged over z close toz0. For instance, we can compute a LOESS smooth curve fit to the scatterplot of points (zi,fMi).

The basic idea is, the martingale residuals measure the excess events, the difference betweenthe expected and observed number of events. If we compute the expectation without takingaccount of z, then individuals whose z value has f(z) large positive will seem to have a largenumber of excess events; and those whose f(z) is large negative will seem to have fewer eventsthan expected.

A reasonably formal proof may be found in [22], and also reproduced in chapter 4 of [8].

Model diagnostics 129

16.6 Outliers and influenceWe don’t expect a model to be exactly right, but what does it mean for it to be “close enough”?An important way a model can go badly wrong is if the model fit is dominated by a very smallnumber of the observations. This can happen if either there are extreme values of the covariates,or if an outcome (the time in a survival model) is exceptionally far off of the predicted value. Theformer are called high-leverage observations, the latter are outliers. We need tools for identifyingthese observations, and determining how much influence they exert over the model fit.

16.6.1 Deviance residuals

The martingale residual for an individual is

observed # events � expected # events

for individual i. In principle, large values indicate outliers — results that individually areunexpected if the model is true. The problem is, these are highly-skewed variables — they takevalues between 1 and �1 and it is hard to determine what their distribution should look like.

We define the deviance residual for individual i as

di := sgn(fMi)

n�2⇥fMi + �i log(�i � fMi)

⇤o1/2. (16.3)

This choice of scaling is inspired by the intuition that each di should represent the contributionof that individual to the total model deviance.1 Whereas the martingale residuals are between�1 and 1, the deviance residuals should have a similar range to a standard normal randomvariable. Thus, we treat values outside a range of about �2.5 to +2.5 as outliers, potentially

16.6.2 Delta–beta residuals

Recall that the leverage of an individual observation is a measure of the impact of that observationon the parameters of interest. The “delta–beta” residual for parameter �k and subject i is definedas

��ki := �k � �k(i),

where �k(i) is the estimate of �k with individual i removed. In principle we may compute thisexactly, by recalculating the model n times for n subjects, but this is computationally expensive.We can approximate it by a combination of the Fisher information and the Schoenfeld residual(actually, a variant of the Schoenfeld residual called the score residual). We do not give theformula here, but it may be found in [3, Section 4.3]. These estimates may be called up easily inR, as described in section 16.7.

Individuals with high values of �� should be looked at more closely. They may revealevidence of data-entry errors, interactions between different parameters, or just the influence ofextreme values of the covariate. You should particularly be worried if there are high-influenceindividuals pushing the parameters of interest in one direction.

1Recall that deviance is defined as

D = 2⇥log likelihood(saturated) � log likelihood(�)

⇤.

Applied to the Cox model, the saturated model is the one where each individual has an individual parameter �⇤i .

It is possible to derive then that the deviance isPn

i=1 d2i .

Model diagnostics 130

16.7 Residuals in R (non-examinable)As an object-oriented language, R has functions in place for any well-written package to producestandard kinds of outputs in an appropriate way. In particular, if fit is the output of a model-fitting procedure, resid(fit) should produce sensible residuals. If fit is an object of type coxph,the residuals can be "martingale", "deviance", "score", "schoenfeld", "dfbeta", "dfbetas",and "scaledsch" (scaled Schoenfeld).

16.7.1 Dutch Cancer Institute (NKI) breast cancer data

We apply these methods to the data on survival of Dutch breast cancer patients collected in thenki data set, and discussed in [13]. The data are available in the dynpred package in R. Themost interesting thing about the study that these data come from is that it was one of the firstto relate survival of breast cancer patients to gene expression data. The data and some resultsare described in [24].

The data frame includes the following covariates:

patnr Patient identification number

d Survival status; 1 = death; 0 = censored

tyears Time in years until death or last follow-up

diameter Diameter of the primary tumor

posnod Number of positive lymph nodes

age Age of the patient

mlratio oestrogen expression level

chemotherapy Chemotherapy used (yes/no)

hormonaltherapy Hormonal therapy used (yes/no)

typesurgery Type of surgery (excision or mastectomy)

histolgrade Histological grade (Intermediate, poorly, or well differentiated)

vasc.invasion Vascular invasion (-, +, or +/-)

We begin by fitting the Cox model, including all the potentially significant covariates:1 nki . surv=with ( nki , Surv ( tyears , d ) )2 nki . cox=with ( nki , coxph ( nki . surv⇠posnodes+chemotherapy+hormonaltherapy3 +h i s t o l g r ad e+age+mlrat i o+diameter+posnodes+vasc . i nva s i on+typesurgery ) )4 > summary( nki . cox )5 Cal l :6 coxph ( formula = nki . surv ⇠ posnodes + chemotherapy + hormonaltherapy7 + h i s t o l g r ad e + age + mlrat i o + diameter + posnodes + vasc . i nva s i on +

typesurgery )8

9 n= 295 , number o f events= 7910

11 co e f exp ( co e f ) se ( c o e f ) z Pr(>| z | )12 posnodes 0 .07443 1.07727 0.05284 1 .409 0.15897113 chemotherapyYes �0.42295 0.65511 0.29769 �1.421 0.15538114 hormonaltherapyYes �0.17160 0.84232 0.44233 �0.388 0.698062

Model diagnostics 131

15 h i s t o l g r adePoo r l y d i f f 0 .26550 1.30409 0.28059 0 .946 0.34403016 h i s t o l g radeWe l l d i f f �1.30782 0.27041 0.54782 �2.387 0.016972 ∗17 age �0.03937 0.96139 0.01953 �2.016 0.043816 ∗18 mlrat i o �0.75031 0.47222 0.21138 �3.550 0.000386 ∗∗∗19 diameter 0 .01976 1.01996 0.01323 1 .493 0.13533420 vasc . i nva s i on+ 0.60286 1.82733 0.25315 2 .381 0.017245 ∗21 vasc . i nva s i on+/� �0.14580 0.86433 0.49104 �0.297 0.76653022 typesurgerymastectomy 0.15043 1.16233 0.24864 0 .605 0.545183

We could remove the insignificant covariates stepwise, or use AIC, or some other model-selection method, but let us suppose we have reduced it to the model including just histologicalgrade, vascular invasion, age, and mlratio (the crucial measure of oestrogen-receptor geneexpression. we then get

1 summary( nki . cox )2 Cal l :3 coxph ( formula = nki . surv ⇠ h i s t o l g r ad e + age + mlrat i o + vasc . i nva s i on )4

5 n= 295 , number o f events= 796

7 co e f exp ( co e f ) se ( c o e f ) z Pr(>| z | )8 h i s t o l g r adePoo r l y d i f f 0 .42559 1.53050 0.26914 1 .581 0.1138109 h i s t o l g radeWe l l d i f f �1.31213 0.26925 0.54782 �2.395 0.016611 ∗

10 age �0.04302 0.95790 0.01961 �2.194 0.028271 ∗11 mlrat i o �0.72960 0.48210 0.20754 �3.515 0.000439 ∗∗∗12 vasc . i nva s i on+ 0.64816 1.91203 0.24205 2 .678 0.007410 ∗∗13 vasc . i nva s i on+/� �0.04951 0.95169 0.48593 �0.102 0.918840

16.7.2 Complementary log-log plot

The first model diagnostic is to plot the cumulative hazard on a log scale for different valuesof the covariate. We show this in Figure 16.2. Figure 16.2(a) shows the cumulative hazardcalculated from the Cox model fit for the 10th, 50th, and 90th percentiles of mlratio. Since thisis calculated from the model, the curves are exact vertical shifts of each other; this is what the plotwould look like in the ideal case. Figure 16.2(b) shows the Nelson–Aalen estimator for the threesub-populations formed from the upper, middle, and lower tertiles of mlratio. It does not fitperfectly, but it is not obviously wrong to see them as vertical shifts of one another. Note that thesecond and third tertiles are very similar; the big difference is between the lower tertile and the rest.Figure 16.3 shows a similar plot where the population has been stratified by the vascular invasionstatus. We see here that the effect of the vascular invasion variable — reflected in the gap betweenthe two log cumulative hazard curves seems to increase over time. The code to generate this plot is

1 nki . km3=s u r v f i t ( nki . surv⇠f a c t o r ( nki $ vasc . i nva s i on ) )2 p lo t ( nki . km3 , mark . time=FALSE, xlab= 'Time ( Years ) ' , y lab= ' l og (cum hazard ) ' ,3 main= ' Vascular inva s i on ' , conf . i n t=FALSE, c o l=c (2 , 1 , 3 ) , fun=myfun , f i r s t x =1)4 l egend (10 ,�3 , c ( '� ' , '+ ' , '+/� ' ) , c o l=c (2 , 1 , 3 ) , t i t l e= ' Vascular inva s i on ' , lwd=2)

Model diagnostics 132

1 2 5 10 20

−6−5

−4−3

−2−1

NKI survival (Cox model fit)

Time (Years)

log(

cum

haz

ard)

mlratio−1.2−0.10.35

(a) Model fit for different mlratio levels

5 10 15

−4−3

−2−1

NKI survival (Tertiles)

Time (Years)

log(

cum

haz

ard)

mlratio tertiles123

(b) Nelson–Aalen estimator for tertiles of mlratio

Figure 16.2: Log cumulative hazards of NKI data.

16.7.3 Andersen plot

We show an Andersen plot for the vascular invasion covariate in Figure 16.4. We have alreadyobserved that the effect seems to increase with time, and this is reflected in the convex shape ofthe curve.

16.7.4 Cox–Snell residuals

We compute the Cox–Snell residuals by simply extracting the martingale residuals automaticallycomputed by the resid function, and subtracting them from the censoring indicator:

1 nki . mart=r e s i d u a l s ( nki . cox )2 nki .CS=nki $d�nki . mart

We then test for the goodness of fit by computing a Nelson–Aalen estimator for the residuals,and plotting the line y = x for comparison.

1 nki . CStest=s u r v f i t ( Surv ( nki .CS, nki $d)⇠1)2 p lo t ( nki . CStest , fun= ' cumhaz ' ,mark . time=FALSE, xmax=.8 ,3 main= 'Cox�Sne l l r e s i d u a l p l o t f o r NKI data ' )4 ab l i n e (0 , 1 , c o l =2)

Model diagnostics 133

5 10 15

−5−4

−3−2

−1

Vascular invasion

Time (Years)

log(

cum

haz

ard)

Vascular invasion−++/−

Figure 16.3: Log cumulative hazards of NKI data for subpopulations stratified by vascularinvasion status.

16.7.5 Martingale residuals

We know that the effect of the gene activity covariate mlratio is highly significant. But does itfit the Cox linear model assumption? We test this by fitting the model without mlratio, andthen plot the martingale residuals against mlratio, using a local smoother to get a picture ofhow the martingale residuals change, on average, with values of this covariate. The code is:

1 nki . nogen=with ( nki , coxph ( nki . surv⇠h i s t o l g r ad e+age+vasc . i nva s i on ) )2 nki .NGmart=r e s i d ( nki . nogen , type= ' mart inga le ' )3 ord=order ( nki $mlrat )4 m1=nki .NGmart [ ord ]5 m2=nki $mlrat [ ord ]6 p lo t ( nki $mlrat io , nki .NGmart , xlab= ' mlrat i o ' , y lab= ' Mart ingale r e s i d u a l ' )7 l i n e s ( lowess (m1⇠m2) , lwd=2, c o l =2)

This produces the result in figure 16.7We see that the effect declines up to a level of about �0.5, and is 0 after that. The

original paper [24] did not introduce mlratio as a log-linear effect, but created a binary variable,distinguishing between oestrogen-receptor-negative tumours (mlratio< �0.65) and oestrogen-receptor-positive tumours. The receptor-negative subjects are about 23% of the total. Ouranalysis suggests that this is a reasonable approach. This fits very well with our observation insection 16.7.2 that there was a clear difference between the subjects with the lowest tertile ofmlratio and all the others, but hardly any difference between the upper two tertiles.

16.7.6 Schoenfeld residuals

Schoenfeld residuals are produced in the survival package by the cox.zph command applied tothe output of coxph. The command z=cox.zph(nki.cox) produces the output

1 rho ch i sq p2 h i s t o l g r adePoo r l y d i f f �0.0661 0 .3629 0 .5473 h i s t o l g radeWe l l d i f f 0 .1686 2 .3944 0 .122

Model diagnostics 134

0.00 0.05 0.10 0.15 0.20 0.25

0.0

0.1

0.2

0.3

0.4

0.5

0.6

cumulative hazard -

cum

ulat

ive

haza

rd +

Figure 16.4: Andersen plot of NKI data for subpopulations stratified by vascular invasion status.

4 age 0 .0313 0 .0919 0 .7625 mlrat i o 0 .1357 1 .5480 0 .2136 vasc . i nva s i on+ 0.0459 0 .1664 0 .6837 vasc . i nva s i on+/� 0 .1312 1 .3028 0 .2548 GLOBAL NA 9.7256 0 .137

The column rho gives the correlation of the scaled Schoenfeld residual with time. (That is, eachSchoenfeld residual is scaled by an estimator of its standard deviation). The chisq and p-valueare for a test of the null hypothesis that the correlation is zero, meaning that the proportionalhazards condition holds (with constant �). GLOBAL gives the result of a chi-squared test for thehypothesis that all of the coefficients are constant. Thus, we may accept the hypothesis that allof the proportionality parameters are constant in time.

Plotting the output of cox.zph gives a smoothed picture of the scaled Schoenfeld residualsas a function of time. For example, plot(z[4]) gives the output in Figure 16.8, showing anestimate of the parameter for mlratio as a function of time. We see that the plot is not perfectlyconstant, but taking into account the uncertainty indicated by the confidence intervals, it isplausible that the parameter is constant. In particular, the covariate histolgradeWell diff

produces the smallest p-value, but from the plot it is clear that the apparent positive slope ispurely an artefact of there being two points with very high leverage right at the end, where thevariance is particularly high. (The estimation of ⇢ is quite crude, being simply an ordinary leastsquares fit, so does not take account of the increased uncertainty at the end.)

Model diagnostics 135

Time

Bet

a(t)

for h

isto

lgra

deW

ell d

iff

1.6 2.6 3.3 4.8 5.8 8.1 11 13

-50

510

1520

(a) histolgradeWell diff

Time

Bet

a(t)

for m

lratio

1.6 2.6 3.3 4.8 5.8 8.1 11 13

-4-2

02

(b) mlratio

Figure 16.5: Schoenfeld residuals for some covariates in the Cox model fit for the NKI data.

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

Cox-Snell residual plot for NKI data

Figure 16.6: Nelson–Aalen estimator for the Cox–Snell residuals for the NKI data, together withthe line y = x.

Model diagnostics 136

−1.5 −1.0 −0.5 0.0 0.5

−1.5

−0.5

0.0

0.5

1.0

mlratio

Mar

tinga

le re

sidu

al

Figure 16.7: Martingale residuals for the model without mlratio, plotted against mlratio, witha LOWESS smoother in red.

Model diagnostics 137

Time

Bet

a(t)

for m

lratio

1.6 2.6 3.3 4.8 5.8 8.1 11 13

-4-2

02

Figure 16.8: Scaled Schoenfeld residuals for the mlratio parameter as a function of time.

Bibliography

[1] Odd O. Aalen et al. Survival and Event History Analysis: A process point of view. SpringerVerlag, 2008.

[2] George Leclerc Buffon. Essai d’arithmétique morale. 1777.[3] David Collett. Modelling survival data in medical research. 3rd. Chapman & Hall/CRC,

2015.[4] David R Cox. “Regression Models and Life-Tables”. In: Journal of the Royal Statistical

Society. Series B (Methodological) 34.2 (1972), pp. 87–22.[5] CT4: Models Core Reading. Faculty & Institute of Acutaries, 2006.[6] Stephen H. Embury et al. “Remission Maintenance Therapy in Acute Myelogenous Leukemia”.

In: The Western Journal of Medicine 126 (Apr. 1977), pp. 267–72.[7] Gregory M. Erickson et al. “Tyrannosaur Life Tables: An example of nonavian dinosaur

population biology”. In: Science 313 (2006), pp. 213–7.[8] Thomas R. Fleming and David P. Harrington. Counting Processes and Survival Analysis.

Wiley, 1991.[9] A. J. Fox. English Life Tables No. 15. Office of National Statistics. London, 1997.

[10] Benjamin Gompertz. “On the Nature of the function expressive of the law of human mortalityand on a new mode of determining life contingencies”. In: Philosophical transactions of theRoyal Society of London 115 (1825), pp. 513–85.

[11] Peter Hall and Christopher C. Heyde. Martingale Limit Theory and its Application. NewYork, London: Academic Press, 1980.

[12] David P. Harrington and Thomas R. Fleming. “A Class of Rank Test Procedures forCensored Survival Data”. In: Biometrika 69.3 (Dec. 1982), pp. 553–66.

[13] Hans C. van Houwelingen and Theo Stijnen. “Cox Regression Model”. In: Handbook ofSurvival Analysis. Ed. by John P. Klein et al. CRC Press, 2014. Chap. 1, pp. 5–26.

[14] Edward L Kaplan and Paul Meier. “Nonparametric estimation from incomplete observations”.In: Journal of the American Statistical Association 53.282 (1958), pp. 457–481.

[15] Samuel Karlin and Howard M. Taylor. A Second Course in Stochastic Processes. AcademicPress, 1981.

[16] Kathleen Kiernan. “The rise of cohabitation and childbearing outside marriage in westernEurope”. In: International Journal of Law, Policy and the Family 15 (2001), pp. 1–21.

[17] John P. Klein and Melvin L. Moeschberger. Survival Analysis: Techniques for Censoredand Truncated Data. 2nd. SV, 2003.

138

Model diagnostics 139

[18] A. S. Macdonald. “An actuarial survey of statistics models for decrement and transitiondata. I: Multiple state, Poisson and binomial models”. In: British Actuarial Journal 2.1(1996), pp. 129–55.

[19] Kyriakos S. Markides and Karl Eschbach. “Aging, Migration, and Mortality: Current Statusof Research on the Hispanic Paradox”. In: Journals of Gerontology: Series B 60B (2005),pp. 68–75.

[20] Rupert G. Miller et al. Survival Analysis. Wiley, 2001.[21] Richard Peto and Julian Peto. “Asymptotically Efficient Rank Invariant Test Procedures”.

In: Journal of the Royal Statistical Society. Series A (General) 135.2 (1972), pp. 185–207.[22] Terry M. Therneau et al. “Martingale-based residuals for survival models”. In: Biometrika

77.1 (1990), pp. 147–60.[23] Anastasios Tsiatis. “A nonidentifiability aspect of the problem of competing risks”. In:

Proceedings of the National Academy of Sciences 72.1 (1975), pp. 20–22.[24] Marc J Van De Vijver et al. “A gene-expression signature as a predictor of survival in

breast cancer”. In: New England Journal of Medicine 347.25 (2002), pp. 1999–2009.[25] Kenneth W. Wachter. Essential Demographic methods. Harvard University Press, 2014.