fitting models of mortality with generalized linear and

Fitting models of mortality with generalized linear andnon-linear models.

Running headline: Fitting models of mortality

Iain D CurrieDepartment of Actuarial Mathematics and Statistics, and the Maxwell Institute for Mathe-matical Sciences, Heriot-Watt University, Edinburgh, EH14 4AS, UK.Email: [email protected], Tel: +44 (0)131 451 3208, Fax: +44 (0)131 451 3249

Abstract: Many models of mortality can be expressed compactly in the language of eithergeneralized linear models (GLMs) or generalized non-linear models (GNMs). The R-languageprovides a description of these models which parallels the usual algebraic definitions but hasthe advantage of a transparent and flexible model specification. We compare Poisson modelsfor the force of mortality and binomial models for mortality rates for a wide range of mortalitymodels with data from six countries; in general we find that the binomial models give a betterfit.

Key words: Constraints, generalized linear models, identifiability, mortality, R language.

File name: .../GNM/Paper/Paper.tex: 3 July 2013

1 IntroductionSect:Intro

The provision of pensions and care for the elderly are the subject of much public debate andgovernment legislation in the developed world. Pensionable ages are increasing and healthbudgets are under ever increasing pressure. Central to the provision of these services is theincrease in life expectancy. For example, life expectancy in the UK has increased from 68.1years to 78.4 for males and 73.9 to 82.4 for females from 1960 to 2010, ie, roughly ten years infifty years. Other developed countries show very similar increases. Furthermore, there is nosign of the rate of increase levelling off (Human Mortality Database, 2012). It is no surprisethen that the problem of forecasting the future course of mortality has been a subject ofgreat interest both to governments and to pension and annuity providers.

Lee and Carter (1992) was an early paper in this area. In their paper, Lee and Carter usedmortality data classified by age of death and year of death, and then modelled the forceof mortality in terms of these two variables; forecasts were obtained by treating the yearof death or period parameters as a time series, and then forecasting the estimates of theseparameters. Cairns et al. (2009) in a comprehensive and influential paper considered eightmodels. Seven of these models used the same basic approach as Lee and Carter with theimportant addition that some of these new models included terms for year of birth or cohort.The addition of cohort terms greatly improved the fit of the models to the data. Forecastswere again obtained by treating the estimates of the period and cohort parameters as timeseries, and then using time series methods to forecast them. The other model considered byCairns et al. (2009) was very different in nature and used penalties both to smooth and toforecast the mortality surface.

The models in Cairns et al. (2009) can be fitted with LifeMetrics (Cairns, 2007), softwarewritten in R (R Core Team, 2013) and available for download. The approach followed thereis the same as Brouhns et al. (2002) and Renshaw and Haberman (2006), for example: obtainand solve the maximum likelihood equations for each model. While this approach can beapplied to a very wide range of models, it does suffer from some disadvantages.

• The associated code is difficult for a third party to follow.

• It is not easy to make a simple change to the model. For example, we may want toswitch from the existing Lee-Carter model for the force of mortality, µ, with Poissonerror to the Lee-Carter model for the rate of mortality, q, with a binomial error.

• The structure of the model is obscured by the nature of the code.

It is the purpose of the present paper to describe the models in Cairns et al. (2009) in thestandard model terminology of generalized linear models (GLMs) or generalized non-linearmodels (GNMs); in the case of the smooth models we use the theory of GLMs to ease thefitting process. This approach makes model specification and fitting transparent, and greatlyexpands the range of models available. Each model has an associated model matrix and oncethis is defined fitting often requires only a single line of code. Of course, there are many othermodels of mortality; see Booth and Tickle (2008) for a comprehensive review of mortality

2

modelling and forecasting. Many of these models can be fitted by the methods presentedhere but we hope that the models in Cairns et al. (2009) provide a sufficient illustration ofour methods.

Cairns et al. (2009) considered two classes of models: four models targeted the log of theforce of mortality and four the logit of the mortality rate. One advantage of our approach isthat it is easy to compare these two models and we present a comprehensive comparison ofsuch models across six countries.

The plan of our paper is as follows. In section 2 we describe our data and our notation. Insection 3 we consider the classical Gompertz model for data classified by age of death for asingle year. This simple model illustrates our approach to the fitting of more complex models.In section 4 we give the usual algebraic model specification of the eight models in Cairns etal. (2009). Section 5 describes the fitting of the five models with the GLM structure; we payparticular attention to the problem of identifiability, and show how we deal with parameterconstraints. Section 6 uses the generalized non-linear models of Turner and Firth (2012)to fit the Lee-Carter and Renshaw-Haberman models. Section 7 considers the 2-d smoothmodel of Currie et al. (2004). Finally, in section 8 we present some conclusions and venturesome comments on the use of these models for forecasting.

2 Data and notationSect:Data_Notation

We use data on male deaths and exposures from the Human Mortality Database (2012).The data are available by country for (a) single ages from age 0 to age 109 with all deaths athigher ages gathered into a category denoted by 110+, and (b) single years. The periods forwhich data are available vary by country and we standardize on the common period 1960 to2009. Further, we consider the restricted age range 50 to 90, the range of greatest interestto providers of pensions and annuities. For this data range, we will report results for sixcountries: USA, UK, Japan, Australia, Sweden and France.

Let calendar year t run from exact time t to exact time t + 1 and let dx,t be the numberof deaths aged x last birthday in calendar year t. We suppose that the data on deaths arearranged in a matrix D = (dx,t). In a similar way, the data on exposure are arranged in amatrix Ec = (ex,t) where ex,t is a measure of the average population size aged x last birthdayin calendar year t, the so-called central exposed to risk. We suppose that (dx,t) and (ex,t) areeach na × ny so that we have na ages and ny years; let nc = na + ny − 1 denote the numberof cohorts. In our examples, we have na = 41, ny = 50 and so nc = 90. Let n = nany be thenumber of observations.

We denote the force of mortality at exact time t for lives with exact age x by µx,t. The forceof mortality µx,t can be thought of as an instantaneous death rate and the probability that alife subject to a force of mortality µx,t dies in the interval of time (t, t+ dt) is approximatelyµx,t dt where dt is small. We also define the mortality rate, qx,t, to be the probability that alife aged exactly x at exact time t dies in the following year, ie, between t and t+ 1.

The force of mortality µx,t for human populations varies slowly and one supposes smoothly

3

in both x and t and a standard assumption (Cairns et al., 2009) is that µx,t is constant overeach year of age, ie, from exact x to exact age x + 1, and over each calendar year, ie, fromexact time t to exact time t+ 1. Thus,

µx+u,t+v ≈ µx,t, 0 ≤ u, v < 1, (1) eq:muxt

and so µx,t approximates the force of mortality at exact age x+ 12and exact time t+ 1

2. The

assumption (1) has two important consequences. First, we have the following simple relationbetween the force of mortality and the mortality rate:

qx,t ≈ 1− exp(−µx,t). (2) eq:muq

Second, if we treat dx,t as a random variable, Dx,t, and the central exposure, ex,t, as fixed,then Dx,t has a Poisson distribution

Dx,t ∼ P(ex,tµx,t). (3) eq:Poisson

We have a corresponding distributional result for the mortality rate, qx,t. We approximatethe initial exposed to risk Ex,t ≈ ex,t +

12dx,t and let E = (Ex,t) denote the matrix of initial

exposures. Then Dx,t has a binomial distribution

Dx,t ∼ B(Ex,t, qx,t). (4) eq:binomial

Equations (3) and (4) lead immediately to maximum likelihood estimates (MLEs) of µx,t

and qx,t:

µ̂x,t =dx,tex,t

(5)

q̂x,t =dx,tEx,t

(6)

and a simple application of Taylor’s Theorem shows that

− log(1− q̂x,t)− µ̂x,t ≈ 112µ̂3x,t, (7) eq:hatmuq

a very small quantity. Thus to a very good approximation, the MLEs of µ̂x,t under (3)and q̂x,t under (4) satisfy (2).

3 Gompertz models for µ and qSect:Gompertz

Gompertz (1825) observed that the force of mortality when plotted on the log scale wasapproximately linear in age over most of adult life. The left panel of Figure 1 shows such aplot for US male data in 1960. The simple Gompertz model appears to fit the data ratherwell, at least for 1960. Actuaries will be familiar with the Gompertz Law in the form

µx = eθ0+θ1x (8) eq:Gomp_Law

4

50 60 70 80 90

−4.

5−

4.0

−3.

5−

3.0

−2.

5−

2.0

−1.

5

Year

log(

µ)

ObservedFitted

50 60 70 80 90

−4.

5−

4.0

−3.

5−

3.0

−2.

5−

2.0

−1.

5

Year

logi

t(q)

Figure 1: Left: observed and fitted log(µ) for Gompertz model with Poisson errors; right:observed and fitted logit (q) for Gompertz model with binomial errors. Data: USA males in1960.Fig:Gompertz

for the force of mortality at age x (Neill, 1977, p27, Forfar, 2004, p1140); (we have simplifiednotation for the moment and dropped the suffix t). We prefer to emphasize the linear natureof (8) and write log µx = θ0 + θ1x. If we add the Poisson assumption (3) then we have

ηx = logE[Dx] = log ex + log µx = log ex + θ0 + θ1x. (9) eq:GompertzP

We have defined a GLM with dependent variable dx, Poisson error, log link and linearpredictor ηx = log ex+θ0+θ1x; the term log ex is known as an offset in the model. There arethree reasons for the choice of the log as link function: first, and most importantly, the datasupports the log function (see Figure 1); second, the log function maps µ, 0 < µ < ∞, onto−∞ < log µ < ∞, the natural scale for regression; third, the log link simplifies the solutionof the maximum likelihood equations for GLMs with Poisson errors. The log function isknown as the canonical link for a GLM with Poisson errors; see McCullagh and Nelder(1989, Chap 2) for a detailed discussion of these ideas.

We can write (9) in matrix/vector form. Let η = (η1, . . . , ηna)′, e = (e1, . . . , ena)

′, x =(x1, . . . , xna)

′ and θ = (θ0, θ1)′. Then (9) can be written

η = log e+Xθ, X = [1na : x], (10) eq:GompertzP2

where 1s is a vector of 1’s of length s. We refer to X as the model matrix. The modelmatrix captures the linear structure of the Gompertz law, and more generally a GLM isdefined by its model matrix, its link function and its error distribution. The GLM structureis particularly clear in the R-language. Let Death, Off and X be the R variables whichcorrespond to the vector of observed deaths, d, the vector of log central exposures, log e,

5

and the model matrix, X, respectively. We use the glm function to fit the Gompertz model,(8), with Poisson errors, as follows:

glm(Death ∼ −1+ X+ offset(Off), family = poisson(link = "log")) (11) eq:glmP

R fits an intercept term by default, but we can safely remove this default intercept with themodel term -1 since the intercept is included in our model matrix, X, in (10).

We can express the Gompertz law with Poisson errors in terms of the binomial distribu-tion, (4). We have

qx ≈ 1− exp(−µx), by (2)

= 1− exp(− exp(θ0 + θ1x)), by (8)

⇒ log(− log(1− qx)) ≈ θ0 + θ1x. (12)

The function cloglog q = log(− log(1 − q)), 0 < q < 1, is known as the complimentarylog-log function. Now let Dx ∼ B(Ex, qx) have the binomial distribution as in (4) and letQx = Dx/Ex be the random variable corresponding to qx then under (12) we have

ηx = cloglogE[Qx] = cloglog qx = θ0 + θ1x. (13) eq:GompertzB

We have defined a GLM with dependent variable qx = dx/Ex, binomial error, complimentarylog-log link and linear predictor ηx = θ0 + θ1x. Again, the R language makes the structureof this GLM very clear: we have

glm(Rate ∼ −1+ X, family = binomial(link = "cloglog"), weight = Initial) (14) eq:glmB

where Rate, X and Initial contain the observed death rates, d/e∗, the model matrix, X,and the initial exposures, e∗ = (E1, . . . , Ena)

′, respectively. We note that (a) the exposuresare introduced into the Poisson model as an offset, but into the binomial model as a weight,and (b) the model matrix, X, is the same for both models and is given by (10). We concludethat the GLM structure allows us to define the three components of our model, namely themodel matrix, the link function and the error distribution, independently of each other. Wewill see in section 5 that five of the models in Cairns et al. (2009) are examples of GLMs andso these models are easily fitted with (11) for Poisson models and (14) for binomial modelsonce the model matrix has been specified.

We would expect these two models, i.e., the Poisson model for the force of mortality, µx,with log link and the binomial model for the mortality rate, qx, with complimentary log-loglink, to give very similar fitted values. Let µ̂P and µ̂Bc be the fitted forces of mortalityunder the Poisson model with log link and the binomial model with complimentary log-loglink respectively. For our US data in 1960 we find max |µ̂P − µ̂Bc | ≈ 0.0005; the agreementis indeed very good.

Other link functions are possible. The complimentary log-log link is not the canonical linkin a GLM with binomial error; this is the logit function where logit q = log(q/(1 − q)),0 < q < 1. Actuaries may be more familiar with the logit link in the form

qx =eθ0+θ1x

1 + eθ0+θ1x. (15) eq:Gomp_Logit

6

Cairns et al. (2009) target (i) the force of mortality with the log link in four of their models(M1, M2, M3 and M4) and (ii) the mortality rate with the logit link in the remaining four(M5, M6, M7 and M8); in all eight models they assume a Poisson distribution for the numberof deaths. Our approach coincides with Cairns et al. (2009) for M1-M4 but is different forM5-M8 where the GLM setting suggests to us that it is more natural to use the logit linkwith the binomial error. We fit the binomial model with logit link for the mortality ratewith

glm(Rate ∼ −1+ X, family = binomial(link = "logit"), weight = Initial) (16) eq:glmB2

and the only change in (14) is the specification link = "logit".

We have the same three reasons for the choice of the logit as link function for the binomialdistribution as we had for the choice of the log function for the Poisson distribution: the rightpanel of Fig 1 supports this choice, the logit function maps q, 0 < q < 1 onto −∞ < logit q <∞ and the solution to the maximum likelihood equations simplify. However, we would expecta rather different fit with the logit link than we found above with the complimentary log-loglink. Let µ̂Bl

be the fitted forces of mortality under the binomial model with logit link. Thechoice of link function does make a difference: we find that max |µ̂Bl

− µ̂Bc | ≈ 0.01.

Figure 1 shows the fitted Gompertz lines for both the Poisson model with log link and thebinomial model with logit link with the US data in 1960; both fits look satisfactory. We alsohave the binomial model with complimentary log-log link. Which approach gives the betterfitting model? The standard measure of fit in a GLM is the deviance; see McCullagh andNelder (1989, Chap 2) for a discussion of deviance. For the Poisson model the deviance is

DevP = 2∑x

{dx log(dx/d̂x)− (dx − d̂x)} (17) eq:Dev.P

and for the binomial model it is

DevB = 2∑x

{dx log(dx/d̂x) + (Ex − dx) log[(Ex − dx)/(Ex − d̂x)]} (18) eq:Dev.B

where d̂x is the fitted death at age x for a Poisson model in (17) and for a binomial modelin (18).

The fitted deaths from the Poisson model with log link give DevP = 1060, those from thebinomial model with logit link give DevP = 1134, and those from the binomial model withcomplimentary log-log link give DevP = 1010. For these data, the binomial model for q withcomplimentary log-log link gives the best fit. We have the same conclusion if we compareDevB from the three fits. The difference in the three fits arises largely from the different linkfunctions. We will use the comparison of deviances to investigate the choice of link functionfor the models M1-M8 in the rest of the paper.

Other choices of link function are possible. For example, the probit link, probit q = Φ−1(q),0 < q < 1, is widely used in many branches of statistics; here, Φ−1(·) is the inverse normaldistribution function. However, we found that the binomial model for q with probit link fitsmortality data very poorly and we do not consider it further. Similarly, we found that many

7

Table 1: Model structures for mortality models, M1-M8. In column three i = j+k indicatesthat a total of i constraints are needed for identifiability, of which j are location constraintsand k are scale constraints.

Model Structure Constraints

M1: Lee-Carter (LC) β(1)x + β

(2)x κ

(2)t 2 = 1 + 1

M2: Renshaw-Haberman (RH) β(1)x + β

(2)x κ

(2)t + β

(3)x γ

(3)t−x 4 = 2 + 2

M3: Age-period-cohort (APC) β(1)x + κ

(2)t + γ

(3)t−x 3 = 3 + 0

M4: 2-d Pspline (2-d AP)∑

i,j θi,jBayi,j (x, t) 0 = 0 + 0

M5: Cairns-Blake-Dowd (CBD) κ(1)t + κ

(2)t (x− x̄) 0 = 0 + 0

M6: M5+C (CBD(C)) κ(1)t + κ

(2)t (x− x̄) + γ

(3)t−x 2 = 2 + 0

M7: M5+Q+C (CBD(QC)) κ(1)t + κ

(2)t (x− x̄) + κ

(3)t [(x− x̄)2 − σ̂2

x] + γ(3)t−x 3 = 3 + 0

M8: M5+Cδ (CBD(Cδ)) κ(1)t + κ

(2)t (x− x̄) + γ

(3)t−x(δ − x) 1 = 1 + 0

Tab:Structure

other standard link functions are not suitable for mortality data: in this paper we restrictourselves to the three models discussed above, i.e., the Poisson model for µ with log link,and the binomial models for q with logit and complimentary log-log links.

The code in (11), (14) and (16) for the Gompertz model provides a template for the restof the paper and can be used without alteration for any GLM once the appropriate modelmatrix X has been defined, ie, in five of our eight models.

4 Summary of models in Cairns et al., 2009Sect:Cairns

Cairns et al. (2009) described eight models of mortality which they labelled M1-M8. In theirpaper the model structures for M1-M4 are defined in terms of log µ and for M5-M8 theyare defined in terms of logit q. There is an important difference between their approach andours: all their models assume a Poisson error whereas we assume a Poisson model when wemodel µ and a binomial model when we model q. Further we consider models for both µand q for all eight structures.

We think of M1-M8 in the framework of GLMs and in this section we specify only the modelstructures; these are identical to Cairns et al. (2009). Model fitting is described in section 5for GLMs, section 6 for GNMs and section 7 for smooth models. The model structuresare summarized in standard mathematical notation in Table 1; the notation is close to, butnot identical to, Cairns et al. (2009). Central to our approach is the model matrix, ie, theexpressing of these structures in equivalent matrix terms.

8

5 Generalized linear modelsSect:GLM

Four of the models, M3, M5, M6 and M7, are examples of GLMs, and so are readily fittedwith standard software, such as R’s glm function. Further, conditional on the value of δ,M8 is also a GLM. In this section we show how to define the model matrices for these fivemodels; this enables these models to be fitted directly with (11), (14) or (16) as appropriate.However, only M5 does not have identifiability problems, and we use an extended discussionof M3, the age-period-cohort model, to describe how we obtain parameter estimates subjectto particular constraints from the parameter estimates given by the glm function.

We note here that M1 and M2 do not fit into the GLM structure since both contain mul-tiplicative terms in their model structures. Model M4 involves penalization so it too doesnot fit directly into the GLM structure; however, we shall see in section 7 that the theory ofGLMs requires only a simple modification to be applied to M4.

Let d = vec(dx,t) and e = vec(ex,t) be the vectors of observed deaths and central exposures;here, the vec operator stacks the columns of a matrix in column order on top of each other.We note that with this definition the age suffix varies faster than the year suffix in d and e.

5.1 M3: Age-period-cohort modelsSubSect:APC

The APC model, as defined in Table 1, is specified by its model matrix, X = [Xa : Xy : Xc]where Xa, n×na, defines the age effects, Xy, n×ny, the period effects, and Xc, n×nc, thecohort effects; thus X is n × (na + ny + nc). Recalling that the age suffix in the vector ofdeaths varies faster than the year suffix, we see that Xa consists of ny copies of Ina stackedon top of each other where Is denotes the identity matrix of size s. The Kronecker productis an economical way of writing matrices with a row-column structure and here we haveXa = 1ny ⊗ Ina ; Searle (1982, Chap 10) provides a good discussion of Kronecker products.In a similar way we have Xy = Iny ⊗ 1na . There is no simple way of writing Xc but it iseasily obtained by noting that row i of Xc consists of 0’s except for a single 1 which occursin column c if the data point corresponding to row i belongs to cohort c, c = 1, . . . , nc. Themodel matrix for the APC model can be written

X = [1ny ⊗ Ina : Iny ⊗ 1na : Xc]. (19) eq:modelAPC

It is now a simple matter to check that the rank of X is three less than its number ofcolumns; for example, in R, qr(X)$rank returns the rank of the matrix X. Thus, the modelis not identifiable.

We denote the coefficients corresponding to Xa by β, corresponding to Xy by κ and cor-responding to Xc by γ, as in Table 1; let θ = (β′,κ′,γ ′)′. In order to solve the maximumlikelihood equations we need three constraints on θ. There is no unique way of choosingthese constraints but whatever choice we make we always obtain the same table of fittedforces of mortality, µ̂x,t, or fitted mortality rates, q̂x,t; the deviance is also invariant with

9

respect to the choice of constraints. One possible set of constraints is

ny∑t=1

κt = 0 (20)

nc∑c=1

γc = 0 (21)

nc∑c=1

cγc = 0 (22)

where c runs from 1 (youngest cohort) to nc (oldest cohort). The constraints (21) and (22)are different from those used in Cairns et al. (2009) for M3, but are the same as the cohortconstraints used in that paper for M6 and M7. Cairns et al. (2009) point out that theseconstraints have some intuitive appeal since, “if we use least squares to fit a linear functionϕ1 + ϕ2c to γ” then ϕ̂0 = ϕ̂1 = 0 and γ̂ fluctuates around 0 with “no discernible lineartrend”.

We express these constraints in the form Hθ = 0 where H is defined as follows:

h1 = (0′na,1′

ny,0′

nc)′ (23)

h2 = (0′na+ny

,1′nc)′ (24)

h3 = (0′na+ny

,n′c)

′ (25)

H = (h1,h2,h3)′; (26)

here 0s is a vector of 0’s of the indicated length and nc = (1, . . . , nc)′. We refer to H ,

3× (na + ny + nc), as the constraints matrix. The matrix

Xaug =

(XH

)(27) eq:AugX

is known as the augmented matrix; the maximum likelihood equations have a unique solution,θ̂, subject to the constraint Hθ = 0, if Xaug is of full column rank. This gives a simple wayof checking whether a set of constraints does lead to a unique solution.

The glm function in R uses a different method of obtaining a solution. In effect, R setsκ1 = γ1 = γnc = 0. These are also linear constraints and so can similarly be expressed interms of a constraints matrix, HR, say, where

hR1 = (0′

na, 1,0′

ny+nc−1)′ (28)

hR2 = (0′

na+ny, 1,0′

nc−1)′ (29)

hR3 = (0′

na+ny+nc−1, 1)′ (30)

HR = (hR1 ,h

R2 ,h

R3 )

′. (31)

The R package (and many other statistical packages) gives a very simple method of fittinggeneralized linear models such as the APC model but it is not so easy to make R use a

10

Table 2: Model M3 or APC. Poisson deviances, DevP , for Poisson model for µ with log link,binomial models for q with logit and complimentary log-log links. Minimum DevP shown inbold. Age: 50-90, Period: 1960-2009, Degrees of freedom: 178.

DevPCountry log µ,P logit q,B cloglog q,BUSA 22789 20001 20951UK 7015 5754 6385Japan 11760 9600 10871Australia 3845 3392 3564Sweden 3690 3231 3421France 12920 10297 11717Tab:Dev.M3

particular set of constraints. Suppose θ̂ and θ̂R are the MLEs of θ under our preferredconstraints H and R’s constraints HR respectively. We have simple access to θ̂R. A naturalquestion is: can we obtain θ̂ from θ̂R?

We know that Xθ̂ = Xθ̂R since the fitted µ̂x,t, or q̂x,t as the case may be, are invariant with

respect to the choice of constraints. Further, Hθ̂ = HR θ̂R = 0. Hence

Xaugθ̂ =

(XH

)θ̂ =

(Xθ̂

Hθ̂

)=

(Xθ̂R

HR θ̂R

)=

(XHR

)θ̂R. (32) eq:Basic

Now Xaug in (27) has full column rank so X ′augXaug = X ′X +H ′H is positive definite and

hence non-singular. Thus multiplying both sides of (32) by X ′aug = [X ′ : H ′] gives

θ̂ = (X ′X +H ′H)−1(X ′X +H ′HR)θ̂R (33) eq:Relation

and we have expressed θ̂ in terms of θ̂R, as required. We see two uses for (33): first, wecan use it to check the results of specialized software designed to fit models with specifiedconstraints, such as M3; second, in some cases, we may be able to avoid the use of specializedsoftware altogether.

With X defined in R our three models can be fitted immediately with (11), (14) and (16).We compare the fit of the models through the Poisson deviance; similar conclusions can bedrawn if we use the binomial deviance so these results are omitted. The results for our sixcountries are given in Table 2 and strongly suggest that the APC model structure is bestcombined with a binomial error and a logit link acting on q.

5.2 M5: Original CBD modelsSubSect:CBD5

The original CBD model (Cairns et al., 2006), as defined in Table 1, has model ma-trix, X = [X1 : X2] where X1, n × ny, defines the regression matrix corresponding to

κ(1) = (κ(1)1 , . . . , κ

(1)ny )

′, and X2, n × ny, defines the regression matrix corresponding to

11

Table 3: Model M5 or CBD. Poisson deviances, DevP , for Poisson model for µ with log link,binomial models for q with logit and complimentary log-log links. Minimum DevP shown inbold. Age: 50-90, Period: 1960-2009, Degrees of freedom: 100.


κ(2) = (κ(2)1 , . . . , κ

(2)ny )

′; thus X is n × 2ny. We have X1 = Iny ⊗ 1na and X2 = Iny ⊗ xm

where xm = (x1 − x̄, . . . , xna − x̄)′ is the vector of centred ages. The model matrix

X = [Iny ⊗ 1na : Iny ⊗ xm] (34) eq:modelCBD_M5

has full column rank and the model is identifiable. Again, the R-code in (11), (14) and (16)can be used to fit our three variants of the CBD model. Table 3 shows that a binomial modelfor q is preferred in all cases, but twice with a logit link and four times with a complimentarylog-log link.

5.3 M6: CBD models with cohort effectsSubSect:CBD6

Model M6 is the original CBD model with added cohort effects. Its model matrix is

X = [Iny ⊗ 1na : Iny ⊗ xm : Xc] (35) eq:modelCBD_M6

and has rank two less than its column dimension. The parameter vector is θ = (κ(1)′,κ(2)′,γ ′)′

and has length 2ny + nc. The constraints (21) and (22), as used by Cairns et al. (2009),ensure identifiability, and the constraints matrix, H , is

h1 = (0′2ny

,1′nc)′ (36)

h2 = (0′2ny

,n′c)

′ (37)

H = (h1,h2)′ (38)

with nc defined below (31). We note that although the same constraints (21) and (22) areused in both M3 and M6 the corresponding rows of their constraints matrices are different.

The model with X given by (35) may be fitted in R with the glm function and (11), (14) or(16); in all three cases the implied constraints matrix, HR, is

hR1 = (0′

2ny+nc−2, 1, 0)′ (39)

hR2 = (0′

2ny+nc−1, 1)′ (40)

HR = (hR1 ,h

R2 )

′ (41)

12

Table 4: Model M6 or CBD(C). Poisson deviances, DevP , for Poisson model for µ withlog link, binomial models for q with logit and complimentary log-log links. Minimum DevPshown in bold. Age: 50-90, Period: 1960-2009, Degrees of freedom: 188.


since R in effect sets γnc−1 = γnc = 0. The maximum likelihood solution subject to theconstraints (21) and (22) is obtained by transforming the glm solution with (33).

As in the case of the original CBD model we see from Table 4 that the binomial model for qis preferred in all six cases; the complimentary log-log link gives the lowest deviance in fourof these.

5.4 M7: Quadratic CBD models with cohort effectsSubSect:CBD7

Model M7 is M6 with an added quadratic age effect. From Table 1, we let σ2x =

∑(x−x̄)2/na,

xq = ((x1 − x̄)2 − σ2x, . . . , (xna − x̄)2 − σ2

x)′. Adding the quadratic effect in age to the model

matrix (35) for M6, we obtain the model matrix for M7 as

X = [Iny ⊗ 1na : Iny ⊗ xm : Iny ⊗ xq : Xc]. (42) eq:modelCBD_M7

Let θ = (κ(1)′ ,κ(2)′ ,κ(3)′ ,γ ′)′ be the corresponding parameter set. The model matrix X has3ny + nc columns but rank 3ny + nc − 3. We use the same three constraints as Cairns etal. (2009), namely constraints (21) and (22) together with

nc∑c=1

c2γc = 0. (43) eq:Con4

These three constraints correspond to the constraints matrix

h1 = (0′3ny

,1′nc)′ (44)

h2 = (0′3ny

,n′c)

′ (45)

h3 = (0′3ny

,n2′c )

′ (46)

H = (h1,h2,h3)′ (47)

13

Table 5: Model M7 or CBD(QC). Poisson deviances, DevP , for Poisson model for µ withlog link, binomial models for q with logit and complimentary log-log links. Minimum DevPshown in bold. Age: 50-90, Period: 1960-2009, Degrees of freedom: 237.


where nc = (1, . . . , nc)′ and n2

c = nc ∗nc. The R function glm in effect sets γnc−2 = γnc−1 =γnc = 0 which corresponds to the constraints matrix HR

hR1 = (0′

3ny+nc−3, 1, 0, 0)′ (48)

hR2 = (0′

3ny+nc−2, 1, 0)′ (49)

hR3 = (0′

3ny+nc−1, 1)′ (50)

HR = (hR1 ,h

R2 ,h

R3 )

′. (51)

Again, the maximum likelihood solution subject to the constraints (21), (22) and (43) isobtained by transforming the glm solution with (33). The results for our six test countriesare given in Table 5; there is a clear preference for the binomial model with logit link formodel M7.

5.5 M8: CBD models with age modulated cohort effectsSubSect:CBD7

The model structure for M8 is

κ(1)t + κ

(2)t (x− x̄) + γ

(3)t−x(δ − x). (52) eq:M8

The multiplicative nature of the final model term means that M8 does not fit immediatelyinto the GLM structure. However, if we condition on δ then we have a GLM. The modelmatrix is

X = [Iny ⊗ 1na : Iny ⊗ xm : Xc(δ)]. (53) eq:modelCBD_M8

The age and period terms are the same as M6 and M7 but the cohort term is now a functionof δ. Let v(δ) be the vector 1ny ⊗ (δ1na − x) then

Xc(δ) = v(δ) ∗Xc (54) eq:X_c_d

where Xc is the cohort portion of the model matrix for M3, M6 or M7. Some explanationof (54) is in order: we note that the length of v(δ) is equal to the number of rows of Xc; this

14

allows us to interpret the multiplication symbol ∗ to mean that the vector v(δ) is multipliedelement-by-element onto column i of Xc, i = 1, . . . , nc. This convenient convention coincideswith how R treats pre-multiplication of a matrix by a column vector.

With this definition of the model matrix X we can fit M8 for given δ with any of (11), (14)or (16). Let Dev(δ) be the deviance for given δ. In general we found that Dev(δ) was not awell-behaved function of δ and rather than report results for all six of our test countries weuse the data from the USA to illustrate the kind of problems that arise. Similar difficultiesarise with the other five countries.

In general, the model matrix X is n× (2ny + nc) and rank 2ny + nc − 1; however at δ = 90,the maximum age, we found that X has rank 2ny + nc − 2. The first difficulty for modelselection is that the deviance function, Dev(δ), is not continuous at δ = 90. Figure 2 is a plotof Dev(δ) against δ for δ in three different ranges. For δ lying between the maximum andminimum age we have a local minimum at δ = 61.23 with a further minimum at δ = 90−,ie, for values of δ just less that 90. We find Dev(90−) = 29693 while Dev(90) = 29712. Forvalues of δ larger than 90 we find a further minimum at around δ = 6.2 ∗ 106. The globalminimum occurs at δ = −14.66.

Cairns et al. (2009) impose identifiability on M8 with the constraint∑

wcγc = 0 where wc

is the number of times that cohort c occurs and γc is the corresponding parameter. As withM3, the APC model, R in effect sets γnc = 0. Let κ̂0,R and κ̂1,R be the estimates of κ0, andκ1, returned by R and let γ̂R be the estimate of γ returned by R augmented by a 0 in thelast place. Let wc be the vector of cohort weights and γ̄ =

∑wcγc/

∑wc. Then for given δ

we recover the estimates of κ0, κ1 and γ subject to the constraint∑

wcγc = 0 as follows:

γ̂ = γ̂R − γ̄1nc

κ̂1 = κ̂1,R − γ̄1ny (55)

κ̂0 = κ̂0,R + γ̄(δ − x̄)1ny .

Figure 2 also shows a plot of the different estimates of κ0, κ1 and γ for two of the threevalues of δ corresponding to the local minima of Dev(δ); the estimates of κ0, κ1 and γ areall highly dependent on δ. The parameter estimates corresponding to δ = 6.2× 106 are veryextreme and are omitted from Figure 2. We will comment on the possibilities for forecastingwith M8 in our concluding remarks.

6 Generalized non-linear modelsSect:GNM

The Lee-Carter (Lee and Carter, 1992) and Renshaw-Haberman models (Renshaw andHaberman, 2006) are denoted M1 and M2 respectively in Cairns et al. (2009); their model

structures are given in Table 1. The multiplicative nature of terms like β(2)x κ

(2)t in these mod-

els means that they do not fit into the GLM structure. However, the R-package gnm (Turnerand Firth, 2012) does allow the direct fitting of such models with a model specification thatis very similar to that of the glm function. We use the gnm function in the package gnm toexamine the choice of link function for M1 and M2.

15

50 60 70 80 90

3000

031

000

3200

033

000

3400

035

000

Constant δ

Dev

ianc

e

5.0 5.5 6.0 6.5 7.0

1736

617

367

1736

817

369

1737

017

371

Constant log10δ

Dev

ianc

e

−100 −80 −60 −40 −20 0 20

1580

016

000

1620

016

400

Constant δ

Dev

ianc

e

δ = 61δ = −14.7

1960 1970 1980 1990 2000 2010

−10

−5

05

Year

κ 0

δ = 61δ = −14.7

1960 1970 1980 1990 2000 2010

0.0

0.1

0.2

0.3

0.4

0.5

Year

κ 1

δ = 61δ = −14.7

1880 1900 1920 1940 1960

−0.

25−

0.20

−0.

15−

0.10

−0.

050.

000.

050.

10

Year of Birth

γ

Figure 2: Panels 1, 2 and 3: deviance as a function of δ for δ ∈ [40, 90], δ ∈ [105, 107] andδ ∈ [−100, 20]; panels 4, 5 and 6: estimates of κ0, κ1 and γ for δ = 61,−14.7. Fig:M8

16

6.1 M1: Lee-Carter modelsSubSect:LC

Model specification with the gnm function is best done with the factor function in R. Let dand e be the vectors of deaths and central exposures as in Section 5. If x = (x1, . . . , xna)

′ isthe vector of ages as in (10) then 1ny ⊗x is the vector of age suffices for d; in a similar way,if y = (y1, . . . , yny)

′ is the vector of years then y ⊗ 1na is the vector of year suffices for d.Let Age, Year and Off be the R variables which contain 1ny ⊗x, y⊗1na and the offset log erespectively. We convert Age and Year to factors and fit model M1 for the force of mortalitywith a log link and Poisson errors with

Age.F = factor(Age); Year.F = factor(Year)

gnm(Dth ∼ −1+ Age.F+ Mult(Age.F, Year.F) + offset(Off), (56)

family = poisson(link = "log"))

The Mult function allows the product of predictors to appear in the model specification.The two binomial models for q are fitted by modifying (56) as indicated in the code in (14)and (16). The results for M1 are given in Table 6; the binomial model for q with logit linkis preferred in all cases.

The code (56) does not return the usual parameter estimates in the Lee-Carter model andin order to recover these estimates we first write the Lee-Carter specification in Table 1 inthe simpler form β

(1)x → αx, β

(2)x → βx and κ

(2)t → κt. With this notation the structure of

the model is αx + βxκt. The model is generally fitted with the location constraint∑

κt = 0and the scale constraint

∑βx = 1 (Lee and Carter, 1992, Cairns et al., 2009). The gnm

function returns estimates of α, β and κ with a random parameterization (Turner andFirth, 2012); thus two calls of (56) result in different parameter estimates, although modelinvariants such as the deviance and the fitted values are the same. Let α̂R, β̂R and κ̂R beany estimates returned by the gnm function and α̂, β̂ and κ̂ be the estimates subject tothe usual constraints. Let κ̄R =

∑t κ̂t,R/ny and β̄R =

∑x β̂x,R/na. The usual estimates are

found as follows

α̂ = α̂R + κ̄Rβ̂R (57)

κ̂ = naβ̄R(κ̂R − κ̄R1ny) (58)

β̂ = β̂R/(naβ̄R). (59)

6.2 M2: Renshaw-Haberman modelsSubSect:RH

We fit M2 (Renshaw and Haberman, 2006) by extending the code in (56). Let z be the vectorof cohorts, i.e., zi = c if the i th data point, i = 1, . . . , n, contains cohort c, c = 1, . . . , nc; letCohort be the R variable corresponding to z. We fit M2 with

Cohort.F = factor(Cohort)

gnm(Dth ∼ −1+ Age.F+ Mult(Age.F, Year.F) + Mult(Age.F, Cohort.F)+ (60)

offset(Off), family = poisson(link = "log"))

17

Table 6: Model M1 or LC. Poisson deviances, DevP , for Poisson model for µ with log link,binomial models for q with logit and complimentary log-log links. Minimum DevP shown inbold. Age: 50-90, Period: 1960-2009, Degrees of freedom: 130.


Table 7: Model M2 or RH. Poisson deviances, DevP , for Poisson model for µ with log link,binomial models for q with logit and complimentary log-log links. Minimum DevP shown inbold. Age: 50-90, Period: 1960-2009, Degrees of freedom: 259.


Table 7 shows that the binomial model for q is preferred in all cases, four times with thelogit link and twice with the complimentary log-log link.

We examine the fitting of the RHmodel in some detail. First, we write the model specificationin Table 1 in the form β

(1)x → αx, β

(2)x → β2,x, β

(3)x → β3,x, κ

(2)t → κt and γ

(3)t−x → γt−x with

vector equivalents α, β2, β3, κ and γ. With this notation the structure of the model isαx + β2,xκt + β3,xγt−x. We ensure model identification with two location constraints ((20)and (21) as used in the APC model) and two scale constraints, namely,

ny∑t=1

κt =nc∑c=1

γc = 0,na∑x=1

β2,x =na∑x=1

β3,x = 1. (61) eq:RH.Con

Fitting the RH model is not entirely straightforward. Renshaw and Haberman (2006) in theiroriginal paper fixed the values of αx = n−1

y

∑t log(dx,t/ex,t) and then estimated the remaining

parameters conditional on this estimate of α. Cairns et al. (2009) used this estimate of αonly as a starting value in a full iterative scheme, but reported very slow convergence. Our

18

experience with the gnm function is reported in Table 8. The function gnm does not knowabout any particular constraints and uses a random parameterization. We found that for aparticular parameterization convergence could be very fast, very slow or even fail completely.For example, in 25 simulations, ie, 25 calls of (60) with the UK data we found (a) convergencein 14 cases in an average number of iterations of 53, (b) non-convergence in 6 cases (after500 iterations) and (c) model failure in 5 cases, ie, gnm was unable to fit the model.

We have not discussed convergence criteria so far. We used the glm function to fit M3, M5,M6, M7 and M8 with convergence when the relative change in the deviance is small, ie, when

|Dev−Dev.Old||Dev|+ 0.1

< 10−8. (62) eq:Conv1

The convergence criterion of the gnm function is in terms of the score vector

S(θ) =∂ℓ

∂θ(63) eq:Score

where θ is the vector of parameters. It is well known (see, for example, Wood (2006, pp103-104)) that at the maximum likelihood solution the score vector has mean 0 and varianceequal to the Fisher Information

I(θ) ≈ − ∂2ℓ

∂θ2. (64) eq:Info

The gnm function sets its convergence criterion in terms of how small S(θ) is relative to itsestimated standard deviation, ie, when

S(θ)2

diag{I(θ)}< 10−12 (65) eq:Conv2

where the inequality is interpreted element-by-element. Both (62) and (65) are stringentcriteria.

An alternative strategy to using a random parameterization with gnm is to fix the param-eterization by supplying initial values for the parameters. For example, we could use theestimates from the APC model. Let α̂M3, κ̂M3 and γ̂M3 be the maximum likelihood esti-mates of α, κ and γ in the APC model of section 5.1 subject to the constraints (20), (21)and (22). Let Alpha, Kappa and Gamma be the R variables which contain α̂M3, κ̂M3 and γ̂M3

respectively. Let Beta2 and Beta3 be the R variables of length na with each componentequal to 1/na (so that

∑x β2,x =

∑x β3,x = 1). Finally, let Start be the R variable defined

byStart = c(Alpha, Beta2, n.a ∗ Kappa, Beta3, n.a ∗ Gamma) (66) eq:Start.Values

Then we may fit the RH model by extending the third line of (56) to

offset(Off), family = poisson(link = "log"), start = Start) (67) eq:Start

This strategy produces fast convergence in all cases except the USA, where the method fails,as can be seen in the final column of Table 8.

19

Table 8: Performance of gnm in fitting M2 for Poisson model for µ with log link (a) 25simulations with random initial values (b) APC initial values. (a) Number of simulationsto (i) converge to MLE (ii) not converge in 500 iterations (iii) fail; (iv) mean number ofiterations for converged simulations. (b) Number of iterations to convergence with APCinitial values.

Country (a) Simulation (b) InitializationConverged Not converged Failed Mean Iterations

USA 4 8 13 57 -UK 14 6 5 53 36Japan 22 0 3 38 33Australia 4 10 11 65 30Sweden 4 6 15 45 49France 11 8 6 33 28Tab:Performance

We can obtain some insight into the nature of the RH model by examining the case of theUSA in more detail. First, we recover the parameter estimates in the RH model subject tothe constraints (61). Let α̂R, β̂2,R, κ̂R, β̂3,R and γ̂R be any estimates returned by the gnm

function and let α̂, β̂2, κ̂, β̂3 and γ̂ be the estimates subject to the constraints (61). Letβ̄2,R =

∑x β̂x,2,R/na, β̄3,R =

∑β̂x,3,R/na, κ̄R =

∑κ̂t,R/ny and γ̄R =

∑γ̂c,R/nc . Then

β̂2 = β̂2,R/(naβ̄2,R) (68)

κ̂ = naβ̄2,R(κ̂R − κ̄R1ny) (69)

β̂3 = β̂3,R/(naβ̄3,R) (70)

γ̂ = (naβ̄3,R)(γ̂R − γ̄R1nc) (71)

α̂ = α̂R + κ̄Rβ̂2,R + γ̄Rβ̂3,R. (72)

We consider three fitting strategies with the USA data.

(a) We use the code (60) with random parameterizations. We identify which of theserandom parameterizations lead to fast convergence. All such parameterizations givethe same deviance, the same table of fitted log mortality, and, after transforming theparameter estimates with (68) to (72), the same maximum likelihood estimates subjectto the constraints (61).

(b) We use the code (60) modified by (67). We know this method fails but the behaviourof the parameter estimates is instructive. We will compare the estimated coefficientsafter 50, 1000 and 5000 iterations of the gnm algorithm.

(c) We make an initial estimate of α, say α0, and fix this in the RH model, ie, we considerthe model structure α0,x+β2,xκt+β3,xγt−x. This is closely related to the fitting strategyused in Renshaw and Haberman’s original paper (2006).

20

The R-code to fit the modified RH model in (c) is similar to (60) as modified by (67). Theinitial estimates, α0, are the estimated coefficients in the simple age model:

gnm(Dth ∼ −1+ Age.F+ offset(Off), family = poisson(link = "log")) (73) eq:Alpha_0

Now let Off.Alpha be the R-variable which contains 1ny ⊗α0 and Start.No.Alpha be theR-variable Start as in (66) with the first na terms omitted, ie, omitting the values for α.We fit the model in (c) with

gnm(Dth ∼ -1 + Mult(Age.F, Year.F) + Mult(Age.F, Cohort.F) +

offset(Off+Off.Alpha), family=poisson(link="log"), start=Start.No.Alpha)

This model gives a deviance of 10827 but we find ourselves in a dilemma. The estimatedcoefficients, ie, β2, β3, κ and γ do not satisfy the constraints (61). There is no difficulty inimposing the scale constraints

∑β2,x =

∑β3,x = 1 but if we impose the location constraints

(20) and (21) on κ and γ we distort the assumed fixed value of α0. In other words, wecannot estimate period and cohort effects independently of age effects. We will see furtherexamples of this with the other approaches to fitting the RH model.

Table 9 shows deviances for the various approaches described in (a), (b) and (c) above.The optimal fit is obtained with fast convergence and is a clear improvement over the otherapproaches. The plots of the estimates of α and κ in Figure 3 repay scrutiny. The upperleft panel of Figure 3 shows the estimates of α. The estimates of α, as estimated in (i) theAPC model and used in Start and (ii) the model for age only in (73), conform to whatwe feel is the effect of age. But the other estimates, including the optimal value of α, arevery different. Goldstein (1979) remarks in a discussion of the APC model: “The fact thatparameters can be estimated does not imply that they can sensibly interpreted”, and theRH model with USA data provides a particularly striking example of this maxim. The slowconvergence reported by Cairns et al. (2009) is illustrated by the behaviour of the estimatesof α and κ when the APC estimates are used as starting values. The deviance hardly changesbetween 1000 and 5000 iterations but the coefficients show no sign of settling down. Thegnm convergence criterion (65) will deal with this situation whereas a criterion such as (62)based purely on the deviance (or equivalently the log likelihood) will not.

The plot of κ in the upper right panel of Figure 3 is particularly disconcerting if we attemptto interpret κ as a pure period effect, ie, independent of the age and cohort effects: theestimate of κ is increasing in the optimal fit. Figure 3 also shows the estimates of β2, β3, γand the fitted log mortality at age 70.

Figure 3 suggests why the RH model is challenging to fit. For data sets such as the USAthe “natural” initial values from the APC model do not provide good starting values for thetrue optimal values. The log likelihood changes very slowly in the region of the initial valuesand the algorithm is unable to “escape” to the region of the true maximum.

The challenging nature of the RH model has also prompted us to investigate our fittingmethod with six further countries from the Human Mortality Database. The method thatspecifies starting values from the APC model works for Canada, West Germany, Denmark

21

50 60 70 80 90

−7

−6

−5

−4

−3

−2

−1

Age

α

InitialOptimalIter = 50Iter = 1000Iter = 5000Fixed Alpha

1960 1970 1980 1990 2000 2010

−50

050

Year

κ

50 60 70 80 90

0.01

50.

020

0.02

50.

030

0.03

5

Age

β 2

50 60 70 80 90

0.01

00.

015

0.02

00.

025

0.03

00.

035

0.04

0

Age

β 3

1880 1900 1920 1940 1960

−20

−10

010

20

Year of Birth

γ

1960 1970 1980 1990 2000 2010

−3.

6−

3.4

−3.

2−

3.0

Year

log(

mor

talit

y)

ObservedOptimalFixed Alpha

Figure 3: Parameter estimates for RH model with log link and Poisson errors: panel 1: α;panel 2: κ; panel 3: β2; panel 4: β3; panel 5: γ; panel 6: observed and fitted log(mortality)for age 70. Data: USA males, age 50-90, period 1960-2009. Fig:RH

22

Table 9: Table of deviances for models specified in (a), (b) and (c) above.

Model Deviance IterationsAPC 22789 3Fixed α0 10827 25Start 9144 50Start 8551 1000Start 8550 5000Optimal 8536 52Tab:Comparison

and Norway but fails, like the USA, for Belgium and Netherlands. For these two countriesthe random parameterization of gnm does provide a solution.

7 Smooth 2-d P -spline modelsSect:2d

It is not the purpose of this paper to describe the 2-dimensional P -spline model in any detail.For this the reader should consult the original paper on P -splines (Eilers and Marx, 1996)and the extension of P -splines to 2-dimensions in Currie et al. (2004). This second paperlaid particular emphasis on the smoothing and forecasting of mortality data. Richards et al(2006) described the method from an actuarial perspective.

The function Mort2Dsmooth in the R-package MortalitySmooth (Camarda, 2012) gives sim-ple fitting of the 2-d P -spline model

Mort2Dsmooth(Age, Year, Death.Mat, offset = log(Exposure.Mat))

where Age, Year, Death.Mat and Exposure.Mat are the R-variables which contain the vectorsof ages and years, and the matrices of deaths and central exposures respectively. Forecastingis part of the package and overdispersion (see below) can be included, but the package islimited to Poisson models for µ with a log link.

The theory of P -splines depends closely on the theory of GLMs and this connection enables usto extend P -splines to binomial models for q with both a logit and complimentary log-log link.We set up some notation. Let xa = (xa,1, . . . , xa,na)

′ and xy = (xy,1, . . . , xy,ny)′ be the vectors

of ages and years respectively, and let Ba = {Ba,1, . . . , Ba,ca} and By = {By,1, . . . , By,cy}be cubic B-spline bases of dimensions ca and cy spanning age and year respectively. LetBa = (Ba,j(xa,i)), na×ca, and By = (By,j(xy,i)), ny×cy, be the resulting regression matricesin age and year. Then

X = By ⊗Ba (74) eq:Kronecker

is the regression matrix for the 2-d P -spline model M4. So far we have not introduced apenalty function. Nevertheless, even without a penalty function, the regression matrix in(74) will result in some smoothing with smaller values of the dimensions ca and cy givingsmoother surfaces while larger values give rougher surfaces. We can take X as the model

23

Table 10: Weights W̃ and working variables z̃. Notation: d and d̃ are observed and currentfitted deaths; q and q̃ are observed and current fitted mortality rates; e∗ is initial exposedto risk.

Distribution Link Weight, W̃ Working variable, z̃

Poisson log diag{d̃} Xθ̃ +d

d̃− 1

Binomial logit diag{e∗q̃(1− q̃} Xθ̃ +q − q̃

q̃(1− q̃)

Binomial cloglog diag

{e∗1− q̃

q̃[log(1− q̃)]2

}Xθ̃ − q − q̃

(1− q̃) log(1− q̃)Tab:W.and.z

matrix in a GLM, and any of our basic R-code, (11) for Poisson models for µ with loglink, and (14) and (16) for both binomial models for q, can be used to fit the model. TheNewton-Raphson algorithm for solving any maximum likelihood equations (MLEs) can bewritten

θ̂ = θ̃ + I−1(θ̃)S(θ̃) (75) eq:Newton

where S(θ̃) is the score function in (63) and I(θ̃) is the Fisher Information in (64); here thetilde, as in θ̃, represents the current estimate and the hat, as in θ̂, represents the improvedestimate. In the case of a GLM with canonical link (75) can be written compactly as

X ′W̃Xθ̂ = X ′W̃ z̃ (76) eq:GLM.Algorithm

where W̃ is a diagonal matrix of weights and z̃ is the working variable; the precise formof W̃ and z̃ depends on the model being fitted. For models with a non-canonical link (76)still holds provided the observed information I(θ̃) in the full Newton-Raphson algorithm isreplaced with the expected information; in this case the algorithm is known as the scoringalgorithm. The two algorithms coincide in the case of a canonical link. Table 10 givesformulae for W̃ and z̃ for the three GLMs considered here.

One of the attractive features of P -splines is the straightforward way in which smoothing isincorporated into the GLM methodology. A penalty matrix, P , is defined which acts on theregression coefficients, θ. In 2-d it is helpful to think of θ, cacy × 1, as arranged in a matrix,Θ, ca × cy. Then we have

P = P (λa, λy) = λaIcy ⊗D′aDa + λyD

′yDy ⊗ Ica , (77) eq:Penalty

where λa and λy are the smoothing parameters in age and year respectively, and Da, cca−2×ca, and Dy, ccy−2 × cy, are second order difference matrices acting on the columns and rowsof Θ respectively. A full discussion of these ideas can be found in Eilers and Marx (1996)and Currie et al. (2004, 2006). The estimating equations, (76), for a GLM become

(X ′W̃X + P )θ̂ = X ′W̃ z̃ (78) eq:GLM.Algorithm.P

24

Table 11: Model M4 or 2-d AP. Poisson deviances, DevP , for Poisson model for µ with loglink, binomial models for q with logit and complimentary log-log links. Minimum DevPshown in bold. Age: 50-90, Period: 1960-2009. Knot spacing is 5 years for both age andperiod. Note that the effective dimension, ED, is approximate and given to the nearest wholenumber. The overdispersion parameter, ϕ, is for the Poisson model; other models given verysimilar values.

DevP DF ϕCountry log µ,P logit q,B cloglog q,BUSA 17860 16883 16891 121 9.3UK 9865 9112 9109 99 5.0Japan 9790 9159 9157 103 5.0Australia 4031 3760 3756 52 2.0Sweden 2498 2340 2342 30 1.2France 10131 9434 9430 71 5.1Tab:Dev.M4

and all the GLM machinery is immediately available; for example, Table 10 applies. Table 11shows the results of fitting the 2-d AP model, M4; the smoothing parameters have beenchosen by minimizing the Bayesian Information Criterion

BIC = Dev + log(n)ED (79) eq:BIC

where ED is the effective dimension of the model. In all cases, the binomial model for q withboth the logit or complimentary log-log link is superior to the Poisson model for µ with loglink; there is little to choose between the two link functions for the binomial model.

Currie et al. (2004, 2006) and Cairns et al. (2009) did not consider the effect of overdispersionon the smoothing process and we too have taken this approach in this paper. However, recentwork by Djeundje and Currie (2011) shows how to allow for overdispersion in the smoothingprocess. When ϕ = 1 the Poisson model holds but when ϕ > 1, as is usually the case withmortality data, both the estimating equations (78) and the selection criterion (79) shouldbe modified; in (78) we replace P with ϕP and in (79) we replace the weight log(n) whichis applied to the effective dimension by ϕ log(n). Thus, the greater the overdispersion, thegreater the smoothing. We see from Table 11 that overdispersion should be allowed for inall cases with the possible exception of Sweden.

8 Concluding remarksSect:Conc

Most statistical software allows the fitting of GLMs in a simple fashion. Such software givesimmediate access to some standard models in mortality modelling and forecasting. Thesemodels are usually not identifiable and we have shown how particular constraints can beaccommodated in the fitting process. For models such as the Lee-Carter and Renshaw-Haberman, more specialized software (Turner and Firth, 2012) allows direct fitting. Smooth

25

models can often be fitted with the R-package mgcv (Wood, 2011). We have taken a morefundamental approach and used the theory of penalized GLMs to fit the 2-dimensional P -spline model.

We have used these methods to compare Poisson models for the force of mortality andbinomial models for the rate of mortality over six countries. The conclusion is clear: thebinomial model generally outperforms the Poisson model. Further, we have discussed theperformance of two link functions with the binomial error: the logit and complimentarylog-log link. There is no clear preference here: with some countries the logit function ispreferred, while for others the complimentary log-log link gives a better fit.

So far, none of our work is controversial: a model is proposed, fitted and the resulting fit isevaluated. We now come to the question of using the model for forecasting. The method usedwith the Lee-Carter model is typical: fit the model, treat the estimated period parameters,κt, as a time series and forecast them with an appropriate Box-Jenkins or ARIMA model.The estimated age parameters, αx and βx, are assumed invariant over time and forecastmortality results. This last assumption is certainly an approximation but the method hasbeen very thoroughly tested, eg, Booth et al. (2005), and found to work well.

The addition of cohort parameters leads to a much improved fitting model and it is temptingto think that an improved forecast will result. Booth and Tickle (2008) remark: “TheAPC model has been usefully applied in describing the past, but has been considered lessuseful in forecasting” while Goldstein (1984) expresses the problem as follows: “The factthat parameters can be estimated does not imply that they can sensibly be interpreted”.In other words, although the cohort parameters are defined in terms of the year of birth(through the cohort matrix Xc for example) the fitted parameters cannot be interpreted asmeasuring the cohort effects; similar remarks apply to the age and period parameters. Oncethe interpretation is lost it is less clear how forecasting should proceed. At the very least,the assumption that the period and cohort parameters can be treated as independent isa very strong assumption. For example, Currie (2012) computed the canonical correlationsbetween the estimates of β, κ and γ in the APC model and found for the data in that paper:ρ(β̂, κ̂) = 0.442, ρ(β̂, γ̂) = 0.555 and ρ(κ̂, γ̂) = 0.659 where ρ denotes the first canonicalcorrelation.

The three variables age of death, year of death and year of birth are mathematical or logicallyconfounded: ie, we can calculate one from the knowledge of the other two. This alone shouldraise a question concerning treating the estimates of the period and the cohort parametersas independent. The correlation between the variables age, year and cohort is also a usefulmeasure in the discussion of forecasting. Let xa, xy and xc be the vectors of length n = nany

containing the ages of death, the years of death and the years of birth. Then for our data, wehave r(xa,xy) = 0, but r(xa,xc) = −0.63 and r(xy,xc) = 0.77, where r denotes the usualcorrelation coefficient. The zero correlation between xa and xy says that age and period areorthogonal in the data; this is one explanation of why the Lee-Carter model has been foundto result in stable and reliable estimates. The non-zero correlation between xa and xc, andbetween xy and xc also raises difficulties for forecasting with the APC model. Finally, wequote Clayton and Schifflers (1987): “In recent years, there have been several attempts touse an APC model fitted to past data to forecast rates. It should come as no surprise to a

26

reader of this paper that we should doubt the wisdom of this course.”

Model risk is ever present when forecasting the future course of mortality and it is thereforeessential to have easy and flexible methods available for fitting a wide range of models. Wehave shown that many models of mortality can be fitted with high-level code; for example,(11), (14) and (16) are typical of our approach. Such code should enable actuaries to examinethe financial consequences of forecasting with different models and hence to come to aninformed assessment of the impact of longevity risk on the portfolios in their care.

References

Booth, H., Tickle, L. and Smith, L. (2005) Evaluation of the Variants of the Lee-CarterMethod of Forecasting Mortality: A Multi-Country Comparison, New Zealand PopulationReview, 31, 13-34.

Booth, H. and Tickle, L. (2008) Mortality modelling and forecasting: a review of methods.Annals of Actuarial Science, 3, 3-43.

Brouhns, N., Denuit, M. and Vermunt, J.K. (2002) A Poisson log-bilinear regression approachto the construction of projected lifetables. Insurance: Mathematics and Economics, 31,373-393.

Cairns, A.J.G. (2007) LifeMetrics, (http://www.lifemetrics.com).

Cairns, A.J.G., Blake, D., Dowd, K., Coughlan, G.D. et al. (2009) A quantitative comparisonof stochastic mortality models using data from England and Wales and the United States.North American Actuarial Journal, 13, 1-35.

Camarda, C.G. (2012). MortalitySmooth: An R Package for Smoothing Poisson Counts withP-Splines. Journal of Statistical Software, 50, 1-24. URL http://www.jstatsoft.org/v50/i01/.

Clayton, D. and Schifflers, E. (1987a) Models for temporal variation in cancer rates. I:Age-period and age-cohort models. Statistics in Medicine, 6, 449-467.

Clayton, D. and Schifflers, E. (1987b) Models for temporal variation in cancer rates. II:Age-period-cohort models. Statistics in Medicine, 6, 469-481.

Currie, I.D., Durban, M. and Eilers, P.H.C. (2004) Smoothing and forecasting mortalityrates. Statistical Modelling, 4, 279-298.

Currie, I.D., Durban, M. and Eilers, P.H.C. (2006) Generalized linear array models with ap-plications to multidimensional smoothing. Journal of the Royal Statistical Society, SeriesB, 68, 259-280.

Currie, I. D. (2012) Forecasting with the age-period-cohort model? Proceedings of 27thInternational Workshop on Statistical Modelling, Prague, 87-92.

Eilers, P.H.C. and Marx, B.D. (1996) Flexible smoothing with B-splines and penalties.Statistical Science, 11, 89-121.

Forfar, D.O. (2004). In Encyclopedia of Actuarial Science, Eds: Teugels, J.L. and Sundt, B.

27

Chichester: Wiley.

Goldstein, H. (1979). Age, period and cohort effects - a confounded confusion. Journal ofApplied Statistics, 6, 19-24.

Gompertz, B. (1825) On the nature of the function expressive of the law of human mor-tality, and on a new mode of determining the value of life contingencies. PhilosophicalTransactions of the Royal Society of London, 115, 513-583.

Human Mortality Database. University of California, Berkeley (USA), and Max PlanckInstitute for Demographic Research (Germany). Available at www.mortality.org orwww.humanmortality.de (data downloaded November, 2012).

Lee, R.D. and Carter, L.R. (1992) Modeling and forecasting U.S. mortality. Journal of theAmerican Statistical Association, 87, 659-675.

McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models. London: Chapman andHall.

Neill, A. (1977). Life Contingencies. London: Heinemann.

Nelder, J.A. and Wedderburn, R.W.M. (1972) Generalized linear models. Journal of theRoyal Statistical Society, Series A, 135, 370-384.

R Core Team (2013) R: A language and environment for statistical computing. R Foundationfor Statistical Computing, Vienna, Austria. (http://www.R-project.org).

Renshaw, A.R. and Haberman, S. (2006) A cohort-based extension to the Lee-Carter modelfor mortality reduction factors. Insurance: Mathematics and Economics, 38, 556-570.

Searle, S.R. (1982). Matrix Algebra Useful for Statistics. New York: Wiley.

Turner, H. and Firth, D. (2012) Generalized nonlinear models in R: An overview of the gnmpackage. (R package version 1.0-6). (http://CRAN.R-project.org/package=gnm).

Wood, S.N. (2006) Generalized Additive Models. London: Chapman & Hall/CRC.

Wood, S.N. (2011) Fast stable restricted maximum likelihood and marginal likelihood es-timation of semiparametric generalized linear models. Journal of the Royal StatisticalSociety, Series B, 73, 3-36.

28

fitting models of mortality with generalized linear and

Documents