allmmaa emmaatterr –ssttuuddiioorruumm o

AAllmmaa MMaatteerr SSttuuddiioorruumm –– UUnniivveerrssiittàà ddii BBoollooggnnaa

DOTTORATO DI RICERCA IN

Scienze Statistiche

Ciclo XXVIII

Settore Concorsuale di afferenza: 13/ D1

Settore Scientifico disciplinare: SECS-S/ D1

THE ANALYSIS OF SURVIVAL AND LONGITUDINAL DATA FROM LIFE-

SPAN CARCINOGENICITY BIOASSAYS ON SPRAGUE-DAWLEY RATS

Presentata da: Daria Sgargi

Coordinatore Dottorato Relatore

Alessandra Luati Rossella Miglio

Esame finale anno 2018

III

Abstract

Carcinogenicity bioassay are among the best instruments to strengthen the

evidence on which regulatory agencies vase their decision to classify harmful

agents as human carcinogens, so they are fundamental to protect public health.

The statistical analysis is fundamental to validate the results from carcinogenicity

bioassay. This work aims to propose and illustrate some methodologies for the

analysis of non-cancer outcomes, in particular for the analysis of time-to-death

and of longitudinal measurements of body weights. The data from an old

experiment were used for this purpose: 4 experiments aimed at testing the

carcinogenic potential of Coca-Cola on Sprague-Dawley rats of different ages

(randomized males and females of 7, 30, 39, 55 weeks of age, and their non-

randomized offspring, observed since birth) were re-analysed.

Survival analysis aimed to verify the influence of the treatment, controlling for

possible differences due to sex, age at beginning of observation and age of the

dams at pregnancy. It was performed using Cox proportional hazards models for

the rats of second generation, and accelerated failure-times models for those of

first generation; the use of frailty terms was evaluated (univariate gamma frailty to

account for unobserved heterogeneity applied to data from breeders; shared

gamma frailty at the litter level applied to data from offspring).

The analysis of longitudinal body weights of the offspring was aimed at verifying

the relevance of treatment, controlling for physiological differences due to sex and

age of the dams at gestation. It was performed using linear and nonlinear mixed-

effects models to handle the hierarchical structure of the data. Linear models were

fitted using log-transformation of time and polynomial terms of order 3; nonlinear

models consisted of growth models, in particular the Berkey-Reed model, that is

usually used to analyse human growth during infancy, was applied.

V

Contents

Abstract ............................................................................................................ III

Contents ............................................................................................................. V

List of Figures ................................................................................................. VII

List of Tables .................................................................................................... IX

Chapter 1 Introduction ........................................................................................ 1

Chapter 2 The Coca-Cola study ......................................................................... 6

2.1 The experimental design .................................................................... 7

2.2 Results....................................................................................................... 9

Chapter 3 The analysis of survival data ........................................................... 11

3.1 The analysis of time-to-event data according to the guidelines ............. 12

3.2 Other methods for survival analysis: .................................................. 14

3.2.1 The Proportional Hazard model ...................................................... 14

3.2.2 Accelerated failure-times models..................................................... 16

3.2.3 Unobservable heterogeneity and familiarity problem: univariate and

shared frailty terms ........................................................................................ 17

3.3 Analysis of time-to-event data from the Coca-Cola study ..................... 23

3.3.1 Descriptive analysis ......................................................................... 23

3.3.2 Analysis of time-to-event for the first-generation’s rats .................. 25

3.3.3 Analysis of time-to-event for the second-generation’s rats ............. 30

Chapter 4 The analysis of longitudinal data ..................................................... 35

4.1 The analysis of longitudinal data according to the guidelines ................ 36

4.2 Other methods for longitudinal data ....................................................... 40

4.2.1 Linear mixed-effects models ............................................................ 40

4.2.2 Nonlinear mixed effects models ...................................................... 46

VI

4.2.3 Growth models ................................................................................. 49

4.3 The analysis of body weights ................................................................. 52

Chapter 5 Conclusions ...................................................................................... 71

References ....................................................................................................... 76

Appendix 1: Analysis of residuals for survival analysis .................................. 79

1.1 Cox Proportional Hazard model, breeders .......................................... 79

1.2 Accelerated failure-time model, generalized Gamma distribution,

breeder 81

1.3 Cox Proportional Hazard model, offspring......................................... 82

1.4 Cox Proportional Hazard model with shared Gamma frailty, offspring

84

VII

List of Figures

Figure 1: Age by experimental variables, Box-and-Whiskers plots ................................................ 24

Figure 2: Kaplan-Meier survival estimates for experimental groups ............................................. 25

Figure 3: Estimated survivor function from the PH model, breeders ............................................. 26

Figure 4: Residuals to evaluate the assumptions underlying PH model, breeders ......................... 27

Figure 5: Cox-Snell residuals for the evaluation of the best distribution for AFT models. ............ 28

Figure 6:Residuals to evaluate the assumptions underlying AFT model, breeders ........................ 30

Figure 7: Estimated survivor functions by sex, treat and age of dams, offspring ........................... 31

Figure 8: Residuals for the evaluation of the assumptions underlying PH model. ......................... 31

Figure 9: Estimated survivor functions from PH model with a shared frailty term ........................ 32

Figure 10: Residuals for the evaluation of the assumptions underlying PH model with Gamma

frailty ............................................................................................................................................... 33

Figure 11- OECD statistical decision tree summarizing procedures for the analysis of continuous

data ................................................................................................................................................. 38

Figure 12: Average weight for females and males by treatment and age of dams at start of

gestation .......................................................................................................................................... 54

Figure 13: Growth trajectory of body weights from a random sample of II generation rats. ......... 55

Figure 14: Plot of data and estimated fixed effects, model 1 .......................................................... 58

Figure 15: Data and linear prediction with fixed and random portion in four selected litters,

model 1. ........................................................................................................................................... 59

Figure 16: Plot of data and estimated fixed effects, model 2. ......................................................... 59

Figure 17: Data and linear prediction with fixed and random portion in four selected litters,

model 2. ........................................................................................................................................... 60

Figure 18: Evaluation of linearity: comparison between growth trajectory and linear regression in

three random samples ..................................................................................................................... 61

Figure 19: Evaluation of normality and homoscedasticity of residuals using Normal probability

plots and standardized residuals. .................................................................................................... 62

Figure 20: Plot of data and estimated average predictors, cubic model. ....................................... 64

Figure 21: Data and linear prediction with fixed and random effects in four selected litters, cubic

model. .............................................................................................................................................. 64

Figure 22: Evaluation of the assumption of normality and homoscedasticity of residuals, cubic

model. .............................................................................................................................................. 65

Figure 23: Estimates from Berkey- Reed model, offspring .............................................................. 67

VIII

Figure 24:Plot of data and estimated average predictions, Berkey-Reed growth model ................ 68

Figure 25: Plot of data and linear predictions with fixed and random part in four randomly

selected litters, Berkey-Reed growth model .................................................................................... 68

Figure 26: Evaluation of the assumptions of normality and homoscedasticity of the model using

residuals. ......................................................................................................................................... 69

IX

List of Tables

Table 1: Experimental plan of the four bioassays performed for the project .................................... 7

Table 2: Variables for survival analysis; “setting” variables in italic, experimental variables in

normal font. ..................................................................................................................................... 23

Table 3: Composition of experimental groups; breeders and their offspring are listed sequentially.

......................................................................................................................................................... 24

Table 4: Estimates and goodness-of-fit measures for different distributions of AFT models.......... 27

Table 5: Results of PH and AFT models for breeders, and measures of goodness of fit................. 29

Table 6: Proportional hazard model, offspring .............................................................................. 30

Table 7: Estimates and goodness of fit of PH model with a shared Gamma frailty ........................ 33

Table 8: Rats by sex, treatment and age of dams at start of gestation ............................................ 53

Table 9: Lost to follow-up in absolute number and percentage at selected measurement occasions

......................................................................................................................................................... 53

Table 10: Multilevel mixed effects models using transformed variables; results of regression of

weight on ln(time, centered) and experimental variables. .............................................................. 58

Table 11 : Multilevel mixed effects models using polynomial terms; results of regression of body

weights on second and third degrees polynomial terms for time and experimental variables ........ 63

Table 12: Correspondence of one human year with rat days at different phases of life, from

Sengupta (2013) .............................................................................................................................. 66

1

Chapter 1 Introduction

Cancer is one of the leading causes of death worldwide, and despite the

encouraging progress achieved during the last decades both in prevention and

treatment of some types, it remains a major threat to public health. The burden of

the disease is substantial and rising, due to increasing incidence rates, the growth

and ageing of the population, and the spreading prevalence of risk factors like

pollution, smoking, alcohol consumption, obesity and hypertension also in

developing countries: these factors reverse the habit to see cancer as a problem

regarding economically developed countries, and make it a global issue (Global

Burden of Disease Cancer Collaboration, 2015). Given this framework, the

importance of prevention is clear.

If we think about prevention in terms of public health, one of the most

important issues is to identify the human carcinogens and to regulate their use.

Carcinogens are all the biologic or synthetic substances, composites, technologies,

occupational exposures and even lifestyles that may have a carcinogenic potential.

It is necessary to identify the hazard, that is the capability of causing neoplastic

effects under some circumstances; to assess the risk, defined as the actual

carcinogenic effect expected if exposed to a cancer hazard, and understand the

mechanisms of action.

Experimental and epidemiological studies are the best currently available

instruments to test and verify these effects; in particular long term and life-span

carcinogenicity bioassays are the main experimental tools when carcinogenicity is

2

under exam. They usually employ rodents, in particular female and male mice and

rats, treated with one or more exposure concentration of the tested substance and

compared with untreated controls; the main route of administration to choose

between, according to the characteristic of the agent, are inhalation, dermal and

the most common, oral. The observation usually begins around 7 weeks of age,

when the subjects have ended the weaning period and are completely independent,

and lasts 104 weeks, or for the whole life; if the human pattern of exposure

requests so, perinatal exposures can be evaluated, to verify the effects during the

critical developmental phases of gestation and lactation.

The amount of information that a well-designed and well-conducted study can

offer is impressive: their primary purpose is to characterise the carcinogenic

properties of the agent,verifying if it might increase the age-specific incidence of

malignant tumours, reduce its latency, intensify its severity or multiplicity (IARC,

2006) or, less directly, trigger a harmful effect in another agent or act in a

combined way with it; they can also establish the existence of a dose-response

relationship, identify target organs, and help to set a benchmark dose or a no-

observed adverse effect level. Broad, strong and scientifically sound evidence is

the base on which the organizations in charge (like the International Agency for

Research on Cancer, the European Chemicals Agency, or the Environmental

Protection Agency, the Food and Drugs Administration or the National

Toxicologic Program for the United States) can classify dangerous agents as

human carcinogens. The codification can in turn boost the action from regulatory

agencies, that may decide to regulate or ban the use of these substances in order to

protect and promote human health.

Since the early 1980s most of the agencies involved in the classification and

regulation of dangerous substances collected the most up-to-date and agreed-on

options and procedures in experimental design, conduct of the study, reporting

and analysis of the results, and formalized them into guidelines and regulations,

that have been enriched and reviewed in the light of scientific progress, advance

of procedures and consideration for the animal’s welfare (OECD, Test No. 451:

Carcinogenicity Studies 2009). The main purposes are to grant and even out the

3

quality of the bioassays they base their knowledge on, and to assure the

comparability of results. Even if today almost every agency has specific protocols,

a common reference can be recognized in the guidelines drafted by the

Environment, Health and Safety Program and the Working Group on Testing from

the Organisation for Economic Cooperation and Development. Their Guidances

for testing the chemicals and the related documents represent the most

comprehensive collection of procedures, and among the very few to explicitly

treat in detail and depth the statistical analysis of data (Hothorn 2014).

Statistical analysis is universally recognised to be integral part of studies: an

appropriate experimental design is the cornerstone to answer the research

question, and statistical evaluation of the data is the necessary complement to

establish and quantify whether the exposure to the selected agent is associated

with adverse effects. Between the different “schools of thought”, the classical

frequentist approach and the concept of hypothesis testing have been chosen to

maintain coherence with most of the work done in toxicology.

One of OECD’s guidelines (OECD 2012) is partly devoted to illustrate and

explain in depth how to design and conduct the appropriate statistical tests based

on the kind of experiment, its objective and the type of data; it also helps to

interpret the results and to understand their real meaning and relative importance.

The methods are systematically organized into a flowchart: according to the

nature of the data, the appropriate “branch” of the tree is chosen, and omnibus

tests to highlight overall differences between groups or tests for linear trend are

suggested, preceded by several tests to verify the respect of their assumptions. In

some cases when these are not met and it is possible to transform the data, a

circular path is proposed and the tests repeated.

What stands out is that consolidate and advanced methodologies exists and are

routinely used for the analysis of tumor incidence, which is for the direct

assessment of carcinogenicity. As it was previously highlighted, nevertheless, a

good designed and performed study produces a great amount of additional

information such as the body weights, food and beverages consumption, or the

time of survival in life-span studies. Their importance to have a picture of the

4

health status of the animals involved and the good progress of the experiment is

clear, but this potential at the moment is not fully exploited, since even the most

detailed guideline proposes quite superficial analysis like those illustrated in the

OECD flowchart.

Today many statistical methodologies exist to treat data of this kind, and their

application to carcinogenicity bioassays may allow to fully use all available

information and to integrate them, to obtain a richer knowledge of the effects of

the tested compound. The idea for this work started from the will to capitalise the

potential of the dataapplying different methods to experimental studies, to verify

if they can shed a new light on data and contribute to understand and establish the

effect of the tested substances on health.

The research was promoted by a local research centre needing help to perform

routine statistical analysis and looking for ways to exploit the potetial coming

from their experience in carcinogenicity studies. The Ramazzini Institute was

founded as a social cooperative in 1987 but is active since the early ‘70s thanks to

the work of Prof. Cesare Maltoni, who gathered the link between living and

working environment and cancer, and performed life-span bioassays in the

laboratories in Bentivoglio (Bo) to understand the mechanisms of action of the

disease, and identify and quantify on experimental base the toxic and carinogenic

potential of widespread substances. During the decades dozens of experiments

contributed to the identifications of many carcinogens.

This work takes up an already published experiment again, the Coca-Cola

experiment, that didn’t show ashtonishing results in carcinogenicity tout court, but

could present some interesting features in other measures that were not analysed

in depth at the time; in the following chapter the experimental design of this

bioassay and its results, as were published, will be presented, together with a brief

exploratory analysis. The third and the fourth chapters are dedicated to specific

topics, and represent the core of the research: survival analysis and the analysis of

longitudinal recordings of body weights, respectively. Each will start with a

recognition of the literature, then present the data, the models and analysis that

5

were performed, and the results obtained. The research will end with a discussion

of the work done, and some conclusion will be drawn.

6

Chapter 2 The Coca-Cola study

Diet can become one of those life habits to ease the occurrence of cancer in

different ways: some food components have the direct ability to induce the

disease; some contain potentially dangerous additives to improve their stability,

preservation or even just the appearance; or in general, the excessive caloric

intake can lead to overweight and obesity, that has been identified as a risk factor

for some types of cancer (Calle E.E. 2003) (RAPP 2005). Given the impossibility

to study diet as a whole, experimental studies have been used to evaluate the

impact of single nutrients or food components.

Coca-Cola is a widespread product among the population of any age, socio-

economic status and country, its concentration of sugar and the caloric power are

very high: for these reasons, the “Cesare Maltoni” Cancer Research Centre of the

the Ramazzini Institute1 decided to evaluate its possible association with tumour

incidence in rodents. Starting from 1986 several sub-studies were performed, each

involving rats or different ages, males and females, randomized in two groups, the

treated and the controls.

The design and the conduction of the bioassay will be now illustrated along

with the results, as they were originally published. In order to have a clearer

picture of the available data, the last part of the chapter will contain also a first

exploratory analysis.

1 At the time named European Ramazzini Foundation

7

2.1 The experimental design2

To verify the long-term influence of strong Coca-Cola consumption on the

development of tumours, the bioassay planned it to be administered to Sprague-

Dawley rats as a substitute of drinking water for the whole life span, until

spontaneous death of the animals.

Treatment Age at start M F

Coca-Cola 7 weeks 80 80

Drinking water 7 weeks 100 100



Coca-Cola Offspring (55) 28 24

Drinking water Offspring (55) 32 24









Table 1: Experimental plan of the four bioassays performed for the project

The experimental plan is summarized above in Table 1: it wanted to account

for possible differences in the metabolism and the mechanism of action linked to

different ages, therefore four different experiments were conducted, involving

respectively female and male breeder rats of 30, 39 and 55 weeks of age and all

their offspring born in all litters, treated since prenatal life, and female and male

non-breeding rats of 7 weeks of age. All animals were bred from the internally

grown colony, formed at the beginning of the laboratories’ activity in the ‘70s.

All breeders were identified, separated by sex and assigned to each

experimental group, so that no more than one female and one male belonging to

2 All information and data in this and the following paragraph come from the original publication

of Belpoggi F. et al. Specificata fonte non valida..

8

the same litter were in the same one; they were housed in groups of five in

makrolon cages until scheduled time of mating and starting of the treatment, then

placed in breeding cages for a week. Females were housed individually for the

whole period of the pregnancy and the weaning, and then went back to the regular

housing for the rest of the experiment; all pups from every litter after the weaning

period were identified, separated by sex and assigned to the same treatment group

as their breeders. Cages belonging to both treatments were kept in the same room,

under the same living conditions and with the same diet. The treatment consisted

in substituting drinking water with ad libitum Coca-Cola, that was supplied every

2 weeks by an Italian retailer and was mechanically shaken before subministration

to eliminate CO2, and it lasted until spontaneous death.

During the observation, the animals were checked daily for physical and

behavioural problems; the individual body weight and mean beverages and food

consumption for cage were measured weakly for the first 13 weeks of

observation, every 2 weeks until week 104, and every 8 weeks later, while

complete examinations to check and report everything about the health status

were performed weekly then every 2 weeks for the whole life.

After death, complete necropsy was done on every subject, all pathological

lesions and all organs and systems3 underwent histopathology, were preserved in

70% ethyl alcohol (bones were fixed in 10% formalin and decalcified), trimmed

and processed as paraffin blocks. 3-5 µm sections of every block were sliced and

coloured with haematoxylin-eosin, and finally examined microscopically by a

group of pathologists; a senior pathologist supervised and reviewed all tumours

and lesions of neoplastic interest.

The statistical analysis was performed on tumour incidences only using a χ2

test to evaluate the significance of differences between treated and controls.

3Skin and subcutaneous tissue, the brain, pituitary gland, Zymbal glands, salivary glands,

Harderian glands, cranium (with oral and nasal cavities and external and internal ear ducts, 5

levels), tongue, thyroid and parathyroid, pharynx, larynx, thymus and mediastinal lymph nodes,

trachea, lung and mainstem bronchi, heart, diaphragm, liver, spleen, pancreas, kidneys and adrenal

glands, oesophagus, stomach (fore and glandular), intestine (4 levels), bladder, prostate, uterus,

gonads, interscapular fat pad, subcutaneous and mesenteric lymph nodes, and any other organ or

tissue with pathological lesions.

9

2.2 Results

It was chosen to aggregate data from all breeders, and to consider all offspring

together (animals who started treatment since embryonic life and the group

observed since the age of 7 weeks).

Fluid consumption differed markedly between treated and control animals, the

former drinking twice the amount of the latter; conversely, food consumption was

almost 40% minor among the Coca-Cola consumers. Body weight was higher

both in breeders and offspring, for both sexes; no significant difference in survival

was observed, just a slight decrease in female offspring.

The following differences were observed in tumour occurrence:

1. A slight increase in malignant tumour incidence;

2. A statistically significant (P < 0.01) increase in malignant mammary

tumours and total mammary tumours, both in breeders and offspring

females;

3. A statistically significant increase in adenomas of the exocrine component

of pancreas in male (P < 0.01) and female (P < 0.05) breeders, and in male

(P < 0.01) and female (P < 0.01) offspring; No exocrine carcinomas were

observed.

4. An increase in islet cell carcinomas was observed between female breeders

and offspring; although it’s not statistically significant, it should be noted

that looking at the historical controls, only one (0.04%) islet cell carcinoma

out of 2274 untreated females was observed.

They finished concluding that the significant rise in the occurrence of

mammary gland tumours could confirm a correlation between higher body

weight and increased risk of mammary cancer, and that the relatively high

number of pancreatic islet cell carcinoma compared to the historical controls

shall not be underestimated, even if not significant. Therefore, even if the

human consumption is rarely as extreme as the one designed in the experiment,

it was confirmed that an abuse of high caloric/ high sugar beverages like soft

10

drinks can ease overweight and obesity, that in turn is a risk factor for human

health and cancer.

11

Chapter 3 The analysis of survival data

When referring to “survival data” in carcinogenicity studies, the narrower

interpretation is the correct one, since they literally indicate the time until death.

In more common 2-years or chronic experiments, we find under this label the

duration of life of the animals deceased before the official time of cessation, that

will be those who experienced the “event of interest”, while the surviving ones

will be treated as censored; in way less usual life-span studies the meaning is the

same, except that there will be no censored observations, since the experiment

lasts until the spontaneous death of the last animal.

The duration of life is often under scrutiny in epidemiological studies and in

clinical trials, but is rarely the outcome of primary interest in carcinogenicity

bioassays, that are aimed to test the incidence of neoplastic lesions under different

doses of treatment. Still, these data are collected and analysed in any kind of study

since they allow to keep the general trend of the experiment under control during

its execution, verifying that the tested compound has not such a high toxic

potential to reduce too much the treated groups. The excessive decrease of sample

size can threaten the sensitivity of the analysis, and increase the likelihood of

obtaining false positives and false negatives (WHO 1987).

When studying cancer these data acquire even more relevance for the nature of

the phenomenon, since the probability of developing neoplastic lesions increases

with age. Moreover, if the tested substance has the power to affect the survival for

toxicity not related to tumour, statistical analysis of incidence may result biased: if

the treatment reduces excessively the duration of life, it can underestimate the

12

carcinogenic potential; conversely, if it increases the longevity, the effect on

cancer development may be overestimated (OECD 2012).

These considerations led regulatory authorities and journals to request an

evaluation of the differences in survival for every study to be considered, and to

include indications about how to perform these analyses in the guidelines and the

protocols.

3.1 The analysis of time-to-event data according to the guidelines

As it was mentioned in the introduction, guidelines drafted from the most

authoritative national and international agencies have become an essential

instrument for research centres, enhancing the possibility of producing quality

research that can contribute to form the scientific base for the regulation of

harmful agents. The documents (OECD 2012) (OECD 2009) from the Guidelines

for testing the Chemicals Series drafted by Organization for Economic

Cooperation and Development will again be the main reference here, since they

are the most detailed and comprehensive in the identification of statistical

procedures and methodologies.

The first indication they give, is that the differences in survival need to be

explored to guarantee the presence of a sufficient number of individuals to

preserve the power of the tests to be performed, and to verify whether the analysis

of incidence will have to be corrected. Survival data represent times, in this

context usually days since the starting of the observation or weeks of age, and

have some particular features: their distribution is not symmetric but tends to be

positively skewed, and they are often censored - mostly right censored in the

context of randomized bioassays. These peculiarities prevent the use of standard

methods for continuous data.

The suggested steps to identify differences or dose-response relationships in

survival are totally based on non-parametric tests like the Mantel-Cox test, the

generalized Wilcoxon or Kruskal-Wallis test, and the Tarone trend test. The

preferred methodology to compare the duration of lives among the experimental

13

groups consists in estimating the survivor function using the product-limit

method, proposed by Kaplan-Meier in 1958. It is based on the construction of a

series of time intervals, such that each contains at least one event (that is assumed

to occur independently from all others), and the event time is taken as beginning

of the interval; the probability of survival is calculated for each interval as the

ratio of subjects surviving and subjects at risk. The estimate is the product of the

probabilities calculated on all time intervals, and it can be represented graphically

as a step function, with the estimate probabilities remaining constant between

adjacent death times, and decreasing at each event (Collett 2003). Different curves

can be estimated for experimental groups, and differences are to be formally

tested using the log-rank test.

The non-parametric approach suggested in the guidelines is very appealing,

given its simplicity and the possibility of representing the estimates graphically,

that allows to catch many information about the data at a glance. This is useful,

and may be sufficient in this kind of experiments, where the subjects under study

are randomized and share every condition (type of caging, room, temperature,

diet, number and scheduled timing of checking, …) except for treatment. In some

situations, nevertheless, we might be interested in obtaining an estimate of the

treatment’s effect on survival, or in the effect of other factors: here this univariate

approach is a useful exploratory tool, but more sophisticated methods are

necessary.

Let’s take the Coca-Cola study as an example: the consumption of this

compound is so widespread that we likely would not expect a strong, direct

influence on survival. However, for its characteristic (highly sweetened, highly

caloric) we may expect it to have an influence on body weight and mass

composition, so that a heavy consumption like the one outlined in this experiment

might ease overweight or even obesity on treated rats. These conditions may

reasonably affect survival, but we are not able to account for and estimate their

effect.

Another feature of this experiments is that they involved adult rats and their

offspring, to verify effects of perinatal exposure: all alive pups from all litters

14

were included, each in the same treatment regime as their breeders, so

randomization was avoided in the second generation of exposed. This poses a

problem of “familiarity”: individuals are more likely to share characteristics and

propensity for diseases, so we expect to easily find similar paths among animals

from the same litter.

Finally, when plotting the Kaplan-Meier estimates of the survivor or the hazard

functions, we may obtain crossing step functions: very close functions with

similar paths might mean that relevant differences among groups do not exist,

while non-parallel curves of the logarithm of the hazard functions might pose a

doubt on the proportionality of the hazard. When there are reasons to question the

assumption of proportional hazard, the Wilcoxon test might be more suitable than

the log-rank test to assess differences among groups, but a modelling approach

could be much more informative and appropriate.

3.2 Other methods for survival analysis:

3.2.1 The Proportional Hazard model

To take advantage of all information and to give thorough answers to the

research questions, we need to leave the hypothesis testing approach and choose

the modelling approach.

Survival data can be modelled to explore the risk of death at any time after the

beginning of the study: the hazard function is the object of interest, representing

the instantaneous probability of death at the time, given that the subject has

survived until that time. The aim is to determine which conditions affect the form

of the hazard function, and to obtain the estimate of the function itself for the

individuals. The most commonly used multivariate approach is the proportional

hazard model, proposed by Cox (1972), that has lately become quite popular also

in carcinogenicity studies thanks to its flexibility and the diffusion of statistical

packages of easy use. The Cox model is necessary for the use and understanding

of many of the methodologies that will be proposed later, but is not the main topic

of this section, so it will be just briefly introduced here.

15

The proportional hazard model explains the hazard of death at any time t as

depending on the values x1, x2, …xp of p explanatory variables recorded at the

time origin of the study, X1, X2, … Xp:

ℎ𝑖(𝑡) = exp(𝜂𝑖) ℎ0(𝑡), 𝜂𝑖 = ∑ 𝛽𝑗𝑥𝑗𝑖𝑝𝑗=1

where ℎ0(𝑡) is the baseline hazard function, representing the hazard at time t

for a subject for whom all values of the covariates are 0, and𝜂𝑖 is the linear

component of the model, containing the explanatory variables 𝑥𝑝and the

respective coefficients 𝛽𝑝. The estimated effect can be interpreted in terms of

hazard ratios (𝑒𝑥𝑝(𝛽𝑖) ): if greater than one, it indicates that the hazard of the

event is positively associated with the corresponding covariate. This is one of the

most important advantages of the modelling approach: it allows to consider the

influence of several risk factors at one time. These are typically categorical,

ordinal or continuous covariates, whose values is recorded at the beginning of the

observation, but several expansions have been studied in time, so that a better

representation of the phenomena under study could be reached.

The essential features of this model are that the baseline hazard is estimated

non-parametrically using the maximum likelihood method, so no assumption is

made on the distribution of survival times (thus the model is defined semi-

parametric), and that its formulation assumes the proportionality of the hazards:

since the covariates act multiplicatively on the hazard at any time, the hazard in

any “group” is a constant multiple of the hazard in any other, and therefore the

survival curves should never cross. It is fundamental to assess the adequacy of

the model: this includes to verify whether all and only the appropriate explanatory

variables have been included in the model, and that the correct functional form

has been used; to check for the presence and the nature of extreme values; and of

course, to verify the assumption of proportional hazard. Visual inspection of the

data is very useful, but must be supported by diagnostic of the residuals.

16

3.2.2 Accelerated failure-times models

The analysis of graphical representation of the data and the diagnostic of the

residuals are fundamental and should never be neglected, to assess the validity of

the models built. One of the aspects that should always be evaluated is the

tenability of the condition of proportionality of hazards, since it is one of the

assumptions on which the Cox model and other parametric models for survival are

based. Very often indeed real data do not behave this way: in these cases,

alternatives are to be considered.

Models that do not require hazard to remain constant are the accelerated failure

time models: the basic idea is that the effect of covariates or risk factors is

constant in time and multiplicative on the time scale (instead that on the hazard

scale, as in PH models), accelerating or decelerating the event of interest.

In the most generic specification, the survivor function can be written as

𝑆(𝑡) = 𝑆0 (𝜑𝑡)

where 𝑆0 (𝑡) is as usual the baseline hazard function, and 𝜑 represents the

acceleration factor, dependent on the covariates:

𝜑 = exp {∑ 𝛽𝑗 𝑋𝑖𝑗

𝑝

𝑗=1}

The model is often expressed in its logarithmic form with respect to time,

log(𝑇𝑖) = 𝛽0 ∑ 𝛽𝑗𝑋𝑖𝑗 + 𝜎 휀𝑖

𝑝

𝑗=1

where 𝛽0 is an intercept, 𝛽𝑗 are the coefficients of the p covariates X, 𝜎 is a scale

parameter and 휀𝑖 is a random variable that models the deviations of log(𝑇𝑖) from

the linear part of the model; to the distribution of 휀𝑖 corresponds a distribution of

the survival times 𝑇𝑖(Bradburn, 2003). The most used specification of the AFT

model is indeed parametric, and many distributions can be used to best represent

the time to event: exponential, Weibull (that can be used both as parametrization

of the PH model and of AFT model), the log-logistic, log-normal or generalized

gamma. Non-parametric specifications also exist, the most known being the

Buckley and James method (Buckley I. V. 1979), but it is rarely used in real data

17

analysis for its problems in theoretical justification and robustness underlined by

Wei (1992) and for the complexity of the computations.

In AFT models, the effect of the covariate is measured on the survival times:

this makes the interpretation of the results more immediate. The magnitude of the

effect of covariates is not given in terms of hazard ratio, but of time ratio: if we

divide the population in groups according to the level of a variable of interest, the

time ratio stands for the ratio of the expected survival of a group with respect of

the reference group. When >1, the covariate “slows down” the appearance of the

event, while if it’s <1, it accelerates death. The model is fitted using the maximum

likelihood method.

Accelerated failure time models represent a valuable alternative to the Cox

model when the assumption of proportional hazard is not realistic; nevertheless,

they are not yet extremely diffused in medical and biological research as they are

in the engineering field, where they are widely applied for reliability studies.

Some of its characteristic make them appealing: as already mentioned, they are in

many contexts more realistic, the parameters are more immediate to interpret, and

choosing the wrong distribution affects the estimates less (Lambert, 2004)

(Keiding, 1997)

3.2.3 Unobservable heterogeneity and familiarity problem:

univariate and shared frailty terms

The methods for survival we have revised so far implicitly assume that

individuals in the population are homogeneous; the analysis are performed to

verify if differences exist and what determines diverse hazards. It can be the risk

factor of interests, such as the tested compound, sex, age at the beginning of the

treatment, but it is likely that other factors have an impact on the duration of life.

Unlike in linear regression, it is very important to be sure that all relevant

covariates are included in survival models in order to have unbiased coefficients

and hazard rate. This is because the hazard rate is time dependent: suppose a

characteristic divides the population in a low-risk group and a high-risk group. If

18

the characteristic is observed and modelled, two constant risks functions are

estimated, one for each group; but if this variable is omitted, we will have one,

common hazard function, decreasing in time since individuals from the high-risk

group fail before than the others. When other covariates are included in the model,

the effect of the omitted one will therefore alter the estimates of the parameters,

because its distribution for each value of the included covariates varies with time.

This was demonstrated in, among others, Gail et al. (1984), Schumacher et al.

(1987), Bretagnolle and Huber-Carol (1988), Hougaard et al. (1994) Schmoor and

Schumacher, 1997; Chastang et al, 1988. Bretagnolle and Huber-Carol also

investigated the direction of the bias: their simulations showed that, no matter the

number of covariates included in and omitted from the model, the effect of the

observed variables will be underestimated, and the asymptotic bias changes the

risk for confidence intervals from 5 to 50%.

Very rarely nevertheless it is possible to observe all risk factors that affect

survival. Experiments such as carcinogenicity bioassay have less problems than

observational studies, thanks to randomization and careful experimental design,

but still there are some sources of unobserved heterogeneity: sometimes it can be

too expansive in terms of time or resources to measure and report all relevant risk

factors with precision and completeness. This is the case with data from the Coca-

Cola experiments: basic information on the health status of the animals was

observed and registered on a weekly basis on paper supports (we must remember

that the experiment was carried out during the 80s): these data are still available,

but they were not translated on informatic files since they are not usually used for

analysis, so the information regarding the health of animals is essentially lost.

This is the reason for the use of mixed effects in survival analysis: the idea is to

decompose the variability of survival in two sources: a predictable part, measured

by the coefficients of observed risk factors, and an unknown part, not directly

observable, described by a frailty term. The concept was first elaborated by

Greenwood and Yule in 1920, and developed after decades in Clayton (1978) and

Vaupel et al (1979): individuals have different not observable characteristics that

affect their survival probability, defined frailties, the more frail will survive less

19

than the less frail, creating a kind of selection that can give a distorted picture of

the underlying process.

Univariate frailties are random variables whose distribution reflect the nature

of the relationship between the subject’s and population’s survival:

ℎ(𝑡, 𝑍) = 𝑍 ℎ0(𝑡)

They are time independent and act multiplicatively on the baseline hazard

function, describing the unobservable heterogeneity among individuals (the

variance of the frailty distribution determines the level of heterogeneity in the

population: 𝐸(𝑍) = 1and 𝑉(𝑍) = 𝜎2, if 𝜎2 is small Z tends to 1 and the

population is quite homogeneous).

The survivor function is defined as

𝑆(𝑡 | 𝑍) = exp {− ∫ ℎ (𝑠, 𝑍) 𝑑𝑠𝑡

0

} = exp {−𝑍 ∫ ℎ0(𝑠) 𝑑𝑠𝑡

0

} = exp{−𝑍 𝐻0(𝑠)}

Frailties can of course be applied to the classic proportional hazard model

containing other covariates, obtaining, for each subject, a hazard function of the

form

ℎ(𝑡, 𝑍, 𝑋) = 𝑍 ℎ0(𝑡) exp {∑ 𝛽𝑗𝑥𝑗𝑖

𝑝

𝑗=1}

Hougaard (1984, 1986a, 1986b) demonstrated that the survivor and density

functions for the whole population, the only observable entities, and the mean and

variance of the frailty are characterized using the Laplace transform of the frailty

distribution as follows:

𝑆(𝑡) = 𝐸 𝑆(𝑡|𝑍) = 𝐸 exp{−𝑍 𝐻0 (𝑡)} = 𝐿(𝐻0(𝑡))

𝑓(𝑡) = −ℎ0(𝑡)𝐿′(ℎ0(𝑡))

𝐸 𝑍 = −𝐿′(0)

𝑉(𝑍) = 𝐿′′(0) − (𝐿′(0))2

These formulations clarify why it is important to choose a distribution for the

frailty that has an explicit Laplace transform, and that maximum likelihood

methods can be used for the estimation of regression parameters. The hazard

function for the population becomes

20

ℎ(𝑡) = 𝐸 ( ℎ(𝑡, 𝑍)|𝑇 > 𝑡)

= ∫ ℎ(𝑡, 𝑍) 𝑓(𝑧 |𝑇 > 𝑡) 𝑑𝑧 = ∞

0

ℎ0(𝑡) ∫ 𝑧 𝑓(𝑧 |𝑇 > 𝑡) 𝑑𝑧 ∞

0

or, taking into account the presence of covariates,

ℎ(𝑡 | 𝑋) = ℎ0(𝑡) exp {∑ 𝛽𝑗𝑥𝑗𝑖

𝑝

𝑗=1} 𝐸 (𝑍 | 𝑇 ≥ 𝑡, 𝑋)

This explains that, since the frailer subjects die earlier, the average frailty of the

survivors is not constant, but will decrease in time.

Many types of distribution are possible for univariate frailties, usually

parametric, like the Gamma distribution, the most widely diffused thanks to its

mathematical properties: it’s always positive and has simple Laplace transform

that allows to easily derive in closed forms the measures of interest, and most of

all it is flexible, since it can take various shapes. Gamma-distributed frailty terms

have been repeatedly used to model univariate and multivariate frailties, also in

the analysis of survival in elderly population, so they have been considered for the

analysis of these life-span experiments. Other possible parametric specifications

are the positive stable, the inverse gaussian, the extended family of power

variance function distributions, the lognormal and the compound Poisson, not

suitable here since it creates a subgroup with Z= 0 that does not experience the

event. If no information is available on the trait influencing the hazard among

groups, the discrete specification is possible, both binary or finite discrete. The

choice should be based on the knowledge of the phenomenon under study, but

often simple mathematical convenience guides the selection.

The Coca-Cola experiments have another peculiarity that can be addressed

using frailty terms: young and adult rats were randomized using systematic

sampling and assigned to treatment or control groups. After one week, they were

also assigned for mating, and all the pups from all litters were included in the

experiment, continuing the regime they were given during pregnancy and weaning

through their dams.

21

Frailties are extremely useful also in this case, since they can be used to model

dependence in cases where the implicit assumption of independence of each

subject ad their failure times does not hold. This can happen when analysing

recurring events, in studies where one patient act as his own paired control, or in

studies where familiar groups are involved. The simplest option to overcome these

independence issues are shared frailties: the idea here is that some of the

unobservable characteristics that may affect the risk of failure are common among

individuals belonging to the same cluster, like siblings, and they can influence the

effect of observed covariates on survival. Here the random variable Zis constant in

time and is associated to the group, instead of individuals, accounting for

dependence among the subjects.

The formalization of shared frailty models is due to Clayton (1978), and can be

found also in Therneau and Grambish (2000) or Hougaard (2000). The survival

times are independent conditional on the frailties, and the hazard takes the form

ℎ(𝑡) = 𝑍𝑖ℎ0𝑗(𝑡) exp {∑ 𝛽𝑗𝑥𝑗𝑖

𝑝

𝑗=1}

where ℎ0𝑗(𝑡) is as usual the baseline hazard, that can be derived semi-

parametrically or parametrically, 𝛽𝑗are the regression parameters linked to the

fixed effect of observable covariates and the frailties are defined by the

parameters of their distribution, are independent between clusters and identically

distributed, and shared by the individuals belonging to the same cluster. This way,

the survivals are conditionally independent with respect to the frailties, and

dependent inside the group, thanks to the frailty term. Supposing we had just two

clusters, we would get

𝑆(𝑡1, 𝑡2|𝑍) = 𝑆1(𝑡1)2𝑆2(𝑡2)2

= exp{−𝑍 𝐻01𝑡1} exp{−𝑍 𝐻02𝑡2} = exp {−𝑍 ∑ 𝐻0𝑖𝑡𝑖

2

𝑖=1}

where 𝐻0𝑖(𝑡) = ∫ ℎ0𝑖(𝑠) 𝑑𝑠𝑡

0. Again, the Laplace transform is extremely relevant

to obtain the marginal survivor function:

𝑆(𝑡1, 𝑡2) = 𝐿(𝐻01(𝑡1) + 𝐻02(𝑡2))

22

As anticipated before, the standard distribution assumed for shared frailties is

the Gamma, with mean 1 and variance 𝜎2, so the marginal survivor is denoted as

𝑆(𝑡1, 𝑡2) = 𝐿(𝐻01(𝑡1) + 𝐻02(𝑡2))

= (1 + 𝜎2 (𝐻01(𝑡1) + 𝐻02(𝑡2)))−

1

𝜎2

= (𝑆1(𝑡1)−𝜎2+ 𝑆2(𝑡2)−𝜎2

− 1)−

1

𝜎2

The extension to multivariate case, with more than two clusters, is attributable to

Cook and Johnson (1981), that defined the survivor function as

𝑆(𝑡1, … , 𝑡𝑝 = (∑ 𝑆𝑖

𝑝

𝑖=1(𝑡𝑖)

−𝜎2− 𝑝 + 1)

−1

𝜎2

with equal correlation among the survival of the individual belonging to the same

group. This specification seems appropriate in such a case, where multiple litters

are included, each composed by more than two siblings.

An estimate of the frailty term within each litter can be found (Nielsen et al.

1992) as

𝑧�̂� = 1 𝜎2 + ∑ 𝛿𝑖𝑗

𝑛𝑖𝑗=1

⁄

1 𝜎2 + ∑ exp {∑ 𝛽𝑗𝑥𝑗𝑖𝑝𝑗=1 } 𝐻( 𝑡𝑖𝑗)

𝑛𝑖𝑗=1⁄

The properties of shared frailties survival models were demonstrated, both

adding other covariates or not, in several works: Murphy (1994) showed its

consistency and asymptotical normality (1995), Giddens (1999) tested for gamma

distributions with the semiparametric survival function, and Cui and Sun (2004)

demonstrated graphically and with numerical methods the adequacy of the

Gamma distribution.

Of course, the shared frailty approach has its limitations, and is not sufficiently

flexible in several contexts: for example, it assumes that unobservable risk factors

are equal among all individuals in each group, and that the association among

subjects is positive in most of the cases (Xue and Bookmeyer, 1996). Since the

characteristics of rats belonging to the same litter are extremely similar, there will

be no need to consider more complex specifications for the multivariate frailty,

like for example the correlated frailty model (where within each group a frailty

term is associated to each individual: these random variables are positively or

23

negatively associated jointly distributed, and different distribution can be chosen

to model different situations, like again the gamma, the log-normal or the

compound Poisson).

3.3 Analysis of time-to-event data from the Coca-Cola study

As it was presented in the previous chapter, the Coca-Cola study involved

randomized Sprague-Dawley rats of different ages, as well as their offspring, that

was included in the observation in toto and without randomization. For all

subjects the event of interest is spontaneous death, since these are life-span

carcinogenicity studies and no terminal sacrifice was planned; subsequently, no

censored observation exists, all subject experience the event. The outcome, then,

is time to death, measured in weeks of age of each subject. All information about

the identification of individuals were recorded at the beginning of the study: they

are presented in the following table, and they represent the variables of interests,

and the main other possible risk factors that we want to control:

Variables Description Values

Entry Weeks of age at the beginning of the observation 7 weeks; 30 weeks; 39

weeks; 55 weeks

Age Duration of life (in weeks)

Event Spontaneous death 1 deceased

Treatment Experimental regime 0 drinking water (control);

1 Coca- Cola (treated)

Sex Sex 0 female; 1 male

Momstart Age of the dam at pregnancy (only available for

offspring)

30 weeks; 39 weeks; 55

weeks

Famid Identifier for litters, common for all siblings born from

the same breeders

1-98

Table 2: Variables for survival analysis; “setting” variables in italic,

experimental variables in normal font.

3.3.1 Descriptive analysis

Here a brief summary of the data and some really basic descriptive statistics

are reported:

24

Treatment Age at start M F















Table 3: Composition of experimental groups; breeders and their offspring are

listed sequentially.

The experimental groups of breeders were formed to be balanced in terms of

sex and treatment regime, but not in terms of ages groups. The offspring, since

had not been randomized, present slightly less homogeneous characteristics, as it

might be expected the pups born form older dams are fewer. A marginal analysis

was used to highlight associations between the outcome and explanatory variables

(Box-and-Whiskers plot and density plots are reported in Figure1).

Figure 1: Age by experimental variables, Box-and-Whiskers plots

The Box-and-Whisker plots don’t show significant alterations in the survival

depending on sex, and the exposition to treatment seems to have a negligible

25

effect, too; the age at beginning of the observation and the age of the dams at

pregnancy, on the other hand, determine more differences; these are confirmed

when estimating the Kaplan Meier survival curves. Log-rank and Wilcoxon test

were performed to verify the equality of survivor functions; they are not reported

here for brevity, but they confirm the statistical significance of the differences

among the curves calculated for animals entering in the experiment at different

ages, for different ages of the dams, and surprisingly between rats belonging to

treated and control group.

Figure 2: Kaplan-Meier survival estimates for experimental groups

According to the guidelines, the statistical analysis of these data could have

stopped here. However, applying the methods that have been presented so far

might give interesting insights on the problem, and more appropriate, robust and

accurate results.

3.3.2 Analysis of time-to-event for the first-generation’s rats

We are dealing with two generations of animals, that, as we said, differ mainly

because the first was randomized, while the second was not, so entire litters are

present among the offspring. For this reason, it was chosen to separate the

analysis and to apply to each generation the most suitable methodologies.

26

The analysis of survival times for the breeders started with the classic semi-

parametric Proportional hazard model, that was used to evaluate the influence of

sex, treatment and their interaction. The variables were tested in univariate and

multivariate models, and, coherently with the results of explorative analysis,

treatment resulted the only experimental condition to represent an additional risk

factor, increasing the probability of the event of about 11% for the subjects from

the treated groups. As the plot of the estimated survivor function shows, the

differences are not really marked, anyway.

Figure 3: Estimated survivor function from the PH model, breeders

The exploration of the residuals showed that the overall fit of the model is

very good, except for the last part, where the decreasing number of observation

causes expectable issues; the plot of deviance and martingale residuals showed

that some individuals could be too influential in determining the estimates, and

the df betas and loglikelihood displacement helped individuate them. A careful

examination of data confirmed that they don’t correspond to measurement or

recording errors, and they fall in the range of plausible values, so they shall not be

considered as outliers and excluded from the dataset. Schoenfeld residuals, “log-

log” plots and tests made to verify that the log hazard-ratio function is constant

over time showed that the survival functions are not so different among the groups

27

we are considering, but that the assumption of proportionality of hazard is not

tenable with respect to variable “sex”.

Figure 4: Residuals to evaluate the assumptions underlying PH model, breeders

Since the fundamental assumption of this approach is in doubt, it was decided

for the use of the accelerated failure-time metric, and the evaluation of the effect

of a univariate frailty has been included in the parametric AFT models only. The

lognormal, loglogistic and generalized gamma distributions were tested in order to

find the best parametrization for the model:

Variables Lognormal Loglogistic Generalized gamma

1.sex 0.005 0.003 0.021*

1.treat -0.029* -0.034** -0.025**

Constant 4.610*** 4.640*** 4.729***

Ancillary -1.283*** -1.902*** -1.533***

Kappa

1.093***

Log-likelihood -157.843 -112.815 -20.483

AIC 323.686 233.630 50.966

BIC 344.3665 254.311 76.817

*** p<0.01, ** p<0.05, * p<0.1

Table 4: Estimates and goodness-of-fit measures for different distributions of AFT

models

The comparison of the log-likelihood and AIC of each fit, together with the

graphic representation of Cox-Snell residuals, showed that the generalized gamma

is the best choices for the data.

28

Figure 5: Cox-Snell residuals for the evaluation of the best distribution for AFT

models.

No univariate frailty could be introduced in the Generalized gamma AFT

model for computational problems of the software; it was tried therefore to fit a

model with a univariate frailty term using the “second best” option, the

Loglogistic distribution, that still showed a reasonable capacity to model the data.

This compromise proved useless in the end, since the term to evaluate unobserved

heterogeneity at the individual level was far from significant, and did not improve

the fit of the model in any way.

We can conclude that the survival experience of the animals belonging to the

“first generation” group is best analysed using an accelerate failure-time model,

and the most suitable distribution to represent their hazard function is a

generalized gamma.

29

PH AFT AFT

Generalized gamma Loglogistic

Variables Coef Coef Coef

1.sex -0.106* 0.021* 0.003

(0.056) (0.012) (0.014)

1.treat 0.109** -0.025** -0.033**

(0.056) (0.012) (0.014)

Constant

4.729*** 4.642***

(0.012) (0.012)

Ancillary

.216*** -1.915***

(.005) (.0232)

Kappa

1.093

(.074)

Theta (gamma frailty)

5.42e-09

Log-likelihood -8009.919 -20.48326 -117.862

AIC 16023.84 50.96653 245.724

BIC 16034.18 76.81712 271.5746

Standard errors in parentheses, *** p<0.01, ** p<0.05, * p<0.1

Table 5: Results of PH and AFT models for breeders, and measures of goodness

of fit.

The estimate coefficients confirm that the differences in survival among sexes are

not relevant, while the treatment is responsible for a small but significant

acceleration of the risk of the event (or, to be more straightforward looking at the

negative coefficient, that the treatment causes the failure to happen more rapidly,

so the expected time-to-event decreases). Residuals have been analysed mostly

graphically, and they showed that the model is overall a reasonable representation

of the data, and that there shouldn’t be major issues regarding the respect of the

assumptions.

30

Figure 6:Residuals to evaluate the assumptions underlying AFT model, breeders

3.3.3 Analysis of time-to-event for the second-generation’s rats

The procedure that was followed to analyse the group of the offspring was

essentially the same followed before: it started with the fitting and interpretation

of a Cox model, whose assumptions were evaluated in order to assure its validity.

The models were fitted to assess the effect of treatment on the risk of death,

controlling for a possible effect of sex and of one more variable, the age of the

dams at the beginning of gestation, that can take the values of 30, 39 or 55 weeks

of age (in the dataset it is identified as “momstart”).

Model 1- Proportional Hazard

Coef Std. Err. z P>|z| [95% Conf. Interval]

Observations 699

1.sex 0.044 (0.076) 0.571 0.568 -0.106 - 0.193

1.treat 0.182 (0.077) 2.373 0.018 0.032 - 0.332

39.momstart 0.026 (0.085) 0.307 0.759 -0.140 - 0.192

55.momstart 0.553 (0.112) 4.953 0.000 0.334 - 0.772

Log-likelihood -3870

AIC 7748.019

BIC 7766.218

Table 6: Proportional hazard model, offspring

As before, sex was not found to increase the risk of the event in a statistically

significant way, while treatment has a relevant effect, so we can say that there is

sufficient evidence to link a heavy consumption of Coca-Cola since prenatal life

with an increase in the hazard of death; as expected, the later a dam undergoes

mating, the higher becomes the risk of death for the pups: the risk for the pups of

dams of 39 weeks is similar to those of 30, while for those bred by the oldest

dams the risk increases sensibly.

31

Figure 7: Estimated survivor functions by sex, treat and age of dams, offspring

The results showed that for this dataset the assumption of proportionality of

hazard is realistic: even if Schoenfeld and scaled Schoenfeld residuals show some

irregular trend, a formal test for the equality of log hazard-ratio over time

excluded important violations. No observations were found to have an excessive

influence on the estimates or on likelihood of the model, and globally it performs

quite good for these data.

Figure 8: Residuals for the evaluation of the assumptions underlying PH model.

A frailty term was than introduced in the model, this time representing a proper

shared latent random effect, considering each litter a cluster: its effect represents

the common characteristics that individuals belonging to the same cluster

verisimilarly share. It accounts for the possible association existing among them,

32

that would make unreasonable the assumption of independence of the subjects.

Frailty terms are assumed here to be gamma-distributed with mean 1 and variance

θ, and they affect the hazard multiplicatively; a significant shared frailty would

confirm that correlation among members of the same litter exists (and the degree

of the correlation is measured by θ), and is relevant in explaining survival

patterns.

Figure 9: Estimated survivor functions from PH model with a shared frailty term

The term proved significant for the dataset of offspring: the effect of shared

characteristics on survival is relevant and has to be considered. On the other side,

it must be noticed that treatment loose of importance if this new variable is

considered.

33

Model 2- Proportional hazard + shared Gamma frailty

Coef Std. Err. z P>|z| [95% Conf. Interval]

Observations 699

Number of groups 98

1.sex 0.048 (0.081) 0.587 0.557 -0.112 - 0.207

1.treat 0.229 (0.125) 1.830 0.067 -0.016 - 0.474

39.momstart 0.095 (0.138) 0.684 0.494 -0.176 - 0.365

55.momstart 0.688 (0.172) 4.004 0.000 0.351 - 1.024

theta .166 (.046)

0.000

Log-likelihood -3850

AIC 7708.57

BIC 7726.768

Table 7: Estimates and goodness of fit of PH model with a shared Gamma frailty

The post-estimation analysis for model validity and goodness of fit do not

show major problems or violations in the assumptions underlying the model,

neither regarding the presence of outliers, nor the proportionality of hazards, so

for the second-generation group it was not necessary to look for a better

specification using different metrics.

Figure 10: Residuals for the evaluation of the assumptions underlying PH model

with Gamma frailty

Regarding the analysis of survival times, we can draw some conclusions: first,

it is important to go beyond the plain comparison of nonparametric estimates.

Indeed, mostly in case of complex data, or atypical ones, like those that were

analysed here, where we had different generations and members of the same litters

in experimental groups that were not created using randomization, the hazard of

34

death could be attributed to factors that are not of major relevance in reality. A

good example comes from the survival experience of the second-generation rats:

had we not included the shared frailty term, that accounts for unobservable

common features of individuals (it can be a genetic predisposition for some kind

of disease or tumours, the fact of belonging to a very numerous litter, where

individuals are necessarily smaller and therefore possibly weaker, and so on), we

might have attributed the observed differences in the times to death entirely to the

treatment.

Another caveat is to always bear in mind the importance of verifying that the

methods that are chosen to analyse data are suitable for them, and that it is neither

granted nor banal that all assumptions on which models are based, are always met.

Here, the hypothesis of proportionality of hazard did not hold in the breeders’

dataset, and that emerged only after the examination of the residuals; the

importance of model checking is clear for statisticians, but the same can’t be

assumed for all researchers from different fields that follow the whole experiment,

included data analysis.

35

Chapter 4 The analysis of longitudinal data

Data collected measuring one or more variables of interest in repeated

occasions over the same set of subjects, that we assume to constitute a random

subsample of the population of interest, are often referred to as repeated measures

data. When repeated measures are ordered in time or space, and we cannot assume

that the observations within each individual has been assigned randomly, we call

these data longitudinal. When they are available, individual patterns of change

can be observed: longitudinal data can thus provide richer information, but request

particular attention, since the drop out of the individuals under study represent in

some applications an important issue (clearly not in our case), and correlation

rises among observations registered for the same subject.

Long-term and life-span carcinogenicity experiments often request several

types of longitudinal data: measurements of body weights and mean consumption

of food and beverages for cage are routinely collected with quite dense schedule,

since they are an important track of the animal well-being and of the good course

of the experiment itself (OECD 2002). They should be monitored jointly, since

changes in body weight can be related to alterations caused by interactions

between nutrients and the tested substance, or by a different palatability of foods.

Sensitive changes (in particular losses) may be important signals of health

problems, and should be always regarded with attention. These indicators also

gain importance per se because they are heavily related to metabolic, hormonal

and homeostatic functions, growth and sexual maturation. In this study, where the

36

tested compound is a highly caloric and sugary beverage, body weight in

particular should be of primary interest, in association with classic tumour

incidence, since it is well established that overweight and obesity are positively

associated with the increase of risk of many types of cancer. Based on a

systematic review of the published scientific literature, IARC assessed that the

absence of excess body fat has a preventive effect in humans for cancers of the

colon and rectus, oesophagus, kidneys, breast and endometrium, and, with

sufficient evidence, for cancers of the gastric cardia, liver, gallbladder, pancreas,

ovaries, thyroid, meninges, and on multiple myeloma (Lauby-Secretan B 2016).

4.1 The analysis of longitudinal data according to the guidelines

In usual carcinogenicity studies, nevertheless, body weight variations and food

consumption do not directly represent the primary outcome of interest, so their

analysis is, again, not particularly thorough. Guidelines (OECD 2012) suggest to

represent graphically group means, to promptly identify unexpected trends; this is

most important in during the early to middle part of life, while after approximately

80 weeks of age the rodents enter the geriatric phase, and are more prone to

weight losses due to ageing, diseases or tumours; also, it was noted that lighter

rats tend to live longer than heavier ones, so means might be biased downwards.

Formal analysis of data always refers to body weights and food consumptions by

timepoint of interest and averaged on groups, since no methodology for repeated

measures is explicitly advised. It should start with a test to identify the presence of

outliers, like the Dixon and Massey test, followed by a test to evaluate the

assumption of normality (either the Kolmogorov-Smirnov or the Shapiro-Wilk

test). In case of not-normally distributed data, a logarithmic transformation is

suggested, and the test repeated. If data fall back into normality assumption,

another test for outliers such as the Extreme Studentized Deviate statistics could

be carried out; then the F-test, or the Levene’s or Bartlett’s tests can be performed

to evaluate the homogeneity of variance, respectively if the experiments has two

or more groups. If the F-test highlights that variances are homogeneous, a

Student’s t-test can be used to evaluate differences between groups, while if they

37

are heterogeneous the comparison is performed with the modified t-test, using

Satterthwaite’s method. In presence of more than two groups having

homogeneous variances, the comparison between all groups can be performed

using one-way ANOVA followed by Duncan’s multiple range test or Tukey’s

Honest Significant Difference test, and by Dunnett’s test for pair-wise

comparisons between the control and the dosed groups. If the assumption of

normality is not tenable even after transformation and analysis of outliers, or if the

Levene’s test shows that variances are heterogeneous, some alternative methods

are suggested: the Kolmogorov-Smirnov’s test or the Wilcoxon rank sum test can

be used to compare two groups, while global differences among more

experimental groups can be tested with a Kruskal-Wallis ANOVA by ranks, or a

Jonckheere’s test. If significant differences are found, multiple comparisons are

possible using distribution-free methods like Dunn’s or Shirley’s tests. A concise

representation of the suggested analysis can be seen in Figure 1.

38

Figure 11- OECD statistical decision tree summarizing procedures for the

analysis of continuous data

The approach suggested in the guidelines is one of the many that have been

developed based on the analysis of variance since the beginning of XX century:

the foundation goes back to the work of the astronomer G. Biddel Airy, and was

later formalized by R. A. Fisher (Fitzmaurice G 2008).Several approaches took

the moves from the ANOVA paradigm, and the one just described is probably one

of the simplest, since it consists in summarizing the collection of measurements

for each individual in a single or a set of values, that are than compared using the

39

ANOVA method. Means are a very immediate measure, and the area under the

curve (AUC) is also quite common. Appealing for its simplicity, this approach has

clear drawbacks: the loss of information, and the fact that completely different

series of data can produce the same summary measure; the impossibility to

evaluate the effect of time-varying covariates; the violation of the assumption of

homogeneity of variances at the basis of ANOVA, that is very likely when

measurements are irregular, not equally spaced or have missing data.

Another possible method, one of the earliest, was the mixed-effect ANOVA,

also referred to as the univariate repeated-measures ANOVA: since the structure

of longitudinal data shows some similarities to that of data from randomized block

designs, for which ANOVA had been developed, the methodology was applied to

repeated measures data, regarding the individuals as the blocks. The model can be

written as

𝑌𝑖𝑗 = 𝑋′𝑖𝑗 𝛽 + 𝑏𝑖 + 𝑒𝑖𝑗

with i = 1, …, N individuals and j = 1, …n measurements, where 𝑋′𝑖𝑗 is a design

vector, 𝛽 a vector of regression parameters, 𝑒𝑖𝑗~N(0, 𝜎𝑒2)and 𝑏𝑖~N(0, 𝜎𝑏

2) doesn’t

stands as the proper block effect anymore, but as a random subject effect

representing the unobserved or unmeasured characteristics that cause differences

in the outcome in each subject. This factor accounts for positive correlation

among subsequent measurements within-subject, but is forced to follow a quite

restrictive structure of the covariance, maintaining constant variance and

covariance ( 𝑉𝑎𝑟(𝑌𝑖𝑗) = 𝜎𝑏2+ 𝜎𝑒

2, and 𝐶𝑜𝑣(𝑌𝑖𝑗𝑌𝑖𝑘) = 𝜎𝑏2). The assumed

compound symmetric structure of the covariance matrix is not the best for

longitudinal data since correlation is likely to decrease as separation in time

increase, and sometimes the variance does not hold constant in time. Some

shortcomings of the model were addressed in time: Greenhouse and Gassier

(1959) proposed an adjustment to handle more general covariance structure, while

Henderson (1963) developed a method for unbalanced data. Anyway, the idea of

allowing for random differences among subjects is the basis of various subsequent

40

regression methods, so the univariate repeated-measures ANOVA can be

considered as a precursor of the regression methods that will be addressed later.

A slightly more complex method was also used for the analysis of longitudinal

data, the repeated-measures analysis by MANOVA, or repeated-measures

ANOVA, a special case of MANOVA that handle multivariate, correlate data, and

allows less stringent assumptions on the structure of the covariance matrix

(covariances are only assumed homogeneous across subjects). This method, too,

has some limitations, since it can’t handle time-unbalanced designs and missing

data, leading to inefficiencies and possibly biased results.

These methods can be used for longitudinal data after verifying their

suitability, but their limitations must be kept in mind; more sophisticated and

flexible methods are available. They will be exposed and applied to the Coca-Cola

experiment data, and their usefulness evaluated.

4.2 Other methods for longitudinal data

4.2.1 Linear mixed-effects models

Mixed effects models are likely the most widely used tool for continuous

outcomes whose residuals distribute normally but are not independent or

homoscedastic: these are characteristics of “grouped” data, where grouping may

arise from clustering (measuring outcome on pups from the same litter for

example), or from repeated measurements or longitudinal studies, where

individuals are assessed repeatedly over time or under different experimental

conditions. The correlation between measurements arising from the feature of the

data is conveniently addressed in these models thanks to the possibility to choose

between several types of parsimonious covariance structures.

Furthermore, the basic idea of having a common functional form for all

individuals, with parameters that vary among subjects is quite intuitive and can be

appropriate in numerous situations. Indeed, these models are called “mixed” since

they can incorporate explicative variables as fixed or random effects: fixed effects

are associated to continuous or discrete covariates, and represent unknown

41

parameters that are constant in the population, for each level or value of the

associated independent variable. When the levels of a categorical variable can be

considered as random realizations of a sample space, and they are not of particular

interest per se, they can be modelled as random effects.

Its foundations lie in the ANOVA paradigm, as can be seen in the works of

Scheffé (1959) and Harville (1977), but also in the idea of allowing for random

differences across individuals when analysing growth curves, as in Wishart

(1938), Box (1950), Rao (1958), Potthoff and Roy (1964), and Grizzle and Allen

(1969). Another contribution in the development of mixed-effect model was the

two-stage approach, adopted and made popular by the US National Institute of

Health. According to this approach, the distribution of the repeatedly measured

outcome is the same for all subjects and is characterized in the first stage, but the

parameters can vary randomly over the units (so they also can be referred to as

random effects), and their distribution constitutes the second stage of the model.

Laird and Ware (1982) were the first to propose a flexible class of mixed

models for longitudinal data: it includes bot growth and repeated-measures

models as special cases, and introduces population parameters, individual effects

and within-subject variation, as well as between-subject variation.

In their representation, the 𝑛 ∗ 1 vector of responsesfor the ith subject can be

modelled as

𝒚𝑖 = 𝑿𝑖𝜷 + 𝒁𝑖𝒃𝑖 + 𝜺𝑖 𝑖 = 1, … 𝑁

where

• 𝑿𝑖𝑗 is a 𝑛𝑖 ∗ 𝑝 design matrix of explanatory variables or fixed factors;

• 𝜷 is a 𝑝 ∗ 1 vector of unknown population parameters, or fixed effects

coefficients, describing the relationships between the outcome and the

explanatory variables for group defined by levels of a fixed factor (for

example, describing the contrast between males and females);

• 𝒁𝑖 is a 𝑛𝑖 ∗ 𝑞 design matrix of variables orrandom factors;

• 𝒃𝑖 isa𝑞 ∗ 1vector of unknown random effects, specifically referred to a

given level of a random factor, usually representingthe deviations from the

relationships described by fixed effects. Random effects can be set as

42

random intercepts (random deviations for an individual or cluster from the

overall fixed intercept), or as random coefficients (random deviations for

an individual or cluster from the overall fixed effects). They are assumed

to follow a multivariate normal distribution (~ 𝑁(𝟎, 𝑫)), with𝐷being a

𝑞 ∗ 𝑞symmetric, positive definite variance-covariance matrix;

• 𝜺𝑖 is a𝑛𝑖 ∗ 1vector of errorsfor the ith subject for each measurement

occasion, whose terms do not need to be independent but can be correlated

within individual. The residuals for each subject follow, again, a

multivariate normal distribution (~ 𝑁(𝟎, 𝑹𝒊 )),with 0 mean and a positive

definite 𝑞 ∗ 𝑞 variance-covariance matrix,𝑹𝑖 .

Several covariance structures can be specified both for D and for 𝑹𝒊 4. For D, a

very common definition is the unstructured, where no additional restriction is

assumed on the value of its elements apart from positive- definiteness and

symmetry, so the variance components to be estimated are 𝑞 ∗ (𝑞 − 1) 2⁄ (the

variance 𝜎𝑏2 for each of the 𝑞 random effects and the covariance 𝜎𝑏𝑘,𝑏𝑙 for each

couple). Other more parsimonious covariance structures are possible, but they

require more constraints: an example isthe diagonal matrix, where only the

variances are estimated, while the covariances are set to 0. The simplest

specification for 𝑹𝒊 is the diagonal structure, that require the estimation of one

single parameter for the variance component, 𝜎𝑏2, since the residuals are assumed

to be uncorrelated within individual, and to have common variance. The

covariance matrix can alternatively take a compound symmetry structure, that

assumes equal variances and equal covariances among observations within

subject, and is suitable for example for repeated assessments under the same

experimental conditions, when equal correlation of the residuals is plausible.

Another quite common structure is the 1st order autoregressive or AR(1), under

which the variance components vector only contains two parameters, the variance

𝜎𝑏2, that is assumed constant and always positive, and a correlation parameter ρ,

whose values go from -1 to +1; the covariance is calculated as 𝜎𝑏2𝜌𝑤, where

4The following presentation of Linear Mixed effect models is based on the work of West et al. (2010)

43

𝑤 represents the lag between observations within subjects (so for adjacent

observations is higher, and tends to 0 for very far observations): this makes it

suitable for equally spaced experimental designs. Other structures, with less

constraints, are possible, but are less parsimonious, and require the estimation of

more parameters of the variance components. Mixed models even allow to assume

that subjects characterized by different levels of variables may share the same

structure of the covariance matrix, but have different values for the parameters of

the vector of variance components of D and 𝑹𝒊 .

An alternative specification of the mixed effect model, referred to all

individuals together, is given as

Y = Xβ + Z u + ε, where u∼ N (0, G) and ε∼ N (0, R)

Here Y, u and ε are vectors obtained from “stacking” respectively the Yi, ui and

εi vectors for all subjects vertically, the n × p design matrix X is obtained by

stacking all Xi matrices vertically and the Z matrix is a block-diagonal matrix,

with blocks on the diagonal defined by the Zi matrices. The G matrix is therefore a

block-diagonal matrix containing the variance-covariance matrix for all random

effects, with blocks on the diagonal defined by the D matrices, while the n × n

matrix R is a block-diagonal matrix of variance-covariance for all residuals, where

the Ri individual matrices define the blocks on the diagonal.

Since it is assumed that the random effects and the error terms follow a normal

distribution, the whole model can be written marginally as

𝒚𝑖 = 𝑿𝑖𝜷 + 𝜺𝑖

where휀𝑖 ~ 𝑁(0, 𝑉𝑖 )and 𝑉𝑖 = 𝑅𝑖 + 𝑍𝑖 𝐷 𝑍𝑖′ , with the covariances of the

observations (the off-diagonal elements) allowed to be correlated, so different

from 0. Consequently, the marginal distribution of the vector of responses is

defined as 𝑦𝑖 ~ 𝑁(𝑋𝑖 𝛽, 𝑅𝑖 + 𝑍𝑖 𝐷 𝑍𝑖′ ). This representation is worth being

reminded since it’s in this framework that are estimated the fixed effects and the

variance components in most statistical software, and lies at the basis of the

likelihood approach for estimation.

44

Inference on the parameters estimates can be based on least squares and

maximum likelihood methods, or, formulating the model the appropriate way,

using an empirical Bayesian method. Under the classical frequentist approach, 𝛽

and 𝜃 (where 𝜃 is a 𝑞 ∗ 1 vector containing the variances and covariances from 𝑅𝑖

and 𝐷) can be obtained maximizing the likelihood function, as usual: the

likelihood is set as a function of the parameters of the hypothesized model, taking

all assumptions into account: for the mixed effect model, the marginal

specification of the model is used. The values of the parameters that, given all

assumptions, make the observed values of the outcome most likely, are the

maximum likelihood estimates for the parameters. The likelihood function 𝐿(𝛽, 𝜃)

is given by the product of all individual likelihood functions, and the log-

likelihood is found as usual, as the natural logarithm of the likelihood function,

resulting in

𝑙𝑀𝐿(𝛽, 𝜃 | 𝑦) = −1

2[𝑛 ∙ log(2𝜋) + log(𝑽𝑖) + ( 𝑦𝑖 − 𝑋𝑖 𝛽)𝑽𝑖

−1( 𝑦𝑖 − 𝑋𝑖 𝛽)]

The estimates of the covariance parameters are obtained using iterative

procedures, until the reaching of convergence; once the ML estimates of θ are

found, they are used to compute (directly) the estimated value of 𝑽𝑖, and finally to

calculate the generalized least square for the regression parameters β. Since an

estimate of the values in 𝑽𝑖are used, the βs are also called the empirical best

linear unbiased parameters. The main problem with maximum likelihood

estimates of the variance components, is that they do not account for the degrees

of freedom used to estimate the parameters of β, so they are biased, as discussed

in Verbeke & Molenberghs (2000). To overcome this issue, the Reduced

Maximum Likelihood estimates are more often used. REML was proposed

initially to overcome the problem of the estimation of variance components from

unbalanced or incomplete block design, and was then adopted as an alternative

tool and even preferred to classical ML method since the estimates are not biased,

because the degrees of freedom used for the estimation of the fixed effects are

considered. This allows to obtain estimates of the values in the 𝑽𝑖 matrix; then the

parameters of the β can be found using the methods from ML approach, using

45

generalized least squares. The estimates obtained with the REML and ML method

are different: for β, ML produces biased results, while REML does not, as already

mentioned; for the variances of β, both methods give results that are biased

downwards, because neither compensates the uncertainty coming from the use of

empirical estimates of 𝑽𝑖. Usually, this problem is overcome trying to find the

best possible estimates of the 𝑽𝑖, using REML to fit models with alternative

structures for D and 𝑹𝒊 .

The estimation of the covariance parameters (that is, the maximization of the

log-likelihood function under the assumption of positive-definite matrices D and

𝑹𝒊 )is usually perform using the Expectation-maximization algorithm, the

Newton-Raphson procedure or the Fisher scoring algorithm. In most statistical

software, first appropriate starting values for the parameters are estimated with the

EM algorithm; these are then used for the following estimations, using the

preferred algorithm. The most common method is the Newton-Raphson, both for

ML and REML.

Mixed effects models have also proved to be robust in the analysis of

unbalanced data when compared to the General Linear Model framework

(Pinheiro and Bates, 2000): individuals with incomplete observations can still be

included in the analysis, and with complete data, too, mixed effects models

provide advantages over the GLM.

So far, the response variable for individuals was assumed to follow a linear and

continuous trend, but often it can present discontinuities, or show a nonlinear

path, making reasonable the assumption that the likelihood function may depend

on the parameters in a non-linear way. It is the case of several physical

phenomena, and growth is the main example. Several adaptations may be adopted

to cope with these situations, the most common being the splitting of the time of

analysis into subperiods, so that the linearity assumption underlying the model is

reasonable within each one. The drawback of this simple solution is that it

prevents from analysing the phenomenon under study in its richness, treating

discontinuities and nonlinearities as problems to overcome to fit the data to the

model, rather than as the characteristics of the process itself.

46

Two approaches can be tried in this situation (Singer J. D. 2003): the first is

more empirical, based on the observation of individual trajectories and the

manipulation of data. One possibility is to identify a suitable transformation of the

outcome or the time scale that can lead to a linearization of the trajectories for

each individual, so that the assumption of linearity holds, and the simpler model

illustrated so far can be fitted. Many ways to transform data of course exist, and

the most suitable must be found every time through several tries5.Another

possibility is to introduce in the model different predictors that all together

characterise time, that is, to represent time as a polynomial function. So, a second-

order polynomial may be fitted to account for trajectories that resemble quadratic

change with one stationary point; a third-order polynomial will account for cubic

change, and so on. The main issues with this device is the increased complexity of

the model and the interpretability of the regression parameters. The second

approach is more theory-based, and requests to identify a reasonable functional

form underlying the trend of subject’s data, that will be used in the model to

describe the relationship between the response and explicative variables. As we

will show in the next paragraphs, both strategies have been applied to the body

weights data of the second generation of rats from the Coca -Cola study.

4.2.2 Nonlinear mixed effects models

The general model presented in the previous paragraph and all the expedients

that allow to represent growth trajectories of many shapes share a characteristic:

they imply an association linear in the parameters between the outcome and the

explicative variables, because the model is composed by individual growth

parameters, that are linked in a linear way. Often however, the likelihood function

depends on the parameters in a non-linear way: it is the case of several physical

phenomena, and growth is the main example. In such cases, the use of nonlinear

model is justified by the possibility to obtain a more interpretable model, and to

5A useful tool to have an idea of the possibilities is given in Mosteller and Tuckey (1977), where

the ladder of transformation is associated to the so-called rule of the bulge (the idea is to associate

the approximate shape of the plots to a suitable transformation of the outcome or time, to be

applied to all individuals)

47

use a smaller number of parameters, mostly if we compare them with high-order

polynomial models.

Lindstrom and Bates (1990)were the first to present a general, nonlinear mixed

effects model for data in which the assumption of the normality of residual holds,

but the expectation function is nonlinear. The model can be written as

𝑦𝑖𝑗 = 𝑓(𝑥𝑖𝑗 , 𝛽, 𝑢𝑖) + 𝜺𝑖𝑗, 𝑖 = 1, … , 𝑁

where𝑓 is a real valued function, 𝑥𝑖𝑗is a vector of covariates containing both

within- and between-subjects covariates, 𝛽 is a 𝑞 ∗ 1 vector of unknown

parameters of fixed effects, 𝑢𝑖 is a vector of unobservable subjective random

parameters following a multivariate normal distribution with 0 mean and

variance-covariance matrix 𝛴, and휀𝑖 is the usual error vector of dimension 𝑛𝑖 ∗ 1,

following a multivariate normal distribution with 0 mean and variance-covariance

matrix 𝜎2𝛬.

It is useful to adopt the two-stages representation of the model6, since it helps

clarifying how the non-linear function is used to express the individual trajectory

of change at level 1

𝑦𝑖𝑗 = 𝑚 (𝑥𝑖𝑗𝑤, 𝜑𝑖) + 휀𝑖𝑗,

where m describes the behaviour of the individual growth as depending on

individual-specific parameters 𝜑𝑖 and the vector of within-subject covariates 𝑥𝑖𝑗𝑤,

while the inter-individual variability can be expressed using a regular linear

relationship at level 2:

𝜑𝑖 = 𝑑 (𝑥𝑖𝑗𝑏 , 𝛽, 𝑢𝑖)

where d is a vector function that explains the variation of individual-specific

parameters between subjects, and incorporates 𝛽, the vector of parameters for the

population, and 𝑥𝑖𝑗𝑏 , the set of between-subjects covariates.

The assumptions underlying the non-linear mixed effects model are that the

random effects 𝑢𝑖and the error terms 휀𝑖 are independent between each other and

across individuals, that 𝜎2 > 0 and that matrix Σ is definite nonnegative.

6The following presentation of nonlinear mixed effect models is mainly based on the work of Demidenko (2013)

48

An important part of the modelling process regards the choice of which

parameters to consider random, so individual-specific, and which can be regarded

as population-averaged, so fixed.

A specific feature of NLME models is that the random parameters 𝜑𝑖 appear in

a nonlinear function, implying that the expected value of the response can’t be

expressed in closed terms, in terms of population averaged parameters. This

makes it is very difficult to establish the statistical properties of the model in small

samples; indeed, even the condition of a number of subjects N tending to infinity

while the number of observations for each individual 𝑛𝑖remainis finite, that is a

sufficient condition for linear mixed effect models to be consistent, asymptotically

normal and efficient, is enough only for maximum likelihood estimation.

Maximum likelihood estimation, yet, requires the integration of the unobservable

individual-specific parameters, and this leads to the presence of a

multidimensional integral. To avoid this problem, various approximation methods

for the estimation of the model exist. They can be grouped into two categories, the

two-stage methods, and those that approximate the NLME function to a linear

function, to reduce the model to a nonlinear marginal mixed effects model. In the

former methods, nonlinear least squares are used to estimate individually the

subject-specific parameters from the first stage; then these are used as

observations for the second stage model. These methods can be proficiently used

to when 𝜎2 is relatively small, and the number of observations for each cluster

relatively large. The latter methods allow to obtain good approximations when an

estimate of the random effects from the penalized nonlinear least square estimator

is used. All methods, anyway, become equivalent when the number of clusters

and of observations for each cluster tend to infinity. As already mentioned, the

ML estimator is consistent when the number of subjects tends to infinity and the

number of observations for each individual remains finite, but this characteristic

can be lost if the distribution of the random effects is mis-specified (however, if

also the number of observations per cluster tend to infinity, we can consider it

consistent even if the distribution of the random effects is not appropriate). The

ML estimator also returns correct standard error for all estimates.

49

Choosing the correct functional form for the model is very important: the

choice of a not suitable model may prevent convergence, produce negative

variances for the random effects, computational difficulties and so on. The

observation of individual growth paths can ease the choice, but a theory-driven

approach is possible, as it will be illustrated in the next paragraph. The growth and

maturation processes have been long studied in humans, and several models have

been proposed to formalize the growth patterns during infancy, childhood and

adolescence. Since some similarities can be recognized in the growing process of

very young humans and rats, some of these models have been “borrowed” and

applied here.

4.2.3 Growth models

The analysis of the progress of growth and maturation in humans has a long

history. It covers the study of variations of stature and weight, and of the velocity

of growth standardized relative to the passage of time, as well as of the acquisition

of secondary sexual characteristics, to investigate the progressive achievement of

adult status. Describing the normal progress of childhood growth is important per

se, and even more interesting is the study of any variation from the regular

pattern: differences in size, velocity and timing of maturation can give indications

on health problems at different levels. In normal conditions, indeed, growth is

determined by the personal genetic complement and ruled by several hormonal

systems, but many and various environmental factors, first of all nutrition,

represent a fundamental constraint.

Auxological studies always looked at mathematics to formalize and model the

normal growth phases7, starting from the observation and measurement of

individuals under study. They usually are of two types, with different goals: cross-

sectional studies usually aim at obtaining age-dependent reference curves, while

longitudinal studies have the purpose to understand the growth process over some

period.

7The illustration of growth models is mainly based on the work of Hauspie et al, 2011.

50

Longitudinal data and the will to understand the growth process and highlight

possible differences due to the treatment regime is what we have in the context of

this study on rats, too, so these methods were applied to this unusual field.

It is well established that the pattern of growth of body dimensions of the

“general type” (to be distinguished from those of lymphoid, neural and genital

type) is increasing and S-shaped, from birth to the adult age, since it progresses

rapidly in the first years, then slows down, and accelerates again around the so-

called pubertal spurt, to become a plateau when the approximate adult size is

reached. Simple linear regression is therefore clearly not the best tool to manage

this kind of data. As already mentioned in the previous sections, polynomial

models are sometimes used to represent longitudinal growth data, since they are

computationally convenient, but they are preferred for small periods, and their

major drawback is the difficult interpretation of the higher-order terms.

Another interesting alternative might be the use of smoothing methods, that can

help highlighting the shape of growth form “noisy” data, estimating a curve

without an a priori fixed parametric model. The most used techniques are

smoothing splines, kernel estimators, or local polynomials. They may represent a

good alternative in this case, where a lot of observations were registered at quite

close measurements, so data might be prone to measurements errors and short

terms variations.

Regression models based on an adequate parametric function would be the best

alternative, since they are estimated using fewer parameters that have some

biological interpretation, but they are quite insidious: the choice of the appropriate

functional form is difficult, it is necessary to consider all the facets of the growth

process to reduce the risk of obtaining biased results, and their estimation is

computationally quite demanding.

Several models have been developed in time to represent phases of growth:

most structural models are monotonously increasing functions, that were designed

to describe the growth of dimensions for which only rising values are possible,

and only during the initial part of life (mostly infancy, more rarely adolescence),

so their appropriateness to model body weights during the whole lifespan must be

51

carefully evaluated; here only a small selection will be presented and applied to

data.

All will be illustrated as if they were referred to the individual: all functions,

indeed, can be substituted to the generic function at the level 1 of the nonlinear

multilevel mixed effect model that was illustrated before, since they describe the

behaviour of each individual.

The first parametric model was elaborated already in 1937 by Jenss and

Bayley: it was developed to describe data from birth to approximately 8 years

using 4 parameters combined in a function with a linear and an exponential part,

accounting for growth and its decreasing rate:

𝑦 = 𝑎 + 𝑏 𝑡 − 𝑒𝑐+𝑑 𝑡

Another option is the Count model proposed in 1942, that uses only 3

parameters combined in a linear way

𝑦 = 𝑎 + 𝑏 𝑡 + 𝑐 ln(𝑡 + 1)

This model proved to perform slightly worse than the Jenss and Bayley, but

both share the quality of remaining robust relative to the choice of starting values

for the parameters. The Count model was slightly modified in 1973 by Berkey and

Reed in a way that maintained the simple, linear structure but added one or two

parameters

1𝑠𝑡 𝑜𝑟𝑑𝑒𝑟: 𝑦 = 𝑎 + 𝑏 𝑡 + 𝑐 ln(𝑡 + 1) +𝑑

𝑡

2𝑛𝑑 𝑜𝑟𝑑𝑒𝑟: 𝑦 = 𝑎 + 𝑏 𝑡 + 𝑐 ln(𝑡 + 1) +𝑑1

𝑡+

𝑑2

𝑡2

accommodating for one or two additional inflection points (and so allowing to

consider periods of acceleration), and leading to a better fit compared to the

previous alternatives.

Adolescence growth was initially analysed using the logistic function

𝑦 = 𝑝 +𝑘

1 + 𝑒𝑎−𝑏 𝑡

and the Gompertz function

𝑦 = 𝑝 + 𝑘 𝑒−𝑒𝑎−𝑏 𝑡

52

because they well reflected its main features of a sigmoid trend, starting from

a lower “asymptote”, experiencing a sharp increase in velocity and, after an

inflection point, a decline in the growth rate until the approximation to the upper

asymptote, the adult size. Both models proved to perform quite good, but they

required the lower age bound (the cut-off between childhood and adolescence) to

be determined arbitrarily, so their use became limited in time, in favour of

alternatives, such as the Preece-Baines model. Several models were also proposed

for the analysis of growth from birth or early childhood until adult age form the

‘80s, all sharing the fundamental idea of combining different (at least two or

three) functions, so that each component could describe the type of growth typical

of a smaller period of life. The first were the double- and triple- logistic functions,

composed by the sum of separate logistic functions, each to model infancy, mid-

childhood and adolescent phases. These were modified by Bock and colleagues in

1994 replacing each logistic with a generalized logistic function; more options

were provided, between other, by Shohoji- Sasaki who proposed the Count-

Gompertz function in 1987, or by Jolicoeur and colleagues, that created the JPPS,

the JPA-1 and JPA-2 models. Despite all these methods would be extremely

interesting to analyse more in depth, they will not be further treated here, since

they were created specifically to model height/length instead of weight, and they

were formulated quite specifically to render the human progress from infancy,

trough childhood, the so-called take off, the whole adolescence spurt, until

adulthood. Our data hardly share these characteristics, so simpler models will be

preferred.

4.3 The analysis of body weights

The analysis of body weights in the first generation of rats, those originally

randomized and involved in the experiments, was not performed here, but it was

chosen to study only the rats belonging to the second generation, since their

characteristic make them more interesting, challenging, and overall more worth a

thorough analysis.

53

We can say that their treatment started in the pre-natal phase, because their

dams underwent mating and started gestation one week after the beginning of the

experiment, when they were already exposed to the treatment or control regime;

moreover, they continued the exposure during the whole period of the pregnancy

and the weaning. The observation of the second generation of rats started at 8

weeks of age8, when all pups have certainly reached a complete independence

from their dams.

Age of dams Males Females

Treated Control Treated Control

30 weeks 74 110 73 98

39 weeks 67 49 65 55

55 weeks 28 32 24 24

Total 169 191 162 177

Table 8: Rats by sex, treatment and age of dams at start of gestation

Male and females from treated and control groups are quite balanced, while

there is an important decrease as the age of the dams at the beginning of gestation

increases, as we would physiologically expect.

Measurement Missing

n %

Week 9 (1 of observation) 0 0

Week 30 13 0,018

Week 70 78 0,111

Week 112 348 0,498

Week 114 367 0,525

Week 122 639 0,914

Week 130 688 0,984

Week 146 698 0,998

Table 9: Lost to follow-up in absolute number and percentage at selected

measurement occasions

The inspection of individual and mean weights per experimental group and age

of the dams at pregnancy showed some important features that are to be

considered in the following analysis:

8The “time” variable will be centered in the analysis, so that it will not represent the weeks of age

but the weeks since the beginning of observation and treatment. The equivalence between these

two measures is extremely straightforward (ctime= time-8).

54

• observations for all animals are present only for the first occasions, then they

start to experience mortality; by week 114, when rats are in their elderly age

(and measurements of body weight start being performed every 8 weeks) half

of the original population remains; since later assessments happens more

sporadically, they involve fewer and fewer subjects, as from table 3;

Figure 12: Average weight for females and males by treatment and age of dams at

start of gestation

• as expected, males are heavier than females, the treatment proportionally

affects body weight, and the age of dams at pregnancy doesn’t seem relevant

for rats from the control groups, while it corresponds to sensibly different

paths for the treated;

• the weights sharply increase during the first weeks of observation, until

around the age of 13 weeks, then the growth stops or continue at a slower pace

once the adult size is reached;

55

Figure 13: Growth trajectory of body weights from a random sample of II

generation rats.

• variability is very important, both within subjects, and among them. Most

growth lines present indeed a lot of erraticism that is usually attributable to

small variations in the health status of the subjects, that are not relevant but,

since the measurements are very frequent, are registered. All trends,

nevertheless, are quite similar in shape during the first period of growth, while

during the adult/elderly period peculiar patterns appear: very often, weights

decrease in the very last part of life, because of diseases, or the physiologic

ageing process. Some animals, in contrary, experience an extreme (in

magnitude and velocity) increase of weight that is likely due to the presence of

neoplastic masses.

The conclusions that can be drown from this exploration of data is that models

will probably suffer from this pronounced variability, that it is important to

consider all the variables that characterize the data (sex, treatment and the age of

dams at gestation), and that linear models can’t be directly fitted to data, so

several expedients will be applied, and their use and suitability evaluated. These

56

can be resumed in (I) mathematically manipulating variables so that the linear

assumption appears reasonable, (II) using polynomial functions to represent time,

so that a non-linear trend can be modelled using a linear function and, finally, (III)

fitting nonlinear “human” growth models. Furthermore, since the observations in

the last occasions of measurement are drastically reduced and represent very old

rats, that likely experience weight variations according to their health status more

than to the treatment regime, the observations registered after week 114 will not

be considered.

The first step to analyse body weight was fitting linear mixed effects models;

to do so, some mathematical manipulation of data was necessary, so that the

assumption of linearity remained reasonable; after several tries9,the log-

transformation of the time variable was chosen.

It was chosen to compare the estimates and performance of two models that

could be justifiable and meaningful for these data: one only allows for random

intercept and slopes for each litter, since, at least in the first part of life, the

weights of each pup within the same litter tend to be quite similar, but the

differences among litters can be remarkable; they tend to level-up with time, but a

great variability at individual level remains. The alternative possibility was to

include a random intercept only at the litter level, and to include the subject level

in the analysis, allowing every subject within each litter to have its own random

intercept and slope. Information criteria and likelihood-ratio tests favoured this

last option, since such a setting would probably help explaining the peculiar

growth paths in adult/elderly life we highlighted before, refining the fit of the

models and reducing the unexplained variability; on the other hand, it probably

unnecessary weights the model down.

Fixed effects were assessed in a very straightforward way, since they consist of

experimental conditions: the relevance of sex, treatment received, and the age of

the dams at the beginning of gestation was evaluated, and the first two were found

to have a relevant influence on body weights. Variables and model selection were

performed according to statistical significance of the estimates, to the results of

9Weight raised at the power of 2, 2.5, 3, 3.5, 4; natural logarithm of time, time1/2, 1/time, 1/time2.

57

likelihood ratio tests in case of nested models, or considering information criteria

as AIC and BIC, and, as usual, pairing all this with the knowledge of the

phenomenon under study and biological meaning. The last consideration regards

the fact that this type of repeated measurements can’t be independent between

subsequent assessments, but each value highly depends on the previous ones, with

correlation decreasing the more the observations become far apart in time. To

account for this, the possibility of shaping the structure of the residual errors at the

individual level was evaluated. The variance-covariance structure was built to

have serially correlated residuals using an exponential specification, where

𝑐𝑜𝑣(휀𝑖𝑡, 휀𝑖𝑡′) = 𝜎𝜀 2 𝑒𝑥𝑝{−𝛾(𝑡′ − 𝑡)}

which tends to the variance the closer are the measurements, and exponentially

decreases to zero as they become more distant in time.

Observations 36,606 Model 1 Model 2

Number of groups 98 Coef s.e. Coef s.e.

Fixed part

Time (ln(centered

time))

77.688*** (1.980) 76.405*** (0.914)

Treatment

12.170** (5.290) 17.858*** (4.837)

Sex

158.158*** (0.543) 66.573*** (3.144)

Constant

67.011*** (4.513) 100.786*** (3.810)

Random part

Litter: Unstructured (1) sd(logct) 19.401 (1.430)

Identity (2) sd(_cons) 35.886 (3.193) 17.480 (2.325)

corr(logct,_cons) -.707 (.061)

Subject: Independent

(2) sd(logct)

18.329 (.782)

sd(_cons)

1.49e-07 .

Residual: Exponential

(2) rho(e)

.395 (.015)

sd(Residual) 47.552 (.176) 51.360 ( 1.007)

Goodness of fit

Log restricted-

likelihood

-193723.6 -155124.1

AIC

387463.1 310264.3

BIC

387531.2 310332.4

*** p<0.01, ** p<0.05,

* p<0.1

58

Table 10: Multilevel mixed effects models using transformed variables; results of

regression of weight on ln(time, centered) and experimental variables.

The models give quite similar results in terms of significance and type of effect of

each variable. If we consider only random differences at the litter level, we find

out that female rats weight approximately 70 g at their first assessment at 8 weeks

of age while males are sensibly heavier, of around 160 g; treatment is responsible

of a smaller, but still relevant, increase in weight, of around 12 g. The

physiological growth that all animals experience in time is around 77 g for

females of the control group, per unit of time (which is on logarithmic scale).

The graphical representation of the fixed effects in figure 4 and6, and of the fixed

and random effects for some selected litters in figure 5may help clarify these

findings.

Figure 14: Plot of data and estimated fixed effects, model 1

59

Figure 15: Data and linear prediction with fixed and random portion in four

selected litters, model 1.

Figure 16: Plot of data and estimated fixed effects, model 2.

60

Figure 17: Data and linear prediction with fixed and random portion in four

selected litters, model 2.

61

Residuals were analysed to verify whether the underlying basic assumptions

(linearity of the relationship between the outcome and the regressors, normality

and homoscedasticity of residuals) were met, and they highlighted several

problems.

Figure 18: Evaluation of linearity: comparison between growth trajectory and

linear regression in three random samples

As it was clear from the beginning, linearity is hardly respected even when a

transformation of the variables is used: the main problem clearly arises in the

relation between growth of body weights and time, since even if a linear trend can

approximate reasonably well the growth in the first period of life, the

unpredictable variability in the last phase is often too high.

The representation of normal probability plots and standardized residuals

estimated for each level of the models shows more issues that can probably be

attributed, again, to the extreme variability of body weights among rats: too many

observations present extreme values both in the upper and in the lower tail of the

distribution.

62

Figure 19: Evaluation of normality and homoscedasticity of residuals using

Normal probability plots and standardized residuals.

Even the assumption of homoscedasticity raises some doubts: coherently with

what was highlighted so far, body weights can grow to quite extreme values, or

fall sharply in relatively short time, mostly in adult and aged rats. The fact that

standardized residuals tend to increase with time is therefore not so surprising.

These analyses show that indeed linear mixed effect models are a powerful tool to

represent the multilevel nature of data, to account for the fact that they are

repeated measurements of the same physical trait, and to handle the lack of

randomization of this particular dataset, too. The great variability in the possible

growth paths and its non-linear trend in time doesn’t allow, nevertheless, to

consider those illustrated above as the optimal models for this type of analysis,

and to take the estimates as definitive. Given these drawbacks, a different

approach was tried: the same models were fitted using polynomial terms to

represent time, so that the trajectory of growth could be better represented.

The models were built considering again sex, experimental treatment and age

of each dams at the beginning of gestation as covariates to include in the structural

part; it was chosen to include a random effect depending on the variable “time” at

the familiar level.

63

Observations 36,606 Quadratic Cubic

Number of groups 98 Coeff s.e. Coeff s.e.

Fixed Part .

.

Time

6.906*** (0.178) 11.871*** (0.266)

Time 2

-0.048*** (0.002) -0.174*** (0.006)

Time 3 .0008*** (.00003)

Treatment

9.039* (5.224) 11.270** (5.221)

Sex

158.225*** (0.613) 158.184*** (0.569)

Constant

162.674*** (3.790) 124.929*** (3.975)

Random Part

Litter:

Unstructured sd(ctime) 1.715 '(.102) 2.456 .

sd(ctime2) .015 (.001) .049 .

sd(ctime3)

.0003 .

sd(_cons) 27.184 (2.188) 29.202 .

corr(ctime,ctime2) -.901 . -.883 .

corr(ctime,ctime3)

.752 .

corr(ctime,_cons) -.209 . -.401 .

corr(ctime2,ctime3)

-.962 .

corr(ctime2,_cons) .342 (.023) .438 .

corr(ctime3,_cons)

-.368 .

sd(Residual) 53.517 (.198) 49.662 .

Goodness of Fit

Log restricted likelihood -198178.5 -195553.3

AIC 396377 391118.7

BIC

396462.1 391169.7

*** p<0.01, ** p<0.05, * p<0.1

Table 11 : Multilevel mixed effects models using polynomial terms; results of

regression of body weights on second and third degrees polynomial terms for time

and experimental variables

Quadratic and cubic models were evaluated adding the second and the third

power of variable “time” in the model equation, and the cubic was found to be the

better for these data. The results are not so dissimilar to those obtained before,

most of the differences among body weights can be explained by sex, but the

consumption of Coca Cola ad libitum is a relevant factor, as well.

64

Figure 20: Plot of data and estimated average predictors, cubic model.

Differences among litters are relevant, too, so it’s worthy adding random

effects at this level.

Figure 21: Data and linear prediction with fixed and random effects in four

selected litters, cubic model.

65

The display of residuals highlights that the problems arising from the extremes

behaviour of weights of elderly or ill rats remains quite evident: residuals can’t

really be said to distribute normally, and appear quite heteroscedastic in relation

to time, too, as can be seen in figures 12.

Figure 22: Evaluation of the assumption of normality and homoscedasticity of

residuals, cubic model.

Polynomial terms have proven to be a worthy expedient to be able to model curve

trajectories remaining in the framework of linear mixed effects models, but the

noisy nature of data continues to create issues that this model can hardly handle

and that may prevent to consider the results reliable.

A third option was attempted, just in an explorative way: a generalized additive

mixed effect model was fitted, where time was not included directly, but trough a

smoothed function. It was then added to the other regressors, always keeping a

multilevel structure with random effects at litter level. This attempt proved quite

effective in taking into account time and model its irregular trend in quite a simple

way. It may be proficiently used to explore data, and eventually to model them,

but the thorough evaluation of this method is left for the next future.

In this occasion it was chosen to explore more in depth the models for human

growth, and to try to apply them to animal growth. The choice of which to use

was based on the possibility to adapt them to the available data in a relatively easy

way, that means that the models had not to be too strictly bonded to

characteristics, rhythms and timings typical of human growth only. It was indeed

showed (Sengupta 2013) that rat and human life are correlated, taking the whole

life span into account, so that one human year equals 13.2 rat days; it is better,

66

nevertheless, to consider different phases separately, since while humans develop

very slowly, rats have a very accelerated pace, and reach sexual maturity around 6

weeks of age: according to this equivalence, the rats that we are observing could

be considered adolescents, but as becomes clear from Table 5, it is impossible to

draw an exact and constant equivalence between rats’ age and humans’ age.

Period Correspondence to 1 human year

Entire life span 13.2 rat days

Weaning period 42.4 rat days

Pre-pubertal period 3.3 rat days

Adolescent period 10.5 rat days

Adulthood 11.8 rat days

Aged phase 17.1 rat days

Average 16.4 rat days

Table 12: Correspondence of one human year with rat days at different phases of

life, from Sengupta (2013)

This is the reason behind the choice of not considering those parametric,

nonlinear models that were designed to account for the specific features and

mechanisms of human growth, but to focus on those that could be used to describe

a path, similar for intensity and velocity to that of humans during young age.

Some of the models that might answer to these needs are the Jenns and Bayley,

the Count and the 1st order Berkey and Reed model, that are usually employed to

analyse growth during adolescence, and are therefore built to describe a steep

growth during the first part of the curve, that slows down at the reaching of the

approximate adult size. They were fitted to data to verify which could be the best,

and the choice fell on the Berkey and Reed model, basing the decision on the

graphical examination of the curves and on the comparison of the measures of

goodness of fit. The model uses four parameters to describe the specific functional

form of the individual growth curve

1𝑠𝑡 𝑜𝑟𝑑𝑒𝑟: 𝑦 = 𝑎 + 𝑏 𝑡 + 𝑐 ln(𝑡 + 1) +𝑑

𝑡

that may represent the starting point, growth rate, acceleration and deceleration of

growth, so that the trajectory is allowed to be curvilinear and to have one or more

67

inflection points. Here, the function was built so that each of the parameters

would be dependent on age10, sex and treatment received; a random effect at the

litter level was introduced and evaluated for all of them, so that each litter is not

constrained to have the same intercept, or, for example inflection point.

Coef. Std. Err. z P>|z| [95% Conf. Interval]

a sex 565.789 (33.214) 17.035 (0.000) 500.691 - 630.887

treat -380.528 (33.304) -11.426 (0.000) -445.803 - -315.252

_cons 655.279 (28.321) 23.137 (0.000) 599.770 - 710.788

b sex -0.150 (0.122) -1.224 (0.221) -0.390 - 0.090

treat -0.846 (0.196) -4.310 (0.000) -1.230 - -0.461

_cons 1.567 (0.147) 10.639 (0.000) 1.279 - 1.856

c sex 78.149 (9.415) 8.301 (0.000) 59.696 - 96.601

treat -118.262 (9.444) -12.522 (0.000) -136.773 - -99.751

_cons 89.059 (8.016) 11.110 (0.000) 73.347 - 104.770

d sex -3,381.295 (137.819) -24.534 (0.000) -3,651.416 - -3,111.175

treat 1,330.012 (138.127) 9.629 (0.000) 1,059.288 - 1,600.736

_cons -3,125.875 (117.836) -26.527 (0.000) -3,356.830 - - 2,894.921

Litter: Identity var(U0) '.567 '.081

.427 - .752

var(Residual)

1966.77 14.557

1938.44 – 1995.51

Figure 23: Estimates from Berkey- Reed model, offspring

The estimated parameters are not easy to interpret; what is relevant is that the

treatment is responsible for significant differences in all parameters. This becomes

very clear if we represent the curves graphically: the plot for some randomly

selected litters shows that the model fits most of the growth trajectories quite

good, while the average curves for treated and controls, separated by sex, show

that both these covariates affect sensibly body weights.

10 This time was considered without centering or transformation, since this model requires time to

be expressed as “age”.

68

Figure 24:Plot of data and estimated average predictions, Berkey-Reed growth

model

Figure 25: Plot of data and linear predictions with fixed and random part in four

randomly selected litters, Berkey-Reed growth model

69

The examination of residuals points out the same problems that emerged with

all other analysis: the residuals can hardly be considered to respect all the

assumptions that underlie the model.

Figure 26: Evaluation of the assumptions of normality and homoscedasticity of

the model using residuals.

At this point it is useful to draw some conclusions: these data have some very

peculiar characteristics, the most relevant for their analysis are non-linearity and

the fact that they can take unexpected, extreme turns upwards or downwards,

mostly when rats are close to the end of life, reflecting the presence of important

neoplastic masses or a worsening of the health conditions. These features should

discourage the use of methods based on the comparison of measures of synthesis

like the group means, because they could be heavily affected by the atypical

recordings, giving an unrealistic picture of the situation and possibly preventing to

detect differences caused by experimental factors.

The use of multilevel mixed effects models is surely to be encouraged, since it

allows to analyse directly the recording of each subject, without concerning about

the differences in the duration of the recordings. They are also a precious tool in

case of clustered data, and in this rather peculiar experimental design, when no

randomization was performed on a whole cohort of rats. The most straightforward

specifications of such models using a proper linear function is not the best option,

because it requires a transformation of the variables, so that the advantage of a

simple functional forms is counterbalanced by the difficult interpretation of

transformed variables. Even the introduction of polynomial terms, that allow to

represent a curve trajectory remaining in the frame of a linear function, slightly

70

reduces the ease of the interpretation, but it can still be acceptable if it allows a

more faithful representation of the growth trajectories; as their plot showed,

nevertheless, it’s not always the case. The methods that we feel to choose as

preferred are the nonlinear growth models that were “borrowed” from human

studies: a wide variety exists, so one can select the most appropriate every time,

according to the characteristics of data. In this case, too, the model parameters are

not easy to interpret, but a clear indication is provided about the statistical

significance of each covariate and its ability to affect the outcome.

71

Chapter 5 Conclusions

The research centre that asked for this collaboration, that concretized in this

research, is active since decades in testing the carcinogenic potential of substances

and compounds that the population is frequently exposed to, for occupational or

lifestyle reasons, and was in several occasions the first to outline the health risks

connected to chemicals that are today recognized as human carcinogens (see (S.

C. Maltoni C. 1979) and (C. B. Maltoni C. 1983) for the example of benzene, or

(Belpoggi F 1995) for Methyl-tert-Butyl Ether). Nevertheless, in some occasions,

the results obtained from its experiments failed to be recognized by the regulatory

organs in charge because they were not sufficiently strong and universally

accepted from a methodological point of view, rather than because of a weak

scientific evidence.

Deepening the knowledge and understanding of the methods used for the

statistical analysis of their results is one of the strategies to enhance the quality of

the research, and it is the ground on which a strong scientific base can be built.

This work is an attempt in this direction: it aims to go beyond the techniques that

are performed routinely, to explore the characteristics of the data and to try to

understand the mechanisms that determined them, and, in this framework, to

propose some methodologies that manage to answer the research questions in a

more comprehensive way.

The data were chosen among those that were published long before, did not

contain any sensitive or reserved information and resented some features that

72

made them particularly challenging for the researchers to analyse. The choice fell

on the Coca- Cola study, a series of 4 experiments that started in the late ‘80s and

whose results were not found to be particularly controversial or worrying, and

were indeed published only in 2006. They presented the particularity of involving

two generations of subjects, the first randomized, the second included in blocks

into the experiment, to evaluate the effect of exposing pups in a particular window

of susceptibility, the perinatal period. They were also one of the few cases where

the exposure determined an important increase in body weight, so that this

measure needed to be considered with more attention than usual.

This work focused on two topics that are not the main focus of the interest in

carcinogenicity studies, but deserve a thorough analysis for the reasons that were

illustrated at the beginning of each of the two chapters dedicated to data analysis.

For the analysis of survival, the time-to-death of each individual was

examined, to evaluate the effect of the experimental regime received; two separate

analyses were performed, one involving breeders, the other offspring only,

because we assume the former observations to be independent, while the latter are

taken on siblings, so independence can’t be assumed. The effect of the exposure

was evaluated, while it was necessary to control for the influence of other likely

risk factors: this was possible thanks to the modelling approach. Proportional

hazard models were fitted, and the Accelerated failure-time model, parametrized

using a generalized gamma distribution, was used when the assumption of PH

didn’t hold. The use of frailties was evaluated: a univariate term was introduced

for the breeders, at the individual level, to control for possible unobserved

heterogeneity; a shared frailty was used to account for the lack of independence

among subjects from the second generation, that contained many siblings. The

results say that the treatment appears relevant in accelerating death among the

individuals exposed from adult life; the frailty could not be introduced in the best-

fitting model because of computational problems, so its influence remains to be

verified. The models estimated for the rats of second generation initially

confirmed the role of the treatment in increasing the risk of death, but when a term

for a common, unobserved source of heterogeneity was introduced to consider the

73

effect of familiarity among members of the same litters, it became the only

relevant variable.

These findings underline some very important principles: the first is the

importance of properly choosing the methods and specifying the models, where

properly means in a data-and-experience-driven way. A thorough knowledge of

the data and of the dynamics that contribute to determine them is always a good

starting point to build plausible, really representative and meaningful models.

Another crucial point is the fact that model checking and verification of the

respect of the assumptions that lie at the foundations of any method should

become a routine embedded in every analysis, while it remains not yet so common

(or, at least, not always explicitly reported) in the literature regarding

carcinogenicity bioassays.

The methods for the analysis of time-to-event are really flexible: here they

were use for the most “literal” application, to evaluate the general survival of the

population under study. They would be suitable for several other applications, if

the collection of additional information was possible: it would be interesting for

example to build a classification of tumours based on their lethality or

incidentality; being able to attribute a cause of death to each animal, or to register

the time of onset for those (rare) type of lesions that are detectable while the

animals are still alive. The collection of such additional information would greatly

enhance the possibilities for analysis, making possible to consider cause-specific

mortality or methods to handle competing risks. Another interesting possibility

that can immediately be evaluated and implemented with the information

available so far is the joint modelling of longitudinal measures (such as body

weights) with survival data.

The second part of this work was dedicated to the analysis of the longitudinal

measurements of body weights from the group of rats of second generation. The

main purpose was to find the best modelling approach to adequately describe the

process of growth of these animals, from their youth to the reaching of the adult

size; once this is assessed, the influence of the exposure to Coca-Cola and the

effect of other covariates has been also evaluated.

74

The data consisted of weekly recordings of body weights of young rats,

followed from the age of 8 weeks until spontaneous death: the observations are

very numerous and were collected regularly; they present a high variability among

individuals that is at least partially expected, because males and females can have

very different sizes, and sometimes a very high variability within individuals as

well, that is due to quite extreme weight increases or decreases due to health

issues or mortality.

All analyses were made in the framework of mixed- effects models, because

they are the best approach to treat longitudinal data and allow to include

additional level of grouping, that in this case consists of litters. It is therefore

possible to account for the structural effect of covariates, that we expect to act the

same way on all individuals, and to add random effects that introduce a

correlation among the subjects within the same group (here: within the pups of the

same litter.

Mixed effects can be included in a variety of models: here several were

explored, and their suitability has been evaluated. The first was a linear

formulation, which is reasonable only if variables undergo a transformation to

linearize the trend of body weights, that increase very rapidly in the first period

and gradually slow down. The log-transformation of the time variable was used.

The second was to build a linear function with polynomial terms for time, to allow

the trajectory to assume a curve trend; the cubic function of time, that allows for

two inflections points, was found to be the best option. Finally, growth models

were fitted: they are non-linear models that were specifically created to study

human development. We selected those that could easily be adapted to the animal

context, since they were less sophisticated to catch characteristics specific of

human and infant growth. The Berkey and Reed model proved the best for this

purpose.

Even if it's not possible to directly compare the results of these models using

the measures of information criteria, we can still draw some conclusions. Growth

models represent an interesting way to analyse the body weights of young rats,

since not only they allow to evaluate the differences among experimental groups,

75

but also give information about the growth process. Choosing the appropriate

formulation for the data allows to represent quite precisely the shape of the

trajectories; they may become useful in those studies where the development and

sexual maturation of the animals are investigated, because they can be used to

obtain an estimate of the start and development of the adolescent period.

They were not explicitly considered here, but these analyses of course are

suitable for adult rats as well; for this purpose, a linear model or a quadratic

polynomial of time should be the best options, because rarely growth models are

built to model the weight trajectories during the whole lifespan of individuals, so a

simpler and more efficient alternative is to be preferred.

Some issues remain open, like the problem of how to consider and treat the

extreme trends that some individual experience in the last part of their life; it is an

interesting feature, that is usually associated with the worsening of the health

conditions, but it can also affect heavily the estimates and the overall likelihood of

the models. It could be useful to this regard to further study and develop the

application of general additive mixed models to these data, that were only briefly

explored during this analysis but seemed quite promising. Allowing to include a

nonparametric, smoothed function of time in the classic linear regression function,

this approach can handle variables with an irregular trend like these body weights,

and at the same time maintain a simple and understandable interpretation for the

experimental variables.

76

References

Belpoggi F, Soffritti M, Maltoni C. “Methyl-tert-Butyl Ether (MTBE) – a gasoline

additive – causes testicular and lymphohaematopoietic cancers in rats.”

Toxicol Ind Health 11 (1995): 119-149.

Bretagnolle J., Huber-Carol C. “Effects of omitting covariates in Cox’s model for

survival data.” Scandinavian Journal of Statistic 15 (1988): 125–138.

Buckley I. V., James I. “Linear regression with censored data.” Biometrika, 1979:

429–436.

Calle E.E., Rodriguez C., Walker-Thurmond K., Thun M.J. “Overweight, obesity,

and mortality from cancer in a prospectively studied cohort of U.S.

Adults.” New English Journal of Medicine, 2003: 1625-38.

Cancer, International Agency for Research on. “Preamble.” In IARC monographs

on the Evaluation of Carcinogenic Risks to Humans. Lyon: WHO- IARC,

2006, last update September 2015.

Clayton, D.G. “A model for association in bivariate life tables and its application

in epidemiological studies of familial tendency in chronic disease

incidence.” Biometrika 65 (1978): 141 - 151.

Clayton, D.G. “A model for association in bivariate life tables and its application

in epidemiological studies of familial tendency in chronic disease

incidence.” Biometrika 65 (1978): 141-151.

Collaboration for Global Burden of Disease Cancer. “The global burden of cancer

2013.” JAMA Oncology 1, no. 4 (2015): 505-527.

Collett, D. Modelling survival data in medical research. 2nd edition. Boca Raton:

Chapman &Hall/CRC, 2003.

Cook R.D., Johnson M.E. “A family of distributions for modeling non-elliptically

symmetric multivariate data.” The Journal of the Royal Statistical Society

B, no. 43 (1981): 210 - 218.

Cox, D. R. “Regression models and life tables (with discussion).” Journal of the

Royal Statistical Society B, no. 74 (1972): 187-220.

77

Fitzmaurice G, Molenberghs G. “Advances in longitudinal data analysis: an

historical perspective.” In Longitudinal data analysis: a Handbook of

modern statistical methods, by Davidian M, Verbeke G, Molenberghs G.

Fitzmaurice G, 3-30. London: Chapman & Hall/CRC Press, 2008.

Hothorn, L. “Statistical evaluation of toxicological bioassays- a review.” Toxicol

Res 4 (2014): 418.

Hougaard, P. Analysis of multivariate survival data. New York: Springer, 2000.

Laird N M, Ware J H. “Random effects models for longitudinal data.” Biometrics

38 (1982): 963–974.

Lauby-Secretan B, Scoccianti C, Loomis D, Grosse Y, Bianchini F, Straif K.

“Body fatness and cancer – viewpoint of the IARC Working Group.” The

New England Journal of Medicine, 2016: published online, 25 August

2016.

Lindstrom M., Bates D. “Nonlinear mxed effects models for repeated measures

data.” Biometrics 46, no. 3 (September 1990): 673-668.

Maltoni C., Conti B., Cotti G. “Benzene: A multipotential carcinogen. Results of

long-term bioassays performed at the Bologna Institute of oncology. .. , ),

pp. .,.” American Journal of Industrial Medicine 4, no. 5 (1983): 589-630.

Maltoni C., Scarnato C. “First experimental demonstration of the carcinogenic

effects of benzene. Long-term bioassays on Sprague-Dawley rats by oral

administration.” Medicina del Lavoro 70 , no. 5 (1979): 352-357.

OECD. “Guidance document 116 on the conduct and design of chronic toxicity

and carcinogenicity studies, supporting Test Guidelines 451, 452 and 453,

2nd edition.” Paris: OECD, 13 April 2012.

OECD. Guidance Notes for Analysis and Evaluation of Chronic Toxicity and

Carcinogenicity Studies. Organisation for Economic Co-operation and

Development, 2002.

OECD. Test No. 451: Carcinogenicity Studies. Paris: OECD Publishing, 2009.

RAPP, K. et al. “Obesity and incidence of cancer: a large cohort study of over

145000 adults in Austria.” British Journal of Cancer 93 (2005): 1062–

1106.

78

Sengupta, P. “The laboratory rat: Relating its age with human’s.” International

Journal of Preventive Medicine 4 (2013): 624-30.

Singer J. D., Willet J. B. Applied longitudinal data analysis. Modeling change and

event occurrence. Oxford: Oxford University Press, 2003.

Therneau T.M., Grambsch P.M. Modeling Survival Data. New York: Springer,

2000.

Vaupel J.W., Manton K.G., Stallard E. “The impact of heterogeneity in individual

frailty on the dynamics of mortality.” Demography 16 (1979): 439–454.

Wei, L.J. “The accelerated failure time model: A useful alternative to the cox

regression model in survival analysis.” Statistics in Medicine 11, no. 14-15

(1992): 1871–1879.

WHO. IPCS Environmental Health Criteria 70: Principles for the safety

assessment of food additives and contaminants in food. Geneva: World

Health Organization, 1987.

79

Appendix 1: Analysis of residuals for survival analysis

1.1 Cox Proportional Hazard model, breeders

80

Test of proportional-hazards assumption Time: Time ---------------------------------------------------------------- | rho chi2 df Prob>chi2 ------------+--------------------------------------------------- 0b.sex | . . 1 . 1.sex | -0.05423 3.84 1 0.0499 0b.treat | . . 1 . 1.treat | -0.02560 0.85 1 0.3570 ------------+--------------------------------------------------- global test | 4.78 2 0.0915 ---------------------------------------------------------------- Time: Log(t) ---------------------------------------------------------------- | rho chi2 df Prob>chi2 ------------+--------------------------------------------------- 0b.sex | . . 1 . 1.sex | -0.05282 3.65 1 0.0562 0b.treat | . . 1 . 1.treat | -0.02284 0.68 1 0.4113 ------------+--------------------------------------------------- global test | 4.40 2 0.1107 ----------------------------------------------------------------

81

Time: Kaplan-Meier ---------------------------------------------------------------- | rho chi2 df Prob>chi2 ------------+--------------------------------------------------- 0b.sex | . . 1 . 1.sex | -0.05994 4.70 1 0.0302 0b.treat | . . 1 . 1.treat | -0.03101 1.24 1 0.2645 ------------+--------------------------------------------------- global test | 6.06 2 0.0483 ----------------------------------------------------------------

1.2 Accelerated failure-time model, generalized Gamma distribution, breeder

82

1.3 Cox Proportional Hazard model, offspring

83

Test of proportional-hazards assumption Time: Time ---------------------------------------------------------------- | rho chi2 df Prob>chi2 ------------+--------------------------------------------------- 0b.sex | . . 1 . 1.sex | -0.05304 2.01 1 0.1568 0b.treat | . . 1 . 1.treat | -0.00458 0.01 1 0.9029 30b.momstart| . . 1 . 39.momstart | -0.07824 4.30 1 0.0382 55.momstart | -0.07424 3.86 1 0.0494 ------------+--------------------------------------------------- global test | 7.81 4 0.0986 ---------------------------------------------------------------- Time: Kaplan-Meier ---------------------------------------------------------------- | rho chi2 df Prob>chi2 ------------+--------------------------------------------------- 0b.sex | . . 1 . 1.sex | -0.03200 0.73 1 0.3928 0b.treat | . . 1 . 1.treat | -0.02653 0.50 1 0.4796 30b.momstart| . . 1 . 39.momstart | -0.07956 4.44 1 0.0350 55.momstart | -0.05326 1.99 1 0.1585 ------------+--------------------------------------------------- global test | 6.25 4 0.1815 ----------------------------------------------------------------

84

Time: Rank(t) ---------------------------------------------------------------- | rho chi2 df Prob>chi2 ------------+--------------------------------------------------- 0b.sex | . . 1 . 1.sex | -0.03189 0.72 1 0.3945 0b.treat | . . 1 . 1.treat | -0.02669 0.51 1 0.4769 30b.momstart| . . 1 . 39.momstart | -0.07962 4.45 1 0.0349 55.momstart | -0.05328 1.99 1 0.1584 ------------+--------------------------------------------------- global test | 6.25 4 0.1809 ----------------------------------------------------------------

1.4 Cox Proportional Hazard model with shared Gamma frailty, offspring

86

Test of proportional-hazards assumption Time: Time ---------------------------------------------------------------- | rho chi2 df Prob>chi2 ------------+--------------------------------------------------- 0b.sex | . . 1 . 1.sex | -0.04712 1.78 1 0.1826 0b.treat | . . 1 . 1.treat | 0.00100 0.00 1 0.9659 30b.momstart| . . 1 . 39.momstart | -0.03689 2.48 1 0.1151 55.momstart | -0.04387 3.17 1 0.0752 ------------+--------------------------------------------------- global test | 5.85 4 0.2108 ----------------------------------------------------------------

Time: Kaplan-Meier ---------------------------------------------------------------- | rho chi2 df Prob>chi2 ------------+--------------------------------------------------- 0b.sex | . . 1 . 1.sex | -0.02686 0.58 1 0.4474 0b.treat | . . 1 . 1.treat | -0.01614 0.48 1 0.4887 30b.momstart| . . 1 . 39.momstart | -0.03295 1.98 1 0.1593 55.momstart | -0.02387 0.94 1 0.3329 ------------+--------------------------------------------------- global test | 3.64 4 0.4572 ---------------------------------------------------------------- Time: Rank(t) ---------------------------------------------------------------- | rho chi2 df Prob>chi2 ------------+--------------------------------------------------- 0b.sex | . . 1 . 1.sex | -0.02673 0.57 1 0.4495 0b.treat | . . 1 . 1.treat | -0.01628 0.49 1 0.4850 30b.momstart| . . 1 . 39.momstart | -0.03300 1.99 1 0.1587 55.momstart | -0.02392 0.94 1 0.3320 ------------+--------------------------------------------------- global test | 3.65 4 0.4551 ----------------------------------------------------------------

allmmaa emmaatterr –ssttuuddiioorruumm o

Documents