methods to improve the efficiency of confirmatory€¦ · increasing trial efficiency by early...

162
Methods to improve the efficiency of confirmatory clinical trials Ruud Boessen

Upload: lamdien

Post on 01-Jul-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

The development of new drugs is increasingly costly and ineffective. Most time and money is accounted for by late-stage clinical trials that aim to confirm the safety and efficacy of the investigated drug. To assure the continued arrival of new and affordable therapies, it is therefore essential to optimize the efficiency and success-rates of these trials. This thesis discusses innovative trial methodology that contributes to this goal.

The research presented in this thesis was performed at the Julius Center for Health Sciences and Primary Care of the University Medical Center Utrecht, and is part of the Dutch Top Institute Pharma›s Escher project.

Methods to im

prove the efficiency of confirm

atory clinical trials Ruud Boessen

Uitnodiging

voor het bijwonen van de openbare verdediging van

het proefschrift

Methods to improve the efficiency of confirmatory

clinical trials

door

Ruud Boessen

op donderdag 28 maart 2013 om 10:30 uur in de Senaatszaal

van het Academiegebouw van de Universiteit van Utrecht, Domplein 29 te Utrecht

ReceptieAansluitend aan de promotie

in de aula van het academiegebouw

Ruud BoessenPearsonlaan 993527 CB Utrecht

[email protected]

ParanimfenNiels Schenk

[email protected]+31 6 1683 6126

Niek [email protected]

+ 31 6 4391 0252

Methods to improve the efficiency of confirmatory clinical trials

Ruud Boessen

Boessen_Omslag.indd 1 25-02-13 11:59

Methods to improve the efficiency of confirmatory clinical trials

Ruud Boessen

Methods to improve the efficiency of confirmatory clinical trials

PhD thesis, Utrecht University, the Netherlands, with a summary in Dutch

ISBN 978-90-5335-660-9Author Ruud BoessenCover by Marc WehkampLay-out by Nikki Vermeulen, Ridderprint BV, the NetherlandsPrinted by Ridderprint BV, Ridderkerk, the Netherlands

© 2012 Ruud Boessen, Utrecht

Methods to improve the efficiency of confirmatory clinical trials

Methoden ter bevordering van de efficiëntie van conformatief klinisch onderzoek (met een samenvatting in het Nederlands)

Proefschrift

ter verkrijging van de graad van doctoraan de Universiteit Utrecht

op gezag van de rector magnificus, prof. dr. G.J. van der Zwaan,ingevolge het besluit van het college voor promoties

in het openbaar te verdedigen opdonderdag 28 maart 2013 des ochtends te 10.30 uur

door

Ruud Boessen

geboren op 19 oktober 1983te Weert

Promotoren: Prof. dr. D.E. Grobbee Prof. dr. K.C.B. Roes Co-promotoren: Dr. M.J. Knol Dr. R.H.H. Groenwold

The following parties are gratefully acknowledged for financially supporting the publication of this thesis: Chipsoft, GlaxoSmithKline, TNO, PRA and the Nederlandse Bijwerkingen Fonds.

The studies presented in this thesis were performed in the context of the Escher project (T6-202), a project of the Dutch Top Institute Pharma.

Table of Contents

Chapter 1 ............................................................................................................................... 7General introduction

Chapter 2.1 .......................................................................................................................... 15Clinical trial simulation in late-stage drug development

Chapter 2.2 .......................................................................................................................... 45Validation and predictive performance assessment of clinical trial simulation models

Chapter 3.1 .......................................................................................................................... 51Increasing trial efficiency by early reallocation of placebo non-responders in sequentialparallel comparison designs: application to antidepressants

Chapter 3.2 .......................................................................................................................... 77Optimizing trial design in pharmacogenetics research; comparing a fixed parallel group,group sequential and adaptive selection design on sample size requirements

Chapter 3.3 ....................................................................................................................... 105Improving clinical trial efficiency by biomarker-guided patient selection

Chapter 4.1 ........................................................................................................................ 131Classifying responders and nonresponders; does it help when there is evidence ofdifferentially responding patient groups?

Chapter 4.2 ........................................................................................................................ 149Comparing HAMD17 and HAMD subscales on sensitivity to antidepressant drugeffects in placebo-controlled trials

Chapter 5 ........................................................................................................................... 171General discussion

Summary ............................................................................................................................... 145

Samenvatting ......................................................................................................................... 151

Dankwoord ............................................................................................................................ 157

Curriculum Vitae ................................................................................................................... 159

Chapter 1General introduction

General introduction

9

General introductionTrends in drug developmentIn 1975, pharmaceutical companies spent on average about 140 million in today’s US dollars on the research and development (R&D) of every drug that was authorized for marketing by the US Food and Drug Administration (FDA). By 1987, that number had increased to 320 million, and by 2005 it was estimated at over 1.5 billion [1-3]. Despite this sharp rise in R&D spending, the number of newly approved novel drugs (i.e. new molecular entities; NMEs) decreased (fi gure 1) [4-6]. In fact, 2010 represented a 20-year low with only 15 NMEs to receive marketing approval [1]. More expensive and less effective drug development is likely to affect the industry’s future revenues [7]. More importantly, these trends will also affect patient care, since higher R&D costs slow down innovative research and translate into less new and affordable drug therapies. In addition, high development costs are an incentive for pharmaceutical companies to focus on compounds with a large potential to recoup investments, which could restrain the development of cures for rare and third-world diseases [8].

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Num

ber o

f NM

Es a

ppro

ved

010

2030

4050

60

53

39

30

35

2724

1721

31

18 1816

2119

15

240

1020

3040

5060

R&

D sp

endi

ng (U

S$ b

illio

ns)

Figure 1: number of approved New Molecular Entities (NMEs) and total Research and Development (R&D) spending by the leading pharmaceutical companies in the US. Source: 2011 profi le: pharmaceutical industry.

In most countries, law requires that a compound intended for human use goes through adequate and well-controlled clinical testing before it is submitted for marketing approval. In the US and Europe, this requirement has led to a highly regulated and standardized three-stage development trajectory for any compound that emerged from basic research and animal

Chapter 1

10

testing. These stages begin with phase I clinical trials that test an experimental treatment for the fi rst time in a small group of typically healthy volunteers (n<50) to evaluate its safety, determine a safe dosage range, and identify potential side effects. If the compound passes these tests, it moves on to phase II. In this stage, it is given to a larger group of patients (n= 50-300) to establish its effect in treating a particular condition, symptom or illness (i.e. to show ‘proof-of-concept’), and to further evaluate its safety. Only after these stages does a drug candidate move on to phase III. In this stage, the drug is tested against placebo or currently available treatments on a large number of patients to confi rm its effi cacy, monitor side effects and collect information that allows it to be used safely in medical practice. The large number of patients in phase III clinical trials are necessary to protect against statistical issues that may occur with smaller samples, and serves to provide a comprehensive and reliable picture of the drug’s benefi ts and risks [9,10]. Large sample sizes also enable the detection of side effects and complications that affect only a small portion of the patient population and go unnoticed in smaller trials.Nowadays, the drug development process is also increasingly divided into early-stage and late-stage, or exploratory and confi rmatory studies. The early learning stage (i.e. roughly corresponding to phase I and early phase II), is aimed to accumulate knowledge on the drug candidate, to maximize its medical potential and determine whether continued investment is justifi ed. Alternatively, late-stage confi rmatory trials (i.e. late phase II and phase III), focus specifi cally on testing the hypothesis that the drug is effective in treating the target indication for a specifi ed patient population.Overall, late-stage confi rmatory clinical trials are estimated to represent about 50 percent of total R&D expenditures [2,11,12]. However, these estimates are based on all pharmaceutical candidates that companies test, and include many compounds that never even reach phase III. When the analysis is confi ned to drugs that eventually obtain marketing approval, confi rmatory studies represent a much larger percentage of the total development costs. Reducing the costs and failure-rate of confi rmatory clinical trials is therefore a crucial hurdle in the effort to improve the affordability and output of drug development [13,14]. This thesis discusses a number of approaches that could help to achieve this goal.

Outline of this thesisThe topics discussed in this thesis can be divided into three different sections, and chapters are numbered accordingly. The fi rst section is about clinical trial simulation (CTS); a statistical simulation technique that allows to mimic the conduct of a trial in order to anticipate problems and suggest probable solutions before the trial is started. Chapter 2.1 provides a review of recent developments in the use and methodology of CTS, based on published literature. Chapter 2.2 is a letter in response to an article on CTS and focuses specifi cally on the importance of validation and predictive performance assessment of CTS models.

General introduction

11

The second section is about clinical trial designs that allow to modify design aspects of an ongoing trial based on incoming data, and hence provide the fl exibility to respond to emerging knowledge as the trial progresses. Chapter 3.1 includes an evaluation of the sequential parallel comparison (SPC) design in antidepressant clinical trials. The SPC design was proposed to reduce the impact of placebo response, and is expected to require fewer patients to establish treatment effi cacy as compared to a conventional parallel group design. Chapter 3.2 includes a comparison of several trial designs for situations where there is some, but inconclusive evidence of effect modifi cation by a genomic marker. Two different two-phase designs are evaluated that allow to stop early for effi cacy or futility or in addition enable to modify inclusion criteria after interim analysis to comprise only patients from the most promising patient subgroup. These designs are compared to a conventional parallel group design on sample size requirements. Chapter 3.3 addresses the situation were a patient’s baseline value on a (bio)marker, or short-term (bio)marker changes in response to brief treatment exposure predict long-term drug response in a clinical trial. We evaluate study designs that use this information to select specifi c patients for randomization, and compare the performance characteristics of these designs to those of a conventional parallel group design. The third section is about important design aspects relevant to every clinical trial, i.e.: endpoint selection and statistical analysis. Chapter 4.1 discusses the implications of dichotomization when continuous outcome are bimodally distributed, and includes an example based on empirical data from antidepressant clinical trials. Chapter 4.1 involves a comparison of several established depression rating scales on their ability to differentiate between patients on active treatment and placebo in antidepressant clinical trials.

Chapter 1

12

Reference list(1) PhRMA. 2011. 2011 profi le: pharmaceutical industry <http://www.phrma.org/sites/

default/ fi les/159/phrma_profi le_2011_fi nal.pdf> (April 2011).(2) DiMasi JA, Hansen RW, Grabowski HG. The price of innovation: new estimates of

drug development costs. Journal of Health Economics 2003; 22(2):151-185.(3) DiMasi JA. Risks in new drug development: approval success rates for investigational

drugs. Clinical Pharmacology and Therapeutics 2001; 69(5): 297-307.(4) Bunnage ME. Getting pharmaceutical R&D back on target. Nature Chemical Biology

2011; 7(6):335-339.(5) Woodcock J, Woosley R. The FDA critical path initiative and its infl uence on new drug

development. Annual Review in Medicine 2008; 59:1-12.(6) Booth B, Zemmel R. Prospects for productivity. Nature Reviews Drug Discovery 2004;

3(5): 451-456.(7) Gilbert J, Henske P, Singh A. 2003. Rebuilding big pharma’s business model. In Vivo

2003; 21:1−4.(8) Trouiller P, Olliaro P, Torreele E, Orbinskij J, Laing R, Ford N. Drug development for

neglected diseases: a defi cient market and a public-health policy failure. Lancet 2002; 359(9324):2188-2194.

(9) Pocock SJ. Clinical trials: a practical approach, 1983.(10) Senn S. Statistical issues in drug development, 2008.(11) Paul SM, Mytelka DS, Dunwiddie CT, Persinger CC, Munos BH, Lindborg SR,

Schacht AL. How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nature Reviews Drug Discovery 2010; 9(3):203-214.

(12) Dickson M, Gagnon JP. Key factors in the rising cost of new drug discovery and development. Nature Reviews Drug Discovery 2004; 3(5):417-429.

(13) Rawlins MD. Cutting the cost of drug development? Nature Reviews Drug Discovery 2004; 3(4):360-364.

(14) Kola I, Landis J. Can the pharmaceutical industry reduce attrition rates? Nature Reviews Drug Discovery 2004; 3(8):711-715.

Chapter 2.1Clinical Trial Simulation in late-stage drug development

Boessen R, Knol MJ, Roes KCB

Chapter 2.1

14

AbstractThe development of new drugs is increasingly risky and expensive. This is largely due to more frequent late-stage failures and rising costs of phase II and III clinical trials. The success-rate of these trials can be improved with clinical trial simulation (CTS); a technique that allows to mimic the conduct of a trial in order to identify and resolve potential shortcomings in the study protocol before the trial is started.This review discussed published CTS studies that addressed pertinent questions related to the design and analysis of late-stage clinical trials. Inclusion was confi ned to studies that evaluated trial designs on relevant decision-making metrics (e.g. power or sample size requirements to establish treatment effi cacy, expected costs of the trial, etc.). Key characteristics regarding the objective(s), simulation models and analytic methods were derived and discussed.Most studies performed CTS retrospectively in order to investigate its utility. Other studies performed CTS prospectively to inform in the planning of future trials. It was found that model-building and validation procedures were often inadequate for proper assessment and optimization of predictive performance. This is an important issue regarding the use of CTS in practice, when models are used to extrapolate beyond their source data to predict future trial results.Overall, the review indicates that CTS is a valuable tool to evaluate the degree and impact of uncertainty at the planning stage, and allows to evaluate and compare study designs on relevant metrics. In addition, CTS compels the integration of preclinical and early clinical data to ensure a comprehensive synopsis of available knowledge about the drug before a trial is started. Given its potential and the urgent need for more effective and effi cient drug development, the review promotes a larger role for CTS in the planning stage of future trials.

CTS in late-stage drug development

15

IntroductionThe costs of drug development is rising, while the number of newly approved drugs is going down [1-3]. This trend is likely to infl uence the industry’s future revenues, but will also affect patient care, since increased research and development (R&D) spending slows down innovative research, and fewer new drugs translate into less new and affordable therapies. In addition, higher R&D costs could motivate companies to focus their research on compounds with a high potential to recoup investments, which would restrain the development of products for rare or third-world diseases.Most of the time and money invested into pharmaceutical R&D is spend on late-stage confi rmatory trials [1]. The high costs associated with failure of these trials has led sponsors to rely on conservative approaches to trial design and analysis, which has held back the innovations needed to address and reverse current challenges in drug research.The March 2004, the US Food and Drug Administration (FDA) launched the Critical Path initiative and issued a report entitled ‘Innovation or Stagnation: Challenge and Opportunity on the Critical Path to New Medical Products’, that identifi ed ways to accelerate the pace and reduce the costs of drug development. This report advocated more extensive use of clinical trial simulation (CTS), which was described as “a technique aimed at simulating trial conduct to predict study outcomes, anticipate potential problems and conceive probable resolutions before initiation of the actual trial” [2]. CTS was presented as a powerful technique to facilitate trial design and offer ‘an important approach to improve drug development knowledge management and development decision making’.A CTS model combines information on the investigated compound, the targeted disease and patient characteristics related to drug and disease model parameters [4]. CTS models are often based on available preclinical and clinical data, sometimes complemented with knowledge from similar compounds or literature. Once the model is defi ned, trial design and anticipated protocol deviations (e.g. dropout, partial compliance) are specifi ed, and stochastic simulations are performed to generate virtual trial outcomes. Assessment of these data allows to test whether the design is adequate and effective in meeting the trial’s objectives. In addition, alternative designs can be explored to identify the most appropriate study design based on specifi ed criteria like, for instance, the maximum likelihood of fi nding a signifi cant treatment effect or minimal costs.Its assumed potential to reduce costs and increase the effi ciency of clinical trials has generated interest from both industry and academia [4-6]. This review provides an overview of published CTS studies that addressed pertinent questions related to the design of late stage-confi rmatory trials. It discusses the pursued objectives, the employed models and methodologies, and the analyses of model performance and simulated outcomes.

Chapter 2.1

16

MethodsSearch strategyStudies included in this review were identifi ed through literature searches in PubMed performed in August 2011. The search terms used were ‘clinical’ and ‘trial’ and ‘simulation*’ in title or abstract, limited to studies written in English. Based on title and keywords, abstracts from potentially relevant articles were selected for further review. When there was uncertainty about the relevance of the article after assessment of the abstract, the full text was read to determine whether the study met our inclusion criteria. Also, the reference lists of the selected articles were checked for additional publications.

Inclusion criteriaInclusion was confi ned to simulation studies that 1) used models based on preclinical and clinical trial data, 2) assessed the properties and performance of late-stage confi rmatory trial designs (i.e. phase IIb and phase III) and 3) evaluated a single, or compared multiple designs on informative criteria for development decision making, such as statistical power, required sample size or estimated clinical effect size.Since the interest was on simulation of late-stage confi rmatory trials, the review only included studies in which the simulated outcome matched the clinical outcome for which the drug was studied. Simulation studies that focused on a biomarker with unresolved relevance to a meaningful clinical endpoint were excluded.

Data extraction The key features of the selected studies were extracted from the articles and presented in table I. Layout and terminology of the table were partially adopted from a previously published review and guidance document on CTS [4,7]. The information derived from the articles included: the fi rst author, the year of publication, the compound under study, the target indication and the study objectives as phrased in the original publication. Also obtained were details on the model structure, where, in accordance with previous publications, a distinction was made between 1) the input-output (IO) model, 2) the covariate distribution model, and 3) the trial execution model. Furthermore, parameters were obtained defi ning the evaluated design(s), and details on the methods used to assess model performance and analyze the simulated trial outcomes.

Input-output modelThe IO model refers to the model that is used to predict trial outcomes given certain inputs and baseline covariate values. It generally incorporates multiple components related to factors such as the effect of the drug (e.g. its PK/PD profi le), the state and progression of

CTS in late-stage drug development

17

the disease and the clinical response in the treatment and control group. Typically, the IO model contains both fi xed components and stochastic components that account for between and within subject variability and measurement error. However, the structure of the IO model may vary considerably between studies, depending on characteristics of the drug and the disease, the amount of available information and the objectives of the study.

Covariate distribution modelThe covariate distribution model defi nes the distribution of relevant covariates in the trial population and their relation with inter-individual differences in drug and disease related model parameters. It serves to account for patient-specifi c features that are associated with systematic differences in drug response between patients (e.g. age, weight, metabolic functioning).

Trial execution modelThe trial execution model describes the characteristics of the trial design(s) under investigation. It provides a representation of the study design according to the protocol (i.e. the nominal design), and is often supplemented with a description of protocol deviations that are expected to occur as the study progresses (e.g. patient dropout, partial compliance). It also includes the methods used to analyze the (simulated) trial outcomes (i.e. the primary analysis).Key features of the various models that were applied in each of the selected studies are presented in table I, along with the data and additional information on which these models were based (i.e. the source data). Also presented is the software used to derive the structure and parameter values of the models.

Model validation and predictive performance assessmentCTS models rely on assumptions that are often based on empirical data, which are subject to various sources of variation. It is therefore essential to conduct considerable model-checking before accepting a model for CTS purposes [8]. Table I presents information on the employed model-checking procedures described in the selected publications. These include both internal validity checks (i.e. how well the model fi ts its source data), as well as assessments of the models true predictive performance (i.e. how well model predictions agree with external data).

Simulation experiments and data analysisTypical CTS studies are experiments in which factors related to the design and analysis of the trial (e.g. dosing regimen, sample size, trial duration, number and frequency of assessments) are varied to evaluate their impact on simulated trial outcomes. For each of the selected studies, these factors are presented in table I. Also presented are the number of replications,

Chapter 2.1

18

the simulated endpoint(s), and a description of the outcome(s) used to assess the evaluated design(s).

ResultsSearch resultsThe literature search in PubMed resulted in 972 articles. After a fi rst screening on title and keywords 78 studies were identifi ed that used simulations to evaluate or compare between trial designs. Careful review of the abstract and, if necessary, the full text of these articles resulted in a selection of 13 studies that met our inclusion criteria. All these studies were published within the last ten years. No additional articles were included based on literature references, confi rming the sensitivity of our search strategy.

Compounds and indicationsCTS has been applied to a variety of compounds in a range of therapeutic areas, including cancer [9,10], neurovascular [11], cardiovascular [12] and respiratory disease [13], and central nervous system (CNS) disorders [14-16]. Five studies involved compounds that were still in the pre-approval phase when simulations were conducted [12,15-17]. Two studies made use of models applicable to a particular drug class (i.e. antidepressants) rather than to a specifi c drug compound [18,19]. One study did not provide details on the evaluated compound [20].

Study objectivesThree studies determined the overlap between simulated study results and the actual outcomes from previously conducted trials with the sole intent to investigate the feasibility and utility of CTS in late-stage clinical development [11,14,20]. Other studies focused on the evaluation and optimization of clinical study designs. These studies usually varied one or multiple design parameters, to assess the impact on primary trial outcomes and identify the most favorable set of parameters with regard to a prespecifi ed evaluation criterion, such as the likelihood of fi nding a statistically signifi cant difference between treatment and control, or the expected costs or duration of the trial. Three studies explicitly stated that the fi ndings from the CTS experiment directly infl uenced decision making about the development of the drug [13,16,17].Several studies did not have design evaluation or optimization as their main purpose, but focused instead on balancing the safety and effi cacy consequences of a dose modifi cation strategy for a specifi c patient subpopulation [9,10], evaluating the added effect of a drug-facilitator drug combination as compared to the drug alone [18] or predicting the effectiveness of an alternative drug formulation [17,21].

CTS in late-stage drug development

19

Input-output models The type of models used in CTS depended on drug and disease related factors, the study objectives and the amount of available preclinical and clinical knowledge. The majority of selected studies made use of PK/PD models to characterize drug response. PK/PD models describe the time course of effect intensity in response to administration of the drug. Such models are a useful point of departure for CTS when the drug under study has a consistent and well-defi ned dose-response profi le. However, when the dose-response characteristics of the drug vary greatly between patients and over time, as for many CNS drugs, PK/PD models may be less suitable for simulation purposes, and other approaches need to be considered. Examples are also included in table 1; these studies applied models that characterized clinical score kinetics in response to antidepressant treatment, without relying on an explicit model of the drug response [18,19].Aside from a characterization of the drug response profi le, some studies also incorporated model components that accounted for variations in disease state [9,14,16]. An example includes a study that compared multiple trial designs on the power to detect a pre-specifi ed treatment effect of a M1 muscarinic agonist for the treatment of Alzheimer’s disease, and modeled the time course of disease progression [16]. The inclusion of a comprehensive disease model was exceptional. Most studies included covariates related to inter-individual differences in disease state or progression (e.g. baseline disease severity).

Data sourceMost frequently, models were based on data from preclinical research or early clinical trials. Some studies also used information from literature and other sources (e.g. expert opinion, past experience) [10,12,16]. When drug-specifi c information was insuffi ciently available, several studies used data from related compounds to leverage model construction [11,13]. Sometimes, models were updated as new data on the target drug became available [11,21].

Covariate modelEight studies specifi ed covariates to account for patient-specifi c features (e.g. age, body weight, renal function) associated with systematic differences in drug response [9,10,12-16,21]. However, the way in which covariates were defi ned and incorporated into the IO model was not always clear and varied substantially between studies.

Trial execution modelAll of the selected examples simulated parallel group designs in which one or more doses of the treatment drug were compared with placebo or an active comparator. One study compared a parallel design with multiple cross-over designs that varied in the number and length of treatment periods [16].

Chapter 2.1

20

Deviations from the nominal trial protocol (e.g. patient withdrawal, non-compliance) may affect the quality and validity of trial outcomes. Reliable prediction of study results in CTS thus requires accurate descriptions of the protocol deviations that are likely to occur as the study progresses. Eight out of the twelve studies took the anticipated patient dropout into account [9,13-17,19]. Two studies also considered dosing adjustments in response to safety events [9,10].

Simulation experimentMost of the selected studies evaluated the effects of controllable design parameters (e.g. dosing regimen, sample size, assessment schedule, randomization ratio, treatment duration), to decide on the adequacy of the tested design, and explore the performance of alternative designs. Of primary interest was often the establishment of an appropriate dosing-regimen, which is a challenging and important aspect in protocol development. However, CTS was also used to examine the combined effect of a set of design parameters [11,15-17,19,21], predict the impact of expected protocol deviations (e.g. dropout mechanisms) [19] or determine whether the choice for a certain endpoint affected the likelihood of reaching a statistical signifi cant comparison between treatment and placebo [18].The number of simulation replications varied between studies, and ranged from 50 to 1000 per simulated scenario. Seven studies carried out 100 replications per scenario. None of the selected studies provided a rationale for the chosen number of simulation replications.

Model validation and predictive performance assessmentMost studies assessed model performance by applying basic internal validation methods (e.g. goodness-of fi t plots) or predictive check procedures that quantifi ed or visualized the degree of correspondence between model predictions and the source data. In two cases these analyses were supplemented with sensitivity and uncertainty analysis to evaluate the impact of deviations in parameters assumptions [12.13]. Three studies compared model predictions with external data (i.e. not used in model development) [11,17,20]. Two studies did not mention or describe any model-checking procedures [15,16].

Analysis of design performanceThe studies in the current review were selected on the outcome used to evaluate trial designs, which was usually the assumed statistical power (i.e. the probability of establishing a statistically signifi cant difference between the treated group and the control group when such a difference is present), but could also be the required number of patients [11,12] or the expected effect size [13,14].

CTS in late-stage drug development

21

Tabl

e I

Ref

eren

ceD

rug

Com

men

tsIn

put-o

utpu

t Mod

elC

ovar

iate

Mod

elTr

ial E

xecu

tion

Mod

elSi

mul

atio

n Ex

peri

men

tD

ata

Ana

lysis

Auth

or (y

ear)

Com

poun

dO

bjec

tives

Stru

ctur

al C

ompo

nent

sC

ovar

iate

sN

omin

al D

esig

nD

esig

n or

ana

lysi

s fa

ctor

s M

odel

per

form

ance

as

sess

men

tIn

dica

tion

Stoc

hast

ic

Com

pone

nts

Prot

ocol

Dev

iatio

nsRe

plic

atio

nsAn

alys

is o

f des

ign

perf

orm

ance

Appl

icat

ion

of

resu

ltsD

ata

Sour

ceD

ata

Sour

ceD

ata

Sour

ceSi

mul

ated

pat

ient

ou

tcom

e(s)

Mea

sure

of d

esig

n pe

rfor

man

ceSo

ftwar

ePr

imar

y An

alys

isSo

ftwar

eSo

ftwar

e

Kim

ko e

t al.

(200

0)

Que

tiapi

ne

fum

arat

eTo

inve

stig

ate

whe

ther

a

dose

-res

pons

e re

latio

nshi

p ca

n be

des

crib

ed w

ith

a PK

/PD

mod

el

and

sim

ulat

ion

can

reas

onab

ly p

redi

ct

the

outc

ome

of th

e ac

tual

pha

se II

I tr

ial.

PK/P

D m

odel

Com

bine

d di

seas

e pr

ogre

ss a

nd p

lace

bo

resp

onse

mod

el

Age,

bod

y w

eigh

t an

d se

x w

ere

cons

ider

ed

but n

one

wer

e in

clud

ed

Para

llel,

plac

ebo-

cont

rolle

d m

ultip

le

dose

pha

se II

I tri

al

with

50

patie

nts p

er

arm

Dos

ing

regi

men

Com

pari

ng p

lots

an

d su

mm

ary

stat

istic

s fro

m

sim

ulat

ed a

nd

obse

rved

pha

se II

da

ta.

Schi

zoph

reni

aIn

ter-

indi

vidu

al

vari

abili

tyRe

sidu

al v

aria

bilit

y

Patie

nt d

ropo

ut10

0 tr

ial

repl

icat

ions

Pow

er c

alcu

latio

n

Retro

spec

tive

sim

ulat

ion

stud

yPh

ase

I and

pha

se II

da

taPh

ase

II d

ata

Phas

e II

dat

aC

hang

e fro

m

base

line

in B

PRS

scor

e at

wee

k 6

Trea

tmen

t effe

ct si

zePo

wer

NO

NM

EM 1

.0AN

CO

VAAC

SL B

iom

ed

NO

NM

EM 1

.0 /

SAS

6.12

Chapter 2.1

22

Tabl

e I (

Con

tinue

d)

Ref

eren

ceD

rug

Com

men

tsIn

put-o

utpu

t Mod

elC

ovar

iate

Mod

elTr

ial E

xecu

tion

Mod

elSi

mul

atio

n Ex

peri

men

tD

ata

Ana

lysis

Veyr

at-F

olle

t et

al.

(200

0)

Doc

etax

elTo

ass

ess w

heth

er

patie

nts w

ith h

igh

base

line

α1-a

cid

glyc

opro

tein

(AAG

) le

vels

ben

efi t

from

do

se in

tens

ifi ca

tion.

Popu

latio

n PK

mod

elW

eibu

ll m

odel

rela

ting

drug

exp

osur

e to

tim

e-to

-dea

th, p

rogr

essi

on

and

drop

out

Logi

stic

mod

el re

latin

g dr

ug e

xpos

ure

to

seve

re to

xici

ty

Cum

ulat

ive

AUC

,tw

o or

mor

e or

gans

invo

lved

an

d AA

G

Para

llel,

phas

e II

I tr

ial c

ompa

ring

st

anda

rd v

s. in

crea

sed

dosa

ge

with

200

pat

ient

s pe

r arm

Dos

ing

regi

men

Com

pari

ng

sum

mar

y st

atis

tics

from

sim

ulat

ed a

nd

obse

rved

pha

se II

da

ta

Non

-sm

all-

cell

lung

ca

ncer

Inte

r-in

divi

dual

va

riab

ility

Resi

dual

var

iabi

lity

Patie

nt d

ropo

utD

ose

adju

stm

ents

100

tria

l re

plic

atio

nsPr

opor

tion

of

sign

ifi ca

nt tr

ial

repl

icat

ions

Phas

e II

dat

aPh

ase

II d

ata

Phas

e II

dat

aO

vera

ll su

rviv

al,

toxi

city

and

tim

e-to

-pr

ogre

ssio

n

Pow

er

NO

NM

EM /

ASC

L Bi

oMed

4.0

Log-

rank

test

AC

SL B

iom

ed 4

.0

NO

NM

EM /

SAS

6.12

Nes

toro

v et

al

. (20

01)

Nar

atri

ptan

To in

vest

igat

e th

e fe

asib

ility

and

util

ity

of m

odel

-bas

ed C

TS

in th

e de

velo

pmen

t of

nar

atri

ptan

Com

bine

d na

ratr

ipta

n PK

and

sum

atri

ptan

PD

mod

el R

efi n

ed

nara

trip

tan

PK/P

D

Para

llel,

plac

ebo-

cont

rolle

d, m

ultip

le

dose

pha

se II

b tr

ial

Dos

ing

regi

men

Sam

ple

size

Resp

onse

sam

plin

g sc

hedu

le

Com

pari

son

betw

een

sim

ulat

ed

and

obse

rved

pha

se

IIa

data

Mig

rain

eIn

ter-

indi

vidu

al

vari

abili

ty

Rand

om e

ffect

s for

ba

selin

e sc

ores

200

repl

icat

ions

per

sc

enar

ioPr

opor

tion

of

sign

ifi ca

nt tr

ial

repl

icat

ions

Sam

ple

size

ca

lcul

atio

nRe

trosp

ectiv

e si

mul

atio

n st

udy

Prec

linic

al, p

hase

I a

nd p

hase

II d

ata

from

nar

atri

ptan

and

su

mat

ript

an

Dic

hoto

miz

ed p

ain

relie

f sco

rePo

wer

Re

quire

d sa

mpl

e si

ze

NO

NM

EM /

BMD

P M

ATLA

BM

icro

soft

EXC

EL

CTS in late-stage drug development

23

Tabl

e I (

Con

tinue

d)

Ref

eren

ceD

rug

Com

men

tsIn

put-o

utpu

t Mod

elC

ovar

iate

Mod

elTr

ial E

xecu

tion

Mod

elSi

mul

atio

n Ex

peri

men

tD

ata

Ana

lysis

Cha

baud

et

al. (

2002

) Iv

abra

dine

To d

eter

min

e th

e si

ze o

f the

trea

tmen

t ef

fect

to b

e ex

pect

ed, t

he c

hoic

e of

the

dose

and

the

num

ber o

f pat

ient

s to

be

incl

uded

in a

ph

ase

III t

rial

PK/P

D m

odel

rela

ting

plas

ma

conc

entr

atio

n to

hea

rt-r

ate

Phys

iolo

gic

mod

el

rela

ting

hear

t-rat

e to

bi

nary

effi

cacy

and

sa

fety

out

com

es

Base

line

hear

t-ra

te p

rofi l

esPa

ralle

l, pl

aceb

o-co

ntro

lled

phas

e II

I tri

al w

ith 1

00

patie

nts p

er a

rm

Dos

ing

regi

men

M

odel

par

ts

wer

e ev

alua

ted

by c

ompa

ring

pr

edic

ted

and

obse

rved

dat

a.

Sens

itivi

ty a

naly

sis

Unc

erta

inty

an

alys

isAn

gina

pe

ctor

isIn

ter-

indi

vidu

al

vari

abili

ty

Resi

dual

var

iabi

lity

Inte

r-re

plic

atio

n va

riab

ility

100

repl

icat

ions

per

tr

ial

Dis

trib

utio

n of

the

p-va

lues

fro

m in

divi

dual

re

plic

atio

ns

Phas

e I d

ata

Lite

ratu

reEp

idem

iolo

gic

data

base

Phas

e I d

ata

Che

st p

ains

and

di

zzy

epis

odes

Pow

erRe

quire

d sa

mpl

e si

zeN

ON

MEM

VC

hi-s

quar

e te

st

NO

NM

EM V

/ SA

S 6.

12 /

S-Pl

us 4

.5N

ON

MEM

V /

SAS

6.12

/ S-

Plus

4.5

Lock

woo

d et

al

. (20

03)

Preg

abal

inTo

eva

luat

e ho

w

wel

l tw

o 8-

wee

k pa

ralle

l gro

up

phas

e II

stud

ies

coul

d id

entif

y th

e m

inim

um e

ffect

ive

dose

(MED

)

Gab

apen

tin P

K/P

D

mod

el w

ith p

rega

balin

PK

com

pone

nt

Body

wei

ght

and

Cre

atin

ine

clea

ranc

e

Para

llel,

plac

ebo-

cont

rolle

d, m

ultip

le

dose

pha

se II

tria

ls

with

80

patie

nts p

er

arm

Two

desi

gn w

ere

eval

uate

d th

at

vari

ed in

leng

th a

nd

dosi

ng re

gim

en

Neu

ropa

thic

pa

inRa

ndom

effe

cts f

or

plac

ebo

onse

t rat

e an

d ba

selin

e sc

ore

Scal

ing

fact

or fo

r pl

aceb

o re

spon

sePa

ram

eter

var

iabi

lity

Resi

dual

err

or

Patie

nt d

ropo

ut

500

tria

l re

plic

atio

ns

Chapter 2.1

24

Tabl

e I (

Con

tinue

d)

Ref

eren

ceD

rug

Com

men

tsIn

put-o

utpu

t Mod

elC

ovar

iate

Mod

elTr

ial E

xecu

tion

Mod

elSi

mul

atio

n Ex

peri

men

tD

ata

Ana

lysis

Lock

woo

d et

al

. (20

03)

Prec

linic

al a

nd

clin

ical

dat

a fro

m

preg

abal

in a

nd

gaba

pent

in

Prec

linic

al d

ata

from

pre

gaba

lin

and

gaba

pent

in

Clin

ical

dat

a fo

rm

gaba

pent

in

Pain

scor

e on

1-1

0 sc

ale

Prop

ortio

n of

re

plic

atio

ns w

ith

MED

est

imat

e w

ithin

a sp

ecifi

ed

rang

e fro

m th

e tr

ue

valu

e N

ON

MEM

VA

MED

est

imat

e w

as d

eriv

ed fr

om

pool

ed m

ean

base

line

adju

sted

pa

in sc

ores

Phar

sigh

t Tri

al

Sim

ulat

or 1

.0S-

plus

4.0

De

Ridd

er

(200

5)

Uns

peci

fi ed

To g

ener

ate

outc

omes

of o

n-go

ing

phas

e II

I tr

ials

and

com

pare

th

ese

to th

e ac

tual

da

ta, t

o as

sess

the

robu

stne

ss o

f tho

se

tria

ls w

ith re

spec

t to

unc

erta

inty

in

vari

ous p

aram

eter

s, an

d to

ass

ess

the

likel

ihoo

d of

ac

hiev

ing

a re

leva

nt

resp

onse

with

a

low

er d

ose.

Dos

e-re

spon

se m

odel

in

clud

ing

treat

men

t ef

fect

and

pla

cebo

re

spon

se

Para

llel,

pla

cebo

-co

ntro

lled

mul

tiple

do

se p

hase

III t

rial

w

ith 5

000

patie

nts

per t

reat

men

t arm

Dos

ing

regi

men

Para

met

er v

alue

sC

ompa

riso

n be

twee

n si

mul

ated

an

d ob

serv

ed p

hase

II

dat

a

Uns

peci

fi ed

Inte

r-in

divi

dual

va

riab

ility

Resi

dual

var

iabi

lity

1000

tria

l re

plic

atio

nsPr

opor

tion

of

sign

ifi ca

nt tr

ials

Cas

e st

udy

in th

e us

e of

CTS

in th

e tr

ansi

tion

from

ph

ase

II to

pha

se II

I

Phas

e II

dat

aPh

ase

II d

ata

Sym

ptom

scor

eEf

fect

size

Pow

erSA

S C

hi-s

quar

e te

st

CTS in late-stage drug development

25

Tabl

e I (

Con

tinue

d)

Ref

eren

ceD

rug

Com

men

tsIn

put-o

utpu

t Mod

elC

ovar

iate

Mod

elTr

ial E

xecu

tion

Mod

elSi

mul

atio

n Ex

peri

men

tD

ata

Ana

lysis

Lock

woo

d et

al

. 200

6C

I-10

17

(a M

1 m

usca

rini

c ag

onis

t)

To c

ompa

re th

e po

wer

of d

iffer

ent

stud

y de

sign

s to

dete

ct a

spec

ifi ed

tre

atm

ent e

ffect

, to

com

pare

des

ign

perf

orm

ance

to d

if-fe

rent

iate

bet

wee

n do

se-r

espo

nse

pat-

tern

s, an

d to

eva

lu-

ate

the

bias

in d

rug

effe

ct si

ze re

lativ

e to

ph

arm

acod

ynam

ic

equi

libri

um

Popu

latio

n PK

mod

elPo

pula

tion

PD m

odel

s co

nsid

erin

g di

ffere

nt

dose

-res

pons

e pr

ofi le

s an

d ac

coun

ting

for

dise

ase

prog

ress

ion

and

plac

ebo-

resp

onse

Age

and

Smok

ing

stat

us a

ffect

ed P

K

para

met

er

Plac

ebo

cont

rolle

d m

ultip

le d

ose

desi

gns w

ith ±

60

subj

ects

and

12

wee

k du

ratio

n

Eigh

t pla

cebo

co

ntro

lled

desi

gns

wer

e ev

alua

ted

that

va

ried

in p

erio

d le

ngth

, num

ber o

f do

ses a

nd n

umbe

r of

ass

essm

ents

Non

e

Alzh

eim

er’s

dise

ase

Inte

r-in

divi

dual

va

riab

ility

Re

sidu

al v

aria

bilit

y Be

twee

n-oc

casi

on

vari

abili

ty

Patie

nt d

ropo

ut10

0 tr

ial

repl

icat

ions

Prop

ortio

n of

si

gnifi

cant

tria

l

The

optim

al d

esig

n w

as c

ondu

cted

in

real

ity

Clin

ical

dat

a fro

m

Tarc

rine

and

CL-

1017

Lite

ratu

reEx

pert

opi

nion

CL-

1017

Pha

se

I dat

aEx

pert

opi

nion

ADAS

-Cog

scor

epo

wer

NO

NM

EM

Anal

ysis

of v

aria

nce

Phar

sigh

t Tri

al

Sim

ulat

orG

ruw

ez e

t al.

(200

7)

Clo

mip

ram

ine

with

or

with

out

lithi

um

To v

erify

a k

inet

ic-

phar

mac

odyn

amic

(K

-PD

) mod

el’s

abili

ty to

cha

ract

er-

ize

clin

ical

dat

a,

to e

valu

ate

seve

ral

stat

istic

s to

sum

ma-

rize

clin

ical

effe

ct,

com

pare

the

anal

y-si

s bas

ed o

n th

ese

stat

istic

s, an

d to

dete

rmin

e th

e op

ti-m

al d

ates

of c

linic

al

asse

ssm

ent

K-P

D m

odel

ch

arac

teri

zing

the

time-

cour

se o

f clin

ical

re

spon

se

Para

llel t

rial

co

mpa

ring

dru

g an

d fa

cilit

ator

vs.

drug

and

pla

cebo

w

ith 1

00 p

atie

nts

per a

rm

Dat

e of

clin

ical

as

sess

men

t Su

mm

ary

stat

istic

of

clin

ical

effe

ct

Com

pari

son

betw

een

sim

ulat

ed

and

obse

rved

dat

a

Dep

ress

ion

Inte

r-in

divi

dual

va

riab

ility

Re

sidu

al v

aria

bilit

y

100

tria

l re

plic

atio

nsN

umbe

r of

sign

ifi ca

nt tr

ials

Chapter 2.1

26

Tabl

e I (

Con

tinue

d)

Ref

eren

ceD

rug

Com

men

tsIn

put-o

utpu

t Mod

elC

ovar

iate

Mod

elTr

ial E

xecu

tion

Mod

elSi

mul

atio

n Ex

peri

men

tD

ata

Ana

lysis

Gru

wez

et a

l. (2

007)

Clin

ical

tria

l dat

aM

ADRS

scor

ePo

wer

NO

NM

EM V

Para

met

ric

test

s for

di

ffere

nt su

mm

ary

stat

istic

s

NO

NM

EM V

SPSS

11.

5

Kow

alsk

i et

al.

(200

8)

SC-7

5416

(a

sele

ctiv

e C

OX-

2 in

hibi

tor)

To in

form

the

desi

gn

of a

stud

y to

test

w

heth

er o

rally

ad-

min

istre

dSC

-754

16

coul

d ac

hiev

e a

stat

istic

ally

sign

ifi -

cant

impr

ovem

ent i

n pa

in re

lief r

elat

ive

to ib

upro

fen

PK/P

D m

odel

s fo

r SC

-754

16,

rofe

coxi

b, v

alde

coxi

b an

d ib

upro

fen,

all

incl

udin

g a

plac

ebo-

time

and

drug

effe

ct

com

pone

nt

Para

llel,

plac

ebo-

an

d ac

tive-

cont

rolle

d m

ultip

le

dose

tria

ls

Seve

n pa

ralle

l, pl

aceb

o an

d ac

tive

cont

rolle

d de

sign

s th

at v

arie

d in

do

sing

regi

men

and

sa

mpl

e si

ze

Com

pari

son

betw

een

sim

ulat

ed

and

obse

rved

dat

a

Pain

Rand

om e

ffect

fo

r int

er-p

atie

nt

diffe

renc

es in

the

prob

abili

ty o

f pai

n re

lief.

Betw

een-

repl

icat

ion

vari

abili

ty

Patie

nt d

ropo

ut

1000

repl

icat

ions

pe

r tes

ted

desi

gnPr

opor

tion

of

succ

essf

ul tr

ials

Sim

ulat

ion

fi ndi

ngs

wer

e us

ed to

info

rm

in th

e de

sign

of a

n ac

tual

pha

se II

I tr

ial

One

SC

-754

16 a

nd si

x va

ldec

oxib

stud

ies

clin

ical

dat

a fro

m

SC-7

5416

]Pa

in re

lief s

core

s ov

er th

e fi r

st 6

h po

st

dose

Pow

er

NO

NM

EM V

Anal

ysis

of v

aria

nce

SAS

8.0

CTS in late-stage drug development

27

Tabl

e I (

Con

tinue

d)

Ref

eren

ceD

rug

Com

men

tsIn

put-o

utpu

t Mod

elC

ovar

iate

Mod

elTr

ial E

xecu

tion

Mod

elSi

mul

atio

n Ex

peri

men

tD

ata

Ana

lysis

Putn

am e

t al.

(200

8)

HAE

1 (a

hi

gh-a

ffi ni

ty

anti-

IgE

mon

oclo

nal

antib

ody)

To d

eter

min

e th

e pr

obab

ility

of

atta

inin

g a

spec

ifi ed

IgE

leve

l du

ring

the

dosi

ng

inte

rval

at a

giv

en

dose

, to

dete

rmin

e th

e pr

obab

ility

of

achi

evin

g th

e ta

rget

cl

inic

al re

spon

se,

and

to p

redi

ct th

e pr

opor

tion

of

subj

ects

sym

ptom

re

lief

Adap

ted

PK/P

D m

odel

fro

m re

late

d co

mpo

und

Refi n

ed H

AE1

PK/P

D

mod

el

Base

line

IgE

leve

l Bo

dy w

eigh

tPa

ralle

l, pl

aceb

o-co

ntro

lled

phas

e II

tria

l

Dos

ing

regi

men

A no

n-pa

ram

etri

cbo

otst

rap

proc

edur

e to

est

imat

e pr

ecis

ion

of p

aram

eter

es

timat

esVi

sual

pre

dict

ive

chec

kAs

thm

aIn

ter-

indi

vidu

al

vari

abili

tyPa

tient

dro

pout

Fifty

boo

tstr

ap

itera

tions

with

re

sam

plin

g of

100

su

bjec

ts

Prob

abili

ty o

f at

tain

ing

a cl

inic

al

resp

onse

Trea

tmen

t /pl

aceb

o od

ds ra

tio

Cas

e st

udy

in th

e us

e of

qua

ntita

tive

phar

mac

olog

y

Prec

linic

al a

nd

clin

ical

dat

a fro

m

omal

izum

ab a

nd

HAE

1

omal

izum

ab d

ata

omal

izum

ab d

ata

Free

IgE

leve

ls

tran

slat

ed to

ex

acer

batio

n ra

tes

Effe

ct si

ze

NO

NM

EM V

Oza

wa

et a

l.(2

009)

Doc

etax

elTo

eva

luat

e th

e do

se-r

educ

tion

stra

tegy

for

Japa

nese

pat

ient

s w

ith li

ver-

dysf

unct

ion

Popu

latio

n PK

mod

el

Wei

bull

mod

el re

latin

g ex

posu

re to

tim

e-to

-de

ath,

exp

osur

e an

d dr

opou

t Lo

gist

ic m

odel

rela

ting

expo

sure

to sa

fety

ev

ents

α1-a

cid

glyc

opro

tein

(A

GP)

and

A

lbum

in (A

LB)

leve

l, bo

dy

surf

ace

area

(B

SA),

liver

fu

nctio

n in

dex

scor

e (H

EP) a

nd

dise

ase

site

s [y/

n]

Para

llel,

phas

e II

I tr

ial c

ompa

ring

st

anda

rd v

s. re

duce

d do

sage

with

200

pa

tient

s per

arm

Dos

ing

regi

men

Com

pari

ng ti

me

cour

ses o

f dea

th a

nd

sum

mar

y st

atis

tics

from

sim

ulat

ed a

nd

obse

rved

pha

se II

da

ta

Can

cer

Inte

r-in

divi

dual

va

riab

ility

Re

sidu

al v

aria

bilit

y

Patie

nt d

ropo

utD

ose

adju

stm

ents

20

0 tr

ial

repl

icat

ions

Prop

ortio

n of

si

gnifi

cant

tria

l re

plic

atio

ns

Chapter 2.1

28

Tabl

e I (

Con

tinue

d)

Ref

eren

ceD

rug

Com

men

tsIn

put-o

utpu

t Mod

elC

ovar

iate

Mod

elTr

ial E

xecu

tion

Mod

elSi

mul

atio

n Ex

peri

men

tD

ata

Ana

lysis

Oza

wa

et a

l.(2

009)

Can

cer

Prec

linic

al a

nd p

hase

II

dat

aLi

tera

ture

Prec

linic

al a

nd

phas

e II

dat

aPh

ase

II d

ata

Ove

rall

surv

ival

, sa

fety

eve

nts a

nd

time-

to-p

rogr

essi

on

Pow

er

Phar

sigh

t Tri

al

Sim

ulat

or 2

.2C

ox p

ropo

rtio

nal

haza

rds r

egre

ssio

nPh

arsi

ght T

rial

Si

mul

ator

2.2

SPSS

15J

/ S

-plu

s 6.

2 / P

ASS

2008

Sant

en e

t al.

(200

9)

Antid

epre

s-sa

nt d

rug

To e

xplo

re th

e in

fl uen

ce o

f po

pula

tion

size

, ra

ndom

izat

ion

ratio

ac

ross

trea

tmen

t ar

ms,

freq

uenc

y of

ass

essm

ents

, dr

opou

t m

echa

nism

s, cl

inic

al e

nd p

oint

, an

d st

atis

tical

m

etho

ds o

n tr

ial

outc

ome.

Dua

l ran

dom

effe

cts-

mod

el (D

REM

) ch

arac

teri

zing

the

time-

cour

se o

f clin

ical

re

spon

se

Para

llel c

linic

al

tria

lSa

mpl

e si

zeRa

ndom

izat

ion

ratio

Freq

uenc

y of

as

sess

men

tsD

ropo

ut m

echa

nism

Clin

ical

end

poin

tPr

imar

y st

atis

tical

an

alys

is m

etho

d

Com

pari

son

betw

een

sim

ulat

ed

and

obse

rved

dat

a

Dep

ress

ion

Two

rand

om e

ffect

sRe

sidu

al e

rror

Patie

nt d

ropo

ut10

0 re

plic

atio

ns p

er

scen

ario

Pr

opor

tion

of

sign

ifi ca

nt tr

ials

.Pr

opor

tion

of fa

lse-

posi

tive

resu

lts

Two

phas

e II

I stu

dies

11 a

ntid

epre

ssan

t tr

ials

HAM

D-1

7 an

d H

AMD

-7 su

bsca

lePo

wer

Ty

pe I

erro

rW

inBU

GS

1.4.

1Am

ong

the

fact

ors

bein

g ev

alua

ted

R / S

AS 9

.1R

CTS in late-stage drug development

29

Tabl

e I (

Con

tinue

d)

Ref

eren

ceD

rug

Com

men

tsIn

put-o

utpu

t Mod

elC

ovar

iate

Mod

elTr

ial E

xecu

tion

Mod

elSi

mul

atio

n Ex

peri

men

tD

ata

Ana

lysis

Kri

shna

et a

l. (2

011)

Anac

etra

pib

To c

hara

cter

ize

the

popu

latio

n PK

/PD

of

ana

cetr

apib

in

heal

thy

volu

ntee

rs

and

patie

nts f

or

two

form

ulat

ions

an

d un

der d

iffer

ent

mea

l con

ditio

ns a

nd

to p

redi

ct e

ffect

s of

ana

cetr

apib

as

a fu

nctio

n of

do

se w

hen

give

n as

mon

othe

rapy

or

in c

ombi

natio

n w

ith a

stat

in u

nder

di

ffere

nt m

eal

cond

ition

s.

PK m

odel

PD m

odel

s for

H

DL-

C a

nd L

DL-

C

(pro

port

iona

l Em

ax

mod

els)

Die

t Fo

rmul

atio

nPa

ralle

l, pl

aceb

o co

ntro

lled

mul

tiple

(7

) dos

e ph

ase

III t

rial

with

45

patie

nts p

er a

rm

Dos

ing

regi

men

Die

t (hi

gh-fa

t, lo

w-

fat,

fast

ed)

Form

ulat

ion

Com

pari

ng p

lots

an

d su

mm

ary

stat

istic

s fro

m

sim

ulat

ed a

nd

obse

rved

pha

se II

da

ta.

prim

ary

hype

r-ch

oles

tero

-la

emia

and

m

ixed

hyp

er-

lipid

emia

inte

rsub

ject

err

orre

sidu

al v

aria

bilit

y10

00 re

plic

atio

ns

per s

cena

rio

Anal

ysis

of d

esig

n pe

rfor

man

ce

Sim

ulat

ion

resu

lts e

nabl

ed

to su

cces

sful

ly

purs

ue a

dos

e an

d fo

rmul

atio

n in

ph

ase

III.

Phas

e I s

tudi

esPh

ase

IIb

stud

yPh

ase

IIb

stud

yH

DL-

C a

nd L

DL-

CM

easu

re o

f des

ign

perf

orm

ance

N

ON

MEM

Ver

sion

5.0

SPLU

S ve

rsio

n 8.

0So

ftwar

e

Chapter 2.1

30

DiscussionThis review provides an overview of published studies that used CTS to address pertinent questions related to the design of late-stage confi rmatory trials. We did not intent to provide a full account of all aspects and procedures that need to be considered before and while conducting CTS. Recommendations and guidance has been published elsewhere [5,7,21]. Instead, we discussed the questions that were addressed, the data, models and software that was used, the type of experiments that were conducted and the analyses that were performed to assess model performance, simulated outcomes and features of the evaluated design(s).Advances in the use and methodology of CTS have been discussed previously in two reviews by Holford and colleagues. The fi rst was published over a decade ago and served as the basis and framework for our own analysis [4]. The second was carried out simultaneously and independent of our own study, and published only recently [23]. It included most of the studies also discussed here, and considered some of the same issues, including the possibilities, challenges, and pitfalls of CTS and its role in the current drug development process. However, the focus was primarily on lessons learned from CTS studies and the potential contribution of CTS to the drug development process, whereas our study focused more specifi cally on the methodological considerations when planning or conducting CTS.We found that the models that were used to perform CTS varied widely in structure, size and complexity. The employed IO models, for example, ranged from simple characterizations of the outcome kinetics to much more elaborate descriptions of drug response and disease progression. IO models may be distinguished into models of the data (i.e. empirical models), and models of the underlying data-generating mechanisms (i.e. mechanistic models). The fi rst describes the input-output relationship observed in data from empirical (e.g. preclinical and clinical) studies, while the latter characterizes the actual pharmacological and physiological mechanisms that gave rise to these data. The use of mechanistic models in CTS is often encouraged since such models are expected to extrapolate to new situations better then empirical models [5]. However, mechanistic models require knowledge that is often not available or easy to obtain when a CTS study is initiated. Aside from differences in the nature of the IO models, studies also varied in whether they accounted for inter-individual differences in features that were associated with systematic differences in drug response between patients. The inclusion of such patient-specifi c covariates (i.e. a covariate distribution model) is especially important when meaningful differences in drug response can be expected between differential patient subgroups included in the trial.The extensiveness of the trial execution model varied as well, ranging from a simple defi nition of the treatment groups and sample sizes, to a detailed account of design parameters such as dosing regimen, patient allocation, dropout and (non-)compliance.

CTS in late-stage drug development

31

In itself, the structure and complexity of the CTS model does not determine its adequacy. More important is whether the model is suitable for its intended purpose; that is, to predict trial outcomes in order to inform in the design of future trials. A guidance document on the performance of CTS recommended that “The complexity of the models and simulation procedures should be no more than necessary to meet the objectives of the simulation project” [7].Typically, models used in CTS are based on available preclinical and clinical trial data. These data are subject to various sources of variation, and it is therefore essential to conduct extensive model-checking before accepting a model for CTS. Most studies included in the review applied basic internal validation methods (e.g. goodness-of-fi t plots) or predictive check procedures. These methods account for the model’s fi t to its source data, but do not address the reliability of extrapolations beyond the observed data, which is relevant for simulating future trials [8].Methods adopted from diagnostic and prognostic statistical and epidemiological research can be used to assess and improve the predictive performance of models used in CTS; these include cross-validation, adjusted parameter estimates (shrinkage) and different information criteria and operating characteristics [24-26]. At both the model building and validation stage, these methods allow for rigorous quantitative ranking of different models according to their capability to predict the operating characteristics of future trials. Validation against independent external data would be the optimal approach to assess the model’s predictive performance, although such data may not always be available at the time a CTS study is undertaken.CTS fi ts well within the learning-confi rming framework introduced by Lewis Sheiner [27]. This framework was posed as an alternative to the traditional, phased approach to clinical drug development, and has become increasingly infl uential since its introduction. The learning-confi rming framework recognizes that at different phases of the drug development process the focus is on distinctly different objectives. During the learning phase, the goal is to accumulate knowledge on the new drug candidate (e.g. its PK/PD profi le), maximize its medical potential (e.g. optimize dosing regimen) and determine whether continued investment is justifi ed. In the confi rmatory phase, the focus is specifi cally on testing the hypothesis that the drug is effective in treating the target indication for a specifi ed patient group.The studies discussed in this review provided examples of how CTS could serve as a bridge between preclinical and early clinical learning studies, and late-stage confi rmatory trials. The CTS models all relied on data that was acquired during early development (i.e. the learning phase), and were used to optimize the design of late-stage (i.e. confi rmatory) clinical trials.A number of major pharmaceutical companies expressed growing interest in CTS [28,29], but also regulatory agencies advocated more frequent use of CTS to streamline drug development. In fact, the FDA has occasionally encouraged, or even specifi cally requested

Chapter 2.1

32

sponsors to perform trial simulations as part of the development process or to support a new drug application [34]. The FDA’s ‘Critical Path’ report referred to CTS as an important tool to improve development decision making, and other FDA guidance documents also acknowledged a role for CTS in the evaluation of trial designs [31,32]. The EMA has been less explicit in its stance towards CTS, but is known to support the use of modeling and simulation approaches throughout the drug development process [33].In addition, since the acceptance of the FDA Modernization Act (FDAMA) in 1997, sponsors can now fi le a request for approval that is supported by data from a single adequate and well-controlled phase III clinical trial supplemented with confi rmatory evidence (section 115a) [34]. Some authors have suggested that the supportive evidence could be obtained from a carefully conducted CTS [35], which would enable faster and more economic registration of new drugs.The number of published CTS examples in late-stage development is limited, which suggests that at present the procedure is not commonly applied (or at least not frequently reported on) by pharmaceutical companies. It may be that these studies are considered sensitive in terms of intellectual property and are only published when a product is marketed. There is at present no obligation to publicly register such studies, as they do not require actual clinical trials. Also, CTS takes time to conduct, which companies may fear will reduce the duration of patent protection once the drug is on the market. In addition, CTS may require information and expertise that is unavailable. This review suggests that CTS offers a valuable approach to evaluate and compare trial designs and could hence ensure thoughtful use of limited resources and optimize the likelihood of a successful trial. In addition, it compels the integration of data from the different development phases (e.g. PK/PD and dose-fi nding studies, early clinical trials, literature and related compounds) to assure a comprehensive synopsis of all the available information before a large (and expensive) clinical trial is embarked upon. Overall, we expect that this will streamline late-stage drug development and advocate a larger role for CTS in the planning of future trials.

CTS in late-stage drug development

33

Reference List(1) DiMasi JA, Hansen RW, Grabowski HG. The price of innovation: new estimates of

drug development costs. Journal of Health Economics 2003; 22(2):151-185.(2) US Department of Health and Human Services, Food and Drug Administration.

Innovation or stagnation? Challenge and opportunity on the critical path to new medical products. 2004.

(3) Eichler HG, Aronsson B, Abadie E, Salmonson T. New drug approval success rate in Europe in 2009. Nature Reviews Drug Discovery 2010; 9(5):355-356.

(4) Holford NH, Kimko HC, Monteleone JP, Peck CC. Simulation of clinical trials. Annual Review of Pharmacology and Toxicology 2000; 40:209-234.

(5) Bonate PL. Clinical trial simulation in drug development. Pharmaceutical Research 2000; 17(3):252-256.

(6) Girard P. Clinical trial simulation: a tool for understanding study failures and preventing them. Basic & Clinical Pharmacology & Toxicology 2005; 96(3):228-234.

(7) Holford NHG, Hale M, Ko HC, et al. Simulation in Drug Development; Good Practices, 1999.

(8) Boessen R, Knol MJ, Groenwold RH, Roes KC. Validation and predictive performance assessment of clinical trial simulation models. Clinical Pharmacology and Therapeutics 2011; 89(4):487-488.

(9) Veyrat-Follet C, Bruno R, Olivares R, Rhodes GR, Chaikin P. Clinical trial simulation of docetaxel in patients with cancer as a tool for dosage optimization. Clinical Pharmacology andTherapeutics 2000; 68(6):677-687.

(10) Ozawa K, Minami H, Sato H. Clinical trial simulations for dosage optimization of docetaxel in patients with liver dysfunction, based on a log-binominal regression for febrile neutropenia. Yakugaku Zasshi 2009; 129(6):749-757.

(11) Nestorov I, Graham G, Duffull S, Aarons L, Fuseau E, Coates P. Modeling and stimulation for clinical trial design involving a categorical response: a phase II case study with naratriptan. Pharmaceutical Research 2001; 18(8):1210-1219.

(12) Chabaud S, Girard P, Nony P, Boissel JP. Clinical trial simulation using therapeutic effect modeling: application to ivabradine effi cacy in patients with angina pectoris. Journal of Pharmacokinetics and Pharmacodynamics 2002; 29(4):339-363.

(13) Putnam WS, Li J, Haggstrom J et al. Use of quantitative pharmacology in the development of HAE1, a high-affi nity anti-IgE monoclonal antibody. The AAPS Journal 2008; 10(2):425-430.

(14) Kimko HC, Reele SS, Holford NH, Peck CC. Prediction of the outcome of a phase 3 clinical trial of an antischizophrenic agent (quetiapine fumarate) by simulation with a population pharmacokinetic and pharmacodynamic model. Clinical Pharmacology and Therapeutics 2000; 68(5):568-577.

(15) Lockwood PA, Cook JA, Ewy WE, Mandema JW. The use of clinical trial simulation to support dose selection: application to development of a new treatment for chronic neuropathic pain. Pharmaceutical Research 2003; 20(11):1752-1759.

Chapter 2.1

34

(16) Lockwood P, Ewy W, Hermann D, Holford N. Application of clinical trial simulation to compare proof-of-concept study designs for drugs with a slow onset of effect; an example in Alzheimer’s disease. Pharmaceutical Research 2006; 23(9):2050-2059.

(17) Kowalski KG, Olson S, Remmers AE, Hutmacher MM. Modeling and simulation to support dose selection and clinical development of SC-75416, a selective COX-2 inhibitor for the treatment of acute and chronic pain. Clinical Pharmacology and Therapeutics 2008; 83(6):857-866.

(18) Gruwez B, Poirier MF, Dauphin A, Olie JP, Tod M. A kinetic-pharmacodynamic model for clinical trial simulation of antidepressant action: application to clomipramine-lithium interaction. Contemporary Clinical Trials 2007; 28(3):276-287.

(19) Santen G, Horrigan J, Danhof M, Della PO. From trial and error to trial simulation. Part 2: an appraisal of current beliefs in the design and analysis of clinical trials for antidepressant drugs. Clinical Pharmacology and Therapeutics 2009; 86(3):255-262.

(20) De Ridder F. Predicting the outcome of phase III trials using phase II data: a case study of clinical trial simulation in late stage drug development. Basic & Clinical Pharmacology & Toxicology 2005; 96(3):235-241.

(21) Krishna R, Bergman AJ, Green M, Dockendorf MF, Wagner JA, Dykstra K. Model-based development of anacetrapib, a novel cholesteryl ester transfer protein inhibitor. The AAPS Journal 2011; 13(2):179-190.

(22) Smith MK, Marshall A. Importance of protocols for simulation studies in clinical drug development. Statistical Methods in Medical Research 2011; 20(6):613-622.

(23) Holford N, Ma SC, Ploeger BA. Clinical Trial Simulation: A Review. Clinical Pharmacology and Therapeutics 2010; 88(2):166–182.

(24) Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine 1996; 15(4):361-387.

(25) Brendel K, Dartois C, Comets E et al. Are population pharmacokinetic and/or pharmacodynamic models adequately evaluated? A survey of the literature from 2002 to 2004. Clinical Pharmacokinetics 2007; 46(3):221-234.

(26) Konig IR, Malley JD, Weimar C, Diener HC, Ziegler A. Practical experiences on the necessity of external validation. Statistics in Medicine 2007; 26(30):5499-5511.

(27) Sheiner LB. Learning versus confi rming in clinical drug development. Clinical Pharmacology and Therapeutics 1997; 61(3):275-291.

(28) Lalonde RL, Kowalski KG, Hutmacher MM et al. Model-based drug development. Clinical Pharmacology and Therapeutics 2007; 82(1):21-32.

(29) Rajman I. PK/PD modelling and simulations: utility in drug development. Drug Discovery Today 2008; 13(7-8):341-346.

(30) Lee H, Yim DS, Zhou H, Peck CC. Evidence of effectiveness: how much can we extrapolate from existing studies? The AAPS Journal 2005; 7(2):E467-E474.

(31) Center for Drug Evaluation and Research FDA. Guidance for Industry. End-of-Phase 2A Meetings. 2009. Food and Drug Administration.

CTS in late-stage drug development

35

(32) Center for Drug Evaluation and Research FDA. Guidance for Industry. Adaptive Design Clinical Trials for Drugs and Biologics. 2010. Food and Drug Administration.

(33) European Medicines Agency. Innovative Drug Development Approaches. Final report from the EMEA/CHMP - think tank on innovative drug development. 2007.

(34) The US Congress. The food and Drug Administration Modernization Act of 1997. Pub. L. No. 105-115, 111 Stat. 2295. 1997.

(35) Peck CC, Rubin DB, Sheiner LB. Hypothesis: a single clinical trial plus causal evidence of effectiveness is suffi cient for drug approval. Clinical Pharmacology and Therapeutics 2003; 73(6):481-490.

Chapter 2.2Validation and predictive performance assessment of

Clinical Trial Simulation models

Boessen R, Knol MJ, Groenwold RHH, Roes KCB

Clinical Pharmacology and Therapeutics 2011;89:487-8

CTS in late-stage drug development

39

In the August issue of this journal, Holford and colleagues reviewed recent developments in the use and methodology of Clinical Trial Simulation (CTS) [1]. Based on our own assessment of the underlying publications, we consider it essential to add validation and performance assessment of CTS models as important areas of attention.The main purpose of the CTS studies discussed in the review is to generate distributions of likely trial outcomes for a set of underlying assumptions, in order to inform the design of future trials. The development of CTS models usually relies on empirical data, which are subject to multiple sources of variation. It is therefore essential to conduct extensive model-checking before accepting a model for CTS purposes [2]. Most of the studies reviewed by Holford et al. applied basic internal validation methods (e.g. goodness-of-fi t plots) or predictive check procedures. These methods can be used to assess a model’s fi t to its source data, but do not address the reliability of extrapolations beyond the observed data, which is relevant for designing future trials.There is an abundance of established methods applicable to CTS that can assess and improve the predictive performance of statistical prognostic models, including cross-validation, adjusted parameter estimates (shrinkage), and various information criteria and operating characteristics [3,4,5]. At both the model-building and validation stages, these methods allow for rigorous quantitative ranking of different models according to their capability to predict operating characteristics of future trials. Surprisingly, only a few of the studies reviewed by Holford et al. appear to have employed such methods systematically.In total, six of the reviewed studies made comparisons between simulated and independent external data. Overall, these studies displayed deviance between predicted and observed outcomes (Table), raising the concern that the predictive performance of CTS models is often overestimated when judged on the basis of basic internal validation techniques alone. We conducted a search to retrieve reports of trials based on CTS results so as to compare the predicted and actual outcomes. Two trials were identifi ed, but direct comparison was impossible because of differences in reported outcomes.In conclusion, we consider the current model-building and validation practices in CTS inadequate for proper assessment and optimization of predictive performance. We advocate a stronger emphasis on the evaluation of the predictive performance of a model using methods adopted from diagnostic and prognostic statistical and epidemiological research and validation against independent external data whenever possible.

Chapter 2.2

40

ReferenceMethod used to assess predictive model performance

Qualitative assessment of predictive model performance

Quantitative interpretation based on fi gures and tables from the original publication

Fig. 3 and 4, p. 1215 underlying table entry 2 from Holford et al. (2010)

Comparison between phase I model predictions and actual phase IIa data

The general trend of the predicted profi le is correct but systematically underpredicts the observed data

About half of the actual PK observation fell outside the 95% prediction interval

Fig. 7, p. 239 underlying table entry 5 from Holford et al. (2010).

Comparison between predictive distributions derived from phase II model simulations and actual outcomes of three phase III trials

Good overall agreement, but observed response was higher than expected for 2 mg in one trial and consistently lower than expected for 4 mg

For one trial the 2mg effect fell outside the 95% prediction interval. The 4mg was consistently underpredicted but within the 95% prediction interval

Table V and VIII, p. 800-801 underlying table entry 10 from Holford et al. (2010).

Comparison between simulations with different assumptions on effect and washout and observed trial data

Observed results were similar to the model predictions assuming a protective effect and slow washout

Predictions from the best performing model deviated almost two standard errors from the observed effect for placebo and low dose

Fig. 1 and 2, p. 180-181 underlying table entry 11 from Holford et al. (2010).

Comparison between simulated and observed plasma concentration-time curves and antitussive effect-time profi les

Model predictions recovered the plasma concentration-time profi le with reasonable accuracy

Underprediction of the concentration-time curves. Good agreement between predicted and observed effect-time profi les

Table 4, p. 863 underlying table entry 13 from Holford et al. (2010).

Comparison between model predictions and the outcomes from a subsequent effi cacy trial

The observed means for the active treatment groups are within 1 point of the model predictions, confi rming the predictive performance of the model used in the CTS

Good agreement between predicted and observed outcomes for all four treatment groups, overprediction for placebo-group

Fig. 1a,b, p. 669 underlying table entry 16 from Holford et al. (2010).

Graphical comparison between simulation predictions and external data

No further discussion About half of the actual observation fell outside the 95% prediction interval

CTS in late-stage drug development

41

Reference List(1) Holford N, Ma SC, Ploeger BA. Clinical trial simulation: a review. Clinical

Pharmacology and Therapeutics 2001; 88: 166–182. (2) Holford NHG, Hale M, Ko HC, Steimer JL, Sheiner LB, Peck CC. Simulation in Drug

Development: Good Practices. <http://bts.ucsf.edu/cdds/research/sddgpreport.php> (1999). Accessed 1 September 2010.

(3) Harrel FEH, Kerry LL, Mark BB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy and measuring and reducing errors. Statistics in Medicine 1996; 15:361–387.

(4) Brendel K et al. Are population pharmacokinetic and/or pharmacodynamic models adequately evaluated? A survey of the literature from 2002 to 2004. Clinical Pharmacokinetics 2007; 46: 221–234.

(5) König IR, Malley JD, Weimar C, Diener HC, Ziegler A. German Stroke Study Collaboration. Practical experiences on the necessity of external validation. Statistics in Medicine 2007; 26: 5499–5511.

Chapter 3.1Increasing trial effi ciency by early reallocation of placebo non-responders in sequential parallel comparison designs:

application to antidepressants

Boessen R, Knol MJ, Groenwold RHH, Grobbee DE, Roes KCB

Clinical Trials 2012;9(5):578-87.

Chapter 3.1

44

AbstractBackground: The sequential parallel comparison (SPC) design was proposed to improve the effi ciency of psychiatric clinical trials by reducing the impact of placebo response. It consists of two consecutive placebo-controlled comparisons of which the second is only entered by placebo non-responders from the fi rst. Previous studies suggest that in antidepressant trials, non-response to placebo can already be predicted after two weeks of follow-up. This would allow to reduce the fi rst phase of the SPC design to further increase its effi ciency.Purpose: To compare the sample size requirements of an eight week randomized controlled trial (RCT8) and alternative SPC designs with equal or longer total follow-up duration (SPC2+6, SPC4+4 and SPC6+6).Methods: Scenarios for response and dropout rates were defi ned. Sample sizes to achieve 80 percent power were determined for the various designs. Three treatment functions assumed either a smaller, equal, or larger effect at the early stage of the trial as compared to the end. Two dropout models described either predominantly early or linearly increasing dropout, and dropout was considered as non-response. The relative effi ciency of the different designs was evaluated across these scenarios and for a specifi c scenario based on empirical antidepressant trial data.Results: The different SPC designs (i.e. SPC2+6, SPC4+4 and SPC6+6) were generally more ef-fi cient than the RCT8 design when the treatment effect at the early stage of the trial was equal or larger than the effect at the end. In this case, the advantage of the SPC designs increased in the presence of dropout. The SPC2+6 design was usually more effi cient than the SPC4+4 design, and relatively less affected by dropout when it occurred predominantly early. For the scenario that was based on antidepressant trial data, the SPC2+6 and SPC4+4 designs required 51 and 53 percent fewer patients than the RCT8 design.Limitations: A limited variety of scenarios was evaluated. Parameter values resembled those observed in antidepressant trials.Conclusions: This study suggests that SPC designs are highly effi cient alternatives to a con-ventional RCT in indications where placebo response is high and substantial treatment ef-fects are established after a relatively short follow-up period (i.e. at the end of the fi rst SPC design phase). We conclude that SPC designs can reduce sample size requirements and in-crease success-rates of antidepressant trials.

Early reallocation of placebo non-responders in SPC designs

45

IntroductionMany late-stage clinical trials for new psychiatric drugs fail to detect a statistically signifi cant difference in effi cacy between treatment and placebo [1]. The failure-rate of confi rmatory an-tidepressant trials, for example, is estimated at over 50 percent, even for treatments registered by the US Food and Drug Administration (FDA) [2]. As late-stage trial failures are expensive and delay product development, improving the success-rate of clinical trials could substan-tially decrease the costs and time required to bring new psychiatric drugs to the market [3].High placebo response is common in psychiatric trials, and considered among the main rea-sons for trial failure [4,5]. If patients randomized to placebo demonstrate large improve-ments, it is diffi cult for patients on active treatment to show substantial additional improve-ment beyond this effect [6].To address the excessive trial failure-rates in psychiatry, Fava et al. proposed the sequential parallel comparison (SPC) design; a study design aimed to reduce the overall impact of pla-cebo response and thereby increase the effi ciency of the trial [4]. The SPC design consists of two consecutive, placebo-controlled comparisons. In the fi rst, the majority of patients is randomized to placebo. In the second, placebo non-responders go on to receive either active treatment or placebo while patients on active treatment and placebo responders leave the study. The analysis combines the data from both study phases in order to maximize power and reduce sample size requirements [7].To examine the effi cacy of antidepressant treatments, previous publications suggested a six-week duration for both phases of the SPC design (SPC6+6) [4,8]. In comparison, the follow-up duration of a conventional antidepressant trial is usually only six to eight weeks in total. This means that a SPC6+6 trial would take about twice as long to complete as compared to a conventional trial. As a result, a SPC6+6 trial will also be more strongly affected by patient dropout, and is likely to require additional assessments per patient. A more attractive alterna-tive was recently presented [9], where both phases of the SPC design were only four weeks in duration (SPC4+4). This design may resolve some of the potential downsides of the SPC6+6 design.Studies that investigated the predictability of trial response based on early symptom relief in antidepressant trials concluded that the absence of improvement in the fi rst two weeks is a good predictor for non-response at the end of the trial for patients on both treatment and placebo [10-12]. This suggests that it may be possible to identify and reallocate placebo non-responders already after two weeks (SPC2+6) to further increase the effi ciency of the design.This study evaluates the effi ciency of the SPC2+6 and SPC4+4 designs as compared to a con-ventional randomized controlled trial of equal follow-up duration (RCT8). In addition, the comparative effi ciency of the SPC6+6 design is evaluated to assess whether a reduction in overall placebo response outweighs the increase in total study duration (i.e. larger dropout). In order to do so, sample size requirements were determined under a range of scenarios, in-cluding a scenario that was based on empirical data from antidepressant trials.

Chapter 3.1

46

MethodsSPC designThe SPC design consists of two consecutive comparisons between treatment and placebo (fi gure 1). Before the study starts, patients are randomized into one of three groups, according to a prespecifi ed randomization fraction a. The fi rst two groups receive placebo in the fi rst phase and include a·n patients. The third group starts on treatment and includes (1-2·a) ·n pa-tients. After the fi rst study phase, non-responders from group one continue on placebo while non-responders from group two switch to treatment. Placebo responders and patients from group three (i.e. those initially on treatment) do not continue into the second study phase, but may be retained in the study to obtain additional tolerability data [8].

Figure 1: Representation of the sequential parallel comparison (SPC) design. The SPC design com-bines two consecutive, placebo-controlled comparisons. Before the study starts, all patients (n) are randomized into one of three groups according to a predefi ned randomization ratio (a). Patients in the fi rst two groups receive placebo while those in the third group receive treatment. After the fi rst phase only non-responders from group one and two go on to receive placebo or treatment in the second phase. p1, q1, p2 and q2 denote the response rates to treatment and placebo in the fi rst and second study phase. All patients can be assigned to one of thirteen possible outcome categories based on their exposure and outcome (response, non-response or dropout) in the fi rst and second study phase (see also table 1).

Early reallocation of placebo non-responders in SPC designs

47

Response rates on treatment and placebo in the fi rst study phase are denoted by p1 and q1, while p2 and q2 denote the response rates on treatment and placebo in the second phase. Because the second phase includes only placebo non-responders from the fi rst, q2 will likely be smaller than q1. All patients that participate in the SPC design end up in one of thirteen possible outcome categories based on their exposure and outcome in the fi rst and second study phase (fi gure 1). The overall treatment effect is a weighted average of the estimated differences in response rates over the two study phases and can be written as: h=w·(p1-q1)+(1-w) ·(p2-q2), where w is the weight assigned to the fi rst study phase, which needs to be chosen before the study starts.

Simulation scenariosThe three designs (i.e. SPC2+6, SPC4+4 and RCT8) were compared with respect to sample size requirements across scenarios that were defi ned by: 1) eight-week response rates to treat-ment and placebo in the overall study population (p1 and q1) and the placebo non-responder population (p2 and q2), and 2) the dropout rate after week eight (d). In total 32 scenarios were evaluated that assumed: q1 = 0.3, 0.4, 0.5 or 0.6, p1-q1 = p2-q2 = 0.1 or 0.2, q1-q2 = 0.1 or 0.2 and d = 0.0 or 0.4.

Treatment functions and dropout modelsThree different treatment functions were used to characterize the progression of response rates in the overall and placebo non-responder population (fi gure 2a-c). These functions were used to derive response rates for time-points prior to week eight. All treatment functions as-sumed the same asymptotic increase in placebo response. Progression of treatment response was characterized so that early in the trial the effect was smaller (treatment function 1), equal (treatment function 2) or larger (treatment function 3) than the effect at the end of week eight.Two dropout models were used to characterize dropout over time and obtain estimates for the dropout rate at the end of the fi rst and second phase of the SPC4+4 and the SPC2+6 design (fi gure 2d). The fi rst model assumed a linear increase in dropout, while the second assumed dropout to occur predominantly during the fi rst weeks of the follow-up period. The SPC6+6 design was also compared with the RCT8 design. In this comparison, dropout in the RCT8 design was different from when the RCT8 design was compared to the SPC4+4 and SPC2+6 designs, due to the modifi cation of the dropout model to comprise not eight but twelve weeks.

Chapter 3.1

48

Treatment function 1

Time in weeks

0 2 4 6 8

Res

pons

e ra

te

●●

● ● ● ●

TRT phase1TRT phase2PLC phase1PLC phase2

a

Treatment function 2

Time in weeks

0 2 4 6 8R

espo

nse

rate

●●

● ● ● ●

b

Treatment function 3

Time in weeks

0 2 4 6 8

Res

pons

e ra

te

●●

● ● ● ●

c

Dropout models

Time in weeks

0 2 4 6 8

Dro

pout

rate

Dropout model 1Dropout model 2

d

Figure 2: Response functions and dropout models. Figures 2a-c show the three different treatment functions that were used to characterize response rates on treatment (squares) and placebo (circles) in the overall (fi lled symbols) and placebo non-responder (open symbols) population. The various treat-ment functions assumed the early effect (after two or four weeks) to be smaller (treatment function 1), equal (treatment function 2) or larger (treatment function 3) than the effect after eight weeks. Figure 2d shows the two dropout models that were used to characterize dropout over time. The fi rst dropout model assumed a linear increase in dropout (fi lled squares), while the second assumed dropout to occur predominantly early (open squares).

Early reallocation of placebo non-responders in SPC designs

49

Effect size estimate in the SPC designThe estimate of the overall treatment effect in the SPC design (h), is a weighted average of the effects observed in the fi rst and second study phase, and can be expressed as a function of the thirteen different outcome categories (fi gure 1), The probability to end up in each of these outcome categories is dependent on the exposure and outcome in the fi rst and second study phase (table 1).

Table 1: proportion of patients in each of the 13 outcome categories

Outcome category 1st phase 2nd phase Probability

1 placebo - dropout - a·d1

2 placebo - response - a·(1-d1) ·q1

3 placebo - non-response placebo - dropout a·(1-d1) ·(1-q1) ·d2

4 placebo - non-response placebo - response a·(1-d1) ·(1-q1) ·(1-d2) ·q2

5 placebo - non-response placebo - non-response a·(1-d1) ·(1-q1) ·(1-d2) ·(1-q2)

6 placebo - dropout - a·d1

7 placebo - response - a·(1-d1) ·q1

8 placebo - non-response

treatment - dropout a·(1-d1) ·(1-q1) ·d2

9 placebo - non-response treatment - response a·(1-d1) ·(1-q1) ·(1-d2) ·p2

10 placebo - non-response treatment - non-response a·(1-d1) ·(1-q1) ·(1-d2) ·(1-p2)

11 treatment - dropout - (1-2·a) ·d1

12 treatment - response - (1-2·a) ·(1-d1) ·p1

13 treatment - non-response - (1-2·a) ·(1-d1) ·(1-p1)

The proportion of patients that ends up in each of the 13 possible outcome categories expressed as a function of exposure (i.e. placebo or treatment) and outcome (response, non-response or dropout) in the fi rst and second study phase. Response and dropout rates in the fi rst phase are denoted by p1, q1 and d1, whereas p2, q2 and d2 denote response and dropout in the second phase. a corresponds to the predefi ned randomization fraction, i.e. the portion of patients allocated to both placebo groups.

Dropout was considered as non-response (intention-to-treat analysis) and h was therefore defi ned as:

( )( )

++−

++⋅−+

⋅⋅

+−

⋅⋅−⋅=

543

4

1098

97212

NNN

N

NNN

Nw1

na2

NN

na21

Nwh

where n is the total sample size, a is the randomization fraction, w is the weighting of the fi rst study phase and Ni refers to the number of patients in the ith outcome category.Because dropout was c onsidered as non-response, the actual response rates were lower than those initially assumed for scenarios that included dropout. When, for example, the treatment

Chapter 3.1

50

effect was fi xed at p1-q1, due to dropout, the actual response rates on treatment and placebo were equal to p1·(1-d) and q1·(1-d), respectively, and hence the observed treatment effect was defi ned as (p1-q1) ·(1-d).The variance of h is equal to D’VD, where D is the column vector of partial derivatives of h with respect to the expected values for N1 to N13, and V is the variance covariance matrix of N1 to N13 [8]. Because h is asymptotically normally distributed, the standardized overall effect estimate z can be compared to the standard normal distribution, and the power of that test is Φ·(z-1.96) where Φ is the cumulative distribution function of the standard normal distribution.

Sample size calculationThe sample size required to attain 80 percent statistical power with the SPC design was determined as follows: for sample sizes up to 5,000 patients, the standardized overall effect estimate z was calculated for all possible combinations of a and w, where a ranged from 0.10 to 0.45 and w ranged from 0.20 to 0.90. Next, the minimal sample size was selected that cor-responded to the desired power.For the RCT8 design, the sample size was derived for a standard chi-square test with p1·(p1-d) and q1·(1-d) as the respective response rates for treatment and placebo, to account for dropout by means of an intention-to-treat analysis. All tests were two-sided with an overall statistical signifi cance level of .05. All simulations and analyses were carried out in R (ver-sion 2.14.0, http://www.r-project.org). The simulation script is available upon request.

Antidepressant dataThe sample size requirements for the three designs were also estimated for a realistic sce-nario where dropout and response rates were derived from empirical antidepressant trial data. These data came from four phase IIb/III, confi rmatory trials that compared mirtazapine with placebo and reported a statistically signifi cant treatment effect (three of these studies have been published [13-15]). For 349 patients with major depressive disorder, the relative change from baseline (rCFB) on the Hamilton depression rating scale was calculated for all available weekly assessments, and corrected for investigational center. The binary response criterion was set at a corrected rCFB ≥ 50 percent, which is a common effi cacy criteria in antidepres-sant trials. Response rates were calculated as the number of responders divided by the total number of patients still in the study at the time of assessment. Dropout rate was defi ned as the proportion of patients lost to follow-up.The four available studies were all six weeks in duration. To match the eight-week designs investigated, we extrapolated observed dropout and response rates beyond six weeks. As in the simulated scenarios, the treatment effect (i.e. the difference between treatment and pla-cebo) after eight weeks of exposure was assumed to be equal in the overall and placebo non-

Early reallocation of placebo non-responders in SPC designs

51

responder population. The shift in placebo response was set to ten percent. Response rates for week eight were extrapolated based on data from the fi rst six weeks, using an exponential, non-linear least squares model (with the nls() function in R). Likewise, dropout rates from the fi rst six weeks were extrapolated to derive an estimate for the dropout rate at week eight. Based on the resulting parameter values we estimated the sample size requirements for the three different eight-week designs to establish a signifi cant treatment effect with 80 percent power.

ResultsSPC4+4 versus RCT8

Results from the simulations comparing the SPC4+4 and RCT8 design are presented in fi gure 3. The comparative performance of the SPC4+4 design was greatly dependent on the assumed treatment function. For scenarios with no dropout, the SPC4+4 design was more effi cient than the RCT8 design under treatment function 2 (i.e. equal effect size after four and eight weeks; 8 - 34 percent fewer patients) and treatment function 3 (i.e. larger effect after four than after eight weeks; over 60 percent fewer patients). Conversely, under treatment function 1 (i.e. smaller effect after four than after eight weeks) the SPC4+4 design was much less effi cient, requiring up to three times as many patients.The presence of dropout further increased the relative ineffi ciency of the SPC4+4 design under treatment function 1, especially when dropout followed dropout model 2 (i.e. predominantly early dropout). On the other hand, it increased the comparative advantage of the SPC4+4 design when the effect was equal or larger after four weeks than it was after eight (i.e. treat-ment function 2 and 3, respectively). More specifi cally, under treatment function 2, the gain in sample size was largest when dropout followed dropout model 1 (i.e. linearly increasing dropout; about 50 percent fewer patients), but also substantial under dropout model 2 (36 – 39 percent fewer patients). Dropout increased the comparative effi ciency of the SPC4+4 design under treatment function 3 to at least 60 percent fewer patients, with little difference whether dropout followed dropout model 1 or 2.

SPC2+6 versus SPC4+4 and RCT8

Results of the comparison between the SPC2+6 and the RCT8 design are presented in fi gure 4, and show the same general trends as observed in fi gure 3. In the absence of dropout, the SPC2+6 design generally required fewer patients than the RCT8 design under treatment func-tion 2 (0 – 34 percent fewer patients) and treatment function 3 (around 70 percent fewer patients), but was ineffi cient under treatment function 1 (up to about three times as many patients). In the presence of dropout, the relative effi ciency of the SPC2+6 design increased under treatment functions 2 and 3, while under treatment function 1 the already ineffi cient SPC2+6 design performed even worse.

Chapter 3.1

52

RCT8SPC4+4 − treatment function 1SPC4+4 − treatment function 2SPC4+4 − treatment function 3

No dropout

First phase placebo response rate (q1)0.3 0.4 0.5 0.6

0

300

600

900

1200

Sam

ple

size

No dropout

First phase placebo response rate (q1)0.3 0.4 0.5 0.6

0

300

600

900

1200

Sam

ple

size

No dropout

First phase placebo response rate (q1)

0.3 0.4 0.5 0.60

300

600

900

1200

Sam

ple

size

a b

c

Figure 3: Sample size requirements for the SPC4+4 and RCT8 design. Sample size requirements for the SPC4+4 and RCT8 design when no dropout is assumed (3a), and when dropout is characterized by drop-out model 1 (3b) and dropout model 2 (3c). All fi gures correspond to scenarios assuming equal effect size p1-q1 = p2-q2 = 0.20 and shift from fi rst to second-phase placebo response rate q1-q2 = 0.10. As a result, fi rst-phase treatment response (p1) and second-phase response to treatment (p2) and placebo (q2) can be derived from the fi rst phase placebo-response rate (q1) on the x-axis.

A comparison between the SPC2+6 and the SPC4+4 design reveals that in the absence of drop-out, the SPC2+6 design was more effi cient than the SPC4+4 design under treatment functions 1 (18- 23 percent fewer patients) and 3 (28 – 28 percent fewer patients), while approximately equally effi cient under treatment function 2. Under treatment function 1, the vast majority of patients in the SPC2+6 design were randomized to placebo, and the highest possible weight was assigned to the second study phase (appendix 2), causing the SPC2+6 design to resemble a RCT design that is preceded by a placebo lead-in phase.

Early reallocation of placebo non-responders in SPC designs

53

Under dropout model 1, the SPC2+6 design outperformed the SPC4+4 design under treatment function 2 (2 - 14 percent fewer patients) and treatment function 3 (47 - 52 percent fewer patients), but not under treatment function 1 (0 – 14 percent more patients). Under dropout model 2, the SPC2+6 design was more effi cient under all treatment functions: around 20 per-cent, 8 - 13 percent, and 41 – 51 percent fewer patients under treatment function 1, 2 and 3, respectively.

RCT8SPC2+6 − treatment function 1SPC2+6 − treatment function 2SPC2+6 − treatment function 3

No dropout

First phase placebo response rate (q1)0.3 0.4 0.5 0.6

0

300

600

900

1200

Sam

ple

size

a

First phase placebo response rate (q1)0.3 0.4 0.5 0.6

0

300

600

900

1200

Sam

ple

size

b

Dropout model 2

First phase placebo response rate (q1)0.3 0.4 0.5 0.6

0

300

600

900

1200

Sam

ple

size

c

Dropout model 1

Figure 4: Sample size requirements for the SPC2+6 and RCT8 design. Sample size requirements for the SPC2+6 and RCT8 design when no dropout is assumed (4a), and when dropout is characterized by dropout model 1 (4b) and dropout model 2 (4c).

Chapter 3.1

54

The results presented in fi gures 3 and 4 correspond to scenarios that assumed an effect size of 20 percent and a 10 percent reduction in placebo response from the fi rst to the second study phase. Obviously, other assumptions produced different results; a smaller treatment effect raised sample size requirements for all designs, and a larger reduction in placebo response increased the comparative effi ciency of the SPC4+4 and SPC2+6 designs. However, the main conclusions remained unchanged. The complete simulation output along with the optimal randomization ratios (a) and weightings of the fi rst study phase (w) are presented in appendices 1-3. Notice that the optimal values for a and w were relatively insensitive to the introduction of dropout, which was also previously reported by Tamura and Huang [8].

SPC6+6 versus RCT8

For scenarios without dropout, the SPC6+6 design was generally more effi cient compared to the RCT8 design under treatment function 2 (16 – 33 percent fewer patients) and treatment function 3 (around 43 percent fewer patients), but required more patients under treatment function 1 (12 – 55 percent more patients) (appendix 3). With linearly increasing dropout, the SPC6+6 design was 8 – 20 percent, 45 – 57 percent, and 54 – 70 percent more effi cient under treatment functions 1, 2 and 3, respectively. However, when dropout occurred predominantly early, the SPC6+6 design was more effi cient under treatment functions 2 (34 – 44 percent fewer patients) and 3 (44 – 62 percent fewer patients), but not under treatment function 1 (around 10 percent more patients).

Antidepressant dataData-derived dropout and response rates on treatment and placebo are presented in fi gures 5. The placebo response rate was 9 percent after two weeks and 20 percent after four weeks. In the treatment group this was 26 and 43 percent. The dropout rate was 5 percent after two weeks and 13 percent after four weeks. Extrapolation to eight weeks yielded an estimated response rate of 43 percent, respectively. for the placebo group, and 68 percent for the treat-ment group. The extrapolated dropout rate after eight weeks was 35 percent. Based on these values, the RCT8 design required in total 278 patients to establish a signifi cant treatment effect with 80 percent statistical power. The SPC4+4 design required 130 patients with optimal a = 0.24 and w = 0.79. The SPC2+6 design required a sample size of 136, with a = 0.21 and w = 0.80. This means that relative to the RCT8 design, the sample size gain associ-ated with the SPC4+4 and the SPC2+6 design was 53 and 51 percent, respectively.

Early reallocation of placebo non-responders in SPC designs

55

Time in weeks

0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

Res

pons

e ra

te

a

Time in weeks

0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

Dro

pout

rate

b

Figure 5: Response and dropout rates from empirical antidepressant trial data. Figure 5a shows the average response rates to treatment (fi lled squares) and placebo (open circles) as derived from four sig-nifi cant confi rmatory antidepressant trials (n=349). Figure 5b shows the corresponding dropout rates.

DiscussionOur simulations indicated that the SPC2+6 design and the SPC4+4 design were both effi cient alternatives to the RCT8 design when a substantial treatment effect was already established after a short follow-up period. In this circumstance, their comparative advantage further in-creased in the presence of dropout. Conversely, when the effect was slowly established, both SPC designs required more patients than the RCT8 design, and their relative performance got worse in the presence of dropout. The SPC2+6 design was usually more effi cient than the SPC4+4 design, even when the effect size after two weeks was smaller than after four. Moreover, dropout had a larger effect on the relative effi ciency of the SPC4+4 design when it occurred predominantly early than when it increased linearly over time. Comparing the SPC6+6 and the RCT8 design revealed that (as for the other SPC designs) the former was more effi cient when a meaningful effect was established early on. Dropout increased the relative effi ciency of the SPC6+6 design, especially when it occurred predominantly early and before week eighth.

Chapter 3.1

56

Some of these fi ndings merit a more detailed explanation. We observed that dropout in-creased the relative effi ciency of the SPC design regardless of the duration of the fi rst study phase. This is because the impact of dropout on the RCT8 design is independent of the drop-out model, yet dropout models that reach their maximum only after eight weeks have less im-pact on treatment functions that show a larger effect after two or four weeks than after eight.The observation that the SPC2+6 design was more effi cient than the SPC4+4 design when the effect was smaller after two weeks than after four can be explained by the fact that the dura-tion, and therefore the effect size in the second phase of the SPC2+6 design is larger than in the SPC4+4 design. When this is true (and expected a priori), the majority of patients will be randomized to placebo and the largest weight will be assigned to the second study phase.One of the main conclusion from our simulations is that the SPC design is a particularly ef-fi cient alternative to a conventional RCT design with equal total follow-up duration when a meaningful treatment effect is established early on (i.e. at the end of the fi rst SPC study phase). Several recent evaluations of the progression of treatment response in antidepressant clinical trials provide support for this presumption [16-20]. Our own analysis of data from four positive confi rmatory antidepressant trials suggested a somewhat smaller effect size after two weeks than after eight. Still, the early effect observed in these data was suffi ciently large to results in a sample size advantage for both versions of the SPC design. This was also due to the fact that additional dropout after two and four weeks was relatively large, and a lot of weight was assigned to the fi rst study phase.Previously, Tamura and Huang compared the effi ciency of the SPC and RCT design, and concluded that the SPC design required 20 to 25 percent fewer patients under a wide range of underlying assumptions [8]. However, this study did not adequately account for dropout and different time-dependent treatment functions. Our simulations did, and showed that the advantage of the SPC design over the RCT design was highly dependent on the progression of dropout and treatment effect over time.Our simulations were limited in the number of parameters and scenarios that were evaluated. Scenarios were defi ned to include dropout and response rates typically observed in antide-pressant clinical trials. In addition, we evaluated only dichotomous outcomes (i.e. response vs. non-response). These choices may limit the generalizability of our results.Just like previous studies [4,8], this study focused exclusively on sample size requirements since patient recruitment usually takes up the largest portion of the total time and money needed to conduct a trial. However, with the SPC6+6 design, a reduction in sample size comes with a longer trial duration and possibly additional clinical assessments per patient. If such assessments are costly, longer trial duration may substantially increase the total cost of the trial. This should be considered in the planning stage.Other issues also need to be considered when planning an SPC trial or interpreting its re-sults. First, the patient populations in both phases of the study differ, since the second phase

Early reallocation of placebo non-responders in SPC designs

57

includes only a selection of the patients from the fi rst. As a result, the interpretation of the overall effect estimate is complicated, and generalizability limited [21, 22]. However, in clin-ical drug development early clinical trials often serve primarily to provide proof-of-concept rather than to obtain a conclusive effect size estimate for the entire target population. The SPC design would be well suited for this purpose, as it is more sensitive to detect a treatment effect in the presence of substantial placebo response. If, under such circumstances, no effect is observed in a SPC design study, it will be unlikely to demonstrate effi cacy in subsequent, conventionally designed, confi rmatory trials. Moreover, if the aim is to provide proof-of-concept, rather than a generalizable effect size estimate, the performance of the SPC design may be further amendable by choosing (before the study is initiated) a response criterion that provides an optimal balance between small placebo response in the fi rst phase and a sizable treatment effect in the second.A second issue to consider is that the effi ciency of the SPC design strongly depends on the randomization ratio (a) and weighting of the fi rst study phase (w). Both values were opti-mized to acquire the presented sample size estimates, but need to be selected a priori when conducting an SPC trial in practice. Our simulations showed that the optimal values for a and w are relatively insensitive to the introduction of dropout. Adequate values for a and w should be established using simulations like those described in our study.Thirdly, apparent similarities can be noted between the SPC2+6 design and a convention-al parallel group design with a placebo lead-in phase. Both designs include a short initial study phase aimed to reduce the overall placebo response by identifying placebo respond-ers and excluding them from the study. However, placebo lead-in phases are typically only single-blinded, with clinicians being aware of treatment allocation. This may introduce bias when, for example, improvements during lead-in are underestimated, or low expectation of improvement is implicitly communicated to patients. Moreover, the SPC design combines outcomes from both design phases, whereas data from a placebo lead-in phase is typically disregarded. Hence, the SPC makes more effi cient use of the available patient data.This study suggests that both the SPC2+6 and the SPC4+4 design are highly effi cient alterna-tives to the RCT8 design for trials in indications where placebo response is typically high and substantial treatment effects are established after a relatively short follow-up period. How-ever, it was also observed that both designs are ineffi cient when the early effect is small. It is therefore important to have a clear and well-informed prospect of the progression of response rates and effect size over the course of the trial when considering the use of these designs in practice. Previous studies and our own evaluation of response rates in antidepressant trials suggest that SPC designs can considerably reduce trial sample size requirements in this fi eld.

Chapter 3.1

58

Reference List(1) Kola I. The state of innovation in drug development. Clinical Pharmacology and Ther-

apeutics 2008; 83(2):227-230.(2) Khan A, Khan SR, Walens G, et al. Frequency of positive studies among fi xed and fl ex-

ible dose antidepressant clinical trials: an analysis of the food and drug administration summary basis of approval reports. Neuropsychopharmacology 2003; 28(3):552-557.

(3) Gordian M, Singh N, Zemmel R, Elias T. Why Products Fail in Phase III. In vivo 2010; 24: 49-54.

(4) Fava M, Evins AE, Dorer DJ, Schoenfeld DA. The problem of the placebo response in clinical trials for psychiatric disorders: culprits, possible remedies, and a novel study design approach. Psychotherapy and Psychosomatics 2003; 72(3):115-127.

(5) Robinson DS, Rickels K. Concerns about clinical drug trials. Journal of Clinical Psy-chopharmacology 2000; 20(6):593-596.

(6) Dworkin RH, Katz J, Gitlin MJ. Placebo response in clinical trials of depression and its implications for research on chronic neuropathic pain. Neurology 2005; 65(12 Suppl 4):S7-19.

(7) Grandi S. The sequential parallel comparison model: a revolution in the design of clini-cal trials. Psychotherapy and Psychosomatics 2003; 72(3):113-114.

(8) Tamura RN, Huang X. An examination of the effi ciency of the sequential parallel de-sign in psychiatric clinical trials. Clinical Trials 2007; 4(4):309-317.

(9) Papakostas G, Shelton R, Zajecka J. P01-588 - L-methylfolate augmentation of selec-tive serotonin reuptake inhibitors (SSRIS) for major depressive disorder: Results of two randomized, double-blind trials. European Psychiatry 2011; 26(Suppl 1):592.

(10) Gomeni R, Merlo-Pich E. Bayesian modelling and ROC analysis to predict placebo responders using clinical score measured in the initial weeks of treatment in depression trials. British Journal of Clinical Pharmacology 2007; 63(5): 595-613.

(11) Henkel V, Seemuller F, Obermeier M, et al. Does early improvement triggered by anti-depressants predict response/remission? Analysis of data from a naturalistic study on a large sample of inpatients with major depression. Journal of Affective Disorders 2009; 115(3):439-449.

(12) Szegedi A, Jansen WT, van Willigenburg, et al. Early improvement in the fi rst 2 weeks as a predictor of treatment outcome in patients with major depressive disorder: a meta-analysis including 6562 patients. Journal of Clinical Psychiatry 2009; 70(3):344-353.

(13) Claghorn JL, Lesem, MD. A double-blind placebo-controlled study of Org 3770 in depressed outpatients. Journal of Affective Disorders 1995; 34(3):165–171.

(14) Bremner JD. A double-blind comparison of Org 3770, amitriptyline and placebo in major depression. Journal of Clinical Psychiatry 1995; 56(11):519–525.

(15) Smith NIT, Glaudin V, Panagides J, et al. Mirtazapine vs. amitriptyline vs. placebo in the treatment of major depressive disorder. Psychopharmacology Bulletin 1990; 26(2):191–196.

Early reallocation of placebo non-responders in SPC designs

59

(16) van Calker D, Zobel I, Dykierek P, et al. Time course of response to antidepressants: predictive value of early improvement and effect of additional psychotherapy. Journal of Affective Disorders 2009; 114(1-3):243-253.

(17) Katz MM, Tekell JL, Bowden CL, et al. Onset and early behavioral effects of pharma-cologically different antidepressants and placebo in depression. Neuropsychopharma-cology 2004; 29(3):566-579.

(18) Nierenberg AA, Farabaugh AH, Alpert JE, et al. Timing of onset of antidepressant re-sponse with fl uoxetine treatment. American Journal of Psychiatry 2000; 157(9):1423-1428.

(19) Szegedi A, Muller MJ, Anghelescu I, et al. Early improvement under mirtazapine and paroxetine predicts later stable response and remission with high sensitivity in patients with major depression. Journal of Clinical Psychiatry 2003; 64(4): 413-420.

(20) Stassen HH, Angst J, Hell D, et al. Is there a common resilience mechanism underlying antidepressant drug response? Evidence from 2848 patients. Journal of Clinical Psy-chiatry 2007; 68(8):1195-1205.

(21) Temple R. FDA perspective on trials with interim effi cacy evaluations. Statistics in Medicine 2006; 25(19):3245-3249.

(22) Wang SJ, Hung HM, O’Neill RT. Adaptive patient enrichment designs in therapeutic trials. Biometrical Journal 2009; 51(2): 358-374.

Chapter 3.1

60

Appendix 1: Simulation results for the SPC4+4 design

Simulation Parameters RCT8 SPC2+6 - trt. func. 1 SPC2+6 - trt. func. 2 SPC2+6 - trt. func. 3

q1 p1 q2 p2 d n a w n a w n a w n

no d

ropo

ut

0.3 0.4 0.2 0.3 0 712 0.32 0.55 1426 0.28 0.66 458 0.27 0.68 2060.3 0.4 0.1 0.2 0 712 0.4 0.28 925 0.32 0.52 402 0.27 0.62 2000.3 0.5 0.2 0.4 0 186 0.31 0.59 392 0.27 0.66 123 0.26 0.69 770.3 0.5 0.1 0.3 0 186 0.37 0.37 288 0.28 0.59 113 0.27 0.64 740.4 0.5 0.3 0.4 0 776 0.31 0.63 1850 0.28 0.69 549 0.26 0.74 1920.4 0.5 0.2 0.3 0 776 0.34 0.5 1496 0.29 0.65 522 0.26 0.72 1940.4 0.6 0.3 0.5 0 194 0.29 0.66 486 0.28 0.7 140 0.28 0.72 740.4 0.6 0.2 0.4 0 194 0.32 0.55 412 0.27 0.68 136 0.26 0.7 750.5 0.6 0.4 0.5 0 776 0.3 0.69 2162 0.28 0.74 609 0.27 0.76 1670.5 0.6 0.3 0.4 0 776 0.31 0.63 1919 0.29 0.71 598 0.28 0.76 1710.5 0.7 0.4 0.6 0 186 0.29 0.7 551 0.27 0.75 150 0.27 0.75 670.5 0.7 0.3 0.5 0 186 0.3 0.65 501 0.29 0.73 148 0.27 0.75 680.6 0.7 0.5 0.6 0 712 0.29 0.75 2349 0.28 0.77 637 0.27 0.79 1380.6 0.7 0.4 0.5 0 712 0.3 0.71 2195 0.28 0.76 636 0.27 0.81 1420.6 0.8 0.5 0.7 0 164 0.28 0.76 585 0.28 0.77 151 0.29 0.79 550.6 0.8 0.4 0.6 0 164 0.29 0.72 554 0.29 0.78 151 0.27 0.81 57

linea

r dro

pout

0.3 0.4 0.2 0.3 0.4 1446 0.3 0.65 2584 0.28 0.72 794 0.26 0.74 3520.3 0.4 0.1 0.2 0.4 1446 0.36 0.43 1851 0.29 0.62 722 0.27 0.7 3430.3 0.5 0.2 0.4 0.4 396 0.29 0.68 702 0.26 0.73 212 0.25 0.75 1320.3 0.5 0.1 0.3 0.4 396 0.32 0.53 556 0.27 0.67 199 0.26 0.7 1280.4 0.5 0.3 0.4 0.4 1718 0.29 0.72 3257 0.27 0.77 938 0.26 0.78 3230.4 0.5 0.2 0.3 0.4 1718 0.31 0.62 2775 0.28 0.72 904 0.27 0.77 3250.4 0.6 0.3 0.5 0.4 456 0.28 0.73 849 0.27 0.76 239 0.26 0.77 1260.4 0.6 0.2 0.4 0.4 456 0.3 0.64 751 0.27 0.74 233 0.26 0.76 1260.5 0.6 0.4 0.5 0.4 1926 0.28 0.77 3727 0.27 0.8 1028 0.26 0.82 2790.5 0.6 0.3 0.4 0.4 1926 0.29 0.72 3412 0.27 0.78 1015 0.26 0.82 2840.5 0.7 0.4 0.6 0.4 500 0.28 0.77 946 0.26 0.81 253 0.26 0.81 1120.5 0.7 0.3 0.5 0.4 500 0.29 0.73 881 0.27 0.78 251 0.27 0.8 1130.6 0.7 0.5 0.6 0.4 2074 0.28 0.8 3980 0.27 0.83 1062 0.27 0.83 2280.6 0.7 0.4 0.5 0.4 2074 0.28 0.78 3786 0.27 0.82 1061 0.27 0.84 2330.6 0.8 0.5 0.7 0.4 530 0.28 0.81 988 0.28 0.82 251 0.29 0.84 910.6 0.8 0.4 0.6 0.4 530 0.28 0.79 949 0.27 0.83 252 0.28 0.85 93

pred

omin

antly

ear

ly d

ropo

ut

0.3 0.4 0.2 0.3 0.4 1446 0.35 0.53 2858 0.31 0.64 949 0.29 0.68 4330.3 0.4 0.1 0.2 0.4 1446 0.44 0.2 1722 0.34 0.51 812 0.3 0.61 4160.3 0.5 0.2 0.4 0.4 396 0.33 0.57 794 0.3 0.66 255 0.29 0.68 1610.3 0.5 0.1 0.3 0.4 396 0.4 0.33 550 0.32 0.56 230 0.29 0.62 1540.4 0.5 0.3 0.4 0.4 1718 0.32 0.64 3796 0.3 0.7 1152 0.29 0.73 4060.4 0.5 0.2 0.3 0.4 1718 0.37 0.47 2946 0.31 0.64 1085 0.28 0.72 4120.4 0.6 0.3 0.5 0.4 456 0.32 0.65 1001 0.29 0.71 295 0.29 0.72 1570.4 0.6 0.2 0.4 0.4 456 0.35 0.53 822 0.3 0.67 283 0.29 0.71 1570.5 0.6 0.4 0.5 0.4 1926 0.31 0.7 4512 0.29 0.74 1294 0.28 0.77 3580.5 0.6 0.3 0.4 0.4 1926 0.33 0.62 3910 0.3 0.72 1265 0.28 0.77 3690.5 0.7 0.4 0.6 0.4 500 0.3 0.72 1153 0.29 0.75 318 0.29 0.76 1420.5 0.7 0.3 0.5 0.4 500 0.32 0.65 1028 0.29 0.73 315 0.29 0.77 1450.6 0.7 0.5 0.6 0.4 2074 0.3 0.75 4976 0.29 0.78 1366 0.29 0.8 2970.6 0.7 0.4 0.5 0.4 2074 0.31 0.72 4581 0.29 0.78 1362 0.28 0.81 3090.6 0.8 0.5 0.7 0.4 530 0.3 0.76 1240 0.29 0.79 323 0.29 0.8 1190.6 0.8 0.4 0.6 0.4 530 0.31 0.73 1160 0.29 0.78 325 0.29 0.81 123

Early reallocation of placebo non-responders in SPC designs

61

Appendix 2: Simulation results for the SPC2+6 design

Simulation Parameters RCT8 SPC2+6 - trt. func. 1 SPC2+6 - trt. func. 2 SPC2+6 - trt. func. 3

q1 p1 q2 p2 d n a w n a w n a w n

no d

ropo

ut

0.3 0.4 0.2 0.3 0 712 0.45 0.2 1078 0.28 0.67 436 0.21 0.83 1150.3 0.4 0.1 0.2 0 712 0.44 0.2 675 0.32 0.52 371 0.22 0.77 1110.3 0.5 0.2 0.4 0 186 0.45 0.2 300 0.27 0.66 122 0.21 0.79 550.3 0.5 0.1 0.3 0 186 0.44 0.2 209 0.29 0.57 109 0.21 0.75 530.4 0.5 0.3 0.4 0 776 0.45 0.2 1445 0.29 0.68 530 0.21 0.87 960.4 0.5 0.2 0.3 0 776 0.45 0.2 1093 0.3 0.62 495 0.22 0.83 950.4 0.6 0.3 0.5 0 194 0.45 0.2 381 0.27 0.68 142 0.22 0.81 490.4 0.6 0.2 0.4 0 194 0.45 0.2 305 0.28 0.64 135 0.23 0.81 480.5 0.6 0.4 0.5 0 776 0.45 0.2 1739 0.29 0.69 603 0.22 0.87 780.5 0.6 0.3 0.4 0 776 0.45 0.2 1467 0.29 0.67 587 0.22 0.87 780.5 0.7 0.4 0.6 0 186 0.45 0.2 443 0.28 0.67 156 0.22 0.82 420.5 0.7 0.3 0.5 0 186 0.45 0.2 388 0.28 0.66 154 0.22 0.82 420.6 0.7 0.5 0.6 0 712 0.45 0.2 1942 0.29 0.69 655 0.23 0.88 620.6 0.7 0.4 0.5 0 712 0.45 0.2 1775 0.29 0.69 654 0.23 0.9 620.6 0.8 0.5 0.7 0 164 0.45 0.2 479 0.29 0.67 164 0.24 0.83 340.6 0.8 0.4 0.6 0 164 0.45 0.2 453 0.29 0.67 166 0.24 0.86 34

linea

r dro

pout

0.3 0.4 0.2 0.3 0.4 1446 0.45 0.21 2536 0.25 0.78 654 0.21 0.86 1490.3 0.4 0.1 0.2 0.4 1446 0.45 0.2 1587 0.27 0.69 599 0.22 0.83 1460.3 0.5 0.2 0.4 0.4 396 0.45 0.2 705 0.25 0.78 182 0.21 0.86 730.3 0.5 0.1 0.3 0.4 396 0.44 0.2 490 0.25 0.71 172 0.2 0.83 720.4 0.5 0.3 0.4 0.4 1718 0.44 0.26 3392 0.26 0.79 785 0.22 0.9 1220.4 0.5 0.2 0.3 0.4 1718 0.45 0.2 2573 0.27 0.74 757 0.22 0.87 1220.4 0.6 0.3 0.5 0.4 456 0.45 0.22 894 0.24 0.78 211 0.22 0.87 640.4 0.6 0.2 0.4 0.4 456 0.45 0.2 718 0.25 0.76 205 0.21 0.86 640.5 0.6 0.4 0.5 0.4 1926 0.43 0.3 4073 0.26 0.79 890 0.22 0.9 990.5 0.6 0.3 0.4 0.4 1926 0.45 0.21 3455 0.27 0.77 876 0.22 0.9 990.5 0.7 0.4 0.6 0.4 500 0.44 0.26 1037 0.26 0.78 231 0.23 0.87 540.5 0.7 0.3 0.5 0.4 500 0.45 0.2 914 0.25 0.77 230 0.23 0.86 540.6 0.7 0.5 0.6 0.4 2074 0.43 0.3 4543 0.27 0.79 966 0.24 0.9 780.6 0.7 0.4 0.5 0.4 2074 0.45 0.23 4174 0.26 0.8 966 0.23 0.89 790.6 0.8 0.5 0.7 0.4 530 0.44 0.26 1123 0.26 0.77 245 0.23 0.88 440.6 0.8 0.4 0.6 0.4 530 0.45 0.21 1065 0.26 0.79 246 0.23 0.9 44

pred

omin

antly

ear

ly d

ropo

ut

0.3 0.4 0.2 0.3 0.4 1446 0.45 0.2 2203 0.27 0.73 790 0.21 0.87 1970.3 0.4 0.1 0.2 0.4 1446 0.45 0.2 1366 0.3 0.6 691 0.22 0.8 1920.3 0.5 0.2 0.4 0.4 396 0.45 0.2 612 0.26 0.72 221 0.22 0.82 950.3 0.5 0.1 0.3 0.4 396 0.45 0.2 423 0.28 0.64 201 0.22 0.78 920.4 0.5 0.3 0.4 0.4 1718 0.45 0.2 2957 0.28 0.74 956 0.23 0.88 1630.4 0.5 0.2 0.3 0.4 1718 0.45 0.2 2227 0.29 0.69 903 0.23 0.86 1620.4 0.6 0.3 0.5 0.4 456 0.45 0.2 779 0.26 0.75 256 0.23 0.84 840.4 0.6 0.2 0.4 0.4 456 0.45 0.2 622 0.27 0.7 246 0.24 0.83 830.5 0.6 0.4 0.5 0.4 1926 0.45 0.21 3562 0.28 0.75 1086 0.24 0.89 1320.5 0.6 0.3 0.4 0.4 1926 0.45 0.2 2998 0.28 0.73 1062 0.24 0.89 1320.5 0.7 0.4 0.6 0.4 500 0.45 0.2 906 0.28 0.73 281 0.24 0.86 710.5 0.7 0.3 0.5 0.4 500 0.45 0.2 794 0.27 0.73 278 0.24 0.86 710.6 0.7 0.5 0.6 0.4 2074 0.45 0.21 3978 0.28 0.75 1180 0.24 0.9 1050.6 0.7 0.4 0.5 0.4 2074 0.45 0.2 3635 0.28 0.75 1178 0.25 0.9 1050.6 0.8 0.5 0.7 0.4 530 0.45 0.2 981 0.28 0.72 297 0.24 0.87 580.6 0.8 0.4 0.6 0.4 530 0.45 0.2 927 0.27 0.75 300 0.25 0.87 58

Chapter 3.1

62

Appendix 3: Simulation results for the SPC6+6 design

Simulation Parameters RCT8 SPC2+6 - trt. func. 1 SPC2+6 - trt. func. 2 SPC2+6 - trt. func. 3

q1 p1 q2 p2 d n a w n a w n a w n

no d

ropo

ut

0.3 0.4 0.2 0.3 0 712 0.31 0.61 777 0.29 0.64 475 0.28 0.66 3530.3 0.4 0.1 0.2 0 712 0.34 0.45 601 0.31 0.53 409 0.3 0.57 3180.3 0.5 0.2 0.4 0 186 0.29 0.64 208 0.28 0.66 125 0.28 0.68 1060.3 0.5 0.1 0.3 0 186 0.32 0.51 174 0.3 0.58 113 0.29 0.6 980.4 0.5 0.3 0.4 0 776 0.29 0.69 952 0.28 0.72 562 0.28 0.71 3740.4 0.5 0.2 0.3 0 776 0.31 0.61 851 0.29 0.67 531 0.28 0.69 3640.4 0.6 0.3 0.5 0 194 0.28 0.7 243 0.27 0.72 141 0.28 0.71 1120.4 0.6 0.2 0.4 0 194 0.3 0.65 223 0.29 0.69 135 0.29 0.7 1090.5 0.6 0.4 0.5 0 776 0.28 0.75 1053 0.28 0.76 608 0.28 0.76 3620.5 0.6 0.3 0.4 0 776 0.29 0.72 999 0.28 0.75 596 0.28 0.77 3620.5 0.7 0.4 0.6 0 186 0.28 0.75 259 0.29 0.76 145 0.28 0.76 1080.5 0.7 0.3 0.5 0 186 0.29 0.72 249 0.28 0.75 144 0.28 0.76 1080.6 0.7 0.5 0.6 0 712 0.28 0.8 1072 0.28 0.81 607 0.28 0.81 3230.6 0.7 0.4 0.5 0 712 0.28 0.79 1051 0.28 0.8 607 0.27 0.83 3280.6 0.8 0.5 0.7 0 164 0.28 0.8 254 0.28 0.81 138 0.29 0.81 940.6 0.8 0.4 0.6 0 164 0.28 0.79 251 0.28 0.8 139 0.28 0.82 96

linea

r dro

pout

0.3 0.4 0.2 0.3 0.4 1446 0.28 0.71 1371 0.28 0.72 823 0.27 0.74 6070.3 0.4 0.1 0.2 0.4 1446 0.31 0.57 1129 0.29 0.62 737 0.28 0.66 5630.3 0.5 0.2 0.4 0.4 396 0.27 0.73 364 0.26 0.74 216 0.26 0.74 1830.3 0.5 0.1 0.3 0.4 396 0.3 0.62 318 0.28 0.68 200 0.28 0.68 1720.4 0.5 0.3 0.4 0.4 1718 0.28 0.76 1638 0.28 0.77 957 0.27 0.79 6320.4 0.5 0.2 0.3 0.4 1718 0.29 0.7 1507 0.28 0.74 918 0.28 0.75 6200.4 0.6 0.3 0.5 0.4 456 0.27 0.77 416 0.27 0.77 239 0.26 0.79 1900.4 0.6 0.2 0.4 0.4 456 0.28 0.73 391 0.27 0.76 232 0.27 0.77 1860.5 0.6 0.4 0.5 0.4 1926 0.28 0.8 1778 0.27 0.82 1019 0.27 0.82 6040.5 0.6 0.3 0.4 0.4 1926 0.28 0.78 1711 0.27 0.8 1005 0.27 0.82 6040.5 0.7 0.4 0.6 0.4 500 0.27 0.81 436 0.28 0.81 243 0.27 0.83 1800.5 0.7 0.3 0.5 0.4 500 0.28 0.78 424 0.27 0.8 242 0.28 0.81 1800.6 0.7 0.5 0.6 0.4 2074 0.27 0.85 1782 0.27 0.86 1004 0.27 0.86 5320.6 0.7 0.4 0.5 0.4 2074 0.27 0.84 1755 0.27 0.85 1004 0.27 0.87 5370.6 0.8 0.5 0.7 0.4 530 0.28 0.84 421 0.27 0.86 228 0.29 0.85 1550.6 0.8 0.4 0.6 0.4 530 0.28 0.84 417 0.29 0.86 228 0.28 0.86 157

pred

omin

antly

ear

ly d

ropo

ut

0.3 0.4 0.2 0.3 0.4 1446 0.34 0.58 1660 0.32 0.62 1037 0.31 0.64 7790.3 0.4 0.1 0.2 0.4 1446 0.4 0.35 1174 0.36 0.47 844 0.33 0.54 6760.3 0.5 0.2 0.4 0.4 396 0.33 0.6 449 0.31 0.64 275 0.3 0.67 2350.3 0.5 0.1 0.3 0.4 396 0.37 0.43 352 0.33 0.54 240 0.33 0.56 2100.4 0.5 0.3 0.4 0.4 1718 0.32 0.67 2106 0.31 0.69 1263 0.3 0.71 8480.4 0.5 0.2 0.3 0.4 1718 0.34 0.58 1806 0.32 0.63 1168 0.31 0.67 8170.4 0.6 0.3 0.5 0.4 456 0.31 0.69 540 0.3 0.71 317 0.29 0.73 2540.4 0.6 0.2 0.4 0.4 456 0.32 0.63 482 0.3 0.69 301 0.31 0.68 2450.5 0.6 0.4 0.5 0.4 1926 0.3 0.74 2398 0.3 0.76 1399 0.29 0.77 8400.5 0.6 0.3 0.4 0.4 1926 0.32 0.69 2229 0.3 0.73 1362 0.3 0.76 8410.5 0.7 0.4 0.6 0.4 500 0.3 0.75 591 0.3 0.75 335 0.29 0.77 2500.5 0.7 0.3 0.5 0.4 500 0.31 0.72 560 0.3 0.76 330 0.3 0.76 2500.6 0.7 0.5 0.6 0.4 2074 0.29 0.8 2506 0.29 0.81 1430 0.29 0.82 7650.6 0.7 0.4 0.5 0.4 2074 0.3 0.78 2433 0.29 0.8 1429 0.29 0.82 7810.6 0.8 0.5 0.7 0.4 530 0.29 0.81 594 0.3 0.81 324 0.31 0.81 2220.6 0.8 0.4 0.6 0.4 530 0.3 0.79 583 0.3 0.81 327 0.3 0.82 227

Chapter 3.2Optimizing trial design in pharmacogenetics research;

comparing a fi xed parallel group, group sequential and adaptive selection design on sample size requirements

Boessen R, van der Baan FH, Groenwold RHH, Egberts ACG, Klungel OH,Grobbee DE, Knol MJ, Roes KCB

Pharmaceutical Statistics (submitted)

Chapter 3.2

64

AbstractTwo-phase clinical trial designs may be effi cient in pharmacogenetics research when there is some, but inconclusive evidence of effect modifi cation by a genomic marker. Two-phase designs allow to stop early for effi cacy or futility, and can offer the additional opportunity to enrich the study population to a specifi c patient subgroup after interim analysis. This study compared sample size requirements for a fi xed parallel group, a group sequential and an adaptive selection design with equal overall power and control of the family-wise type-I error rate. The designs were evaluated across scenarios that defi ned the effect sizes in the marker positive and marker negative subgroups, and the prevalence of marker positive patients in the overall study population. Effect sizes were chosen to refl ect realistic planning scenarios, where at least some effect is present in the marker negative subgroup. In addition, scenarios were considered in which the assumed ‘true’ subgroup effects (i.e. the postulated effects) differed from those hypothesized at the planning stage. As expected, both two-phase designs generally required fewer patients than a fi xed parallel group design, and the advantage increased as the difference between subgroups increased. The adaptive selection design added limited further reduction in sample size, as compared to the group sequential design, when the postulated effect sizes were equal to those hypothesized at the planning stage. However, when the postulated effects deviated strongly in favor of enrichment, the comparative advantage of the adaptive selection design increased, which precisely refl ects the adaptive nature of the design.

Two-stage designs in pharmacogenetic research

65

IntroductionThe costs of bringing a new drug to the market are estimated at over 1.5 billion US dollars [1,2]. Late-stage failures and rising costs of phase II and III clinical trials are key contributors to this sum. In order to make trials as effi cient as possible, in terms of time, money and sample size requirements, adaptive trial designs are increasingly explored. Adaptive trial designs enable to modify design aspects of an ongoing trial based on accumulating data, without undermining the validity and integrity of the study [3-5]. Possible adaptations include: early stopping for futility or effi cacy and restricting patient enrolment after interim analyses to the most promising patient subgroup [3,5,6]. Adaptive trial designs have the potential of major improvement; they can increase the likelihood of a successful trial and lower the number of patients exposed to an inferior or harmful treatment [5].Pharmacogenetic research investigates the role of genetic variability as causal explanation of inter-individual differences in response to treatment. In a trial with a fi xed parallel group design, the study population is defi ned beforehand, and may be unselected or restricted (e.g. to only marker positive patients). If there is a priori evidence that a genomic marker is a treatment effect modifi er, but the evidence is inconclusive, the choice whether to include an unselected or restricted study population can be complicated. In this situation, an adaptive selection design would allow researchers to start with an unselected patient population and decide after interim analysis whether to continue recruitment from the entire population or enrich to the marker positive patient subgroup [7,8].An adaptive selection trial is most appropriate when there is evidence that the marker positive patient subgroup benefi ts more from treatment than its complement (i.e. the marker negative subgroup), but the evidence is not strong enough upfront to justify a clinical trial exclusively focused on the marker positive subgroup. Realistic and ethically acceptable assumptions for trial design and sample size planning in this situation would include a clearly positive treatment effect in the marker positive population and a much smaller (but still clinically relevant) effect in the marker negative population. Establishing the design and sample size on the expectation that the effect in the marker negative subgroup is absent or even harmful would call into question the ethics of the design. In this case, a clinical trial restricted to only the marker positive population would be the preferred alternative.Previous studies have shown that adaptive design trials often have superior statistical power as compared to fi xed design trials with the same total sample size [9–11]. This study takes a slightly different perspective and focuses instead on the number of patient needed to attain a certain power. More specifi cally, it compares a fi xed parallel group, a group sequential and an adaptive selection design on sample size requirements across scenarios that defi ned differential effect sizes for the marker positive and marker negative subgroups, as well as the ratio of both subgroups within the total study population. The three designs addressed the same family hypotheses; i.e. whether there is an effect in the overall population, the marker

Chapter 3.2

66

positive subgroup or both. In addition, the designs were evaluated across scenarios where the assumed ‘true’ subgroup effects (i.e. the postulated effects) differed from those initially hypothesized at the planning stage, and used for sample size estimation. These scenarios provide insight into the performance characteristics of the different designs when a priori assumptions about the subgroup effects are wrong. Especially in this situation would it be attractive to be able to modify design aspects while the trial is ongoing.

MethodsComparison of study designsThe sample size requirements for the three designs were estimated under equal overall power and control of the family-wise Type-I error rate. The fi rst design was a conventional parallel group design in which patients from an unselected patient population are randomized to treatment or control, and the difference between the treatment groups is evaluated after follow-up. The second design was a group sequential design that consists of two consecutive study phases divided by an interim analysis at time t, and that allows for early stopping when effi cacy or futility is established after the fi rst study phase. The third was an adaptive selection design that also allows for early stopping after the fi rst design phase, but in addition offers the opportunity to enrich the study population to a pre-specifi ed patient subgroup in the second phase. Adaptive selection is applied when the interim effect in the overall population is mainly accounted for by patients from the subgroup.

Study populationThe unselected study population was denoted by G0, G1 was the prespecifi ed patient subgroup (e.g. marker positive), and G2 was its complement (e.g. marker negative). f denoted the fraction of G1 subjects in G0. Δi referred to the average treatment effect size in patient cohort i (i = 0, 1, 2). The test statistic of each hypothesis was assumed to be (approximately) normally distributed and expressed as a standardized z-scores Zi. Adopting the z-scale benefi ted the generalizability of our results.

We considered a 1:1 randomization to treatment and control in both G1 and G2. Also, we assumed Δ1 > Δ2, as it is in this situation that adaptive selection will be most sensible and attractive. The true effect in the unselected patient population was calculated by:

210 f)(1f ⋅−+⋅=

( ) i2i1i zt1ztz ⋅−+⋅=

( )tZ2φ2α α/21 −=

Ň GS1GS1GS N)p(1Ntp ⋅−+⋅⋅=

t)(1f)(t

ftt'

−+⋅

⋅=

Tested were the null hypotheses H00: Δ0 = 0 versus H10: Δ0 > 0 and H01: Δ1 = 0 versus H11: Δ1 > 0. Both null hypotheses were considered as equally important, meaning that there was no preference regarding the population in which treatment effi cacy was established (i.e. G0 and/or G1). A list of abbreviations is presented at the end of the chapter.

Two-stage designs in pharmacogenetic research

67

Parallel group designFor the parallel group design (i.e. a single-phase trial with patients from G0), we estimated the sample size NPG required to achieve 80% statistical power, using an iterative optimization method. Power was defi ned as the probability of rejecting H00, H01 or both. Hypotheses were tested one-sided with the Hochberg multiple testing procedure to control for infl ation of the Type-I error rate [12]. The procedure entailed that both H00 and H01 were rejected when the corresponding p-values were smaller than the nominal signifi cance level. If this was not the case, but one of the p-values was less than half the nominal signifi cance level, then only the corresponding null hypothesis was rejected.

Group sequential design The two phases of the group sequential design were separated by an interim analysis at time t, where t represented the proportion of the total sample size from which outcomes were available at interim analysis. Denoted by zij was the standardized test statistic for patient cohort i (i =0, 1, 2) in phase j (j=1, 2). The overall test statistic for patient cohort i was a weighted average of zi1 and zi2, and calculated as:

210 f)(1f ⋅−+⋅=

( ) i2i1i zt1ztz ⋅−+⋅=

( )tZ2φ2α α/21 −=

Ň GS1GS1GS N)p(1Ntp ⋅−+⋅⋅=

t)(1f)(t

ftt'

−+⋅

⋅=

The decision to stop the trial after interim analysis was based on z11 and z01. If both values were below a predefi ned lower threshold, the trial was stopped for futility. We concluded futility when: z11 < 0 and z01 < 0 (though different thresholds could also be considered, for example see 11). Alternatively, if both values exceeded a predefi ned upper threshold, the trial was stopped for effi cacy. To assure control of the family-wise Type-I error rate, a combination of the Hochberg multiple testing procedure [12] and the O’Brien-Fleming alpha spending function [13] was used to arrive at the interim effi cacy thresholds. With a nominal signifi cance level set to α and a single interim analysis at t, the O’Brien-Fleming signifi cance level α1 at interim was:

210 f)(1f ⋅−+⋅=

( ) i2i1i zt1ztz ⋅−+⋅=

( )tZ2φ2α α/21 −=

Ň GS1GS1GS N)p(1Ntp ⋅−+⋅⋅=

t)(1f)(t

ftt'

−+⋅

⋅=

The Hochberg procedure asserted that both z11 and z01 were signifi cantly different from null when they exceeded the critical z-value that corresponds to α1. Otherwise, the largest of both z-statistics was only signifi cantly different from null if it exceeded the critical z-value that corresponds to α1/2.

We iteratively estimated the planned sample size for the group sequential design NGS as the average sample size needed to arrive at an 80% probability to signifi cantly establish a treatment effect in either G0, G1 or both. We denoted the probability to stop early by p1.

Chapter 3.2

68

Taking early stopping into account, the average realized sample size ŇGS was estimated from NGS using:

210 f)(1f ⋅−+⋅=

( ) i2i1i zt1ztz ⋅−+⋅=

( )tZ2φ2α α/21 −=

Ň GS1GS1GS N)p(1Ntp ⋅−+⋅⋅=

t)(1f)(t

ftt'

−+⋅

⋅=

Adaptive selection designThe adaptive selection design considered the same early stopping rules as the sequential group design in order to enable one-on-one comparison between both designs, and evaluate the infl uence on sample size that was associated with the option to enrich the study population after interim analysis. Adaptive selection is most appropriate when the interim effect in G1

is relatively large as compared to that in G2. We applied adaptive selection when z11 > z21 + x, where x = 0.5 · (Δ1 - Δ2). This value for x optimally separates the distributions of z11 and z12 under the hypothesized alternative, but different decision rules could also be considered [11,14]). It should be noted that x is a design parameter, and its value should be specifi ed before the trial is initiated. From NAS we calculated the average realized sample size ŇAS for the adaptive selection design in the same manner as was used to calculate ŇGS.

●● ●●

●●

● ●●●

● ●●●●●●●

●●●

●●

●●●●

●●

●●●●●

●●

−2 0 2 4 6

−20

24

6

Z21

Z11

● ●●

●●

●●

●●●

● ●

●●

●●

●●●

●●

●●

●●

●●● ●●

● ●

●●

●●●

● ●

●●

● ● ●●

●●

●●

● ●●●●

●●

●●●

●●

● ●●●

●●●

●●

●●

●●

●●

●●●●● ●

●●

●●

● ●●

●● ●●

● ●●●

●●

● ● ●●

● ●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

● ●

●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●●

● ●●●

●●

●●●

●●

●●● ●● ●

● ●

●●

●●●●

●●

●●●

● ●●

●●

●●

●● ●

●●

●●

●●●

●●●

●●●●

● ●●●●●

●●

●●

●●

●● ●●

●●

●●

● ●

● ●●

●●●

●●

● ●●●

●●

● ●●

●●●

●●●

●●

●●

●●●

●●

●●●● ●●

●●

●●

●●●●

● ● ●● ●●

●●

●● ●

●●

● ●●

●●● ●

●●

●●●

●●●●

● ● ●●● ●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●

●● ● ●

●●

●●

●●● ● ●

●●

●●●

●●

●●●

●●

●●●

●● ●

●●

●●

●●

●●●●

●●●

●● ●●

●●

●●●●

●●● ●

●●● ●

●● ●●● ●●●●

●●

● ●

●● ●

●●●

●●●● ●

●●●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

● ●

●● ●

●●

●●

●●●

●●

●●

●●

●● ●●

●●

● ●

●●●

●●●

●●

●●●

●●●

●●●

●●

● ● ●

●●

●●

● ●●

●●

●● ● ●●

●● ● ●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

● ●

●●●

● ●●

●●●● ●●

●●●●●

●●●

●● ●●

●●●

●●●

●●

● ●●

●●●●●

●●

●●●●

●●

●●●●

●●

●●●●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●● ●● ●

●●

●● ●

●●●

●●

●●●●

efficacycontinue G0continue G1futility

●●●

● ●

●● ●●

●●

●●●

●● ●●●●

● ●●

●●●

● ●●

●●●

●●

●●●●●●●

● ●● ●●●

●●●●

●●

●●●●●●

−2 0 2 4 6

−20

24

6

Z21

Z11

●●● ●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

● ●●

● ●●

●●

●● ●

●●

●●

●●●

●● ●

●●

●●

● ●

●●

●●

●●●●

●●●●

●●

● ●●

●●

●●

●●

● ●●

●●

● ●●● ●

●●

●●

●●●

●● ●

●●●

● ●●

●●

●●●●

●●

●● ●

●●●●

●●

●●●

●●

●●

●●

●●●

●●●●●

●●●

●●

●●●●

●●

●●●

●●●

● ●

●● ●●

●●

●●

● ●

● ●

● ●

● ●

●● ●

●●

●●●

●●

● ●●

●●

● ●

●●

●●

●●

●●

●●●●

●●

● ●●●

● ●●●

●●

●●

● ●●●

●●●

●●●

●●●●

●●

●●●● ●

●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●●●

●●

●●●

●●

●● ●

●●●

●●●

●●● ●●

●●

● ●●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●●

● ●

●●

●●●

●●

●●●

●●●

●●●

●●

●●●●●

●●●

●●●

● ●

● ●●

●●●●

●●●

●●

●●

●●

●●

●●

●● ●

● ●●

●● ●

●●●●●

●● ●

●●●●

●●●●

●● ●

●●

● ●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

● ●●

●●

●●●●● ●

●●●

●●

●●●

●●●●

●●

● ●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

● ●●

●●●

●●

●●●

●●

●●● ●

●●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●●

●●

●●

●●

●● ●

●●

●●●

●●●●●

●●●

● ●●● ●

●●●

●●

●● ●●

●●

●● ● ●●●

●●●●● ●

●●

●●●

● ●●●●●●●●●

●●

●●●●● ●●● ●● ●

●●●●●

●●

●● ●

●●●●● ●

●●

●● ●●

●●

●●

● ●●

● ●●●●

●●

●●

●●

●●● ●●●

●●●●

● ●●●●●● ●●● ●●

Figure 1. Interim decision as a function of z11 and z21. Scatterplots of interim values z11 and z21, with the four possible interim decisions in different shades of grey: 1) stop for futility, 2) stop for effi cacy, 3) continue with G0 and 4) continue with (enrich to) G1. The left fi gure represents the interim values (of 1,000 replications) for z11 and z21 under H0 (Δ1 = Δ2 = 0.0) and with f = 0.5 and t = 0.5. The right fi gure represents the interim values for z11 and z21 when Δ1 = 0.5, Δ2 = 0.3, f = 0.5 and t = 0.5.

Two-stage designs in pharmacogenetic research

69

Figure 1 shows how the interim decisions determined the course of the adaptive selection design as a function of z11 and z12, given that tests were one-sided with α = 0.05. Changes in Δ1 and Δ2, shifted the cloud of likely values for z11 and z21, consequently changing the probabilities of arriving at one of the four possible interim decisions (i.e. early stopping for effi cacy or futility, or continuation with G0 or G1). In Box 1 the decision schemes are shown for both two-phase design when critical values were based on α = 0.05.

Group sequential design Adaptive selection design

1: Stop the trial for futility if: 1: Stop the trial for futility if:

z01 < 0.00 AND z11 < 0.00 z01 < 0.00 AND z11 < 0.00

2: Stop the trial for efficacy if: 2: Stop the trial for efficacy if:

(z01 > 2.538 AND z11 > 2.538) (z01 > 2.538 AND z11 > 2.538)

OR (z01 > 2.772 AND z11 < 2.538) OR (z01 > 2.772 AND z11 < 2.538)

OR (z01 < 2.538 AND z11 > 2.772) OR (z01 < 2.538 AND z11 > 2.772)

3: Else, continue the trial with G0. 3: Continue and enrich to G1 if:

z11 > z21 + x

4: Else, continue the trial with G0.

Box 1. the decision schemes for both two-stage design when the nominal α = 0.05

When the interim decision was to continue with G0, both H00 and H01 were tested. As both hypotheses were considered equally important; continuation with G0 also yielded a positive trial result if effi cacy was eventually only established in G1. If, however, the interim decision was to enrich to G1, then only H01 was tested. In this case, the overall test statistic was an average of z11 and z12, weighted according to the planned number of patients from G1 in the fi rst and second study phase. More specifi cally, the overall test statistics for G1 was calculated using equation 2 where t was replaced with t’, given by:

210 f)(1f ⋅−+⋅=

( ) i2i1i zt1ztz ⋅−+⋅=

( )tZ2φ2α α/21 −=

Ň GS1GS1GS N)p(1Ntp ⋅−+⋅⋅=

t)(1f)(t

ftt'

−+⋅

⋅=

Simulations We evaluated the different designs across various scenarios. This included a scenario withΔ1 = Δ2 = 0 (i.e. null hypothesis) to check control of the family-wise Type-I error rate. In addition, we considered scenarios that differed in: 1) the hypothesized treatment effect sizes for G1 and G2 (i.e. Δ1 and Δ2), 2) the prevalence f of G1 in G0, and 3) the signifi cance level for statistical tests. An overview of the evaluated scenarios is presented in table 1. Interim

Chapter 3.2

70

analyses took place at t = 0.5. Scenarios were defi ned that justifi ed consideration of an adaptive selection design. To evaluate the infl uence of a more conservative decision scheme, we also simulated a selection of the scenarios with x’ = (Δ1 - Δ2) + 2.

In scenario’s I-III, we specifi ed treatment effect sizes for G1 and G2. In scenario IV, we evaluated the performance of the designs when the postulated treatment effects differed from those initially hypothesized and designed for. More specifi cally, we estimated the sample size requirements based on incorrect hypothesized effects and then evaluated the performance characteristics of the designs under the postulated effects. Such situations, where the initial assumptions are incorrect, are most likely to benefi t from the fl exibility that is offered by adaptive designs. All scenarios were replicated 100,000 times, and scenario I served as reference scenario. Simulation and analyses were performed using R (version 2.14.0, http://www.r-project.org). The simulation script is available upon request.

Table 1: Simulation scenarios

I f = 0.5, α = 0.05 A ∆1 = 0.5, ∆2 = 0.3

B ∆1 = 0.6, ∆2 = 0.2

C ∆1 = 0.7, ∆2 = 0.1

II f = 0.25, α = 0.05 A ∆1 = 0.5, ∆2 = 0.3

B ∆1 = 0.6, ∆2 = 0.2

C ∆1 = 0.7, ∆2 = 0.1

III f = 0.5, α = 0.025 A ∆1 = 0.5, ∆2 = 0.3

B ∆1 = 0.6, ∆2 = 0.2

C ∆1 = 0.7, ∆2 = 0.1

IV f = 0.5, α = 0.05 A ∆1 = 0.4, ∆2 = 0.4 postulated

B (REF) ∆1 = 0.5, ∆2 = 0.3 hypothesized

C ∆1 = 0.6, ∆2 = 0.2 postulated

D ∆1 = 0.7, ∆2 = 0.1 postulated

E ∆1 = 0.8, ∆2 = 0.0 postulated

F ∆1 = 1.0, ∆2 = -0.2 postulated

f is the prevalence of G1 in G0, α is the (one-sided) type-I error rate, Δ0 is the standardized mean treatment effect in G0 and Δ1 is the standardized mean treatment effect in G1. The interim analysis is performed at the end of phase 1, on the outcomes of 50% of the planned sample size. In scenario IV, the postulated treatment effects of the patient subgroups are different from those hypothesized at the planning stage.

Simulation outcomes of interest included: 1) the difference in sample size requirements for the three designs, where the comparative performance of the two-phase designs was

Two-stage designs in pharmacogenetic research

71

expressed as the proportional difference in sample size relative to the parallel group design. 2) the percentage of two-phase trials that were terminated at interim analysis (either for effi cacy or futility) or continued with G0 or G1, and 3) the percentage of trial replicates that reached signifi cance for G0, G1 or both.

ResultsControl of Type-I error rate in two-phase designsBoth two-phase designs were evaluated under the null hypotheses to verify the designed control of the Type-I error rate. For this purpose, the designs were simulated without the option to stop for futility, as early stopping for futility would lower the Type-I error probability. With one-sided testing at α = 0.05, 5.0% of the replications of both the group sequential design and the adaptive selection design rejected either H00, H01 or both. Hence, simulations confi rmed the control of the family-wise Type-I error rate across the two hypotheses for both designs.

Reference scenarioTable 2 presents the sample size requirements for the three different designs across the various simulated scenarios. Scenario I served as reference scenario to which other scenarios will be compared in order to examine the infl uence of the different simulation parameters.

Because sample size estimation was based on signifi cance tests for both G0 and G1, NPG differed over scenarios IA-C, even though these scenarios were equal in Δ0. As Δ1 increased, NPG decreased. This was also true for both ŇGS and ŇAS. In general, the group sequential design required fewer patients than the parallel group design. This was due to trials that stopped early and consequently required only half of the planned sample size NGS. With larger Δ1, the probability of a signifi cant interim result for G1, and hence the likelihood to stop early for effi cacy, increased. On the other hand, larger Δ1 also lowered the planned sample size, which eventually reduced the probability to stop early for futility. These confl icting effects resulted in an optimal sample size benefi t of the group sequential design as compared to the parallel group design under scenario IB.

The adaptive selection design required slightly more patients than the group sequential design when the difference in subgroup effects was small (i.e. scenario IA). However, when the difference in subgroup effects increased (i.e. scenario IC), the advantage of the adaptive selection design increased as well.

Chapter 3.2

72

Figure 2. Proportion of simulated trial replicates resulting in the various possible interim decisions.Column charts of the group sequential design (A) and the adaptive selection design (B), presenting the proportion of simulated trial replicates that stop for futility, stop for effi cacy, continue with G0 and continue with G1. The corresponding simulation scenarios are listed in table 1.

The stacked columns in fi gure 2 represent the percentage of group sequential (2A), and adaptive selection trial replicates (2B) that stopped early for futility or effi cacy, or continued with either G0 or G1. For scenario IA-C, the probability to stop early (p1) was fairly stable

Two-stage designs in pharmacogenetic research

73

at about 25%, for both designs. Of the trial replications that stopped early, fewer than 2.5% did so for futility. These results indicate that as the difference in subgroup effects increased, the lower total sample size, and ensuing chance to fi nd a signifi cant interim result for G0, was counterbalanced by the increased effect and higher likelihood to establish signifi cance at interim for G1.

Table 2: Estimated sample size requirements to establish treatment effi cacy with 80% statistical power

Parallel group Group sequential Adaptive selection

Scenario NPG NGS ŇGS (NPG – ŇGS) NAS ŇAS (NPG – ŇAS)NPG NPG

I A 161.9 166.7 145.7 10.0 176.1 152.8 5.7B 136.3 140.3 121.7 10.7 137.6 120.0 11.9C 112.2 114.8 99.8 11.0 106.8 93.7 16.5

II A 218.8 225.5 197.2 9.8 263.5 225.6 -3.1B 232.1 241.3 210.4 9.4 262.6 227.7 1.9C 217.9 227.3 198.3 9.0 227.6 199.1 8.6

III A 202.3 203.5 184.6 8.7 220.4 196.9 2.6B 173.8 174.5 157.5 9.4 173.3 156.6 9.9C 140.2 141.9 127.7 8.9 135.6 123.0 12.2

IV A 161.9 166.7 147.5 8.9 176.1 153.9 5.0B (REF) 161.9 166.7 144.9 10.5 176.1 151.9 6.2

C 161.9 166.7 141.5 12.6 176.1 147.8 8.7D 161.9 166.7 135.4 16.4 176.1 141.2 12.8E 161.9 166.7 128.5 20.6 176.1 132.4 18.2F 161.9 166.7 111.0 31.5 176.1 114.0 29.6

NPG is the sample size for a fi xed parallel group design. NGS is the planned sample size for a group sequential design. ŇGS is the average realized sample size for a group sequential design. NAS is the planned sample size for an adaptive selection design. And ŇAS is the average realized sample size for an adaptive selection design. All sample size estimates corresponded to 80% statistical power.

As expected, the probability to enrich in the adaptive selection design increased with larger difference in subgroup effect. For scenarios IA-C, the proportion of trial replicates that continued after interim analyses with G1 was 46.6%, 54.2% and 59.3% respectively.

The pie charts on the left side in fi gure 3 show the probability of establishing signifi cance in G0, G1 or both under scenario IA. For all three designs, 20% of trial replications did not

Chapter 3.2

74

reach signifi cance in either G0 or G1, which obviously resulted from the fact that the sample sizes for the various designs were estimated to provide 80% statistical power. It was observed that despite equal power, the probability to reach signifi cance for both G0 and G1 within the same replication was smaller in the group sequential design than in the parallel group design. This was due to the option to stop early for effi cacy in one of both populations. The option to enrich further reduced the chance to reach signifi cance for both G0 and G1, and increased the chance of a signifi cant result in only G1.

Figure 3. Proportion of simulated trial replicates resulting in the various possible trial outcomes. Pie chart of the proportion of simulated trial replicates with a signifi cant result for G0, for G1, for both or for neither G0 or G1, under simulation scenario IA (A) and IVE (B).

Alternative decision schemeWith a more conservative decision scheme, it is decided more often to continue with G0. This was demonstrated by evaluating the adaptive selection design for scenario’s IA-C and with: x’ = (Δ1 - Δ2) + 2 as compared to x = 0.5 · (Δ1 - Δ2) for the other scenarios. The sample size estimates were comparable with the more conservative decision scheme, although less pronounced when the difference in subgroup effects increases (10.2%, 11.8%, and 11.4% under x’ versus 5.7%, 11.9%, and 16.5% for scenarios IA-C, respectively). As expected, the

Two-stage designs in pharmacogenetic research

75

percentage of trials that continued with G0 was higher (67%, 65%, and 63% under x’ versus 27%, 20%, and 16% for scenarios IA-C, respectively), which led to a larger proportion of trial replications that found a signifi cant result for both G0 and G1 (52%, 52%, and 48% under x’ versus 27%, 23%, and 19% for scenarios IA-C, respectively).

Lower prevalence of G1 in G0 Scenario II was characterized by a lower prevalence of G1 in study population G0. Since Δ0 was a weighted average of Δ1 and Δ2, Δ0 was no longer constant over scenarios IIA-C, or equal to Δ0 in scenarios IA-C. With f = 0.25 the required sample sizes increased for all designs as the subgroup with the largest effect was less prominently represented (table 2). The chance to stop after interim analysis remained stable at about 25% for both two-phase designs (fi gure 2). As a consequence the advantage of the group sequential design as compared to the parallel group design decreased. A potential for sample size gain remained present as a result of the option to enrich. However, with a small prevalence of G1 in G0, a large fraction of the patient population from the fi rst phase of the design does not contribute to the overall test statistics when the interim decision is to enrich to G1. This obviously reduced the effi ciency of the adaptive selection design and makes it only appealing when the difference in subgroup effects are large (i.e. scenario IIC).

Testing with alpha 2.5% one-sided Scenario III used a lower signifi cance level for statistical tests than scenario I (i.e. 5% vs. 2.5%, one-sided). Obviously, more conservative testing resulted in higher sample size requirements for the three designs. Moreover, a smaller signifi cance level reduced the percentage of trials was stopped at interim (fi gure 2). Overall, more conservative testing did not affect the comparative advantage of the two-phase designs over the parallel group design.

Underestimating the difference in treatment effect between the subgroups Scenario IV represents situations where there is a discrepancy between the hypothesized subgroup effects at the planning stage and the postulated effects as actually present in the population that entered the trial. The required sample sizes were estimated based on the parameters of scenario IVB, which was equal to scenario IA (i.e. the hypothesized scenario). Scenario IVA represents the situation in which a difference between the subgroups was expected at the planning stage and the trial was planned accordingly, but the interim results did not provide evidence of such a difference. Results for this scenario show that in this case it is most frequently decided to enrich, while the gain in sample size was small. Here, the two-phase designs are of limited value in terms of effi ciency (although not disadvantageous). The results of scenarios IVC-F show that both two-phase designs are more effi cient than the parallel group design when the difference in treatment effects was larger than initially

Chapter 3.2

76

hypothesized (table 2). The probability to stop the trial at interim increased with an increasing difference in subgroup effects, since the more this difference was initially underestimated, the more the planned sample size was overestimated. As a result, the planned interim analysis included a relatively large number of patients, increasing the likelihood to establish effi cacy at interim analysis. This reduced the average actual sample size of the two-phase designs in comparison the sample size required for the parallel group design. The adaptive selection design did not result in an additional benefi t in sample size over the group sequential design, since a potential benefi t can only be achieved when estimating a lower sample size beforehand, based on the possibility to enrich. In scenario IVF, where the actual treatment effect in G2 was negative, only 0.4% of the adaptive selection trials continued with G0. This illustrates an important advantage of the adaptive selection design; it offers the opportunity to recognize and terminate enrolment of patients that do not benefi t from treatment. The overall power under scenarios IVA-F increased equally for the three designs. This is illustrated for scenario IVE in the diagrams on the right of fi gure 3. The realized power in this scenario was 95% across the designs. The chance to reach signifi cance in both G0 and G1 was smallest for the adaptive selection design.

DiscussionWe compared a parallel group, a group sequential and an adaptive selection design on sample size requirements when differences in subgroup effects were expected at the design stage of the trial. Given equal overall power for the three designs and control of the Type-I error rate, we showed that the two-phase designs generally required fewer patients. The largest reduction in sample size was achieved with the possibility to stop the trial after interim analysis. This was especially true when the postulated difference between subgroup effects was larger than initially hypothesized. The option to enrich the study population only resulted in additional sample size reduction when the difference in subgroup effects was large.Our study shows that two-phase designs can be an effi cient design alternative when there is inconclusive evidence of effect modifi cation by a genomic marker. However, if large differences in subgroups effects are expected at the planning stage, an adaptive selection design may not be a sensible option. In this case, a parallel group design with only patients from the most promising patient subgroup would be the preferred choice from both a practical and ethical perspective.If the differences in subgroups effects are initially underestimated, two-phase designs can offer a substantial reduction in sample size requirements as a result of the option to stop early. An additional advantage of the adaptive selection design is the ability to select and continue with only the most responsive patient subgroup, and hence limit the inclusion of patients that are unlikely to benefi t from treatment. However, this also means that the probability to reach signifi cance for the overall population decreases with the option to enrich.

Two-stage designs in pharmacogenetic research

77

An example of a parallel group design where a large difference in subgroup effects was observed (while not expected) is the IPASS trial that evaluated the effi cacy and safety of gefi tinib for the treatment of pulmonary adenocarcinoma [15]. Patients who were positive for the epidermal growth factor receptor gene (EGFR) mutation showed a hazard ratio (HR) for progression or death of 0.48 [95%CI: 0.36-0.64], while patients without the mutation had a HR of 2.85 [95%CI: 2.05-3.98]. If these subgroup effects were expected at the planning stage, a study with only marker positive patients would have been the most sensible and ethical design alternative. The IPASS trial resembled simulation scenario IVF of the current study, for which we concluded that an adaptive selection design was more effi cient and ethically preferable.Although we considered patient subgroups that were characterized by the presence or absence of a genomic marker, our conclusions are not limited to pharmacogenetic studies, but generalizable to other situations where differentially responding patient subgroups are expected at the trial’s planning stage. As an example, consider patients with high baseline depression severity that benefi t more from antidepressant treatment than those with lower baseline disease severity [16]. Mehta et al. provide an example where an intervention may particularly benefi t specifi c patients with comorbid conditions based on the pathophysiology of their disease and the mode of action of the investigated drug [10].In our simulations we found relatively small differences in sample size requirements between the group sequential design and the adaptive selection design. For the evaluated scenarios, the added benefi t of enrichment was limited, whereas in practice enrichment may require additional logistical and analytical considerations [17]. We expect that enrichment is more advantageous when applied in conjunction with sample size re-estimation at interim analysis. Methods for sample size re-estimation have been proposed [18-21], but may give rise to concerns about the type-I error rate of the primary analysis when applied in combination with other adaptive design modifi cation [17].Previous studies that evaluated the performance of two-phase clinical trial designs often focused on the power to detect a statistically signifi cant difference between treatment groups [8–11]. For example, Wang et al. compared an adaptive selection design and a fi xed design approach, and showed that regardless of whether the established effect was applicable to the unselected study population or only to a nested patient subgroup, it was possible to devise an adaptive selection design with superior statistical power as compared to a fi xed design with an equal number of patients [9]. Mehta et al. evaluated several design adaptations (i.e. early stopping, sample size re-estimation and population enrichment) and also showed that these adaptations increased power relative to a fi xed sample size design when the sample size was equal [10]. Rosenblum and Van der Laan demonstrated that under a range of scenarios an adaptive selection design was more powerful than two fi xed parallel group designs with equal sample size [11]. In the present study the issue was approached from a different angle in that the primary focus was on sample size for a given power and not the other way around.

Chapter 3.2

78

A possible limitation of the current evaluation is that it was considered of equal importance whether a signifi cant treatment effect was established in the unselected (i.e. total) study population or in the selected patient subgroup. This may not always be a valid assumption in practice since a signifi cant treatment effect in the unselected population may lead to market authorization for a broader population, which is usually more attractive from a commercial perspective. In addition, we considered only two-phase designs, whereas more than one interim analysis is also possible [10], but not within the scope of this paper. Another limitation of the current study (and inherent to simulation experiments), is that only a limited number of scenarios was evaluated.To conclude, both two-phase designs were generally more effi cient than a fi xed parallel group design, and the comparative advantage increased with an increase in the difference between the assumed subgroup effects. The adaptive selection design added little further reduction in sample size as compared to the group sequential design when the postulated effect sizes were equal to those hypothesized at the planning stage. However, when the postulated effect sizes differed strongly in favor of enrichment (i.e. in case of a larger difference in subgroup effect than initially assumed), the comparative effi ciency of the adaptive selection design increased, which precisely refl ects the adaptive nature of the design and provides the design with an ethical advantage.

Two-stage designs in pharmacogenetic research

79

Reference List(1) DiMasi JA, Hansen RW, Grabowski HG. The price of innovation: new estimates of

drug development costs. Journal of Health Economics 2003; 22(2):151-85.(2) Adams CP, Brantner VV. Estimating the cost of new drug development: is it really 802

million dollars? Health Affairs (Millwood) 2006; 25(2):420-8.(3) Hung HM, O’Neill RT, Wang SJ, Lawrence J. A regulatory view on adaptive/fl exible

clinical trial design. Biometrical Journal 2006; 48(4):565-73.(4) Gallo P, Chuang-Stein C, Dragalin V, Gaydos B, Krams M, Pinheiro J. Adaptive designs

in clinical drug development--an Executive Summary of the PhRMA Working Group. Journal of Biopharmaceutical Statistics 2006; 16(3):275-83.

(5) Bretz F, Koenig F, Brannath W, Glimm E, Posch M. Adaptive designs for confi rmatory clinical trials. Statistics in Medicine 2009; 28(8):1181-217.

(6) Orloff J, Douglas F, Pinheiro J, Levinson S, Branson M, Chaturvedi P et al. The future of drug development: advancing clinical trial design. Nature Reviews Drug Discovery 2009; 8(12):949-57.

(7) van der Baan F, Knol M, Klungel OH, Egberts AC, Grobbee DE, Roes KCB. Potential of adaptive trial designs in pharmacogenetic research. Accepted for publication in Pharmacogenomics 2012.

(8) Jenkins M, Stone A, Jennison C. An adaptive seamless phase II/III design for oncology trials with subpopulation selection using correlated survival endpoints. Pharmaceutical Statistics 2010.

(9) Wang SJ, Hung HM, O’Neill RT. Adaptive patient enrichment designs in therapeutic trials. Biometrical Journal 2009; 51(2):358-74.

(10) Mehta C, Gao P, Bhatt DL, Harrington RA, Skerjanec S, Ware JH. Optimizing trial design: sequential, adaptive, and enrichment strategies. Circulation 2009; 119(4):597-605.

(11) Rosenblum M, van der Laan MJ. Optimizing randomized trial designs to distinguish which subpopulations benefi t from treatment. Biometrika 2011; 98(4):845-60.

(12) Hochberg Y. A sharper Bonferroni procedure for multiple tests of signifi cance. Biometrika 1988; 75(4):800-2.

(13) O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics 1979; 35(3):549-56.

(14) Bauer P & Köhne K (1994): Evaluation of experiments with adaptive interim analyses. Biometrics 50, 1029-1041.

(15) Mok TS, Wu YL, Thongprasert S, Yang CH, Chu DT, Saijo N et al. Gefi tinib or carboplatin-paclitaxel in pulmonary adenocarcinoma. New England Journal of Medicine 2009; 361(10):947-57.

(16) Kirsch I, Deacon BJ, Huedo-Medina TB, Scoboria A, Moore TJ, Johnson BT. Initial severity and antidepressant benefi ts: a meta-analysis of data submitted to the Food and Drug Administration. PLoS Medicine 2008; 5(2):e45.

Chapter 3.2

80

(17) Adaptive Design Clinical Trials for Drugs and Biologics, Draft Guidance for Industry, February 2010. http://www.fda.gov/downloads/Drugs/.../ Guidances/ucm201790.pdf

(18) Denne JS. Sample size recalculation using conditional power. Statistics in Medicine 2001; 20(17-18):2645-60.

(19) Gao P, Ware JH, Mehta C. Sample size re-estimation for adaptive sequential design in clinical trials. Journal of Biopharmaceutical Statistics 2008; 18(6):1184-96.

(20) Cui L, Hung HMJ, Wang SJ. Modifi cation of sample size in group sequential trials. Biometrics 1999; 55(3):853–7.

(21) Lehmacher, W., Wassmer, G. Adaptive sample size calculations in group sequential trials. Biometrics 1999; 55(4):1286-90.

List of abbreviations

Parameter Definition

G0 Total study population

G1 Patient subgroup with genomic marker

G2 Patient subgroup without genomic marker

Δ0 Standardized mean treatment effect in G0

Δ1 Standardized mean treatment effect in G1

Δ2 Standardized mean treatment effect in G2

f Prevalence of G1 in G0

NPG Sample size for the fixed parallel group design

NGS Planned sample size for the sequential group design

ŇGS Average realized sample size for the sequential group design

NAS Planned sample size for the adaptive selection design

ŇAS Average realized sample size for the adaptive selection design

z01 z-statistic for G0 at interim analysis

z11 z-statistic for G1 at interim analysis

z21 z-statistic for G2 at interim analysis

z02 z-statistic for G0 in the second phase of the trial

z12 z-statistic for G1 in the second phase of the trial

z22 z-statistic for G2 in the second phase of the trial

Z0 Overall z-statistic for G0

Z1 Overall z-statistic for G1

Z2 Overall z-statistic for G2

H00: Δ0 = 0 Null hypothesis for G0

H01: Δ1 = 0 Null hypothesis for G1

p1 Probability to stop the trial after interim analysis

Chapter 3.3Improving clinical trial effi ciency by biomarker-guided

patient selection.

Boessen R, Lambers Heerspink HJ, De Zeeuw D, Grobbee DE, Groenwold RHH, Roes KCB

Clinical Pharmacology and Therapeutics (submitted)

Chapter 3.3

84

AbstractPredictive (bio)markers or short-term changes following treatment can be used to select patients most likely to benefi t from treatment in the longer term. Using predictive markers to guide inclusion in randomized clinical trials could yield more targeted trials with improved effi ciency.This study compared three trial designs on sample size requirements across realistic scenarios. The reference design is a parallel group design with no selection. The alternative designs applied selection on a baseline characteristic or on short-term improvement after active run-in. When short-term improvement on the marker reliably predicted treatment response, both active run-in and baseline selection designs could reduce sample size requirements as compared to the parallel group design (up to 65% and 25%, respectively). For other scenarios, the effi ciency gain of both designs was smaller or nonexistent. Active run-in design reduced sample size requirements in most scenario’s but generalizability issues limit the applicability of these designs in practice.

Improving trial effi ciency by patient selection

85

IntroductionClinical trials are increasingly extensive and complex [1,2]. They account for the bulk of investments in drug development, both in terms of time and money [3-5]. To assure the effi cient and timely arrival of new and affordable drugs, it is therefore essential to explore and implement innovative approaches to the design of clinical trials [6,7].In many therapeutic areas, prognostic research has identifi ed patient characteristics that are predictive of future clinical outcomes or favorable / unfavorable treatment response [8,9]. These characteristics can be categorized as prognostic or predictive markers [10,11]. Prognostic markers are associated with future clinical outcomes, irrespective of treatment status, while predictive markers predict differential response to drug treatment. Baseline albuminuria is an example of a prognostic marker in trials with angiotensin receptor blockers, since it is associated with renal and cardiovascular outcomes but unrelated to the size of treatment response[12]. On the other hand, early reduction in albuminuria after a relatively brief exposure to treatment is a predictive marker that was shown to be associated with differential treatment response on renal and cardiovascular endpoints in randomized clinical trials [13-15].Information on baseline markers or short-term changes that are associated with a better long-term treatment outcome could be of value to improve the effi ciency of randomized clinical trials [16]. Predictive markers allow for selection of patient subgroups for which the expected effect of treatment (as compared to control) on a clinical endpoint is larger than in the unselected population. Restricting randomization to this subgroup would yield a larger effect size estimate in the selected stratum, and smaller sample size requirements to signifi cantly establish treatment effi cacy.In this paper, we make a distinction between selection based on baseline marker values or selection based on short-term marker changes in response to treatment exposure. The former can be described as a baseline selection design (BSD), since only a selection of the recruited population (i.e. those with a predefi ned marker value) is randomized at baseline. The latter gives rise to an active run-in design (ARD), where all the recruited patients initially receive treatment and only those with a predefi ned minimum improvement on the marker are selected for random allocation to treatment or control. In this case, improvement on the marker during active run-in is used as predictive marker to guide patient selection for the randomized study stage. This design has some resemblance to a randomized withdrawal design, in which patients are all treated with an experimental treatment until response or recovery, are subsequently randomized to treatment or control and then followed for a clinical outcome (relapse). However, in the present run-in design the initial period is much shorter and only needed to observe a (minimum) response on a marker to guide selection.Both the BSD and ARD exclude part of the recruited population. Although selection may reduce the number of patients needed to establish a signifi cant treatment benefi t in the

Chapter 3.3

86

randomized study stage, it also entails that part of the recruited population is excluded after screening. Thus, the number to recruit and screen at baseline or after brief exposure to treatment may be much larger than the number needed to randomize. Hence, it is not evident whether a BSD and ARD are overall more effi cient than a conventional parallel group design (PGD) where no selection is applied. Moreover, when selection is based on a predictive marker, as in the BSD and ARD, it restricts the population to which the study results apply, and thus limits the generalizability of the trial’s results.This study uses statistical simulations to compare the PGD, BSD and ARD on sample size requirements and the generalizability of the study results. Simulations were performed across a range of realistic scenarios, including a representative scenario based on empirical data from two clinical trials that evaluated the effi cacy of antihypertensive treatments in diabetic patients.

MethodsStudy designsThe PGD, BSD and ARD were compared on the required number of patients to signifi cantly establish a treatment benefi t with 80 percent statistical power and a two-sided 5 percent nominal type-I error rate. In both the PGD and the BSD, patients are randomized to treatment or placebo at baseline, with follow-up until either the clinical endpoint or the end of the study occurs. In the PGD, the study population comprises a representation of the general (i.e. unselected) patient population, whereas in the BSD, the study population is restricted to patients with a baseline marker level that exceeds a predefi ned cutoff value. Hence, in the BSD only a fraction of the patients who could have been enrolled in the PGD are actually randomized. In the ARD, all patients start the study on active treatment and only those in whom improvement on the marker outcome after run-in exceeds a predefi ned minimal cutoff value are randomized to treatment or placebo in the second study stage and followed up until either the clinical endpoint or the end of the study occurs.In the BSD and the ARD, the proportion of the enrolled population that continues into the randomized study stage can be denoted by p, and is dependent on a selection criterion c. Suppose that both a lower baseline marker level and greater improvement (i.e. reduction) on the marker during active run-in predict larger treatment benefi t in the randomized study stage. In that case, when the absolute value of c is large, p is small since few patients meet the selection criterion. Furthermore, the observed effect of treatment on the clinical endpoint will be relatively large in the randomized population, but the fraction of the total population to which these fi ndings apply is reduced and generalizability hence limited. On the contrary, when c is smaller, p is larger, the observed effect in the randomized population is smaller, but generalizability improves. It is important to note that the value of c is a design parameters that should always be defi ned before the study starts to control the type-I error rate of subsequent

Improving trial effi ciency by patient selection

87

tests. In the PGD, there is no selection criterion and all the recruited patients are actually randomized for follow-up.

SimulationsWe conducted a simulation study to assess the sample size requirements of the PGD (NPGD), the BSD (NBSD), and the ARD (NARD). In all designs, patients were randomized in a 1:1 control-treatment ratio, either at baseline (PGD and BSD) or after the fi rst study stage (ARD).A single, large dataset (representing 100,000 patients) was generated. Included in this dataset was the treatment status in the fi rst (before randomization) and second (after randomization) study stage (T1 and T2, respectively with 0 representing placebo and 1 active treatment), and the marker level at baseline (A0) and after the end of the run-in stage (A1). Treatment status was independent of marker levels.For the PGD and BSD, the treatment status of a patient was the same in the fi rst and second study stage (i.e. T1=T2). For the ARD, all subjects were treated in the fi rst stage (T1=1) and randomized in the second. A0 and A1 were generated as follows: fi rst, two series were generated from a multivariate normal distribution (~N(3,1)) with correlation r. The fi rst series represented A0, and A1 was derived as the second series minus Δ (i.e. the fi rst-stage assumed mean improvement on the marker, depending on treatment arm). We assumed no average fi rst-stage improvement on the marker among patients on placebo, and so for this group Δ = 0. Endpoint-free survival times were generated using the method described by Bender et al. [17]. First, a linear predictor was defi ned by:

)A(ATβATβAβTβlp 10240230221 −+++=

[ ] 022104031 AβT)A(AβAββlp +−++=

exp(lp)λlog(U)

S0

−=

In which β1 represented the direct effect (i.e. not mediated through the marker) of treatment in the second stage, β2 the main effect of baseline marker level (prognostic part of baseline marker), β3 the treatment status by baseline marker level interaction (predictive part of baseline marker), and β4 the treatment status by fi rst-stage marker improvement interaction (predictive part of change on the marker). Equation 1 can be rewritten into:

)A(ATβATβAβTβlp 10240230221 −+++=

[ ] 022104031 AβT)A(AβAββlp +−++=

exp(lp)λlog(U)

S0

−=

to show that the effect of treatment status in the second stage is a combined function of β1, β3 and β4.Based on the linear predictor, endpoint-free survival times (S) were generated using:

)A(ATβATβAβTβlp 10240230221 −+++=

[ ] 022104031 AβT)A(AβAββlp +−++=

exp(lp)λlog(U)

S0

−=

where λ0 is the baseline hazard, and U is a random number from the uniform distribution U(0,1).

Chapter 3.3

88

The total follow-up duration was truncated at 100 units of time for all three designs. The run-in period was set to comprise 12 percent of the total study duration (corresponding to about 3 months in a study with a 2 year total follow-up duration). Hence, while the total duration of the different designs was equal, the duration of the randomized study stage was 12 percent less for the ARD as compared to the PGD and BSD. Subjects who experienced an event during the active run-in stage in the ARD were excluded from analysis, thus in principle potentially reducing effi ciency of the ARD in terms of sample-size as compared to the other designs. In practical applications this would typically be a small number of patients excluded, since the active run-in phase is relatively short.The performance of the three designs was evaluated across three sets of scenarios, which are summarized in Table 1. Within each set, multiple combinations of β3 and β4 (i.e. the predictive parts of the marker and short-term change in marker) were defi ned, and the sets differed in the values for β1 and β2 (i.e, direct treatment effect and prognostic part of the marker) In addition, the sets differed in the value for the baseline hazard (λ0), which was chosen to result in equal event-rates for the PGD placebo arm across the different scenarios.

Table 1: Evaluated scenarios

Scenario set r Δ β1 β2 β3 β4

I 0.7 0.5 0.0 0.0 0.0, -0.1 0.0, -0.6

II 0.7 0.5 -0.3 0.0 0.0, -0.1 0.0, -0.6

III 0.7 0.5 0.0 0.5 0.0, -0.1 0.0, -0.6

For every scenario a separate dataset was generated. From this dataset, only patients with A0 > cA0 (i.e. patients for whom marker level at baseline exceeded a certain predefi ned threshold cA0) were randomized in the BSD, and only those with A0-A1 > cA0-A1 (i.e. patients for whom marker improvement exceeded a certain predefi ned threshold cA0-A1) were randomized in the ARD. The values for cA0 and cA0-A1 were chosen to result in p representing a designated percentile: 100, 90, and down to 10 percent of the total patient population. The cutoff values were based on the entire unselected population. Obviously, no patient selection was applied in the PGD (p=1.0).For every value of p (1.0 to 0.1), we estimated the sample size required to signifi cantly establish treatment effi cacy in the corresponding patient stratum, based on a log-rank test with 80 percent statistical power and at a nominal type-I error rate of 5 percent, two-sided (using the ssizeCT.default() function from the powerSurvEpi package in R). This process of generating a dataset and estimating the required sample size for the various patient strata was repeated 100 times, and estimates were averaged across replications in order to reduce random simulation error. The resulting value represented the number of patients to be randomized, and was multiplied by the infl ation factor 1/p to obtain the number of patients to be recruited. For the parallel group design the same steps were performed, but since no patient selection was applied, the number of patient to be randomized equaled the number to be recruited.

Improving trial effi ciency by patient selection

89

The data simulated for patients in a selected stratum can also be used to derive an (extrapolated) overall effect size estimate (i.e. hazard ratio) for the entire unselected patient population, based on the regression model with parameters estimated from the data in that particular trial. This can be done for each realized simulated trial, and simulates the situation that a particular selection design was actually executed, and a treatment effect for the full population was estimated from that trial. Obviously, this overall effect size estimate becomes less precise when derived from an increasingly restricted stratum. We evaluated the precision of these estimates for the various patient strata in the ARD and the BSD. This was done by deriving the effect size estimate for the unselected population from each stratum, and replicating this process 1,000 times.

Empirical example of antihypertensive trials with diabetic patientsWe also evaluated the three designs for a scenario derived from data of two empirical studies. The Reduction in End Points in NIDDM with the Angiotensin II Antagonist Losartan (RENAAL) study and the Irbesartan Diabetic Nephropathy Trial (IDNT) were both multinational, randomized, double-blind trials with a renal endpoint (i.e. development of end-stage renal disease, or death from any cause), conducted in patients in advanced stages of diabetic nephropathy [18, 19]. The RENAAL and IDNT trials involved 1513 and 1715 patients, respectively. In the RENAAL trial, patients received losartan (either 50 or 100 mg/day) or placebo. In the IDNT trial, patients received irbesartan (300 mg/day), amlodipine (10 mg/day) or matched placebo. Both trials were designed to compare an Angiotensin Receptor Blocker based antihypertensive regimen with a conventional blood pressure lowering regimen. To this end, blood pressure was targeted to achieve a blood pressure goal of less than 140/90 mmHg. If the blood pressure target was not achieved, additional antihypertensive agents (but not ACE-inhibitors or Angiotensin Receptor Blockers) were allowed throughout the study. The average follow-up time was 3.4 years for the RENAAL study and 2.6 years for IDNT. Several clinical and laboratory characteristics were assessed at regular intervals during the trials. These included a measurement of albuminuria at baseline and after 3 months of follow-up. For the present study, the data from the losartan and irbesartan arms were pooled into a single active treatment group and compared to the pooled data from both placebo arms. For the purpose of analysis the amlodipine arm (567 subjects) of the IDNT trial was excluded. The distribution of albuminuria levels was fi rst normalized by applying a log-transformation and then standardized and shifted three units upward, in order to allow for comparisons with the results from our simulations. Follow-up was truncated at 750 days, i.e. approximately 2 years, again to allow for comparisons.The data were fi tted to the model represented by equation 1, in order to estimate the effects of treatment, biomarker levels, and their interactions on the outcome. The other parameters that were used in the simulations (r, Δ, and λ0) were also derived from the empirical data. The resulting scenario was evaluated using the simulation approach described above. This allowed us to determine the number of patients to recruit for the PGD and for the various strata in the ARD and BSD for this particular empirical example.

Chapter 3.3

90

ResultsThe results of the simulation of the fi rst set of scenarios are presented in fi gure 1. These scenarios did neither include a direct effect of treatment that was unrelated to baseline marker level or short-term marker improvement (β1=0), nor an association between baseline marker level and outcome incidence (β2=0). In other words, baseline marker level was not a prognostic factor for endpoint-free survival, and the full effect of treatment compared to placebo was predictable from baseline marker levels or early marker improvements during the run-in stage.

ARDBSD

B3=0.0 & B4=−0.6

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

−60

−40

−20

020

4060

p

Sam

ple

size

gai

n as

com

pare

d to

PG

D (%

)A

B3=−0.1 & B4=0.0

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

−60

−40

−20

020

4060

p

Sam

ple

size

gai

n as

com

pare

d to

PG

D (%

)

B

B3=−0.1 & B4=−0.6

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

−60

−40

−20

020

4060

p

Sam

ple

size

gai

n as

com

pare

d to

PG

D (%

)

C

Figure 1: Sample size gain for ARD and BSD vs. PGD for scenarios IA-C. Figures show the gain in the number to recruit for the active run-in design (ARD) and the baseline selection design (BSD) as compared to the parallel group design (PGD) for the scenarios in the fi rst set (i.e., IA-C; in reading direction). Positive gain indicates smaller sample sizes (and thus higher effi ciency). Gain is expressed as a percentage relative to the PGD. The sample sizes for the PGD were 1,386, 800 and 318 for scenario IA-C, respectively.

Improving trial effi ciency by patient selection

91

In scenario IA, the effect of treatment was fully expressed as part of the interaction with early marker improvement (β3=0, β4≠0). In this case, both the ARD and the BSD had the potential to reduce sample size requirements in comparison to the PGD. Especially for the ARD did the increase in treatment effect from the unselected population to more restricted patient strata outweigh the loss of effi ciency due to the exclusion of patients after selection. At the optimal degree of population restriction (p=0.5), the ARD required a little under one third of the sample size for the PGD. Further restriction reduced the comparative effi ciency of the ARD, since the further increase in treatment effect no longer outweighed the increasing exclusion of patients after the run-in stage. Note that when p=1, the ARD required slightly more patients than either the PGD or the BSD (a general picture seen in all the evaluated scenarios), because events during active run-in stage were excluded in the analysis, and the total event rate was therefore smaller in the ARD. The BSD increased effi ciency (as compared to the PGD) because higher baseline marker levels were correlated with larger marker improvement. As a result, patients in the more restricted strata generally displayed a larger marker improvement and hence a stronger effect of treatment. At the optimal level of restriction (p=0.7), the BSD was about 30 percent more effi cient than the PGD, but still much less effi cient than the ARD.When the effect of treatment was fully expressed as part of the interaction with baseline marker levels (β3≠0, β4=0; scenario IB), neither the BSD nor the ARD were (for any value of p) more effi cient than the PGD. In this case, the larger treatment effect in more restricted strata was cancelled out by the increasing proportion of patients that were excluded after selection. The ARD was less effi cient than the BSD since improvement on the marker during run-in was only partly related to baseline marker level, and hence to an increase in effect size with more restricted strata. In general, patients with a larger short-term improvement had a higher baseline marker level to start with, and while improvement in the marker was unrelated to survival in this scenario (β4=0), baseline marker level was not (β3≠0).When the treatment effect was related to both the baseline marker level and short-term marker improvement (i.e. β3≠0, β4≠0; scenario IC), the ARD and BSD had some potential to increase effi ciency compared to the PGD, but considerably less than when the treatment effect was only included as part of the interaction with marker improvement. For the ARD, the increased treatment effect in more restrictive strata still outweighed the exclusion of patients after run-in, but only up to a certain degree of restriction. Further restriction (p<0.2) caused the ARD to become substantially less effi cient than the PGD. The largest increase in effi ciency of the ARD corresponded to approximately 25 percent fewer patients than the PGD. For the BSD, the maximum effi ciency gain was only 10 percent.Figure 2 shows the results for the second set of scenarios which all included a direct effect of treatment (β1≠0), meaning that part of the effect was unrelated to baseline marker level or marker improvement. These scenarios did not include an association between baseline

Chapter 3.3

92

marker level and outcome (β2=0). In this case, sample sizes were reduced overall since the total effect of treatment was a combined function of β1, β3 and β4 (see equation 2), and therefore larger than in the scenarios of the fi rst set. In general, the results showed the same patterns as observed for the fi rst set, but the potential for an effi ciency gain (for the ARD and BSD as compared to the PGD) was reduced over the whole range of p. This results from the fact that the relative difference in treatment effect associated with an increase in β3 and/or β4 was smaller.

ARDBSD

B3=0.0 & B4=−0.6

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

−60

−40

−20

020

4060

p

Sam

ple

size

gai

n as

com

pare

d to

PG

D (%

)

A

B3=−0.1 & B4=0.0

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

−60

−40

−20

020

4060

p

Sam

ple

size

gai

n as

com

pare

d to

PG

D (%

)

B

B3=−0.1 & B4=−0.6

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

−60

−40

−20

020

4060

p

Sam

ple

size

gai

n as

com

pare

d to

PG

D (%

)

C

Figure 2: Sample size gain for ARD and BSD vs. PGD for scenarios IIA-C. Figures show the gain in the number to recruit for the active run-in design (ARD) and the baseline selection design (BSD) as compared to the parallel group design (PGD) for the scenarios in the second set (i.e., IIA-C; in reading direction). Positive gain indicates smaller sample sizes (and thus higher effi ciency). Gain is expressed as a percentage relative to the PGD. The sample sizes for the PGD were 298, 228 and 146 for scenario IIA-C, respectively.

Improving trial effi ciency by patient selection

93

Figure 3 presents the results for the third set of scenarios. These scenarios included an association between baseline biomarker level and outcomes (β2≠0), but no direct effect of treatment independent of the interaction with baseline biomarker level or biomarker improvement (β1=0). Equation 2 pointed out that β2 had no infl uence on the size of the treatment effect (i.e. it represents the prognostic value of baseline marker level), and instead resulted only in an overall increase in event rate (for both the control and the treatment group), which was corrected for by lowering the baseline hazard.

ARDBSD

B3=0.0 & B4=−0.6

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

−60

−40

−20

020

4060

p

Sam

ple

size

gai

n as

com

pare

d to

PG

D (%

)

A

B3=−0.1 & B4=0.0

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

−60

−40

−20

020

4060

p

Sam

ple

size

gai

n as

com

pare

d to

PG

D (%

)

B

B3=−0.1 & B4=−0.6

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

−60

−40

−20

020

4060

p

Sam

ple

size

gai

n as

com

pare

d to

PG

D (%

)

C

Figure 3: Sample size gain for ARD and BSD vs. PGD for scenarios IIIA-C. Figures show the gain in the number to recruit for the active run-in design (ARD) and the baseline selection design (BSD) as compared to the parallel group design (PGD) for the scenarios in the third set (i.e., IIIA-C; in reading direction). Positive gain indicates smaller sample sizes (and thus higher effi ciency). Gain is expressed as a percentage relative to the PGD. The sample sizes for the PGD were 900, 692 and 236 for scenario IIIA-C, respectively.

Chapter 3.3

94

AR

Dp

Estimated hazard ratio for unselected population

10.9

0.80.7

0.60.5

0.40.3

0.20.1

0.0 0.5 1.0 1.5 2.0 2.5 3.0

●●

●●

●●

●●

●●

BSDp

Estimated hazard ratio for unselected population1

0.90.8

0.70.6

0.50.4

0.30.2

0.1

0.0 0.5 1.0 1.5 2.0 2.5 3.0

●●

●●

●●

●●

●●

Figure 4: Estimated hazard ratios for the unselected population. The fi gure show

s the estimates of the hazard ratio in the unselected patient population

as derived from the various patient strata in the A

RD

and BSD

. Each circle represents a single trial replication and the red dot is the average estimate

over all (=1,000) replications.

Improving trial effi ciency by patient selection

95

Figure 4 shows the estimates of the effect size (i.e. hazard ratio) in the unselected patient population as derived from the various patient strata in the ARD and BSD. As expected, the imprecision of these effect size estimates (i.e. the variation in the estimates observed between replications) increased when they were derived from increasingly restricted (and thus smaller) patient strata. In other words, there is an increased risk that the extrapolated effect size estimate for the unselected population is biased when it is derived from increasingly restricted strata. The imprecision is larger for the BSD than for the ARD, because the time between patient selection end of follow-up was shorter for the ARD.

Finally, fi gure 5 shows the results for the scenario that was derived from the empirical data. This scenario was characterized by the following parameters: β1 = -0.25 (p-value = 0.51), β2 = 1.10 (p-value < 0.01), β3 = 0.05 (p-value = 0.61), β4 = -0.49 (p-value < 0.01), r = 0.79, Δ = 0.32, and λ0 = 7.7e-5. These results indicate that baseline albuminuria is a strong prognostic factor for endpoint-free survival, but not signifi cantly associated with differential response to treatment, while early improvement in albuminuria is signifi cantly associated with differential treatment response.

ARDBSDPGD

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

050

010

0015

0020

0025

00

p

Num

ber o

f pat

ient

s to

recr

uit

Figure 5: Sample size gain as compared to the PGD for the empirical example. The fi gure shows the sample size that needs to be recruited in the active run-in design (ARD), the baseline selection design (BSD) and the parallel group design (PGD) to signifi cantly establish treatment effi cacy with a signifi cance level of 5 percent and 80 percent statistical power.

Chapter 3.3

96

In addition, the direct effect of treatment was small and not signifi cant. Care should be taken in interpreting these estimates, as confounding cannot be excluded; i.e. if treatment independently affects endpoint-free survival and short-term biomarker change, while the latter does not predict the former.

The results from the simulation indicate that the BSD did not have much potential to increase effi ciency in comparison to the PGD. In contrast, the ARD did have the potential to increase effi ciency. With an unselected population (p=1.0) the ARD required about 20 percent more patients than the PGD, but with optimal restriction (p=0.5) the advantage was about 35 percent (fi gure 5).

DiscussionThis study evaluated the sample size requirements for three study designs across scenarios in which the effect of treatment on endpoint-free survival was characterized as a combination of an interaction with baseline marker levels and short-term marker improvements and a direct effect (i.e. unrelated to baseline marker level or short-term marker improvement). The designs were: 1) a parallel group design with an unselected patient population (i.e. parallel group design; PGD), 2) a parallel group design with patients selected on their baseline marker level (i.e. baseline selection design; BSD) and 3) an active run-in design with patients selected on their improvement on the marker during the active run-in stage (i.e. active run-in design; ARD).The ARD has the potential to reduce sample size requirements (i.e. increase effi ciency) in comparison to the PGD and BSD when the effect of treatment is predominantly expressed through early improvements on the marker. When the interaction with baseline marker level or the direct effect of treatment increases, the effi ciency advantage of the ARD decreases.The BSD only improves effi ciency as compared to the PGD in case of a strong interaction between short-term marker improvements and treatment status, since higher baseline marker levels are correlated with larger short-term improvements. When baseline marker level in itself is the only predictive factor for treatment effi cacy, the larger effect in the more restricted strata is cancelled out by the increased proportion of patients that is excluded after selection. In this case, there is no advantage of baseline selection.The current study also evaluated a scenario based on empirical data where baseline albuminuria was a prognostic biomarker for endpoint-free survival, and short-term reduction in albuminuria after treatment was predictive of differential treatment response on the longer term (i.e. larger short-term marker reduction predicted better endpoint-free survival). For this scenario, we found that particularly the ARD has the potential to reduce sample size requirements as compared to the PGD (up to 35 percent). This advantage was achieved by

Improving trial effi ciency by patient selection

97

randomizing 50 percent of the recruited study population with the largest early reduction in albuminuria.While in this example, we selected patients on a predictive biomarker, selection could also be based on predictive genomic markers (at baseline) [20] or risk estimates from a prognostic model [21]. Examples include positive HER-2 status in breast cancer and the use of trastuzumab [22], in patients with chronic myeloid leukaemia targeted according to BCR-ABL mutation status [23], and gefi tinib in pulmonary adenocarcinoma used in patients with epidermal growth factor (EGFR) mutations [24]. An example of predictive short-term change with treatment includes the lowering of LDL cholesterol and the effect of lipid-modifying therapies on the risk for cardiovascular events [25].Both the BSD and the ARD design apply selection based on an individual patient measure that is associated with differential treatment response, which is obtained either at baseline (BSD) or after a relatively short exposure to treatment (ARD). In statistical terms, there is an interaction between the measure and the effect of treatment on the outcome, and in biological terms there may be an underlying mechanism explaining the interaction. As a result, every conceivable gain in effi ciency (with the BSD or the ARD) is associated with a reduction in the generalizability of the trial results, since the selected and randomly allocated patients are unrepresentative of the entire patient population. In addition, more stringent selection reduces the precision of extrapolated estimates for the overall (i.e. unselected) population. Hence, BSD and ARD trial results specifi cally apply to the selected subgroup and cannot be extrapolated to broader populations without running the risk of introducing bias as was shown in fi gure 4. This raises signifi cant implications for the use of these designs in confi rmatory trials, and may ultimately affect the label of the investigated drug. This potentially limits their application. In a sense, the selection designs generate a substantial amount of “missing data”: information on (longer term) effi cacy and safety will be missing for the de-selected population. In this respect, the active run-in has the advantage that there will be substantial short-term information on the experimental treatment from the run-in phase. The ARD and BSD are particularly attractive when a relevant treatment effect in the full population is unlikely, but valid markers are available to reliable identify patient subgroups with an increased likelihood to benefi t from treatment.The practicality of the BSD and ARD is obviously dependent on the availability of a predictive marker that can identify the patient subgroup that is most likely to benefi t from treatment. The most appropriate marker for selection is one in the causal path from treatment to effect, as preferably established in earlier research. If the investigated treatment actually benefi ts all patients regardless of marker status, then enrolling only marker-positive patients may slow trial accrual, increase expense, and unnecessarily limit the size of the indicated patient population. If the targeted therapy truly benefi ts subgroups of patients, but the marker used for selection does not correctly identify that group, then a benefi cial therapy could mistakenly

Chapter 3.3

98

be abandoned. It is, however, diffi cult to identify genuine predictors of differential response from single trials, as they are usually insuffi ciently large to reliably assess whether a marker is truly predictive of treatment response as a primary objective. As a result, evidence on predictive markers gathers slowly, from secondary analyses of existing trials and their meta-analysis.In the current study, simulation data was generated using a fairly straightforward model that divided the total effect of treatment on endpoint-free survival into three components (i.e. a direct treatment effect, an interaction between treatment and baseline marker level, and an interaction between treatment and short-term marker changes), and also included a main (prognostic) effect of baseline marker level on survival.When the treatment effect was fully accounted for by its interaction with baseline marker levels (i.e. when there was no direct effect and no interaction with short-term marker changes), the increase in effi ciency that resulted from larger effect sizes in more restricted strata was compensated by the exclusion of patients after selection. In this case, the BSD and ARD had very limited potential to increase effi ciency as compared to the PGD. This resulted from the fact that baseline marker levels were linearly associated with the linear predictor (eq.1) and exponentially associated with the treatment vs. control hazard ratio (eq.3). Consequently, the additional reduction in the number to randomize decreased exponentially with further restriction of the indicated population. When restriction was more extreme, the exclusion of patients after selection started to outweigh the advantage of larger effects in the more restricted strata. Under these circumstances, the BSD only increased effi ciency (as compared to the PGD) in case of a very strong (i.e. unrealistic) interaction effect between baseline marker level and treatment status, which was not included in the evaluated scenarios.When interpreting our fi ndings, some consideration regarding our model should be mentioned. An increase in the main effect of baseline marker level increased the treatment effect size and hence reduced sample size requirements, despite adjustment of the baseline hazard to result in equal event rates for the control group as observed in the other scenarios. When we simulated data based on a model that included a main effect of baseline marker level on survival, there was more variation in the individual hazards. This resulted that some patients were deemed to develop the endpoint early in the trial, irrespective of their treatment status, whereas others were very unlikely to experience the endpoint during the trial. The net effect was that the probability of the outcome after a certain period of follow-up was lower among treated subjects compared to the treated subjects in the simulations based on the model without the main effect of baseline marker level on survival.The model used is fairly fl exible and fi tted the experimental data well. However, results and conclusions may not hold to the same extent if a substantially different model applies. This will particularly be the case if the underlying proportional hazards assumption is inappropriate.

Improving trial effi ciency by patient selection

99

It should also be noted that the improvement in effi ciency with the BSD and ARD as reported in this study is relative to a PGD that disregards baseline marker level and early changes in the marker in response to treatment as a covariate in the analysis of the data. Including these factors as covariates in the fi nal drug effi cacy analysis would improve the effi ciency of the PGD and hence reduce the comparative advantage of the ARD and BSD.In summary, our results suggest that an ARD can potentially reduce the number of patients to recruit in a clinical trial, when the short-term improvement on the marker during run-in is a strong and reliable predictor of differential treatment response. Under these conditions, the BSD was also potentially more effi cient than the PGD, but always less effi cient than the ARD given equally restricted strata. For all the other scenario we evaluated, no meaningful advantage was observed for the BSD. Generalizability issues may limit the applicability of the ARD and BSD in practice. In addition, valid markers must be available to reliable identify patient subgroups with an increased likelihood to eventually benefi t from investigational treatment.

Chapter 3.3

100

Reference list1) DiMasi JA, Feldman L, Seckler A, Wilson. A Trends in Risks Associated With New

Drug Development: Success Rates for Investigational Drugs. Clinical Pharmacology and Therapeutics 2010; 87:272-277.

2) Thiers FA, Sinskey AJ, Berndt ER. Trends in the globalization of clinical trials. Nature Reviews Drug Discovery 2008; 7:13-14.

3) Woodcock J, Woosley R. The FDA critical path initiative and its infl uence on new drug development. Annual Review of Medicine 2008; 59:1-12.

4) Dickson, M and Gagnon JP. Key factors in the rising cost of new drug discovery and development. Nature Reviews Drug Discovery 2004; 3(5):417-429.

5) Kaitin KI, DiMasi JA. Pharmaceutical innovation in the 21st century: new drug approvals in the fi rst decade, 2000-2009. Clinical Pharmacology and Therapeutics 2011; 89(2):183-188.

6) US Department of Health and Human Services, Food and Drug Administration. Innovation or stagnation? Challenge and opportunity on the critical path to new medical products. 2004.

7) Bretz F, Koenig F, Brannath W, Glimm E, Posch M. Adaptive designs for confi rmatory clinical trials. Statistics in Medicine 2009; 28(8):1181-1217.

8) Riley RD, Hayden JA, Steyerberg EW, Moons KGM, Abrams K, Kyzas PA, et al. Prognosis research strategy (PROGRESS) series 2: Prognostic factor research. British Medical Journal 2012.

9) Atkinson AJ, Colburn WA, DeGruttola VG, DeMets DL, Downing GJ, Hoth DF, Oates JA, Peck CC, Schooley RT, Spilker BA, Woodcock J, Zeger SL. Biomarkers and surrogate endpoints: Preferred defi nitions and conceptual framework. Clinical Pharmacology and Therapeutics 2001; 69:89–95

10) Bakhtiar R. Biomarkers in drug discovery and development. Journal of Pharmacological and Toxicological Methods. 2008; 57(2);85-91

11) Freidlin B, McShane LM, Korn EL. Randomized Clinical Trials With Biomarkers: Design Issues. Journal of the National Cancer Institute 2010; 102:152–160.

12) Keane WF, Brenner BM, De Zeeuw D, Grunfeld JP, McGill J, Mitch WE, et al. The risk of developing end-stage renal disease in patients with type 2 diabetes and nephropathy: the RENAAL study. Kidney International 2003; 63(4):1499-1507.

13) De Zeeuw D, Remuzzi G, Parving HH, Keane WF, Zhang Z, Shahinfar S, et al. Albuminuria, a therapeutic target for cardiovascular protection in type 2 diabetic patients with nephropathy. Circulation 2004; 110(8):921-927.

14) De Zeeuw D. Albuminuria, not only a cardiovascular/renal risk marker, but also a target for treatment? Kidney International Suppl 2004; (92):S2-S6.

15) Holtkamp FA, De Zeeuw D, De Graeff PA, Laverman GD, Berl T, Remuzzi G, et al. Albuminuria and blood pressure, independent targets for cardioprotective therapy in patients with diabetes and nephropathy: a post hoc analysis of the combined RENAAL and IDNT trials. European Heart Journal 2001; 32(12):1493-1499.

Improving trial effi ciency by patient selection

101

16) Simon R, Maitournam A. Evaluating the effi ciency of targeted designs for randomized clinical trials. Clinical Cancer Research 2004; 10(20):6759–6763.

17) Bender R, Augustin T, Blettner M. Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine 2005; 24(11):1713–1723.

18) Brenner BM, Cooper ME, De Zeeuw D, Keane WF, Mitch WE, Parving HH, et al. Effects of losartan on renal and cardiovascular outcomes in patients with type 2 diabetes and nephropathy. New England Journal of Medicine 2001; 345(12):861-869.

19) Lewis EJ, Hunsicker LG, Clarke WR, Berl T, Pohl MA, Lewis JB, et al. Renoprotective effect of the angiotensin-receptor antagonist irbesartan in patients with nephropathy due to type 2 diabetes. New England Journal of Medicine 2001; 345(12):851-860.

20) Simon R. The use of genomics in clinical trial design. Clinical Cancer Research 2008; 14(19):5984–5993.

21) Hingorani AD, Hemingway H. How should we balance individual and population benefi ts of statins for preventing cardiovascular disease? British Medical Journal 2011; 342:c6244.

22) Hudis CA. Trastuzumab--mechanism of action and use in clinical practice. New England Journal of Medicine 2007; 357(1):39-51.

23) Capdeville R, Buchdunger E, Zimmermann J, Matter A. Glivec (STI571, imatinib), a rationally developed, targeted anticancer drug. Nature Reviews Drug Discovery 2002; 1(7):493-502.

24) Mok TS, Wu YL, Thongprasert S, Yang CH, Chu DT, Saijo N, et al. Gefi tinib or carboplatin-paclitaxel in pulmonary adenocarcinoma. New England Journal of Medicine 2009; 361(10):947-57.

25) Cholesterol Treatment Trialists’ (CTT) Collaborators. Effi cacy and safety of cholesterol-lowering treatment: prospective meta-analysis of data from 90 056 participants in 14 randomised trials of statins. Lancet 2005; 366:1267–78.

Chapter 4.1Classifying responders and nonresponders; does it help when there is evidence of differentially responding patient groups?

Boessen R, Groenwold RHH, Knol MJ, Grobbee DE, Roes KCB

Journal of Psychiatric Research 2012;46(9):1169-73

Chapter 4.1

104

AbstractIntroduction: Continuous trial outcomes are often dichotomized into ‘response’ and ‘non-response’ categories prior to statistical analysis. This facilitates the interpretation of results, but generally reduces statistical power. Exceptions may occur when response in the study population is heterogeneous, and outcomes are bimodally distributed. We explore whether bimodality is present in antidepressant trial data and whether dichotomizing then indeed yields more powerful analyses.Methods: The distributions of relative changes from baseline (rCFB) on the Hamilton depression rating scale (HAM-D) were estimated using pooled data from nine antidepressant trials. T-tests on rCFB scores and chi-square tests on dichotomized outcomes were compared to assess the consequences of dichotomization, using both the commonly applied cutoff (i.e. rCFB>50%) and an estimated cutoff that provided optimal separation of the mixture of two normal distributions that best fi tted the pooled placebo outcomes. The power of both tests was also evaluated for simulated scenario’s that varied the degree of bimodality as well as treatment effect and sample size. Results: Placebo and treatment groups showed evidence of bimodality. The estimated cutoff closely matched the commonly applied cutoff. Nevertheless, t-tests generally yielded smaller p-values than chi-square tests. Simulations showed that dichotomization only provides superior power when bimodality was considerably more marked than observed in the empirical data.Conclusion: Antidepressant trial outcomes showed bimodality, suggesting differential response among patient groups. This heterogeneity in outcome distributions should be reported more often, since a comparison of means does not adequately summarize the differences between treatment groups. However, simply dichotomizing outcomes is not an appropriate alternative as it reduces statistical power.

Dichotomizing bimodally distributed outcomes

105

IntroductionPatients that participate in medical research are often assigned to one of two outcome categories (e.g. responders or non-responders) based on whether their score on a continuous outcome variable is larger or smaller than a predefi ned cutoff. In antidepressant clinical trials, for example, patients with at least 50 percent improvement on the Hamilton depression rating scale are classifi ed as responders and those with less improvement as non-responders [1]. Often, no clinical or statistical rationale is provided for choosing a certain cutoff.Many researchers and clinicians fi nd tests on dichotomized outcomes easier to perform and interpret than tests on the original, continuous data [2,3]. However, because dichotomized outcomes do not allow for differentiation between differently scoring individuals within the same outcome category (e.g., those just and those far exceeding the cutoff), it leads to loss of information, and usually reduces statistical power of subsequent tests. For this reason, it is generally recommended not to dichotomize [2,4-7]. However, exceptions may occur when continuous outcomes are bimodally distributed (i.e., as a mixture of two normal distributions) [8]. In this case, dichotomization could provide a more accurate representation of the underlying construct and increase the power of subsequent tests [2]. Bimodal outcome distributions arise when there is differential response among patients; e.g., when one group of patients shows meaningful improvements following treatment, while another group remains (relatively) unaffected.Such a distinction (i.e., affected vs. unaffected patients) may be expected among participants of antidepressant clinical trials, since it has repeatedly been shown that a substantial portion of patients on antidepressants are unaffected by treatment [9-11].Factors that predict treatment resistance have been identifi ed in a primary care setting. These factors include the presence of a comorbid psychiatric or general medical disorder [12], which is often an exclusion criteria for antidepressant trials. Other predictors of treatment resistance include older age, female gender, family history, early or late age of onset, greater disease severity, and chronicity of depression [13]. As a result of differential treatment response, the distribution of trial outcomes may deviate from normality, and become multimodal (e.g., bimodal) instead. The usual sample sizes for clinical trials in depression do not allow for extensive exploration of multimodality.To realistically assess the consequences of differential treatment response, this study fi rst examines if bimodality is present in antidepressant clinical trial outcomes based on pooled data from multiple trials. Given such bimodality, it is then evaluated whether dichotomization indeed results in more powerful statistical tests. Also, additional simulations are performed to enable generalization to a broader range of scenarios. This will further clarify under what conditions in may be justifi ed to dichotomize.

Chapter 4.1

106

MethodsEmpirical dataWe analyzed the continuous and dichotomized outcomes from nine confi rmatory antidepressant trials of six week duration. These trials compared either mirtazapine with placebo (5 trials), or mirtazapine with amitriptyline and placebo (4 trials). In total, the dataset therefore included outcomes from 13 individual comparisons between active treatment (i.e., mirtazapine or amitriptyline) and placebo. The study population of these trials consisted of outpatients suffering from major depressive disorder. The continuous outcome considered was the relative change from baseline (rCFB) on the 17-item Hamilton depression rating scale (HAM-D), which was calculated by dividing the change from baseline at the end of the trial by the score at baseline. Because reliably estimating bimodality requires suffi ciently large sample sizes, data from the individual studies were pooled, after appropriately correcting for investigational center (and hence for between-study differences). For patients that dropped out (29.3 percent in total; 27.4 percent of those on placebo and 32.7 percent of those on active treatment), the last observation was carried forward.The rCFB outcomes were dichotomized using either the commonly applied cutoff (rCFB > 0.5), or an estimated, data-derived cutoff that separated the mixture of the two normal curves that best fi tted the pooled placebo outcomes. The estimated cutoff was assumed to result in optimal differentiation between the distinct patient groups that underlie outcome bimodality.Continuous rCFB outcomes from the active vs. placebo comparisons were analyzed with an independent samples t-test. Dichotomized outcomes were analyzed with a chi-square test. Test performances were compared on the resulting p-values.

SimulationsSimulations were performed to evaluate the impact of dichotomization under varying degrees of bimodality. For two treatment groups (groups A and B) outcomes were drawn from a mixture of two normal distributions (i.e. a normal mixture) with the same fi xed means (μ1 and μ2) and equal standard deviations (σ) (Figure 1). The outcome distributions of both groups differed only in the proportion of scores that originated from each of the distributions in the mixture. As a result, the expected mean outcome for groups A and B could be calculated using:

)()w(112A1A

−⋅−+=

)()w(112B1B

−⋅−+=

where wA and wB denoted the relative contribution of the left (i.e. lower) distribution for group A and group B, respectively. This is in agreement with the notion that bimodality results from differentially responding patient groups, and effective treatment causes some

and

Dichotomizing bimodally distributed outcomes

107

patients to move from one group to another rather than shifting the underlying distributions.The outcome distributions for groups A and B were composed of normal distributions with fi xed means μ1 = 0.2 and μ2 = 0.7. The overall mean for group A was set to 0.30, by selecting wA = 0.8. The treatment effect size (d) was defi ned as the relative difference in the overall means between groups A and B, and fi xed at 1.0 (no effect), 1.2, 1.3 or 1.4, by lowering wB. The degree of bimodality was infl uenced by increasing σ from 0.05 (i.e. extreme bimodality) to 0.30 (i.e. slight bimodality). The sample size per group (n) was either 75 or 150. Each scenario was replicated 100,000 times. Parameter values were chosen to resemble values observed in the actual trial data.

rCFB on HAM−D

−0.5 0.0 0.5 1.0

group Agroup B

µ1

µ2

Figure 1: An example of the distributions of the relative change from baseline (rCFB) outcomes from groups A and B. Both were composed of two normal distributions with fi xed means (μ1 and μ2) and equal standard deviation.

Group differences on the continuous outcomes were analyzed with an independent samples t-test. In addition, the outcomes were dichotomized with a cutoff (c) equal to 0.5, and analyzed with a chi-square test. For non-zero treatment effects (i.e. d > 0), the statistical power (i.e. the probability of detecting a treatment effect that is indeed truly present) was defi ned as the proportion of replications that arrived at a signifi cant difference between the groups A and B.

ResultsEmpirical dataCharacteristics of the included trials are included at the end of the chapter. Figure 2 shows the estimated probability densities of the pooled rCFB outcomes as well as the normal mixtures

Chapter 4.1

108

that best fi tted these data. Most data originated from the placebo and mirtazapine groups. For these groups, the estimated outcome densities and best-fi tting normal mixtures clearly indicated bimodality. From the amitriptyline group the number of observations was limited, and bimodality less apparent.

Placebo

rCFB on HAM−D−0.5 0.0 0.5 1.0

Mirtazapine

rCFB on HAM−D−0.5 0.0 0.5 1.0

Amitriptyline

rCFB on HAM−D−0.5 0.0 0.5 1.0

Densities

Normal mixture

Figure 2: Estimated probability densities and best-fi tting mixtures of normal distributions for the pooled and corrected relative change from baseline (rCFB) on the Hamilton depression rating scale (HAM-D) scores from placebo, mirtazapine and amitriptyline.

The fi tted normal mixtures for placebo and active treatments differed primarily in the position and the weight of the distribution associated with low response. This suggests that treatment caused a portion of patients to move from one group to the other, without causing a more general shift in the underlying distributions. The estimated cutoff was 0.48, hence close to the cutoff of 0.5 commonly used in antidepressant trials.Figure 3 presents the probability densities from the 13 individual active vs. placebo comparisons. The estimated densities from many of the trial arms indicated bimodality, despite the fact that the sample size was often limited. Nevertheless, test results in table 1 show that, for all comparisons, the t-test provided a smaller p-value than the chi-square test. In one instance, both tests were signifi cant at p<0.01.

Simulation resultsFigure 4 shows simulation estimates of the statistical power of the t-test and the chi-square test for all simulated scenarios. When the outcome distributions of groups A and B were similar; i.e., no treatment effect (d = 1.0), both tests were signifi cant in 5% of the simulation runs. This means, that under bimodality, the type-I error rate of both tests was controlled.

Dichotomizing bimodally distributed outcomes

109

rCFB

on

HA

M−D

−0.5

0.0

0.5

1.0

(1)

rCFB

on

HA

M−D

−0.5

0.0

0.5

1.0

(2)

rCFB

on

HA

M−D

−0.5

0.0

0.5

1.0

(3)

rCFB

on

HA

M−D

−0.5

0.0

0.5

1.0

(4)

Plac

ebo

Mirt

azap

ine

Am

itrip

tylin

e

rCFB

on

HA

M−D

−0.5

0.0

0.5

1.0

(5)

rCFB

on

HA

M−D

−0.5

0.0

0.5

1.0

(6)

rCFB

on

HA

M−D

−0.5

0.0

0.5

1.0

(7)

rCFB

on

HA

M−D

−0.5

0.0

0.5

1.0

(8)

rCFB

on

HA

M−D

−0.5

0.0

0.5

1.0

(9)

Figu

re 3

: Est

imat

ed p

roba

bilit

y de

nsiti

es fo

r rel

ativ

e ch

ange

from

bas

elin

e (r

CFB

) out

com

es fr

om 1

3 se

para

te c

ompa

rison

s bet

wee

n ac

tive

treat

men

t an

d pl

aceb

o (f

or a

dditi

onal

info

rmat

ion,

see

tabl

e 1)

Chapter 4.1

110

Table 1: Data from

nine placebo-controlled antidepressant (i.e. mirtazapine) trials, four of w

hich included active control (i.e. amitriptyline) (1-4) (i.e.

13 separate comparisons betw

een active treatment and placebo).

Mirtazapine

Study IDN

umber of patients

Mean (SD

) rCFB

Response rate

T-testC

hi-square test

PlaceboTreatm

entPlacebo

Treatment

PlaceboTreatm

entt

p-valueχ

2p-value

137

390.20 (0.22)

0.40 (0.27)0.14

0.443.4

<0.016.95

0.01

248

440.40 (0.35)

0.48 (0.28)0.44

0.451.28

0.20

0.96

350

490.27 (0.29)

0.49 (0.32)0.22

0.573.62

<0.0111.37

<0.01

448

480.27 (0.36)

0.47 (0.33)0.29

0.52.88

<0.013.53

0.06

544

440.20 (0.37)

0.47 (0.37)0.23

0.483.41

<0.014.98

0.03

645

420.34 (0.33)

0.42 (0.32)0.38

0.431.05

0.300.07

0.79

765

2000.33 (0.31)

0.41 (0.29)0.29

0.401.86

0.072.19

0.14

843

410.33 (0.33)

0.44 (0.35)0.33

0.441.47

0.150.71

0.40

961

630.41 (0.36)

0.52 (0.30)0.44

0.511.89

0.060.30

0.58

Am

itriptyline

Study IDN

umber of patients

Mean (SD

) rCFB

Response rate

T-testC

hi-square test

PlaceboTreatm

entPlacebo

Treatment

PlaceboTreatm

entt

p-valueχ

2p-value

137

380.20 (0.22)

0.42 (0.28)0.14

0.373.77

<0.014.23

0.04

248

470.40 (0.35)

0.56 (0.25)0.44

0.662.55

0.013.87

0.05

350

490.27 (0.29)

0.46 (0.26)0.22

0.433.40

<0.014.01

0.05

448

480.27 (0.36)

0.52 (0.30)0.29

0.543.66

<0.015.19

0.02 The table presents the m

ean and standard deviation of the relative change from baseline (rC

FB) outcom

es for treatment and placebo groups, as w

ell as response rates, w

here response was defi ned as rC

FB>0.5. A

lso presented are the test statistic and p-value for the corresponding t-test and chi-square test.

Dichotomizing bimodally distributed outcomes

111

Sam

ple

size

= 7

5

σ

Power

0.05

0.10

0.15

0.20

0.25

0.30

0.00.20.40.60.81.0

d =

1.0

d =

1.2

d =

1.3

d =

1.4

t−te

stχ2 −

test

Sam

ple

size

= 1

50

σPower

0.05

0.10

0.15

0.20

0.25

0.30

0.00.20.40.60.81.0

d =

1.0

d =

1.2

d =

1.3

d =

1.4

Figu

re 4

: Pow

er fo

r the

t-te

st a

nd th

e ch

i-squ

are

test

. Sim

ulat

ed o

utco

mes

for b

oth

grou

ps w

ere

deriv

ed fr

om m

ixtu

res o

f nor

mal

dis

tribu

tions

with

th

e sa

me

fi xed

mea

ns a

nd v

aria

ble

stan

dard

dev

iatio

ns (δ

) (se

e fi g

ure

1). S

ampl

e si

ze (n

), an

d tre

atm

ent e

ffect

size

(d) w

ere

varie

d as

wel

l.

Chapter 4.1

112

The chi-square test improved power over the t-test when the outcome mixtures were composed of two normal distributions with small standard deviations. In this case, both distributions were peaked and scarcely-overlapping on either side of the cutoff (i.e. extreme bimodality). When the two distributions in the mixtures were already virtually non-overlapping, a further reduction of σ did not improve the performance of the chi-square test, since both distributions could already be distinguished perfectly. This did, however, decrease the variance of the normal mixtures as a whole, and hence improved the power for the t-test, causing the comparative advantage of the chi-square test to diminish.With larger σ, the overlap of the distributions increased, which reduced the degree of bimodality, and the advantage of the chi-square test over the t-test. The scenarios that showed superior statistical power for the chi-square test were characterized by more extreme bimodality than what was observed in the empirical antidepressant data. This is consistent with the observation that the t-tests on these data generally provided smaller p-values than the chi-square tests.Not surprisingly, larger sample sizes and treatment effects increased the statistical power of both tests. The relative difference in power between the tests was, however, only modestly affected by sample size and treatment effect. For the scenarios we evaluated, the greatest loss in statistical power associated with dichotomization (when σ = 0.30) was close to 10 percent.

DiscussionThis study indicated that empirical antidepressant clinical trial outcomes show heterogeneity that is consistent with a bimodal distribution, and that the cutoff commonly used in practice to dichotomize these data is close to optimal in terms of statistical power. Nevertheless, it also showed that t-tests on the continuous outcomes generally provided smaller p-values than chi-square tests on the dichotomized results. Additional simulations revealed that dichotomization only yields more powerful statistical tests when outcome bimodality is extreme and more pronounced than observed in the empirical data.Bimodality of trial outcomes suggests differentially responding patient groups in antidepressant trials. Previous studies have shown that a substantial portion of patients are unaffected by antidepressant treatment. These patients can potentially be led back to a set of effect modifying covariates that include comorbid psychiatric or general medical disorder, older age, female gender, family history, early or late age of onset, greater disease severity and chronicity of course [12,13]. Treatment resistance has been studied mostly within a primary care setting. It remains diffi cult to exactly identify treatment resistant patients in antidepressant trials, because the separation in the bimodal distributions is not that strict.We estimated the cutoff that optimally differentiated between groups of patients that are affected by or resistant to antidepressant treatment, and found that it was very similar to the cutoff that is commonly used in practice. This cutoff can therefore be expected to provide, on

Dichotomizing bimodally distributed outcomes

113

average, nearly optimal power for subsequent analyses of dichotomized relative changes on the HAM-D scale. Obviously, however, this does not provide any rationale for using the same cutoff for other depression severity measures or in other clinical fi elds.Dichotomization is often applied to antidepressant trial outcomes. We show that despite bimodality of these outcomes, and the use of a nearly-optimal cutoff value, the t-test on the continuous rCFB outcomes generally provided smaller p-values than the chi-square test on the dichotomized ‘response’ vs. ‘non-response’ outcomes.Simulations indicated that dichotomization only provides superior power when bimodality is extreme and the two groups that result from dichotomization coincide with the grouping (i.e., modes) in the underlying distribution. In that case, distinct groups of patients (those that are affected and those resistant to treatment) can be readily identifi ed. The advantage of dichotomization recedes as bimodality decreases and distributions become increasingly unimodal. Also, in situations where further refi ned heterogeneity results in outcome distributions with more than two underlying groups (i.e., multimodality), dichotomization is unlikely to increase statistical power. In this situation, dichotomization will not summarize treatment response in the most meaningful way.It should be noted that dichotomization may raise concerns when results from different studies are compared, or combined in meta-analysis. Even when the same cutoff value is commonly used across studies, differences in the shape or position of the outcome distributions can introduce systematic variation, which complicates the interpretation of comparisons between studies [14].In this study, missing values were imputed using last observation carried forward (LOCF). It is known that this method may yield biased effect estimates [15]. Nevertheless, we applied LOCF, because our focus was on bimodality of the outcomes distributions, rather than the effects of active treatment as compared to placebo. We also explored bimodality at earlier time-points, which are less affected by dropout. Treatment effi cacy was apparent from 3 weeks onwards. Moderate bimodality was seen, but not as evident as at endpoint. This was partly due to smaller samples sizes, as the different studies had different assessment schemes and not all studies included an assessment at week three.Most methodological studies that discuss dichotomization point to its drawbacks when outcomes are normally distributed [5,6,14]. This study considered its implications when outcomes are bimodally distributed, and concludes that also under these conditions dichotomization is unlikely to be justifi able in real-life situations. Researchers should be aware that dichotomization generally reduces power and is unlikely to optimally address the actual heterogeneity that is present in the data. We advocate more frequent reporting of the full outcome distributions in individual trials and individual patient data meta-analyses, to reveal possible heterogeneity and corresponding grouping of patients. In the presence of substantial heterogeneity, a simple comparison of means does not adequately summarize the differences between groups.

Chapter 4.1

114

Reference List(1) Cipriani A, Furukawa TA, Salanti G, Geddes JR, Higgins JP, Churchill R, et al.

Comparative effi cacy and acceptability of 12 new-generation antidepressants: a multiple-treatments meta-analysis. Lancet 2009; 373:746-758.

(2) MacCallum RC, Zhang S, Preacher KJ, Rucker DD. On the practice of dichotomization of quantitative variables. Psychological Methods 2002a; 7:19-40.

(3) Decoster J, Iselin AM, Gallucci M. A conceptual and empirical examination of justifi cations for dichotomization. Psychological Methods 2009; 14:349-366.

(4) Cohen J. The cost of dichotomization. Applied Psychological Measurement 1983; 3:249-253.

(5) Altman DG, Royston P. The cost of dichotomising continuous variables. British Medical Journal 2006a; 332:1080.

(6) Senn S. Disappointing dichotomies. Pharmaceutical Statistics 2003; 4:239-240.(7) Naggara O, Raymond J, Guilbert F, Roy D, Weill A, Altman DG. Analysis by

categorizing or dichotomizing continuous variables is inadvisable: an example from the natural history of unruptured aneurysms. American Journal of Neuroradiology 2011; 32:437-440.

(8) Fedorov V, Mannino F, Zhang R. Consequences of dichotomization. Pharmaceutical Statistics 2009; 8:50-61.

(9) Souery D, Amsterdam J, de Montigny C, Lecrubier Y, Montgomery S, Lipp O, et al. Treatment resistant depression: methodological overview and operational criteria. European Neuropsychopharmacology 1999; 9(1-2):83–91.

(10) Fava M, Davidson KG. Defi nition and epidemiology of treatment-resistant depression. Psychiatric Clinics of North America 1996; 19(2):179-200.

(11) Fava M. Diagnosis and defi nition of treatment-resistant depression. Biological Psychiatry 2003; 53(8):649-659.

(12) Keitner GI, Ryan CE, Miller IW, et al. 12-month outcome of patients with major depression and comorbid psychiatric or medical illness (compound depression). American Journal of Psychiatry 1991; 148:345-350.

(13) Kornstein SG, Schneider RK. Clinical features of treatment-resistant depression. Journal of Clinical Psychiatry 2001; 62(suppl6):18-25.

(14) Ragland DR. Dichotomizing continuous outcome variables: dependence of the magnitude of association and statistical power on the cutpoint. Epidemiology 1992b; 3:434-440.

(15) Molenberghs G, Thijs H, Jansen I, Beunckens C, Kenward MG, Mallinckrodt C, et al. Analyzing incomplete longitudinal clinical trial data. Biostatistics 2004; 5:445-464.

Chapter 4.2Comparing HAMD17 and HAMD subscales on sensitivity to

antidepressant drug effects in placebo-controlled trials

Boessen R, Groenwold RHH, Knol MJ, Grobbee, DE, Roes, KCB

Journal of Affective Disorders 2013;145:363-369

Chapter 4.2

116

AbstractBackground: The 17-item Hamilton Depression Rating Scale (HAMD17) is the standard effi cacy outcome in antidepressant clinical trials. It is criticized for multidimensionality and poorly discriminating treatment from placebo. HAMD subscales may overcome these limitations and reduce the sample size of clinical trials. This study compared the discriminative performance of the HAMD17 and three established HAMD subscales (Bech, Maier-Philipp, Gibbons) across a range of antidepressants with different mechanisms of action. Methods: We analyzed data from 24 clinical trials including 3,692 patients randomized to tricyclic or tetracyclic antidepressants (TCAs or TeCAs), selective serotonin reuptake inhibitors (SSRIs) or placebo. Data were analyzed using a mixed model for repeated measurements (MMRM). Standardized effect sizes for the HAMD17 and subscales were derived for every time-point, and their effect on sample size was evaluated.Results: For TCAs and TeCAs versus placebo, the HAMD17 consistently provided the highest standardized effects. The sample size to establish effi cacy at week six was >25 percent smaller than for any of the subscales. However, for SSRIs versus placebo, the HAMD17 provided slightly smaller standardized effects and was the least effi cient outcome. There were no relevant differences between the subscales. Limitations: Data were derived exclusively from mirtazapine trials. Conclusions are restricted to clinical trial settings.Conclusions: Comparative performance of the HAMD17 and various subscales strongly depends on type of antidepressant. Results support using HAMD17 as primary endpoint in clinical trials, but it will be benefi cial to pro-actively include subscales as additional endpoints to successfully establish treatment effects of new antidepressants.

Comparing HAMD17 and HAMD subscales

117

IntroductionThe 17-item Hamilton depression rating scale (HAMD17) [1,2] is a standard effi cacy outcome in antidepressant clinical trials [3]. The HAMD17 was derived from the 21-items version of the scale. It is a multidimensional rating scale that covers a range of clinical features associated with major depressive disorder (MDD) [4-7] . Among these features are the affective symptoms of the disease (e.g. depressed mood, anxiety) and additional symptoms such as insomnia, hypochondrias and loss of weight. Depending on the mechanism of action, different antidepressant treatments may have differential effects on the various features of MDD. For example, tricyclic antidepressants appear to have a larger positive effect on symptoms of insomnia as compared to SSRIs [8]. As a result, antidepressant treatments vary in the extent to which they affect specifi c items of the HAMD17, and not all the items are equally informative to measure the effect of a particular antidepressant drug. As a result, the total score may not refl ect the antidepressant effect equally well for all treatments. HAMD subscales may overcome this limitation. These subscales were designed to be unidimensional, and measure specifi c core symptoms of MDD. As a results, subscales may be more sensitive to detect an effect of antidepressant treatment [9-13]. Examples of unidimensional HAMD subscales are the Bech Melancholia scale [9 ], the Maier and Philipp Severity subscale [14] and the Gibbons Global Depression Severity [5].Higher internal consistency of the HAMD subscales could decrease the variability in outcomes and increase the power of subsequent tests. This may in turn reduce the sample size requirements and costs of antidepressant trials [13,15,16]. Also, better specifi city of the subscales could reduce the impact of unintended effects in the comparison with placebo [17]. On the other hand, when the items that are not included in a subscale are (weakly) associated with the subscale scores, inclusion of these items reduces the variability in total score between patients and over time. In this case, the parsimonious, unidimensional subscale is statistically less effi cient to establish an effect of treatment than the more extensive full HAMD17.Previous comparisons between the HAMD17 and the various HAMD subscales yielded inconclusive results; several studies reported larger treatment effect sizes with the subscales [13,15,16,18], while others did not [3,12,17,19]. Many of these studies were limited in sample size and the range of included treatments. In this study we use a large dataset with multiple treatment arms to compare the various scales on their ability to differentiate active treatment from placebo in antidepressant clinical trials. In addition, we examine how the choice for a specifi c HAMD (sub)scales would affect the sample size requirements of the trial.

Chapter 4.2

118

MethodsHAMD17 and HAMD subscalesThe HAMD17 is a severity rating scale designed to rate the severity of symptoms observed in MDD (table 1). A trained medical professional scores each of the 17 items by conducting a structured interview with the patient and by observing the patient’s symptoms. Every item has a number of possible outcomes that range from 0-2 or 0-4, where a larger value indicates more severe presence of the symptom. A HAMD17 total score of 0-7 is considered normal and a minimum scores of 16 (i.e. moderate MDD severity) is usually minimally required for entry in antidepressant clinical trials [20]. Administration of the HAMD17 takes about 20-30 minutes.

Table 1: The 17 individual items of the HAMD17, their maximum score and the items that are included in the different HAMD subscales.

HAMD17 item Max. score Bech Maier-Philipp Gibbons Santen McIntyre

1. Depressed Mood 4 x x x x x2. Feelings of Guilt 4 x x x x x3. Suicide 4 x x x4. Insomnia Early 25. Insomnia Middle 26. Insomnia Late 27. Work and Activities 4 x x x x x8. Retardation 4 x x x9. Agitation 4 x x10. Anxiety/Psychic 4 x x x x x11. Anxiety (somatic) 4 x x12. Somatic Symptoms (Gastrointestinal) 213. Somatic Symptoms (General) 2 x x x14. Genital Symptoms 4 x15. Hypochondriasis 416. Loss of Weight 217. Insight 2

Various methods have been used to identify HAMD17 items that measure the core symptoms of MDD. This has resulted in a number of HAMD subscales of which the Bech Melancholia scale (Bech) [9] the Maier and Philipp Severity subscale (Maier-Philipp) [14] and the Gibbons Global Depression Severity scale (Gibbons) [5] are among the most well-known (table 1). More recently, two additional scales were proposed that overlap substantially with the older subscales. The scale by Santen et al. [13] includes all the items of the Bech scale plus the suicide item. The scale by McIntyre et al. [12] is similar to the Gibbons scale, but without the agitation and genital symptoms items, and with the general somatic symptoms item. Our

Comparing HAMD17 and HAMD subscales

119

analyses and discussion will be focused primarily on the most frequently reported subscales, i.e. Bech, Maier-Philipp, and Gibbons.

Study dataTo evaluate the performance of the HAMD17 and its subscales we analyzed data from 24 randomized controlled trials (RCTs) that investigated the effi cacy of mirtazapine for the treatment of MDD (see appendix). These trials were performed within phase IIb (5 trials), III (13 trials) and IV (6 trials) of mirtazapine’s clinical development, and carried out between 1983 and 2002 in North America, Europe and Asia. The time between the baseline and fi nal assessment was fi ve (2 trials), six (21 trials) or eight (1 trial) weeks. Patients were recruited from a single (9 trials), or multiple investigational centers (15 trials). Trials were either two-armed versus placebo (5 trials), two-armed versus an active comparator (14 trials) or three-armed versus placebo and an active comparator (5 trials). Time-points at which assessments took place varied between the trials. All trials used the HAMD17 as the primary outcome for effi cacy. Studies were carried out in accordance with the latest version of the Declaration of Helsinki. Study designs and procedures were reviewed by appropriate ethics committees and informed consent of the participants was obtained after the nature of the study was fully explained.All patients in the database (n=3,692) were diagnosed with MDD by a psychiatrist, according to the DSM-III criteria. In addition, all patients met the following inclusion criteria: at least 18 years old, at least moderate disease severity (HAMD17 ≥ 16) (Bech et al., 1986), non-smoking, non-suicidal and at least one post-baseline assessment. Trials included both inpatients and outpatients. In general, concomitant use of psychotropic medication and/or psychotherapy was an exclusion criterion. In this case, appropriate wash-out periods were applied to avoid carry-over effects. Within trials, patients were randomly assigned to mirtazapine (n=1,819), placebo (n=481) and/or one of several active controls (i.e. amitriptyline (n=605), fl uoxetine (n=124), fl uvoxamine (n=207), clomipramine (n=87), doxepin (n=78), maprotiline (n=74) or paroxetine (n=217)) as applicable to the specifi c trial. Randomization was stratifi ed by center in multicenter trials.

Data analysisScores on the HAMD17 and the three different subscales were calculated for each patient and every available time-point by adding up the scores on the individual items that were included in the corresponding scale. The antidepressive treatments were categorized according to treatment type: (1) tetracyclic antidepressants (TeCAs; mirtazapine and maprotiline; n=1,893), (2) tricyclic antidepressants (TCAs; amitriptyline, clomipramine and doxepin; n=770) and (3) Selective Serotonin Reuptake Inhibitors (SSRIs; fl uoxetine, paroxetine and fl uvoxamine; n=548).

Chapter 4.2

120

Total scores on the different (sub)scales were analyzed using a mixed-effects model for repeated measurements (MMRM) [21]. The model included patient score at baseline, time of visit (week 1-6), investigational center, the treatment*time interaction, and a random effect for patient. Inclusion of center in the model ensured proper account of study differences. We specifi ed an autoregressive correlation structure which assumes that the correlation between every two adjacent time-points is the same, and the correlation decreases by the power of the number of time intervals between assessments [22].The treatment*time interaction provided estimates for the effect of TeCAs, TCAs and SSRIs relative to placebo on every available time-point after baseline assessment. These estimates were divided by the root mean squared error of the model to acquire the standardized effect size, which is independent of sample size and thus allows for meaningful interpretation across scales and treatment classes with a wide variation in sample size. No data imputation (i.e. Last Observation Carried Forward (LOCF) or otherwise) was performed. The linear mixed model adequately accounts for missing data if missing at random can be assumed to hold conditional on all outcome measurements observed before the missing data point. Previous work has shown that linear mixed models are the preferred method to analyze empirical (antidepressant) trial data with repeated assessments [23].To assess the impact on sample size of the various (sub)scales, we also estimated the sample size presuming that the standardized six-week effect size estimate from the MMRM model was the target effect size to demonstrate effi cacy. We estimated sample size for a trial with two equal groups and 80 percent statistical power separately for the three types of treatment. In addition we calculated the average scores on the individual HAMD17 items for each time-point and every available treatment type after scores were corrected for investigational center and missing values were omitted. All Statistical analyses were carried out in R (version 2.14.0, www.r-project.org).

ResultsTable 2 shows population characteristics and dropout rates for the different treatment groups. All groups included more males than females. However, since the gender distribution was equally skewed in the various treatment groups, it did not have a confounding effect on the results. Age and baseline disease severity were comparable across groups. 96.5 percent of the patient population was below 65 years of age. As expected, dropout was more frequent in the placebo group than in any of the active treatment groups. These numbers are fairly typical in confi rmatory antidepressant trials. The observed dropout patterns did not complicate statistical analysis.

Comparing HAMD17 and HAMD subscales

121

Table 2: Baseline characteristics for patients included in the trials and randomized to placebo, tetracyclic antidepressants (TeCAs), tricyclic antidepressants (TCAs) and Selective Serotonin Reuptake Inhibitors (SSRIs). The mean and standard error is presented for age and baseline HAMD17

Placebo TeCAs TCAs SSRIs

Male 61.7 64.2 69.4 59.4

Age 42.06 (11.81) 44.89 (12.13) 45.46 (11.98) 44.14 (13.26)

Baseline HAMD17 23.66 (4.24) 24.48 (4.64) 24.74 (4.69) 24.04 (4.47)Dropout (%) week 1 1.5 1.2 1.7 0.2 week 2 7.3 4.9 4.0 8.7

week 3 14.5 12.8 11.8 14.1

week 4 18.0 12.7 13.5 14.1

week 6 34.5 19.7 18.7 20.6

The MMRM model yielded standardized effect sizes for the active treatment groups relative to placebo. These estimates are plotted in fi gure 1. All the (sub)scales showed an increasing effect size over time for the three different types of treatment. The six-week treatment versus placebo differences were highly signifi cant (p<0.001), for every treatment type and on every (sub)scale. For the groups treated with TeCAs and TCAs, the largest effect was measured on the HAMD17, and the advantage of the HAMD17 was consistent over time. After six weeks of treatment the standardized effect for the TeCA group was 0.69 (CI95%: 0.56-0.83) when measured on the HAMD17, and 0.59 (CI95%: 0.45-0.73) when measured on the best performing subscale (i.e. Gibbons). For patients treated with TCAs the standardized effect on the HAMD17 was 0.85 (CI95%: 0.70-1.00) as compared to 0.73 (CI95%: 0.57-0.88) on the best performing subscale (i.e. Maier-Philipp). No systematic differences were observed between the subscales in that none of the subscales performed consistently better or worse than another.The standardized effects for SSRIs versus placebo were comparable across the various (sub)scales. The effect after six weeks of treatment was 0.52 (CI95%: 0.34-0.69) on the HAMD17 and 0.57 (CI95%: 0.40-0.75) on the best performing subscale (i.e. Maier-Philipp). Again, no systematic differences were observed between the various subscales. The scales by Santen et al. [13] and McIntyre et al. [12] showed similar results as observed for the other subscales (data not shown).For TeCAs versus placebo, 34 patients per treatment arm were required to establish the estimated treatment effect for week six on the HAMD17 with 80 percent power. For the Bech, Maier-Philipp and Gibbons subscales this was 52, 47 and 46 patients, respectively. 23 patients per group were required to establish effi cacy of TCAs on the HAMD17 total score as compared to 32, 31 and 32 for the various subscales (in the same order). For SSRIs this was

Chapter 4.2

122

60, 60, 49 and 52. Hence, when comparing T(e)CAs with placebo, the HAMD17 total score was the most effi cient measure of effi cacy. However, when comparing SSRIs with placebo the HAMD17 total score was slightly less effi cient than the various HAMD subscales.

TeCAs

Time (weeks)

Stan

dard

ized

effe

ct si

ze

0 1 2 3 4 5 6

−0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

HAMDBECHMAIERGIBBONS

TCAs

Time (weeks)

Stan

dard

ized

effe

ct si

ze

0 1 2 3 4 5 6

−0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

SSRIs

Time (weeks)

Stan

dard

ized

effe

ct si

ze

0 1 2 3 4 5 6

−0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 1: Standardized effect sizes over time as measured on the HAMD17 and the various HAMD subscales. The different plots represent tetracyclic antidepressants (TeCAs) versus placebo (a), tricyclic antidepressants (TCAs) versus placebo (b) and Selective Serotonin Reuptake Inhibitors (SSRIs) versus placebo (c). Many of the included trials had no assessment in the fi fth week (see also appendix), therefore the effect size estimates were extrapolated from week four to week six (indicated with the dashed line).

The graphs in fi gure 2 show the average scores on the individual HAMD17 items across time and separately for placebo, TeCAs, TCAs and SSRIs.

Comparing HAMD17 and HAMD subscales

123

01

23

45

6

Dep

ress

ed m

ood

Tim

e (w

eeks

)

01234

01

23

45

6

Feel

ings

of g

uilt

Tim

e (w

eeks

)

01234

01

23

45

6

Suic

ide

Tim

e (w

eeks

)

01234

01

23

45

6

Inso

mni

a ea

rly

Tim

e (w

eeks

)

01234

01

23

45

6

Inso

mni

a m

iddl

e

Tim

e (w

eeks

)

01234

01

23

45

6

Inso

mni

a la

te

Tim

e (w

eeks

)

01234

01

23

45

6

Wor

k an

d ac

tiviti

es

Tim

e (w

eeks

)

01234

01

23

45

6

Ret

ardat

ion

Tim

e (w

eeks

)

01234

01

23

45

6

Agi

tatio

n

Tim

e (w

eeks

)01234

01

23

45

6

Anx

iety

psy

chic

Tim

e (w

eeks

)

01234

01

23

45

6

Anx

iety

som

atic

Tim

e (w

eeks

)

01234

01

23

45

6

Som

atic

sym

ptom

s

gast

roin

test

inal

Tim

e (w

eeks

)

01234

01

23

45

6

Som

atic

sym

ptom

s gen

eral

Tim

e (w

eeks

)

01234

01

23

45

6

Gen

ital s

ympt

oms

Tim

e (w

eeks

)

01234

01

23

45

6

Hyp

ocho

ndria

sis

Tim

e (w

eeks

)

01234

01

23

45

6

Loss

of w

eigh

t

Tim

e (w

eeks

)

01234

01

23

45

6

Insi

ght

Tim

e (w

eeks

)

01234

Plac

ebo

T eC

As

T CAs

SSR

Is

Figu

re 2

: Av

erag

e sc

ores

on

the

17 in

divi

dual

HA

MD

17 it

ems

for

ever

y w

eekl

y as

sess

men

t and

sep

arat

e fo

r th

e di

ffere

nt tr

eatm

ent c

lass

es; i

.e.

plac

ebo,

tetra

cycl

ic a

ntid

epre

ssan

ts (

TeC

As)

, tric

yclic

ant

idep

ress

ants

(TC

As)

and

Sel

ectiv

e Se

roto

nin

Reu

ptak

e In

hibi

tors

(SS

RIs

). A

lthou

gh th

e sc

alin

g of

the

y-ax

is is

the

sam

e fo

r all

plot

s, so

me

item

s hav

e a

max

imum

scor

e of

4, w

hile

oth

ers h

ave

a m

axim

um sc

ore

of 2

(see

tabl

e 1)

.

Chapter 4.2

124

Several items showed relatively large improvements with treatment as compared to placebo (e.g. depressed mood, feelings of guilt, suicide, insomnia (early, middle and late), work and activities, anxiety psychic/somatic, somatic symptoms general), while other items showed only marginal differences between treatment and placebo (e.g. retardation, genital symptoms, hypochondriasis, loss of weight, insight). In addition, some items showed differential improvement across the various treatment types. This was most apparent for the three insomnia items that improved more with T(e)CAs than with SSRIs. After six weeks of treatment, the standardized effect size of TeCAs was 34.4 percent larger when measured on the full HAMD17 as compared to a version of the HAMD17 without the insomnia items. For TCAs this difference was 32.6 percent, while for SSRIs it is only 23.0 percent.

DiscussionThis study evaluated the performance of the full HAMD17 rating scale and three HAMD subscales on their ability to discriminate antidepressant treatment from placebo in randomized clinical trials. It showed that the performance of the different (sub)scales depends on the mechanism of action of the investigated drug. For comparisons between T(e)CAs and placebo, the HAMD17 yielded the largest effect size on every available post-baseline assessment, but for comparisons between SSRIs and placebo, differences were only marginal. The subscales generally performed about equally well.Earlier studies also compared the various (sub)scales on their discriminative performance. Two of these studies used particularly large datasets. The fi rst, by Faries et al. [15] described two separate meta-analyses on data from eight placebo controlled trials with an SSRI (n>1,600) and four with TCAs (n>1,200). Both analyses suggested a modest but consistent advantage of the subscales (Bech, Maier, Gibbons) over the HAMD17 total score, while the differences between the subscales were small. The study concluded that the use of subscales would reduce the sample size of antidepressant trials with approximately one-third. The other study, by Entsuah et al. [16], performed similar analyses on pooled data from eight RCTs (n=2,045) with placebo, venlafaxine (i.e. a Serotonin-Norepinephrine Reuptake Inhibitor) and several SSRIs. This study also concluded that the subscales provided larger effect sizes and would reduce sample size requirements.Other studies compared the various (sub)scales as measures of the within-group difference between baseline and post-treatment assessments [3,12,17,19]. All these studies had a relatively small sample size (n<500). They considered data from a placebo-controlled trial with fl uoxetine [3] or from studies that evaluated the effectiveness of less restricted treatment strategies (e.g. stepwise psychopharmacological treatment, care as usual) [12,17,19]. None of these studies reported a signifi cant difference between the full HAMD17 and the various HAMD subscales.

Comparing HAMD17 and HAMD subscales

125

Santen et al. [13] assessed the (sub)scales’ sensitivity to differentiate treatment responders (i.e. patients with a relative change from baseline ≥ 50 percent at any point during the trial) from non-responders, using an MMRM model on data from 2 placebo controlled trials with paroxetine (SSRI) (n=765). They reported superior performance of the subscales as compared to the HAMD17, and non-signifi cant differences between the various subscales.These studies all evaluated the (sub)scales’ sensitivity to treatment response (although the defi nition of treatment response differed between studies), and reached inconsistent conclusions. The current study is the fi rst to include a wide range of different treatments and analyze the entire dataset with one comprehensive model. Our results indicate that the HAMD17 total score is more sensitive than any of the evaluated subscales to detect an effect of T(e)CAs as compared to placebo. Moreover, the HAMD17 required at least 25 percent fewer patients to establish effi cacy of these treatments after six weeks of follow-up. Interestingly, the HAMD17 and the various HAMD subscales did equally well in detecting an effect of SSRIs. In terms of sample size, the Maier-Philipp and Gibbons subscales required about 20 percent fewer patients to establish treatment effi cacy as compared to the full scale.Studies that reported superior performance of the HAMD subscales often based their conclusion on analyses of data from trials with SSRIs. It has been suggested before that the HAMD17 systematically favors T(e)CAs over SSRIs [24]. Although this study was not designed to appropriately address this hypothesis, it did show that certain items (most notably those related to insomnia) showed larger improvement on T(e)CAs than on SSRIs (see also: [8]). This may explain why the current study found an advantage of the HAMD17 for T(e)CAs, but not for SSRIs. However, it should be noted that with the present design and analyses the hypothesis of HAMD17 selectively favoring T(e)CAs cannot be distinguished from superior effi cacy of T(e)CAs as compared to SSRIs.Our use of a single comprehensive MMRM model allowed for consistent estimation of the effect size on every available time-point. Moreover, MMRM models make optimal use of the available data and provide unbiased estimates in the presence of missing data if missing at random can be assumed to hold conditional on all outcome measurements observed before the missing data point. Our MMRM approach is in contrast to the widely used repeated measures ANOVA that is limited to patients with complete follow-up. Before repeated measures ANOVA is applied, missing values are often imputed e.g. by LOCF, as was done by Faries et al. [15] and Entsuah et al [16]. Particularly when comparing performance across treatments with different mechanisms of action, LOCF may result in confusing bias because of different unintended effects and associated drop-out [23,25]. MMRM models have been shown to reduce such bias in antidepressant trials as compared to imputation approaches such as LOCF [23]. This may further explanation why some studies reached conclusions that differed from those reported in the present study.

Chapter 4.2

126

A limitation of our analysis is in the composition of our dataset. The mirtazapine treatment arm was present in all 25 trials, and placebo in 10 (of which 5 were two-armed trial, and 5 included an additional active comparator). This led to unequal sample sizes for the various treatment types. Treatment effect estimates relied on indirect comparisons across different trials, with mirtazapine as the most important linking pin present in all trials. As center was included in the model, correction for study differences was applied at the level of randomization. Given the composition of the database, consistency of direct and indirect estimates of treatment effects could realistically only be assessed for mirtazapine and amitriptyline, for which it was confi rmed.Our conclusions regarding the (sub)scales’ performance specifi cally apply to their use as effi cacy outcomes in antidepressant clinical trials. The HAMD17 is originally designed, and often used, to quantify MDD severity in clinical practice. For this purpose it may be particularly important to measure all relevant features of the disease, rather than only the symptoms that are responsive to antidepressive treatment. In addition, the data that were used came from a heterogeneous sample of trial eligible patients that may not be representative for the entire population of patients affected with MDD. This complicates generalization of our fi ndings to settings other than clinical trials. The distribution of males and females was skewed and different from that observed in the general MDD patient population. However, additional analyses showed that the main conclusions were similar for both sexes (data not shown).In summary, this study found that the comparative performance of the (sub)scales depends on the mechanism of action of the antidepressant. For T(e)CAs versus placebo, the HAMD17 outperformed the subscales as it provided larger standardized effects and reduced the sample size that was required to establish effi cacy after six weeks of follow-up. For SSRIs versus placebo, differences in the standardized effect sizes were substantially less pronounced, and the subscales had an advantage on sample size. These results suggest that the use of the HAMD subscales may increase the power and effi ciency of trials with SSRIs, but not for trials with TeCAs or TCAs. Results support the use of the HAMD17 as primary endpoint in clinical trials for new antidepressants, but demonstrate the benefi t of pro-actively including subscales as additional endpoints to successfully establish treatment effects of new antidepressants.

Comparing HAMD17 and HAMD subscales

127

Reference List(1) Hamilton M. A rating scale for depression. Journal of Neurology, Neurosurgery and

Psychiatry 1960; 23:56-62.(2) Hamilton M. Development of a rating scale for primary depressive illness. British

Journal of Social & Clinical Psychology 1967; 6:278-296.(3) O’Sullivan RL, Fava M, Agustin C, Baer L, Rosenbaum JF. Sensitivity of the six-item

Hamilton Depression Rating Scale. Acta Psychiatrica Scandinavica 1997; 95:379-384.(4) Bech P, Allerup P, Gram LF, Reisby N, Rosenberg R, Jacobsen O, Nagy A. The Hamilton

depression scale. Evaluation of objectivity using logistic models. Acta Psychiatrica Scandinavica 1981; 63:290-299.

(5) Gibbons RD, Clark DC, Kupfer DJ. Exactly what does the Hamilton Depression Rating Scale measure? Journal of Psychiatric Research 1993; 27:259-273.

(6) Bagby RM, Ryder AG, Schuller DR, Marshall MB. The Hamilton Depression Rating Scale: has the gold standard become a lead weight? American Journal of Psychiatry 2004; 161:2163-2177.

(7) Kennedy SH. Core symptoms of major depressive disorder: relevance to diagnosis and treatment. Dialogues in Clinical Neuroscience 2008; 10:271-277.

(8) Winokur A, Gary KA, Rodner S, Rae-Red C, Fernando AT, Szuba MP. Depression, sleep physiology, and antidepressant drugs. Depression and Anxiety 2001; 1:19-28.

(9) Bech P, Gram LF, Dein E, Jacobsen O, Vitger J, Bolwig TG. Quantitative rating of depressive states. Acta Psychiatrica Scandinavica 1975; 51:161-170.

(10) Montgomery SA, Asberg M. A new depression scale designed to be sensitive to change. British Journal of Psychiatry 1979; 134:382-389.

(11) Evans KR, Sills T, DeBrota DJ, Gelwicks S, Engelhardt N, Santor D. An Item Response analysis of the Hamilton Depression Rating Scale using shared data from two pharmaceutical companies. Journal of Psychiatric Research 2004; 38:275-284.

(12) McIntyre RS, Konarski JZ, Mancini DA, Fulton KA, Parikh SV, Grigoriadis S, Grupp LA, Bakish D, Filteau M-J, Gorman C, Nemeroff CB, Kennedy SH. Measuring the severity of depression and remission in primary care: validation of the HAMD-7 scale. CMAJ. 2005; 173:1327-1334.

(13) Santen G, Gomeni R, Danhof M, Della PO. Sensitivity of the individual items of the Hamilton depression rating scale to response and its consequences for the assessment of effi cacy. Journal of Psychiatric Research 2008; 42:1000-1009.

(14) Maier W, Philipp M. Comparative analysis of observer depression scales. . Acta Psychiatrica Scandinavica 1985; 72:239-245.

(15) Faries D, Herrera J, Rayamajhi J, DeBrota D, Demitrack M, Potter WZ. The responsiveness of the Hamilton Depression Rating Scale. Journal of Psychiatric Research 2000; 34:3-10.

(16) Entsuah R, Shaffer M, Zhang J. A critical examination of the sensitivity of unidimensional subscales derived from the Hamilton Depression Rating Scale to antidepressant drug effects. Journal of Psychiatric Research 2002; 36:437-448.

Chapter 4.2

128

(17) Ruhe HG, Dekker JJ, Peen J, Holman R, de Jonghe F. Clinical use of the Hamilton Depression Rating Scale: is increased effi ciency possible? A post hoc comparison of Hamilton Depression Rating Scale, Maier and Bech subscales, Clinical Global Impression, and Symptom Checklist-90 scores. Comprehensive Psychiatry 2005;46:417-427.

(18) Helmreich I, Wagner S, Mergl R, Allgaier AK, Hautzinger M, Henkel V, Hegerl U, Tadić A. Sensitivity to changes during antidepressant treatment: a comparison of unidimensional subscales of the Inventory of Depressive Symptomatology (IDS-C) and the Hamilton Depression Rating Scale (HAMD) in patients with mild major, minor or subsyndromal depression. European Archives of Psychiatry and Clinical Neuroscience 2011; 262:291-304.

(19) Ballesteros J, Bobes J, Bulbena A, Luque A, Dal-Re R, Ibarra N, Güemes I. Sensitivity to change, discriminative performance, and cutoff criteria to defi ne remission for embedded short scales of the Hamilton depression rating scale (HAMD). Journal of Affective Disorders 2007; 102:93-99.

(20) Bech P, Kastrup M, Rafaelsen OJ. Mini-compendium of rating scales for states of anxiety depression mania schizophrenia with corresponding DSM-III syndromes. . Acta Psychiatrica Scandinavica 1986; 326:1-37.

(21) Mallinckrodt CH, Clark WS, David SR. Accounting for dropout bias using mixed-effects models. Journal of Biopharmaceutical Statistics 2001; 11:9-21.

(22) Mallinckrodt CH, Kaiser CJ, Watkin JG, Molenberghs G, Carroll RJ. The effect of correlation structure on treatment contrasts estimated from incomplete clinical trial data with likelihood-based repeated measures compared with last observation carried forward ANOVA. Clinical Trials 2004; 1:477-489.

(23) Siddiqui O, Hung HM, O’Neill R. MMRM vs. LOCF: a comprehensive comparison based on simulation study and 25 NDA datasets. Journal of Biopharmaceutical Statistics 2009; 19:227-246.

(24) Hughes JR, O’Hara MW, Rehm LP. Measurement of depression in clinical trials: an overview. Journal of Clinical Psychiatry 1982; 43:85-88.

(25) Molenberghs G, Thijs H, Jansen I, Beunckens C, Kenward MG, Mallinckrodt C, Carroll RJ. Analyzing incomplete longitudinal clinical trial data. Biostatistics 2004; 5:445-464.

Comparing HAMD17 and HAMD subscales

129

App

endi

x: K

ey c

hara

cter

istic

s of

the

sele

cted

tria

ls in

clud

ing:

stu

dy n

umbe

r, de

sign,

dru

g, a

sses

smen

t sch

eme,

num

ber o

f pat

ient

s pe

r tre

atm

ent g

roup

an

d av

erag

e ba

selin

e an

d en

d sc

ore

on th

e H

AM

D17

and

the

thre

e di

ffere

nt H

AM

D su

bsca

les (

i.e. B

ech,

Mai

er-P

hilip

p, G

ibbo

ns).

Stud

yD

esig

n1Ph

ase

Dru

g2A

sses

smen

tsN

Base

line

Scor

eEn

d Sc

ore

HA

MD

17Be

chM

aier

-Phi

lipp

Gib

bons

HA

MD

17Be

chM

aier

-Phi

lipp

Gib

bons

16w

k SC

RC

TII

bPl

aceb

o0,

1,2,

3,4,

5,6

4522

.76

11.7

310

.56

14.0

212

.74

6.63

6.16

7.89

Mirt

azap

ine

4521

.51

11.8

010

.56

13.8

29.

185.

144.

646.

042

6wk

SC R

CT

IIb

Plac

ebo

0,1,

2,3,

4,5,

645

23.6

712

.16

11.0

714

.40

12.8

86.

816.

358.

35M

irtaz

apin

e45

23.5

312

.56

11.6

414

.80

10.9

36.

785.

937.

113

6wk

SC R

CT

IIb

Plac

ebo

0,1,

2,3,

4,5,

641

25.1

213

.10

13.0

716

.54

19.1

210

.60

10.5

213

.36

Mirt

azap

ine

4323

.72

12.2

112

.07

15.3

513

.08

7.36

7.00

8.92

Am

itrip

tylin

e43

24.5

112

.63

12.5

816

.72

12.1

97.

157.

129.

084

6wk

SC R

CT

IIb

Plac

ebo

0,1,

2,3,

4,5,

650

21.4

410

.94

10.2

813

.92

5.57

2.81

2.67

3.62

Mirt

azap

ine

5021

.10

10.6

610

.02

13.6

68.

425.

004.

465.

65A

mitr

ipty

line

4921

.76

11.1

810

.47

13.8

87.

193.

553.

354.

815

6wk

SC R

CT

IIb

Plac

ebo

0,1,

2,3,

4,5,

649

26.7

812

.78

12.2

016

.24

17.7

38.

167.

5710

.41

Mirt

azap

ine

4928

.33

13.0

212

.59

16.8

412

.61

6.10

5.59

7.59

Am

itrip

tylin

e50

27.5

812

.68

12.1

616

.26

13.2

06.

185.

808.

156

6wk

SC R

CT

III

Plac

ebo

0,1,

2,3,

4,5,

650

23.2

012

.60

11.7

015

.50

13.0

47.

857.

279.

15M

irtaz

apin

e50

23.4

412

.78

11.9

215

.20

9.74

5.87

5.65

7.19

Am

itrip

tylin

e50

23.5

212

.72

12.1

615

.62

9.70

5.49

5.27

6.95

78w

k M

C R

CT

III

Plac

ebo

0,1,

2,3,

4,6

7022

.49

12.3

111

.37

14.3

314

.30

7.96

7.14

9.28

Mirt

azap

ine

209

22.0

612

.39

11.4

314

.12

11.3

16.

625.

837.

558

6wk

MC

RC

TII

IM

irtaz

apin

e0,

1,2,

4,6

5522

.93

10.9

110

.68

14.1

88.

724.

244.

005.

69Fl

uoxe

tine

5823

.72

11.1

411

.12

14.7

411

.33

5.47

5.19

6.86

96w

k M

C R

CT

IVM

irtaz

apin

e0,

1,2,

3,4,

5,6

103

25.0

312

.88

12.1

315

.29

8.75

4.81

4.37

5.73

Fluv

oxam

ine

105

25.0

912

.95

12.2

915

.34

8.49

4.90

4.51

5.73

106w

k M

C R

CT

III

Mirt

azap

ine

0,1,

2,3,

4,5,

612

425

.96

11.9

411

.18

14.3

19.

935.

114.

625.

91A

mitr

ipty

line

124

25.8

312

.19

11.3

814

.27

9.80

5.20

4.71

5.96

116w

k M

C R

CT

III

Plac

ebo

0,6

5424

.11

12.1

711

.81

14.5

716

.79

9.21

8.81

10.4

0M

irtaz

apin

e59

23.9

212

.69

12.1

413

.98

13.3

27.

667.

248.

10

Chapter 4.2

130

Appendix: (C

ontinued)

StudyD

esign1

PhaseD

rug2

Assessm

entsN

Baseline ScoreEnd Score

HA

MD

17Bech

Maier

-PhilippG

ibbonsH

AM

D17

BechM

aier-Philipp

Gibbons

126w

k SC R

CT

IIIPlacebo

0,2,4,612

29.2516.25

15.8317.67

5.113.89

3.893.89

Mirtazapine

1132.18

17.4516.45

18.736.91

4.824.82

4.91A

mitriptyline

1031.40

17.3016.20

18.104.14

2.712.71

2.8613

6wk M

C R

CT

IIIM

irtazapine0,2,4,6

8726.18

13.4812.77

15.219.06

5.134.84

5.74C

lomipram

ine87

25.3013.26

12.3814.67

7.073.54

3.314.24

146w

k MC

RC

TIII

Mirtazapine

0,2,4,683

22.1311.54

10.6713.20

8.424.81

4.425.69

Doxepin

7822.55

11.7411.08

13.478.17

4.934.50

5.5515

6wk M

C R

CT

IIIM

irtazapine0,2,4,6

7527.43

13.1912.45

14.2812.08

6.376.01

6.73M

aprotiline74

26.7612.95

12.2613.97

8.764.85

4.484.82

165w

k MC

RC

TIII

Placebo0,2,4,5

6522.91

11.2010.92

13.6512.35

6.386.04

7.35

Mirtazapine

6623.29

11.5211.18

13.4710.04

5.355.23

6.4017

5wk M

C R

CT

IIIM

irtazapine0,2,4,5

7922.58

10.7710.59

13.7610.55

5.805.52

7.03A

mitriptyline

7722.82

10.5210.34

13.819.74

5.145.02

6.6618

6wk M

C R

CT

IIIM

irtazapine0,6

10327.29

13.5812.85

15.489.82

5.144.77

6.00A

mitriptyline

10126.25

13.2012.41

15.079.00

4.574.29

5.4219

6wk M

C R

CT

IVM

irtazapine0,1,2,3,4,6

6526.20

12.8612.83

16.3810.79

5.404.98

6.79Fluoxetine

6626.08

12.4412.41

16.2112.07

5.425.23

7.4220

6wk SC

RC

TIV

Mirtazapine

0,1,2,3,4,6135

22.5211.24

10.8014.45

9.274.55

4.126.15

Paroxetine131

22.6211.34

10.7914.40

10.794.85

4.376.78

216w

k MC

RC

TIV

Mirtazapine

0,1,2,4,698

24.4411.87

11.5714.43

11.435.79

5.407.09

Am

itriptyline101

24.6911.81

11.5414.29

11.075.66

5.386.91

226w

k SC R

CT

IVM

irtazapine0,1,2,3,4,6

1222.33

11.1711.08

15.2517.00

10.009.58

12.25Paroxetine

1324.54

11.9211.62

15.6216.30

8.408.50

11.1023

6wk M

C R

CT

IVM

irtazapine0,2,4,6

7224.78

12.0412.19

15.1011.57

5.985.91

7.42Paroxetine

7325.51

12.0812.42

15.4110.13

4.724.33

5.4324

6wk M

C R

CT

IIIM

irtazapine0,1,2,3,4,5,6

10122.86

11.3210.53

13.366.79

3.643.14

4.39Fluvoxam

ine102

22.4511.15

10.3413.02

7.863.83

3.384.64

1 SC / M

C = single-/m

ulticenter, RC

T = randomized controlled trial

2 Treatment type: m

irtazapine and maprotiline (TeC

As), am

itriptyline, clomipram

ine and doxepin (TCA

s), fl uoxetine, paroxetine and fl uvoxamine (SSR

I),

Chapter 5General discussion

General discussion

133

General discussionClinical trials drive the cost of drug developmentOver the past few decades, drug development has become increasingly costly and ineffective [1,2]. Currently, it takes an estimated 1.5 billion US dollars (including the cost of failures and so-called opportunity costs), and about 12 to 15 years of research to develop a drug until it is ready to be submitted for marketing approval [3]. Most of this time and money is accounted for by late-stage clinical trials [1,4,5]. These trials aim to confi rm the safety and effi cacy of the drug to allow its use in medical practice. They typically require effective collaboration between multiple research centers around the world and the recruitment and repeated assessment of numerous patients. Failure of these trials is often reason to suspend the drug’s development program, which also entails the loss of preceding investments in preclinical and early clinical research. To assure the continued arrival of new and affordable drugs, it is therefore essential to optimize the effi ciency and success-rates of confi rmatory trials [6,7]. As recognized by the US Food and Drug Administration (FDA) and the European Medicines Agency (EMA), the exploration and implementation of innovative clinical trial methodology is of key importance to achieve this goal.Clinical trials are scientifi c experiments that involve humans, and comprise a major source of information upon which it is decided whether the investigated drug is allowed to enter the market. Therefore, it is essential that clinical trials are carefully planned and rigorously conducted, and produce data that meet the highest scientifi c standards. In order to accomplish this, the trial’s design, objectives and procedures are described in a detailed study protocol. The protocol serves as the trial’s ‘operating manual’ and prescribes exactly what procedures to be performed when and to whom. It includes details on eligibility criteria, the randomization procedure, treatment regimens of the different study groups, the total sample size, planned schedules for patient assessment, effi cacy and safety endpoints, analytic methods, etc. Before the trial can start, the study protocol usually requires the approval of an authorized ethics committee. Once approval is obtained and the study is initiated, deviating from the protocol is generally considered as an unacceptable violation of the trial’s ethical and scientifi c rigor.However, at the time when the study protocol is drafted, there is often still a lot unknown about the action and effects of the drug. This uncertainty could give rise to inaccurate protocol assumptions, which in turn could result in an improperly designed trial with an increased risk to fail its objectives. For example, when there is uncertainty about the drug’s dose-response profi le at the planning stage, it may inadvertently be decided to expose the treated patient group to an ineffective or unsafe dosing regimen, causing the trial results to suggest unwarranted concerns about the effi cacy or safety of the drug.

Chapter 5

134

Clinical Trial Simulation for more informed clinical trial designsInaccurate protocol assumptions could result in improperly designed trials with an increased risk to fail. It is therefore important to explore strategies that could help to reduce the degree and impact of uncertainty at the planning stage. One increasingly popular approach with the promising potential to do so is Clinical Trial Simulation (CTS), discussed in chapter 2. CTS is a statistical simulation technique that allows to mimic the course of a trial in order to identify and resolve potential shortcomings in the study protocol (e.g. suboptimal dosing regimens, insuffi cient number of patients, etc.) before the trial is started [8]. It involves the generation of virtual trial outcomes based on given inputs, a comprehensive input-output (IO) model that uses these inputs to generate outcomes, and the subsequent analysis of these data [10-12]. The IO models generally incorporate multiple parameters related to characteristics of the investigated drug (e.g. its dose-response profi le, it pharmacokinetic and pharmacodynamic properties, etc.), the targeted disease (e.g. its symptoms and natural course), the indicated patient population (i.e. distributions of prognostic covariates) and the evaluated design (e.g. treatment regimens, included sample size, follow-up duration, etc.). Assumptions for these parameters are typically based on data from preclinical and early clinical studies, related compounds or scientifi c knowledge. The various IO model components thus rely on empirical data which are subject to all kinds of variation. As discussed in chapter 2.2, adequate model validation is therefore essential to ensure the model’s ability to reliably predict future trial outcomes and evaluate different trial designs [13]. We concluded that the current model-building and validation practices in CTS are inadequate, and we advocated more frequent use of methods from diagnostic and prognostic statistical and epidemiological research [14-16] and validation against independent external data when such data is available. With carefully constructed and validated models, CTS could improve the probability of a successful trial by allowing planners to ask ‘what if’-type of questions [11]. For instance, what if the non-compliance rate of patients is 10 percent larger than expected? What if the effect of the drug is in reality much smaller than assumed? How much smaller before the power of the study is too low to reliably detect a treatment benefi t. Or how will a change of inclusion criteria affect the study outcomes? In doing so, CTS offers a valuable approach to evaluate design parameters and allows for comparison between different design alternatives. In addition, CTS compels the integration of data from various sources and development phases (e.g. PK/PD and dose-fi nding studies, early clinical trials, etc.), to assure a comprehensive summary of available information before a new (and expensive) trial is started. On the other hand, conducting CTS obviously requires time, and especially in drug development, time equals money (in part because longer development reduces the duration of patent protection once the drug is on the market). Moreover, CTS requires specifi c expertise (e.g. in statistical modeling and simulation methodology), which may not always be at hand.

General discussion

135

So far, the number of published examples of CTS in late-stage development is rather small, but interest is increasing [17]. CTS was described as ‘an important approach to improve drug development knowledge management and development decision making’ in the FDA’s ‘Critical Path’ report [8]. Over the years, modeling and simulation techniques have already become an integral part of drug discovery research and early clinical development. Given its potential, as discussed in chapter 2, we also expect a larger role of CTS in the planning of future late-stage trials.

Adaptive trial designs to enable protocol adjustments in an ongoing trial.CTS enables to address uncertainty at the planning stage, but cannot ensure a perfect design as there is typically still a lot unknown about the investigated drug at the time the trial is planned for. Adaptive (or fl exible) designs may offer a solution to address this issue. In essence, adaptive design trials include preplanned opportunities to modify one or several aspects of the study protocol based on the analysis of preliminary (i.e. interim) data from patients in the trial [18-20]. As a result, adaptive designs provide the opportunity to modify the trial design to optimally line up with emerging knowledge when the initial planning assumptions were wrong. The term preplanned means that adaptations need to be planned in advance of the trial and described in the protocol [21]. As with every pivotal trial intended to support marketing approval, implementation of unplanned protocol modifi cation raises major concerns about the study’s integrity and the reliability of results.In chapter 3.2 of this thesis, we discussed an adaptive design trial for situations where, prior to the start of the trial, there is uncertainty about whether the investigated treatment is effective in the entire patient population or only in a patient subgroup with a particular baseline characteristic (e.g. a genomic marker). If the results of the interim data analysis suggests the latter, the design allows to adapt eligibility criteria to include only patients from the subgroup after interim analysis. But if patients without the characteristic also benefi t from treatment, eligibility criteria remain as they were in the pre-interim study phase. As a result, the design optimizes the likelihood of establishing a signifi cant treatment effect (either in the overall population or only in the selected patient subgroup). In addition, the design offers the opportunity to stop the trial early if there is suffi cient support at interim analysis to already conclude effi cacy or futility. A design like this, that can ascertain when further data collection for a particular group of patients is useless, and thereby lead to discontinuation of data collection for that group, may decrease the costs and duration of the study without decreasing the amount of relevant information it provides. Examples of other design adaptations include expansion of the sample size when the interim treatment effect is smaller than projected, or dropping one of several treatment arms after interim analysis when they are not as effective (or safe) as assumed [20].

Chapter 4.2

136

Adaptive design trials can ensure thoughtful use of limited resources, reduce patient exposure to ineffective or poorly tolerated treatments, and lead to the recruitment of patients who, on the basis of their baseline characteristics, are most likely to respond to treatment or have the most favorable risk/benefi t ratio. However, in planning an adaptive design clinical trial, there are two principal methodological issues that require careful consideration [20]. Both have been addressed repeatedly throughout this thesis. The fi rst is whether the adaptation process has led to design, analyses or conduct fl aws that increased the chance of a false conclusion that the treatment is effective (i.e. a Type-I error). The second is whether the adaptation process has led to trial results that are diffi cult to interpret.Infl ation of the Type-I error rate results from the opportunity to choose between several design adaptations (e.g. doses, population subgroups, endpoints) at one or multiple time-points during the trial, based on unblinded examination of the interim data [22]. These adaptation choices create multiple opportunities to succeed in showing a treatment benefi t, with greater likelihood of doing so than when there were no adaptation opportunities. Because confi rmatory trials are typically used to support a claim for approval, adaptive features should only be used when doing so will not increase the type-I error rate. Fortunately, for many adaptive designs, methods have been developed to keep the Type-I error rate in check. An example of an effective control strategy is discussed in chapter 3.2. However, in case of some of the more recently developed, and more complex adaptive methods, Type-I error infl ation may be more diffi cult to understand and account for with statistical procedures.Diffi culties with the interpretation of trial results may arise when the design includes adaptations that, during the course of the study, change the nature or type of data used in the primary analysis [23]. This is for instance what happens in the sequential parallel comparison (SPC) design discussed in chapter 3.1. The SPC design was proposed to reduce the impact of placebo response in psychiatric clinical trials (an important issue in such trials, since excessive placebo response makes it more diffi cult to demonstrate an added benefi t of treatment [24,25]). It consists of two consecutive placebo-controlled comparisons of which the second is only entered by placebo non-responders from the fi rst [26-28]. The primary analysis combines data from both design phases, and hence from two different patient populations (i.e. the initial population and the placebo non-responders), which obviously raises the question to whom the fi nal conclusion from the study apply.A related concern arises with designs that involve a selection period to identify specifi c patient subgroups for inclusion in the randomized study phase. This approach could complicate the generalizability of the trial’s results [20]. Consider, for example, the active run-in design discussed in chapter 3.3. In this design, all the recruited patients are initially assigned to treatment, and only those with a predefi ned minimum improvement on a predictive marker are assigned to treatment or control in the randomized study phase. In this case, the results from the randomized phase provide an unbiased estimate of the effect size for the selected population, but may give rise to bias when extrapolated to a broader population.

Comparing HAMD17 and HAMD subscales

137

Despite the FDA’s publication of a draft guidance document on the conduct and analysis of adaptive design trials, both issues remain a serious regulatory concern [18,20]. Other relevant issues involve early stopping for effi cacy, or reducing the study sample size after interim analysis [29-30]. In both cases, the trial’s conclusion is based on data from fewer patients than originally planned for to obtain a certain power. This increases the probability of an incorrect conclusion resulting from high outcome variability early in the trial. Obviously, early stopping or drastically reducing the sample size also decreases the amount of safety data obtainable from the study.Another relevant issue involves the ethical aspects of adaptive designs [31]. They can reduce the number of patients exposed to inferior treatment, but may also disturb clinical equipoise when the interim analyses strongly favor a certain treatment (regimen). Additional practical issues could be raised as well, e.g. the necessity to appoint an independent data monitoring committee to perform the interim analyses, assuring adequate blinding with regard to the design adaptations, etc. [20] Despite these concerns, it is widely agreed that when carefully planned and conducted, and well understood in advance, adaptive design trials can provide more and better information on the safety and benefi ts of drugs, in potentially shorter time frames, while exposing fewer patients to ineffective or harmful treatments.

The choice of endpoint and statistical analysis.Other design aspects that affect the effi ciency of a trial include the choice of (effi cacy) endpoint and statistical analyses. An example of the fi rst is presented in chapter 4.2. This study compared the full 17-item Hamilton depression rating scale (HAMD17) with several well-established shorter HAMD subscales on their ability to differentiate antidepressant treatment from placebo in a randomized trial [32]. The subscales were proposed to measure only the core symptoms of depression, and suggested to reduce the overall outcome variability and sample size requirements of antidepressant trials [33–35]. Our fi ndings suggested that the comparative performance of the different (sub)scales is dependent on the type of antidepressant treatment; for comparing Selective Serotonin Reuptake Inhibitors with placebo, the difference between the scales was marginal, but for comparing tetracyclic or tricyclic antidepressants with placebo, the full HAMD17 was most effi cient and reduced the sample size requirements with over 25 percent. This study illustrates that the choice of a trial’s endpoint can substantial affect the sample size requirements, and hence the costs of a trial. Obviously, aside from good discriminative properties, an effi cacy endpoint should also provide an adequate and reliable measure of disease state or severity and improvement.The analysis of the study results is another opportunity to affect the trial’s effi ciency and its likelihood of success. For example, in the choices whether and how to categorize outcomes, how to account for infl uential covariates, how to model repeated measurement data and how to deal with missing outcomes [36-38]. A number of these topics are briefl y discussed

Chapter 4.2

138

in chapters 4.1 and 4.2. Chapter 4.1 specifi cally addresses the categorization of outcomes prior to statistical analyses, and shows that it generally reduces statistical power, also for bimodally distributed outcomes [39]. This study showed how a common (and seemingly trivial) procedure can have a profound effect on the power, effi ciency and costs of a trial.

Conclusion Obviously, there are many more factors that affect the quality and effi ciency of a clinical trial than only those discussed in this thesis. An adequate design and analysis is crucial, but more practical aspects (e.g., access to research centers and trained personnel, availability of resources and infrastructure, etc.) are of critical importance as well. Also, there is no general strategy that guarantees an optimal trial as much depends on context. Consider, for example, the SPC design discussed in chapter 3.1. This design can be of great value to establish proof-of-concept in indications with excessive placebo response, but is of limited use when placebo response is small or absent. The adaptive selection design discussed in chapter 3.2 is attractive in the presence of differentially responding patient subgroups, but inadequate and potentially ineffi cient otherwise. Moreover, chapter 4.2 showed that dichotomization can increase the statistical power of subsequent tests, but is much more likely to do exactly the opposite. These examples illustrate that methods that improve the trial’s effi ciency in one situation, may be highly ineffective in another. However, they also illustrate that under specifi c circumstances it pays off to deviate from conventional methods, and implement more innovative approaches instead.Traditionally, sponsors have been rather cautious and conservative in adopting innovative approaches (e.g. adaptive trial designs) when it comes to late-stage confi rmatory trials. The costs and delays associated with the failure of such trials has led them to rely on established methods instead. In part, this restrain may be due to an incomplete understanding of the risks and potential benefi ts of novel methods, or a lack of expertise to implement or evaluate them. In addition, the perceived caution of regulators may have stopped sponsors from applying such methods. Given the current trends in pharmaceutical development (i.e., increasing costs and decreasing output) and the urgent need for more effective and effi cient trials, we think it is increasingly important to abandon a one-size-fi ts-all approach and be more appreciative to the sort of methods that have been discussed in this thesis.

Comparing HAMD17 and HAMD subscales

139

Reference list(1) DiMasi JA, Hansen RW, Grabowski HG. The price of innovation: new estimates of

drug development costs. Journal of Health Economics 2003; 22(2):151-185.(2) Kola I, Landis J. Can the pharmaceutical industry reduce attrition rates? Nature Reviews

Drug Discovery 2004; 3(8):711-715.(3) PhRMA. 2011. 2011 profi le: pharmaceutical industry <http://www.phrma.org/sites/

default/ fi les/159/phrma_profi le_2011_fi nal.pdf> (April 2011).(4) Paul SM, Mytelka DS, Dunwiddie CT, Persinger CC, Munos BH, Lindborg SR, et al.

How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nature Reviews Drug Discovery 2010; 9(3):203-214.

(5) Dickson M, Gagnon JP. Key factors in the rising cost of new drug discovery and development. Nature Reviews Drug Discovery 2004; 3(5):417-429.

(6) Orloff J, Douglas F, Pinheiro J, Levinson S, Branson M, Chaturvedi P et al. The future of drug development: advancing clinical trial design. Nature Reviews Drug Discovery 2009; 8(12):949-57.

(7) Rawlins MD. Cutting the cost of drug development? Nature Reviews Drug Discovery 2004; 3(4):360-364.

(8) US Food and Drug Administration. 2007. Critical Path Opportunities Initiated During 2006. http://www.fda.gov/oc/initiatives/criticalpath/opportunities06.html

(9) European Medicines Agency. 2011. Road map to 2015. http://www.ema.europa.eu/docs/en_GB/document_library/Report/2011/01/WC500101373.pdf

(10) Holford NH, Kimko HC, Monteleone JP, Peck CC. Simulation of clinical trials. Annual Review of Pharmacology and Toxicology 2000; 40:209-234.

(11) Bonate PL. Clinical trial simulation in drug development. Pharmaceutical Research 2000; 17(3):252-256.

(12) Girard P. Clinical trial simulation: a tool for understanding study failures and preventing them. Basic & Clinical Pharmacology & Toxicology 2005; 96(3):228-234.

(13) Holford NHG, Hale M, Ko HC, Steimer J-L, Sheiner LB, Peck. Simulation in Drug Development; Good Practices. 1999. http://bts.ucsf.edu/cdds/research/sddgp.php

(14) Boessen R, Knol MJ, Groenwold RH, Roes KCB. Validation and predictive performance assessment of clinical trial simulation models. Clinical Pharmacology and Therapeutics 2011; 89(4):487-8.

(15) Harrel FEH, Kerry LL, Mark BB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy and measuring and reducing errors. Statistics in Medicine 1996; 15:361–387.

(16) Brendel K, Dartois C, Comets E, Lemenuel-Diot A, Laveille C, Tranchand B, et al. Are population pharmacokinetic and/or pharmacodynamic models adequately evaluated? A survey of the literature from 2002 to 2004. Clinical Pharmacokinetics 2007; 46:221–234.

Chapter 4.2

140

(17) Holford N, Ma SC, Ploeger BA. Clinical Trial Simulation: A Review. Clinical Pharmacology and Therapeutics 2010; 88(2):166–182.

(18) Hung HM, O’Neill RT, Wang SJ, Lawrence J. A regulatory view on adaptive/fl exible clinical trial design. Biometrical Journal 2006; 48(4):565-73.

(19) Gallo P, Chuang-Stein C, Dragalin V, Gaydos B, Krams M, Pinheiro J. Adaptive designs in clinical drug development--an Executive Summary of the PhRMA Working Group. Journal of Biopharmaceutical Statistics 2006; 16(3):275-83.

(20) Adaptive Design Clinical Trials for Drugs and Biologics, Draft Guidance for Industry, February 2010. http://www.fda.gov/downloads/Drugs/.../ Guidances/ucm201790.pdf

(21) Bretz F, Koenig F, Brannath W, Glimm E, Posch M. Adaptive designs for confi rmatory clinical trials. Statistics in Medicine 2009; 28(8):1181-217.

(22) Wang SJ, Hung HM, O’Neill RT. Adaptive patient enrichment designs in therapeutic trials. Biometrical Journal 2009; 51(2):358-74.

(23) Temple R. FDA perspective on trials with interim effi cacy evaluations. Statistics in Medicine 2006; 25(19):3245-3249.

(24) Khan A, Khan SR, Walens G, Kolts R, Giller EL. Frequency of positive studies among fi xed and fl exible dose antidepressant clinical trials: an analysis of the food and drug administration summary basis of approval reports. Neuropsychopharmacology 2003; 28(3):552-557.

(25) Gordian M, Singh N, Zemmel R, Elias T. Why Products Fail in Phase III. In vivo 2010; 24:49-54.

(26) Fava M, Evins AE, Dorer DJ, Schoenfeld DA. The problem of the placebo response in clinical trials for psychiatric disorders: culprits, possible remedies, and a novel study design approach. Psychotherapeutics and Psychosomatics 2003; 72(3):115-127.

(27) Grandi S. The sequential parallel comparison model: a revolution in the design of clinical trials. Psychotherapeutics and Psychosomatics 2003; 72(3):113-114.

(28) Tamura RN, Huang X. An examination of the effi ciency of the sequential parallel design in psychiatric clinical trials. Clinical Trials 2007; 4(4):309-317.

(29) Gao P, Ware JH, Mehta C. Sample size re-estimation for adaptive sequential design in clinical trials. Journal of Biopharmaceutical Statistics 2008; 18(6):1184-96.

(30) Cui L, Hung HMJ, Wang SJ. Modifi cation of sample size in group sequential trials. Biometrics 1999; 55(3):853–7.

(31) Palmer CR, Rosenberger WF. Ethics and practice: Alternative designs for phase iii randomized clinical trials. Controlled Clinical Trials 1999; 20: 172-186.

(32) Boessen R, Groenwold RHH, Knol MJ, Grobbee DE, Roes KCB. Comparing HAMD17 and HAMD subscales on sensitivity to antidepressant drug effects in placebo-controlled trials. Journal of Affective Disorders 2012.

(33) O’Sullivan RL, Fava M, Agustin C, Baer L, Rosenbaum JF. Sensitivity of the six-item Hamilton Depression Rating Scale. Acta Psychiatrica Scandinavica 1997; 95:379-384.

Comparing HAMD17 and HAMD subscales

141

(34) Kennedy SH. Core symptoms of major depressive disorder: relevance to diagnosis and treatment. Dialogues in Clinical Neuroscience 2008; 10:271-277.

(35) Bech P, Allerup P, Gram LF, Reisby N, Rosenberg R, Jacobsen O, Nagy A. The Hamilton depression scale. Evaluation of objectivity using logistic models. Acta Psychiatrica Scandinavica 1981; 63:290-299.

(36) Mallinckrodt, CH, Clark WS, David SR, Accounting for dropout bias using mixed-effects models. Journal of Biopharmaceutical Statistics 2001; 11:9-21.

(37) Siddiqui O, Hung HM, O’Neill R. MMRM vs. LOCF: a comprehensive comparison based on simulation study and 25 NDA datasets. Journal of Biopharmaceutical Statistics 2009; 19:227-246.

(38) Molenberghs G, Thijs H, Jansen I, Beunckens C, Kenward MG, Mallinckrodt C, et al. Analyzing incomplete longitudinal clinical trial data. Biostatistics 2004; 5:445-464.

(39) Boessen R, Groenwold RHH, Knol MJ, Grobbee DE, Roes KCB. Classifying responders and nonresponders; does it help when there is evidence of differentially responding patient groups? Journal of Psychiatric Research 2012; 46(9):1169-73.

SummarySamenvatting

DankwoordCurriculum Vitae

Summary

145

SummaryThe development of new drug therapies is an extremely costly and time-consuming process. Most of this time and money is accounted for by late-stage clinical trials. These trials aim to confi rm the drug’s safety and effi cacy to allow its use in medical practice. They typically require effective collaboration between multiple research centers around the world and the recruitment and repeated assessment of numerous patients. Failure of late-stage trials is often a reason to suspend the drug’s development program, which also entails the loss of preceding investments in preclinical and early clinical research. To ensure the continued arrival of new and affordable drugs, it is therefore essential to contain the costs and failure-rates of confi rmatory clinical trials. This thesis discusses a number of approaches that will help to achieve this goal, and is divided into three separate sections that address distinct (but interrelated) topics.

Chapter 2 is dedicated to Clinical Trial Simulation (CTS); a statistical simulation technique that allows to mimic the conduct of a clinical trial in order to anticipate shortcoming in the study protocol (e.g. suboptimal dosing regimens, insuffi cient sample size, etc.) and suggest probable solutions before the trial is started. Chapter 2.1 reviews published CTS studies that address pertinent questions related to the design and analysis of late-stage clinical trials, and discusses key characteristics regarding the objective(s), simulation models and analytic methods that were used in these studies. Most of the reviewed studies performed CTS retrospectively in order to investigate its utility to arrive at a study design with optimal performance characteristics under a given scenario. Other studies employed CTS prospectively to inform in the planning of future trials. Overall, the review indicates that CTS is a valuable tool to evaluate the degree and impact of uncertainty at the trial’s planning stage, and allows to evaluate and compare study designs on relevant decision-making metrics (e.g. power or sample size requirements to establish treatment effi cacy, expected costs of the trial, etc.). In addition, CTS compels the integration of available data about the investigated drug to ensure a comprehensive synopsis of available knowledge before a new trial is started. Given its potential and the urgent need for more effective and effi cient drug development, the review promotes a larger role for CTS in the planning stage of future trials.However, the review also indicates that model-building and validation procedures that are currently being used in CTS are often inadequate for proper validation and performance assessment of CTS models. This is an important issue with regard to the use of CTS in practice. The development of CTS models typically relies on empirical data, which are subject to multiple sources of variation, and it is therefore essential to conduct extensive model-checking to ensure the models’ ability to extrapolate beyond its source data and reliably predict future trial results. Chapter 2.2 is a letter in response to a published review on CTS, and discusses this subject in more detail. It also points to methods from diagnostic

Summary

146

and prognostic statistical and epidemiological research that could be used to improve model validation, and advocates more frequent use of these methods in CTS.

Chapter 3 discusses innovative clinical trial designs that allow to modify design aspect of an ongoing trial based on incoming data from the trial as it progresses. Such designs are receiving increasing attention from academia, industry and regulatory bodies alike, and have the potential of major improvement by increasing the likelihood of a successful trial and lowering the number of patients exposed to inferior or harmful treatment. Chapter 3.1 evaluates the sequential parallel comparison (SPC) design that was proposed to improve the effi ciency of placebo-controlled trials in depression. These trials are typically characterized by excessive placebo response which reduces the opportunity to establish treatment effi cacy as it gets more diffi cult to demonstrate additional treatment benefi t when response to placebo is already substantial. The SPC design reduces the impact of placebo response by combining two consecutive placebo-controlled comparisons of which the second is only entered by placebo non-responders from the fi rst. The analysis then combines the data from both design phases to maximize statistical power and make optimal use of the collected data. The originally proposed SPC design involved two design phases of four week duration (SPC4+4). This design is compared to an eight-week parallel group design and alternative SPC designs with equal or longer total follow-up (SPC2+6 and SPC6+6) across different scenarios and on their sample size requirements to establish treatment effi cacy. The simulation results indicate that all the SPC designs are highly effi cient in comparison to the parallel group design when placebo response is high and a substantial effect of treatment (as contrasted to placebo) can already be demonstrated after a relatively short follow-up period. For a realistic scenario derived from empirical antidepressant trial data, the SPC2+6 and SPC4+4 designs required 51 and 53 percent fewer patients than the parallel group design with equal total follow-up. It is therefore concluded that SPC designs can signifi cantly reduce the sample size requirements, and increase the success-rates of antidepressant clinical trials.

Chapter 3.2 evaluates several two-stage clinical trial designs for situations where there is some, but inconclusive evidence that suggests a larger treatment benefi t for patients with a certain baseline characteristic (e.g. a genetic marker), as compared to those without the characteristic. A group sequential design allows to stop the trial early for effi cacy or futility when the interim data supports this conclusion. Alternatively, an adaptive selection design also allows for early stopping but offers the additional opportunity to restrict inclusion after interim analysis to a particular patient subgroup when this subgroup is much more likely to respond to treatment than the overall population. This chapter compares the sample size requirements for a fi xed parallel group, a group sequential and an adaptive selection design with equal overall power and control of the family-wise type-I error rate. The designs are

Summary

147

evaluated across scenarios that vary the average effect sizes in a marker positive and a marker negative patient subgroups, and the prevalence of marker positive patients in the overall study population. The effect sizes in the subgroups are chosen to refl ect realistic planning scenarios, where at least some effect is present in the marker negative subgroup (otherwise it would be illogical and possibly even unethical to conduct an adaptive selection design). Additional scenarios are considered in which the actual subgroup effects were assumed to differ from those hypothesized at the planning stage. It was found that both two-stage designs generally require fewer patients as compared to the fi xed parallel group design, and the advantage of the two-stage designs increases with an increasing difference in subgroup effects. Adaptive selection added little further reduction in sample size requirements as compared to the group sequential design when the actual interim effects in the subgroups were equal to those assumed at the planning stage. However, when the actual interim effects deviated strongly in favor of enrichment, the comparative effi ciency of the adaptive selection design increased, refl ecting the adaptive nature of the design. In this case, the adaptive selection design also enables to continue with only the most responsive patient subgroup, thus limiting the inclusion of patients that are unlikely to benefi t from treatment.

Chapter 3.3 also considers the situation where distinct patient subgroups respond differentially to treatment. In many therapeutic areas, individual-patient markers have been identifi ed that predict long-term treatment response at an early stage. These markers include both baseline characteristics and short-term marker changes that follow from brief exposure to treatment. Using such predictive markers to select patients for inclusion in a randomized clinical trial could result in more targeted trials, with larger effect sizes and improved effi ciency. This chapter compares a parallel group design with no selection to alternative designs that applied selection based on a baseline characteristic (i.e. baseline selection design) or on short-term improvement during active run-in (i.e. active run-in design). In both designs, selection restricts generalizability (i.e. the target population to which the study results apply). It may nonetheless be attractive to apply selection if a proportion of the overall patient population hardly benefi ts from treatment and the reduction in the number of patients to randomize is substantial. In this chapter, sample size requirements for the three designs are estimated across realistic scenarios that characterized the effect of treatment on endpoint-free survival as a combination of a direct effect (unrelated to the patient’s predictive marker values) and an interaction with baseline marker levels and short-term marker improvements after run-in. An additional scenario was derived from empirical trial data. Results from the simulation showed that the active run-in design has a substantial potential to reduce the number of patients that need to be recruited (and screened) when early marker improvement after run-in is a reliable predictor of long-term treatment response. In this situation, the baseline selection design could also lower sample size requirements as compared to the parallel group design,

Summary

148

but always required more patients than the comparable (i.e. equally restrictive) active run-in design. For all the other scenarios, no effi ciency benefi t was observed for the baseline selection design. In these conditions, the larger treatment effect in more restricted patient strata is cancelled out by the increasing number of patients that is excluded after screening (and thus not included in the randomized study stage) because they do not meet the selection criterion. It is concluded that the application of patient selection in randomized clinical trials should be limited to situations where prior evidence indicates that a valid treatment effect across the full population is unlikely, and reliable predictive markers are available to identify patient subgroups with an increased likelihood to improve with treatment.

Chapter 4.1 evaluates the statistical implications of dichotomization when continuous outcomes are bimodally distributed. Continuous trial outcomes are often dichotomized into ‘response’ and ‘non-response’ categories prior to statistical analysis. For many researchers and clinicians this facilitates the analysis and interpretation of study results. However, generally, dichotomization reduces statistical power. Exceptions occur when response in the study population is heterogeneous, and outcomes are bimodally distributed as a result. This chapter explores whether bimodality is present in antidepressant clinical trial data, and whether dichotomizing then indeed yields more powerful tests. The distributions of relative changes from baseline (rCFB) on the Hamilton depression rating scale (HAMD17) were estimated based on pooled data from nine antidepressant trials. Next, t-tests were performed on the continuous rCFB scores and chi-square tests on the dichotomized outcomes using both the commonly applied cutoff (i.e. rCFB>50%) and an estimated cutoff that provided optimal separation of the mixture of two normal distributions that best fi tted the pooled placebo results. In addition, the power of a t-test and a chi-square test were evaluated for simulated scenario’s that varied the degree of bimodality as well as treatment effect and sample size. Both the placebo and the active treatment groups showed evidence of bimodality, and the estimated (i.e. optimal) cutoff closely matched the cutoff commonly used in practice. Nevertheless, t-tests generally yielded smaller p-values than chi-square tests. The simulations showed that dichotomization only results in superior statistical power when bimodality is considerably more pronounced than observed in the empirical data. The chapter concludes that antidepressant trial outcomes show bimodality which suggests differential treatment response among patient subgroups. We advocate more frequent reporting of this heterogeneity (by presenting the outcome distributions), since a simple comparison of means may not adequately summarize the differences between treatment groups. On the other hand, dichotomizing outcomes is not an appropriate alternative as it reduces statistical power.

Summary

149

Chapter 4.2 compares the full, 17-item HAMD17 with several shorter HAMD subscales (i.e. the Bech melancholia scale, the Maier-Philipp depression subscale and the Gibbons depression scale) on the ability to discriminate between antidepressant treatment and placebo in randomized clinical trials. The HAMD17 is the preferred effi cacy outcome in antidepressant trials, but also widely criticized for multidimensionality and poor discriminative properties. The various HAMD subscales were proposed to provide a more unidimensional measure of depressive state by largely ignoring behavioral and somatic symptoms. It has been suggested that the use of these subscales would reduce the outcome variability and sample size requirements of antidepressant trials. This chapter discusses the analyses of data from 24 actual antidepressant trials with a total of 3,692 patients randomized to tricyclic or tetracyclic antidepressants (TCAs and TeCAs, respectively), selective serotonin reuptake inhibitors (SSRIs) or placebo. The data were analyzed with a mixed model for repeated measurements (MMRM). Standardized effect sizes on the HAMD17 and the various HAMD subscales were derived from the model, and the implications on sample size requirements were evaluated. For TCAs and TeCAs versus placebo, the HAMD17 consistently provided the largest standardized effect on every available time-point. For these treatments, the sample size to signifi cantly establish the observed effect of treatment (as compared to placebo) after six weeks of treatment was about 25 percent smaller than with any of the subscales. On the other hand, for SSRIs versus placebo, the HAMD17 yielded slightly smaller effects and was the least effi cient outcome. It is concluded that the comparative performance of the HAMD17 and the various shorter subscales strongly depends on type of antidepressant treatment. The results from the analysis supports the use of the HAMD17 as the primary endpoint in clinical trials of antidepressant treatments, while it is still considered benefi cial to pro-actively include HAMD subscales as additional endpoints to successfully establish treatment effects of new antidepressants.

Samenvatting

151

SamenvattingDe ontwikkeling van een nieuw geneesmiddel is een tijdrovend en kostbaar proces. Het meeste geld en tijd wordt besteed aan conformatieve klinische trials. Het doel van deze trials is om de effectiviteit en veiligheid van het nieuwe medicijn te bevestigen zodat het kan worden toegepast in de medische praktijk. Doorgaans vergen klinische trials een effectieve samenwerking tussen meerdere onderzoekscentra in verschillende landen, en de inclusie en herhaalde meting van een groot aantal patiënten. Het falen van een conformatieve klinische trial (d.w.z. het niet kunnen aantonen van de veiligeid en werkzaamheid van het middel) is vaak reden om de verdere ontwikkeling van het medicijn te staken, wat tevens het verlies betekend van investeringen in preklinisch en vroeg klinisch onderzoek. Om de toevoer van nieuwe en betaalbare geneesmiddelen te garanderen is het daarom van essentieel belang om de kosten en de kans op falen van klinische trials te reduceren. Dit proefschrift gaat in op een aantal methoden die helpen om dit doel te bereiken, en is opgedeeld in drie secties met verschillende (maar aan elkaar gerelateerde) thema’s.

Hoofdstuk 2 is gewijd aan klinische trial simulatie (CTS); een statistische simulatietechniek die het mogelijk maakt om het studieverloop van een trial te simuleren om zo tekortkomingen in het onderzoeksprotocol (bv. suboptimale dosering, ontoereikende groepsgrootte) te identifi ceren en te voorkomen voordat de trial wordt gestart. Hoofdstuk 2.1 geeft een overzicht van gepubliceerde CTS studies die ingaan op relevante vragen rondom het design en de analyse van conformatieve klinsiche trials, en bespreekt de doelen, simulatiemodellen en analytische methoden die in deze studies worden gebruikt. De meeste studies in het review hebben CTS retrospectief toegepast om te evalueren hoe CTS kan helpen bij het vinden van het optimale onderzoeksdesign voor een specifi eke situatie. Andere studies gebruikte CTS prospectief en als leidraad in de planning van toekomstige trials. Het review laat zien dat CTS een waardevolle techniek is om de mate en invloed van onzekerheid te bepalen tijdens de planning van een trial, en dat CTS kan worden gebruikt voor de beoordeling en vergelijking van verschillende onderzoeksdesigns op relevante criteria (bv. de statistische power of benodigde groepsgrootte voor het aantonen van een behandelingseffect, of de verwachtte kosten van de studie). Daarnaast dwingt CTS tot de integratie van beschikbare informatie over het onderzochte medicijn voordat een nieuwe trial wordt gestart. Gegeven dit potentieel, en de sterke behoefte aan effectiever en effi ciënter geneesmiddelenonderzoek pleit dit review voor een grotere rol voor CTS bij de planning van klinische trials. Het review laat echter ook zien dat de model-constructie en validatie procedures die veel worden gebruikt in CTS vaak ontoereikend zijn voor een terdege validatie en beoordeling van CTS modellen. Dit is van belang voor het gebruik van CTS in de praktijk. De constructie van CTS modellen is doorgaans gebasseerd op empirische data die onderhevig zijn aan allerlei bronnen van variantie. Het is daarom essentieel dat modellen worden geëvalueerd op

Nederlandse samenvatting

152

hun vermogen om toekomstige trial resultaten betrouwbaar te voorspellen. Hoofdstuk 2.2 is een reactie op een gepubliceerd review over CTS, en gaat nader in op deze kwestie. Het wijst ook op methoden uit diagnostisch en prognostisch statistisch en epidemiologisch onderzoek die kunnen worden ingezet om de validatie van CTS modellen te verbeteren, en pleit voor een algemener gebruik van deze methoden in CTS.

Hoofdstuk 3 bespreekt innovatieve klinische trial designs die het mogelijk maken om design aspecten van een lopende studie aan te passen op basis van resultaten die beschikbaar komen al naargelang de studie vordert. Dit soort designs krijgt steeds meer aandacht van academische, industriële en beleidsmakende partijen, en heeft veel potentie voor verbetering door de kans op een succesvolle trial te verhogen en het aantal patiënten dat wordt blootgesteld aan ineffectieve of schadelijke behandelingen te beperken. Hoofdstuk 3.1 evalueert het sequentiële parallelle groep design (SPC) dat is voorgesteld om de effi ciëntie van placebo-gecontroleerde antidepressiva studies te verbeteren. Antidepressiva trials (en psychiatrische trials in het algemeen) worden gekenmerkt door excessieve placebo response die het moeilijk maakt om een additioneel effect van behandeling aan te tonen (bovenop het toch al grote effect van placebo). Het SPC design vermindert de impact van placebo response door de combinatie van twee achtereenvolgende, placebo-gecontroleerde vergelijkingen waarvan de tweede alleen wordt doorlopen door patiënten die niet reageerden op placebo in de eerste. De analyse combineert vervolgens resultaten van beide fasen om zo de statistische power te maximaliseren en optimaal gebruik te maken van de verzamelde data. Het SPC design is oorspronkelijk voorgesteld als bestaande uit twee fasen met elk een duur van 4 weken (SPC4+4). Dit design wordt hier vergeleken met een acht weken durend conventioneel parallelle groep design en alternatieve SPC designs met een gelijke of langere totale looptijd (SPC2+6 en SPC6+6) over verschillende scenarios en op het aantal patiënten dat nodig is om een signifi cant effect van behandeling aan te tonen. De resultaten van de simulaties laten zien dat alle SPC designs effi ciënter zijn dan het conventionele parallelle groep design wanneer placebo response hoog is, en een substantieel effect van behandeling (in vergelijking met placebo) relatief vroeg kan worden aangetoond. Voor een realistisch scenario dat was afgeleid van empirische antidepressiva trial data, werd gevonden dat het SPC2+6 en het SPC4+4 design respectivelijk 51 en 53 procent effi ciënter was dan een conventioneel parallelle groep design met dezelfde totale looptijd. De conclusie is dan ook dat SPC designs een bijdrage kunnen leveren aan het verminderen van de benodigde groepsgrootte en het verbeteren van de slagingskans van klinische antidepressiva trials.

Hoofdstuk 3.2 evalueert verschillende twee-fasen designs voor de situatie waarbij er reden is om te vermoeden dat het effect van behandeling groter is in patiënten met een bepaalde baseline eigenschap (bv. de aanwezigheid van een genetische marker) dan in patiënten zonder die

Samenvatting

153

eigenschap. Het groep-sequentiële design biedt de mogelijkheid om de trial vroeg te stoppen vanwege futiliteit of effectiviteit als de interim resultaten deze conclusie ondersteunen. Het adaptieve selectie design biedt ook de mogelijkheid om vroeg te stoppen en stelt tevens in staat om inclusie na interim analyse te beperken tot een specifi eke patiëntengroep wanneer de interim resultaten suggereren dat deze groep beter reageert op behandeling dan de volledige, ongeselecteerde populatie. Dit hoofdstuk vergelijkt de benodigde groepsgrootte voor het conventionele parallelle groep design, het groep-sequentiële design en het adaptieve selectie design met gelijke statistische power en controle van de kans op een type-I fout. De designs worden vergeleken over scenario’s met variërende aannames over het gemiddelde behandelingseffect in de marker positieve en marker negatieve patiënten subgroep, en de proportie marker positieve patiënten in de volledige onderzoekspopulatie. De behandelingseffecten voor deze twee subgroepen zijn realistisch gekozen, waarbij er in ieder geval enig effect werd veronderstelt in de marker negatieve subgroep (anders zou het onlogisch en mogelijk zelfs onethisch zijn om een adaptief selectie design uit te voeren). Daarnaast worden er scenario’s bekeken waarbij de ware subgroep effecten worden verondersteld af te wijken van de effecten zoals verwacht voor aanvang van de trial. Beide twee-fasen designs vereisten over het algemeen minder patiënten dan het parallelle groep design, en dit voordeel nam toe naarmate het verschil in de subgroepeffecten groter werd. Het adaptieve selectie design leidde tot weinig verdere reductie in benodigde groepsgrootte ten opzichte van het groep-sequentiële design wanneer de ware effecten overeenkwamen met de effecten zoals die werden verwacht voorafgaand aan de studie. Echter, wanneer de ware effecten afweken ten gunste van verrijking (d.w.z. wanneer het verschil tussen de subgroepen groter was dan verwacht) nam de effi ciëntie van het adaptieve selectie design verder toe, hetgeen overeenkomt met de adaptieve aard van het design. In dit geval biedt het adaptieve selectie design de mogelijkheid om na interim analyse alleen door te gaan met de meest veelbelovende subgroep, en kan het dus de inclusie beperken van patiënten die niet of minder van behandeling profi teren.

Hoofdstuk 3.3 gaat ook in op de situatie waarbij verschillende subgroepen anders reageren op behandeling. In veel therapeutische gebieden zijn vroeg te detecteren markers geïdentifi ceerd die voorspellend zijn voor lange-termijn response op behandeling. Dit kunnen zowel baseline eigenschappen zijn als korte-termijn veranderingen als gevolg van kortstondige blootstelling aan het middel. Het gebruik van deze voorspellende markers voor selectie van patiënten in een klinische trial kan leiden tot beter toegespitste trials, met grotere behandelingseffecten en een kleinere benodigde groepsgrootte. Dit hoofdstuk vergelijkt een conventioneel parallelle groep design zonder selectie met twee alternatieve designs waarbij selectie wordt toegepast op basis van een baseline eigenschap (het baseline selectie design) of op basis van snel te detecteren veranderingen na actieve run-in (het actieve run-in design). In beide

Nederlandse samenvatting

154

designs beperkt de selectie van patiënten de generaliseerbaarheid van de studie (omdat de doelpopulatie waarop de studieresultaten van toepassing zijn specifi eker wordt). Het kan desondanks aantrekkelijk zijn om selectie toe te passen als een deel van de algehele populatie niet of nauwelijks reageert op behandeling, en selectie leidt tot een substantiële afname van het aantal patiënten dat dient te worden gerandomiseerd om een effect van behandeling te kunnen aantonen. In dit hoofdstuk word de benodigde groepsgrootte geschat voor scenario’s waarbij het effect van behandeling op het eindpunt is gekarakteriseerd als een combinatie van een direct effect (ongerelateerd aan de waarde van de patient op de voorspellende marker) en een interactie met baseline marker waarden en korte-termijn veranderingen in de marker na actieve run-in. Een additioneel scenario is afgeleid van empirische trial data. Resultaten van de simulatie laten zien dat het actieve run-in design het aantal te recruteren (en te screenen) patiënten substantieel kan verlagen ten opzichte van het parallelle groep design, indien vroege veranderingen na run-in een betrouwbare voorspeller zijn voor lange-termijn respons op behandeling. In deze situatie kan het baseline selectie design het benodigde aantal patiënten ook verlagen ten opzichte van het parallelle groep design, maar niet ten opzichte van een vergelijkbaar actieve run-in design (d.w.z. een actieve run-in design met dezelfde mate van selectie). Voor alle andere scenario’s was er geen voordeel voor het baseline selectie design. In deze gevallen werd het grotere behandelingseffect bij strengere selectie opgeheven door het toenemende aantal patiënten dat werd geëxcludeerd na interim analyse (en dus niet werd geïncludeerd in de gerandomiseerde fase van de studie) omdat ze niet voldeden aan de minimale inclusie criteria. De conclusie van dit hoofdstuk is dat selectie van patiënten in klinische trials beperkt dient te blijven tot situaties waar een relevant behandelingseffect in de algehele populatie onwaarschijnlijk is, en er betrouwbare voorspellers voorhanden zijn om patiënten te identifi ceren met een verhoogde kans om te reageren op behandeling.

Hoofdstuk 4.1 evalueert de statistische implicaties van dichotomisatie in het geval van bimodaal verdeelde continue uitkomsten. Continue uitkomsten worden vaak gedichotomiseerd in ‘respons’ en ‘non-respons’ categorieën voorafgaand aan de statistische analyse. Dit vereenvoudigd de analyse en interpretatie van de onderzoeksresultaten, maar verlaagt doorgaans de power van statistische tests. Uitzonderingen zijn situaties waarin er differentiële response is binnen de onderzoekspopulatie (d.w.z. wanneer er subgroepen zijn die verschillend reageren op behandeling) en resultaten als gevolg daarvan bimodaal zijn verdeeld. Dit hoofdstuk onderzoekt of bimodaliteit aanwezig is in data van antidepressiva trials, en of dichotomisatie in dat geval inderdaad leidt tot testen met een grotere statistische power. De distributies van relatieve verandering ten opzichte van baseline (rCFB) op de Hamilton depression rating scale (HAMD17) worden geschat op basis van de gepoolde data van negen antidepressiva trials. Vervolgens worden t-testen uitgevoerd op de continue rCBF scores en chi-kwadraat testen op de gedichotomiseerde uitkomsten bij dichotomisatie met

Samenvatting

155

de gebruikelijke grenswaarde (rCFB>50%) of met een optimale grenswaarde die is afgeleid uit de empirische data. Daarnaast wordt de power van een t-test en een chi-kwadraat test geëvalueerd voor gesimuleerde scenario’s waarbij de mate van bimodaliteit wordt gevarieerd alsmede de grootte van het behandelingseffect en de groepsgrootte. Zowel de scores van de onderzoeksdeelnemers in de placebogroep als die van de behandelde groep vertoonde bimodaliteit, en de optimale grenswaarde voor dichotomisatie kwam goed overeen met de waarde die doorgaans wordt gebruikt in de praktijk. Desondanks resulteerden de t-testen vrijwel altijd in kleinere p-waarden dan de chi-kwadraat testen op de gedichotomiseerde resultaten. De simulaties lieten zien dat dichotomisatie alleen resulteert in hogere power bij gedichotomiseerde uitkomsten als de bimodaliteit nadrukkelijker is dan werd geobserveerd in de empirische data. In dit hoofdstuk concluderen we dan ook dat data van antidepressiva trials bimodaal verdeeld zijn en dat dit vaker moet worden vermeldt (door het tonen van de distributies) omdat een simpele vergelijking van gemiddelden in dit geval niet adequaat de verschillen tussen groepen samenvat, maar ook dat dichotomisatie geen geschikt alternatief is omdat het leidt tot een lagere statistische power.

Hoofdstuk 4.2 vergelijkt de volledige 17-item HAMD17 met een aantal kortere HAMD subschalen (de Bech melancholie schaal, de Maier-Philipp depressie subschaal en de Gibbons depressie schaal) op het vermogen om een onderscheid te maken tussen antidepressieve behandeling en placebo in gerandomiseerde klinische trials. De HAMD17 is de standaard uitkomstmaat in antidepressiva trials, maar wordt vaak bekritiseerd vanwege multidimensionaliteit en beperkt onderscheidend vermogen. De verschillende HAMD subschalen zijn voorgesteld als unidimensionaal meetinstrument van depressieve staat en negeren in grote mate de gedrags- en somatische symptomen van depressie. Er is gesuggereerd dat het gebruik van de subschalen de variantie in uitkomsten kan beperken en daarmee de benodigde groepsgrootte van antidepressiva trials kan verlagen. Dit hoofdstuk bespreekt de analyse van 24 antidepressiva trials waarbij in het totaal 3692 patiënten waren gerandomiseerd naar tricyclische of tetracyclische antidepressiva (respectievelijk TCAs en TeCAs), SSRIs of placebo. De data wordt geanalyseerd met een mixed model voor herhaalde metingen (MMRM). Gestandaardiseerde effecten op de HAMD17 en de verschillende HAMD subschalen worden afgeleid uit het model, en de implicaties op benodigde groepsgrootte worden geëvalueerd. Voor TCAs en TeCAs versus placebo gaf de HAMD17 consequent de grootste effect schattingen op alle beschikbare meetmomenten. Om na zes weken een signifi cant effect van deze behandelingen (in vergelijking met placebo) aan te tonen kon de groepsgrootte met 25 procent worden gereduceerd door gebruik te maken van de HAMD17 in plaats van de subschalen. Voor SSRIs versus placebo gaf de HAMD17 daarentegen de kleinste effect schattingen en was het de minst effi ciënte uitkomst. Ook in dit geval waren en geen noemenswaardige verschillen tussen de subschalen. De conclusie is dan ook dat de prestatie

Nederlandse samenvatting

156

van de HAMD17 en de verschillende HAMD subschalen sterk afhankelijk is van het type antidepressivum. De resultaten ondersteunen het gebruik van de HAMD17 als de primaire uitkomst in klinische trials met antidepressiva, hoewel het desondanks gunstig kan zijn om ook de subschalen mee te nemen als additionele eindpunten om de effectiviteit van nieuwe antidepressiva aan te tonen.

Dankwoord

157

DankwoordHet ging ook weleens moeizaam, maar met het einde in zicht kijk ik toch vooral terug op een leuk en leerzaam promotietraject. Dit dankwoord is gericht aan alle betrokkenen, voor hun inbreng, interesse en gezelligheid. In de eerste plaats mijn promotoren en co-promotoren.

Prof. dr. K.C.B. Roes, beste Kit, ondanks je volle agenda ben je heel nauw betrokken geweest bij dit project en was je altijd bereikbaar voor spoedoverleg. Die toegankelijkheid was een luxe waar ik je erg dankbaar voor ben. Jouw indrukwekkende inzicht en kennis van statistiek is van grote waarde geweest voor dit proefschrift. Daarnaast was je ook in de omgang een fantastische begeleider. Ik heb genoten van onze samenwerking en kijk uit naar een eventueel vervolg in de toekomst.

Prof. dr. D.E. Grobbee, beste Rick, jouw rol was vooral die van feilloze quality check in de begin- en eindfase van vrijwel iedere studie in dit proefschrift. Als jij met een idee of een tekst akkoord ging, dan was dat voor mij een garantie dat het goed was. Bedankt voor je waardevolle adviezen en suggesties.

Dr. M.J. Knol, beste Mirjam, jij was mijn dagelijkse begeleidster tot je vertrek, eind 2011. In die 2,5 jaar heb ik heel veel van je geleerd, onder andere om bondig en nauwkeurig te schrijven, gestructureerd en planmatiger te werken en meer rendement te halen uit ons groepsoverleg. Voor die lessen, en voor al je inzet en inhoudelijke inbreng ben ik je heel erg dankbaar.

Dr. R.H.H. Groenwold, beste Rolf, jij bent geleidelijk nauwer betrokken geraakt bij dit project en hebt uiteindelijk de rol van enige co-promotor overgenomen na Mirjam’s vertrek. Met name in dat laatste (en drukste) jaar hebben we samen veel werk verzet. Ondanks je vele bezigheden was je altijd bereikbaar en bereid om me te denken als ik vast zat. Onze overleggen waren gezellig en inzichtelijk. Zonder jouw hulp was dit proefschrift nog lang niet af geweest. Bedankt voor alles!

Verder ben ik iedereen dankbaar met wie ik heb samengewerkt aan de studies in dit proefschrift. In het bijzonder Frederieke van der Baan en Hiddo Lambers Heerspink, bedankt voor jullie goede ideeën, en jullie inzet en enthousiasme.

Nienke, Vincent, Nanne, Rose, Guðrún, Francisco, Grace, Michelle, Laura, Gerdien, Thomas, Paulien, Ewout, Maarten, Stavros, Julien, Susan, Henrike, Loes, Paula, Yvonne, Marjolein, Ilona, Marie, Florianne, Lisette, Louise, Noor and all my other (former) colleagues from the

Dankwoord

158

Julius Center and the Escher project, thank you for your good spirits and gezelligheid during lunches, coffee breaks and hallway discussion. Sonja, bedankt voor de lol en je spannende verhalen. Dankzij jullie ben ik geen dag met tegenzin naar mijn ‘werk’ gegaan.

Mijn kamergenoten en vrienden, Sanne, Liselotte en Carla, met jullie op een kamer was er naast het vele werk ook altijd iets te lachen. Ik heb genoten van jullie gezelschap en denk met veel plezier terug aan dat laatste jaar in de schrijfkamer. We houden contact!

Mijn familie en mijn vrienden uit Utrecht en Weert, er is veel waar ik jullie dankbaar voor ben, maar ik zal me hier beperken tot jullie betrokkenheid en interesse, en de broodnodige ontspanning naast het werk. Marc, ontzettend bedankt voor je hulp met de omslag, van een foto en een vaag idee heb je een prachtig ontwerp gemaakt waar ik heel erg trots op ben!

Niels en Niek, de keuze voor jullie als mijn paranimfen was voor mij vanzelfsprekend, en is al lang geleden gemaakt. Jullie promoveren zelf ook en kennen de ups en downs in het traject. Het was fi jn om met jullie die ervaringen te kunnen delen. Ruim tien jaar geleden zijn we met z’n drieën in Utrecht onze academische loopbaan begonnen, en ik vind het fantastisch om jullie ook bij de grote fi nale te kunnen betrekken.

Tot slot, mijn ouders, lieve pap en mam, jullie hebben mij het meest van iedereen moeten aanhoren als het tegen zat, maar waren altijd geïnteresseerd en betrokken. Ik ben jullie ontzettend dankbaar voor jullie steun en geduld, de liefde en het vertrouwen. Jullie hebben me vaak verzekerd dat het goed zou komen, en dat is ook gebeurd: het proefschrift is af!

Curriculum Vitae

159

Curriculum VitaeRuud Boessen was born in Weert, the Netherlands on October 19, 1983. He graduated from high school in 2002. After completing a Bachelor of Science in Psychology (2006) and a two-year research Master of Philosophy in Neuroscience and Cognition (2008) at Utrecht University , he worked as a PhD student at the Julius Center for Health Sciences and Primary Care of the University Medical Center in Utrecht between 2009 and 2012. His PhD research was supervised by prof. dr. Diederick E. Grobbee, prof. dr. Kit C.B. Roes, dr. Mirjam J. Knol and dr. Rolf H.H. Groenwold, and resulted in the studies presented in this thesis. In 2012, Ruud completed a Master of Science in Epidemiology. As of January 2013, he works as a researcher and statistician at TNO, Earth, Environmental and Life Sciences in Zeist.

The development of new drugs is increasingly costly and ineffective. Most time and money is accounted for by late-stage clinical trials that aim to confirm the safety and efficacy of the investigated drug. To assure the continued arrival of new and affordable therapies, it is therefore essential to optimize the efficiency and success-rates of these trials. This thesis discusses innovative trial methodology that contributes to this goal.

The research presented in this thesis was performed at the Julius Center for Health Sciences and Primary Care of the University Medical Center Utrecht, and is part of the Dutch Top Institute Pharma›s Escher project.

Methods to im

prove the efficiency of confirm

atory clinical trials Ruud Boessen

Uitnodiging

voor het bijwonen van de openbare verdediging van

het proefschrift

Methods to improve the efficiency of confirmatory

clinical trials

door

Ruud Boessen

op donderdag 28 maart 2013 om 10:30 uur in de Senaatszaal

van het Academiegebouw van de Universiteit van Utrecht, Domplein 29 te Utrecht

ReceptieAansluitend aan de promotie

in de aula van het academiegebouw

Ruud BoessenPearsonlaan 993527 CB Utrecht

[email protected]

ParanimfenNiels Schenk

[email protected]+31 6 1683 6126

Niek [email protected]

+ 31 6 4391 0252

Methods to improve the efficiency of confirmatory clinical trials

Ruud Boessen

Boessen_Omslag.indd 1 25-02-13 11:59