panel discussion: scientific issues in data monitoring

18
STATISTICS IN MEDICINE, VOL. 12, 583-600 (1993) PANEL DISCUSSION: SCIENTIFIC ISSUES IN DATA MONITORING Dr. Richard Simon: The purpose of this panel is to focus on some of the difficult issues and decisions involved in interim monitoring of clinical trials, and to share ways we have developed to do a more effectivejob. We have in this room an amazing collection of expertise and experience and so I hope we can get broad participation in these discussions. I’d only ask that we focus on specifics, particularly our specific experiences and specific concrete suggestions for how to deal with difficult issues. The first topic I listed on the programme was that of monitoring boundaries. We did not want to get into the technical issues of one type of sequential monitoring boundary relative to another, but I think there are important general issues. One is, is it useful or important to specify at the outset, either in the protocol or by the data monitoring committee, statistical monitoring boundaries? Should the monitoring committee see comparative efficacy data at times other than those specifically planned interim analyses? If they do, should those be ever considered ‘cost free’ interim analyses from the viewpoint of type I error? Dr. Armitage: I’d like to make a general point about whether one should predetermine a sequential or group sequential stopping rule and effectively stick to that, or whether one should be more relaxed about the procedure that is to be followed. In the large-scale multicentre long-term trial we’ve been mainly talking about, I’m happier not to be too specific at the outset about the stopping procedures that will be followed. The reasons are first of all that there are a number of administrative aspects that are difficult to predict: accrual rates, event rates, final determination points, and so on. Secondly, the actual decision to stop, which the investigators may have to make, will depend not only on whether a particular line has crossed a boundary, but also on lots of other things, like the position on several endpoints and adverse effects,external evidence, and so on. This may seem all too vague. I’d just like to illustrate this by telling a little about what happened in a recent British trial, the MRC vitamin study that was published last year. This is on the question of the possible effectiveness of folic acid and/or multivitamins as diet supplementation at the time of conception to reduce the risk of neural tube defects in women who are at high risk because they’ve already had an infant so affected in the past. This was a 2 by 2 factorial, with folic acid versus placebo as one factor, other vitamins versus placebo, another factor. This was a very high profile study, it caused a great deal of controversy and media publicity, and so on. The protocol had a vague statement about the stopping procedure to the effect that the trial results will be monitored and the effect of this will be taken into considera- tion - something like, it is unlikely that the trial will be recommended to stop unless a difference of more than about 23 standard errors appeared. Now, the data monitoring committee more or less followed that rule; they were looking to see whether a difference of 23 standard errors had appeared, which it didn’t do for a long time. They were faced also with a sudden change on the administrative side in that the event rate was not much more than half of the predicted level, so that although the accrual continued to be good, the actual rate of accrual of events was a good deal smaller. The original power considerations were not wholly relevant. When eventually the DMC decided to recommend stopping, there was a fairly marked difference in favour of folic acid. 0 1993 by John Wiley & Sons, Ltd.

Post on 06-Jul-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Panel discussion: Scientific issues in data monitoring

STATISTICS IN MEDICINE, VOL. 12, 583-600 (1993)

PANEL DISCUSSION: SCIENTIFIC ISSUES IN DATA MONITORING

Dr. Richard Simon: The purpose of this panel is to focus on some of the difficult issues and decisions involved in interim monitoring of clinical trials, and to share ways we have developed to do a more effective job. We have in this room an amazing collection of expertise and experience and so I hope we can get broad participation in these discussions. I’d only ask that we focus on specifics, particularly our specific experiences and specific concrete suggestions for how to deal with difficult issues. The first topic I listed on the programme was that of monitoring boundaries. We did not want to get into the technical issues of one type of sequential monitoring boundary relative to another, but I think there are important general issues. One is, is it useful or important to specify at the outset, either in the protocol or by the data monitoring committee, statistical monitoring boundaries? Should the monitoring committee see comparative efficacy data at times other than those specifically planned interim analyses? If they do, should those be ever considered ‘cost free’ interim analyses from the viewpoint of type I error?

Dr. Armitage: I’d like to make a general point about whether one should predetermine a sequential or group sequential stopping rule and effectively stick to that, or whether one should be more relaxed about the procedure that is to be followed. In the large-scale multicentre long-term trial we’ve been mainly talking about, I’m happier not to be too specific at the outset about the stopping procedures that will be followed. The reasons are first of all that there are a number of administrative aspects that are difficult to predict: accrual rates, event rates, final determination points, and so on. Secondly, the actual decision to stop, which the investigators may have to make, will depend not only on whether a particular line has crossed a boundary, but also on lots of other things, like the position on several endpoints and adverse effects, external evidence, and so on. This may seem all too vague. I’d just like to illustrate this by telling a little about what happened in a recent British trial, the MRC vitamin study that was published last year. This is on the question of the possible effectiveness of folic acid and/or multivitamins as diet supplementation at the time of conception to reduce the risk of neural tube defects in women who are at high risk because they’ve already had an infant so affected in the past. This was a 2 by 2 factorial, with folic acid versus placebo as one factor, other vitamins versus placebo, another factor. This was a very high profile study, it caused a great deal of controversy and media publicity, and so on. The protocol had a vague statement about the stopping procedure to the effect that the trial results will be monitored and the effect of this will be taken into considera- tion - something like, it is unlikely that the trial will be recommended to stop unless a difference of more than about 23 standard errors appeared. Now, the data monitoring committee more or less followed that rule; they were looking to see whether a difference of 23 standard errors had appeared, which it didn’t do for a long time. They were faced also with a sudden change on the administrative side in that the event rate was not much more than half of the predicted level, so that although the accrual continued to be good, the actual rate of accrual of events was a good deal smaller. The original power considerations were not wholly relevant. When eventually the DMC decided to recommend stopping, there was a fairly marked difference in favour of folic acid.

0 1993 by John Wiley & Sons, Ltd.

Page 2: Panel discussion: Scientific issues in data monitoring

584 PANEL DISCUSSION

The data monitoring committee and the investigators thought it would be sensible to make some statement about the effect of repeated looks at the data. They took the unusual step of trying out a sort of retrospectively defined sequential stopping rule. That is to say, they looked back at what they had been doing and said we’ve more or less been following this sort of procedure, and we know looking to the future how long the trial would have gone on at the most. It’s then possible to work out a P-value that takes that into account. This was actually reported in the paper quite explicitly as a rough approximation to the effect of the optional stopping. So I can merely report this as one approach to the question of formality or informality in the use of stopping rules.

Dr. Harrington: Those were nice points about avoiding being too formal. I’m going to disagree a bit. There is an interesting backdrop to my disagreement that I’ll share. The Eastern Cooperative Oncology Group where I work is planning a trial to be co-ordinated with the MRC in England, a transplant study in acute lymphocytic leukaemia. In our elaborate, and probably overdetailed, American way we wrote a design that specified very carefully to the nth degree what the stopping boundaries were, sent it to our collaborators in England, and got back a rather less formal, slightly more vague, presentation of what the stopping boundaries would be. We did some iteration and finally the statistician working on the trial on the plane today showed me that the last transmission from England included a paper by Peter Armitage explaining why they supported this less formal idea. I thought ‘How can one argue with that? And then here I find myself on this panel with the opportunity to argue with that. While I agree that there can be a tyranny to stopping boundaries, our sense in the cancer co-operative group has been that the choice of statistical rules for interim monitoring will generate a fair bit of heat, some light, and lots of argument. We have found it best to finish that discussion before the trial starts accruing patients, because if we delay that discussion into the midst of the trial very close to the actual presentation of data, then we find it is very hard to be sure that the ultimate choice of a monitoring rule isn’t to some extent data driven. So, we worry about having a specification in the protocol that doesn’t give us a guide for some very tough discussions. I do agree with Peter (Armitage) that most of us who have worked in the methodology of interim stopping boundaries, and there are many people in this room who have, will often say that we acknowledge that the interim stopping boundaries are a guideline. Of course we have the same guideline in P < 0.05 that’s used in the statistical literature. But in the clinical literature, it has become much more than a guideline: it has become essentially a litmus test for whether something is published. I am also very nervous about specifications that statisticians understand how to balance and treat, but that for others with less experience in interpreting randomness become more than a guideline, almost a lab assay in measuring whether a trial was done properly.

Dr. David Bristow: As a clinician, I’m eager to see the guidelines set quite early in the course of a trial. I’ve sat through arguments as a trial has gone on about the level of probability that might ultimately be defined for acceptance of significance, one-sided versus two-sided analysis and so on. Those things allow the risk of fitting the analysis to the kind of data that one is getting and I really would like to avoid that, perhaps out of ignorance of some of the statistical nuances. Having said that, I would like to emphasize, again as a clinician, that I look to such stopping rules and boundary definition as purely advisory. I believe my primary role on such boards, in addition to the obvious, is to recognize unexpected problems that surface, trends that appear in subgroups that have not been previously defined or even known about, so that I provide a clinical alertness quite different from just arbitrarily accepting whether a line has been crossed. But again, I feel much more comfortable in the broad sense about having the stopping boundaries defined quite early, if not at the very beginning.

Dr. Michael Walker: From a neurologic disease point of view, I think we have to consider how these committees operate. They meet maybe twice a year, for about 4 hours to a day, and

Page 3: Panel discussion: Scientific issues in data monitoring

PANEL DISCUSSION 585

therefore they have a jam-packed agenda, with many things to address. During the first several such meetings, concern is usually about accrual, understanding the study, possible toxicities, and so forth. As the study goes on, toxicity becomes of higher importance, and then finally the concern about efficacy. It does give the clinicians on these committees a great sense of comfort to have an alerting rule pre-established that is not going to be forgotten. If something should happen along the way they will at least be alerted so that time can be given during the meeting to the serious discussions that need to be had. In addition, the statistician for the study needs to have some idea as to what to bring to a meeting that may only happen every 6 months. Alerting rules put a sense of order to the thing and make sure that nothing gets forgotten on the way.

Dr. Michael Gent: Most of us want the study group and the external safety data monitoring committees to be independent. One important thing is to make sure at the start of the study that there’s congruence between what the external safety and efficacy monitoring committee thinks they should be doing and what the study group want to have done. One advantage of the boundaries themselves is that while they shouldn’t be the sole decision-making procedure, they do reflect a mutual position. Issues like symmetrical boundaries or asymmetrical boundaries reflect a particular strategy in your study. You need to agree upon that. You also need to agree on how stringent or how relaxed you’re going to be in setting up P-values in the various interim analyses. If you’re trying to evaluate a new drug against a standard drug, then the two groups need to be quite clear on what the strategy is for stopping. For example, if you’re trying to find a new drug and you want it to be better than the one presently in use, you may well set a rule that, given the two drugs are equally safe, the new drug should be x per cent better than the existing one.

Dr. Simon: I’d like to follow up on that point, Mike. Has anybody here been involved in the data monitoring committee in which accrual to the trial was continued beyond the point where the null hypothesis of equality with regard to the primary endpoint could be rejected? If so, was informed consent changed? Was such continuation something that was planned at the outset? How did the committee arrive at the decision to continue accrual in that circumstance?

Dr. Meier: Peter Armitage can equally well deal with ISIS 2 because he was also on the data monitoring committee. That study was continued. It was a subset issue. The first subset most likely to benefit from the use of streptokinase consisted of patients who were treated relatively early. The remainder of the picture was not clear. In terms of the total primary endpoint, you could say there was a conclusion. The study was not stopped in toto. I don’t know if that meets the criterion you’re talking about or not.

Dr. Simon: You’re saying, in toto there was a significant effect, but it was continued in order to clarify issues of specificity of the effect?

Dr. Meier: It was, in effect, discontinued for the predefined subgroup in which a benefit had been shown.

Dr. Armitage: I don’t wish to say more about ISIS, but on a general point, there would be many situations where you would advocate continuation when the null hypothesis of zero effect had been contradicted, for the sort of reason that Mike Gent has indicated, that very often that in itself would not indicate the need to change treatment. The physician would be looking for something much bigger than a zero effect before regarding it as unethical to use that particular treatment. This is also tied up with the point you made about adverse effects. A decision as to whether it is ethical to continue would depend on the balance of efficacy and adverse effects. You may feel certain that there is a non-zero efficacy difference, but be worried about the adverse effects and be prepared to continue until it’s clarified.

Dr. Harrington: I was involved in one trial directly and one trial indirectly in which study investigators simply refused to believe the results. There were evident treatment effects of a cancer

Page 4: Panel discussion: Scientific issues in data monitoring

586 PANEL DISCUSSION

therapy to an extent that no one expected, and in fact I was asked very directly as a statistician by a knowledgeable clinician ‘Is this possibly one of those type I errors that you tell us about? Is this one of the 5 per cent times when things may be looking quite significant, and it’s due to chance? I couldn’t say no, because my clinical colleagues said they had no strong scientific basis for the outcome and they in fact lobbied very strongly for a confirmatory trial while keeping the results of the first trial blinded. That was a very difficult decision, because the issues of ethics that were raised earlier and the issues of study design, which we were violating to a certain extent, came into clash. We mounted the confirmatory trial.

Dr. Yusuf: Just rejecting the null hypothesis at the 5 per cent level does not guarantee that you have a clinically important difference. The estimates are quite wide at that stage. Second, you may not have proof beyond reasonable doubt that the results are real or persuasive. While monitoring guidelines are helpful, ultimately what’s being taken into account by any committee is if there is proof beyond reasonable doubt that there is a clinically worthwhile effect. One can define ‘clinically worthwhile effect’ in many different ways. One way is to say that you should guarantee that the lower confidence limit guarantees an important effect. That is one extreme way of doing it, but there are other ways.

Dr. Simon: I have two questions on that. One is whether that’s a decision the data monitoring committee should take, to continue in that circumstance when that wasn’t the way the trial was designed. Second, the question is whether patients should be informed at that point that one treatment is better than the other with regard to a primary endpoint, but the degree of improvement is unclear, and would they under those circumstances wish to continue to enter into the trial.

Dr. Yusuf: I think that has to be a case-by-case decision. Dr. Bristow: I also think it has to remain an open question. Another issue is durability of the

results. We sometimes learn things about a therapy as the trial progresses that we really didn’t know at the beginning. It could remain an open issue if the therapeutic benefit or the harm might be in some way time-limited. For example, if there is a surgical trial with an operative mortality, in a sense, at day 1, the two groups are quite different. Yet, you might postulate that the durability of the surgical therapy is going to outlast its harm on day 1. It seems to me that it would be very difficult to answer this question in a rigid way. It would be, as Salim (Yusuf) said, on a case- by-case basis.

Dr. S. Ellenberg: I think it’s important to come back to the point about the original design of the study that you just mentioned, Rich (Simon). I think that if the study committee and the data monitoring board together believe that the appropriate goal is to determine whether or not a treatment is not just better, but a certain amount better than another, then that should be incorporated in the design of the monitoring plans. It wouldn’t make any sense in that kind of a situation, to develop typical stopping boundaries based on a test of the usual null hypothesis test at the 0.05 level, because they would not reflect the intent of the study.

Dr. Michael Proschan: I think it’s essential to have monitoring boundaries set up in advance, because I was involved in a trial in which there were no boundaries in advance and the co-ordinating centre said if we had used Pocock we would have just barely crossed, if we used O’Brien-Fleming we wouldn’t have, if we had used stochastic curtailing * . and it became completely meaningless.

Dr. Harrington: I want to go back to a point that Susan (Ellenberg) made, which I think is important, that we would rarely find ourselves in the position of contradicting our own study design. When you have a large ongoing research programme with trials going on around the world, there are instances where the original intent of the trial can be contradicted by results that are appearing elsewhere. In fact, in a trial where we had a statistically significant result, which

Page 5: Panel discussion: Scientific issues in data monitoring

PANEL DISCUSSION 587

according to the design would dictate stopping, there were other trials that contradicted that result. I think that only highlights the fact that the monitoring committees have a very complic- ated business on their hands. Marc Buyse and others have raised the possibility of how much they should use external data. While we would like to think of these designs in the old Fisherian sense of being designed in the agricultural plot that were largely independent of what was going on in the rest of the world, they really are being carried out in an enormously complicated social context that makes their evaluation sometimes much tougher.

Dr. William Friedewald I think it’s very important that the guidelines be set out in advance. I’ve also been involved in a study where we began to argue about what boundaries had and hadn’t been crossed. We were using two monitoring procedures simultaneously. I thought the argument, which was subsequently proved right, was not really over those bounds, but was based on the desire of some to keep the study going for other reasons. They used the statistical argument to muddy the waters. To the extent possible, I think the investigators ought to lay out the monitoring plan in advance. Secondly, my experience is that even with a carefully defined monitoring procedure, it still serves as only a guideline. But the point I wanted to get to was one that Salim (Yusif) mentioned. He used two words, ‘real’ or ‘persuasive’. In my judgement, if the committee has decided that it’s a real finding, which means the drug really works, and that brings in the statistical and all the other issues, it seems to me the board has to seriously consider stopping at that point and not going on to be more persuasive. I feel strongly about that.

Dr. Stuart Pocock: I think one problem that Dave (Harrington) raised is interesting. If you see a difference indicates strong evidence to stop the trial, you’re probably on a ‘random high‘ in terms of the magnitude of treatment difference. You can either incorporate this in a Bayesian analysis or you can say that if the trial continues we expect that estimate to decrease. I think that problem of overestimation in trials that stop early is important for data monitoring committees to consider.

Dr. Tognoni: In practical terms, we are facing something which suggests a combination of approaches. On the one hand to have formally stipulated boundaries. On the other hand the data monitoring committee must take into account during the course of the trial what is going on outside and flexible interpretation of the boundaries as an expression of the clinical relevance. The second observation is that we lack a clear documentation of the real decisions taken by many monitoring committees. I think we are discussing theoretical issues, when we have so many people making decisions not based on theoretical principles but on context-dependent condi- tions. If we could make the case for thrombolysis, which I know best, it will be interesting to see the many decisions, not only the ISIS 2 decision, which is one of the most explicit decisions taken, but the many trials. For the problem of consent, I would separate the issue. It is difficult for any patient to understand the sophisticated statistical reasoning we are discussing here, where we don’t even agree. To mandate that we change the informed consent because we are introducing some difference in boundaries, is based on an illusion that we can communicate our statistics to the patient.

Dr. Gent: I want to go back a bit and give three examples of how I think strategy would differ. The example I gave about a new drug against a standard was one where the new drug had to beat the standard. Another example would be comparing standard heparin against low molecular weight heparin in the prevention of deep vein thrombosis. In situations like that we have two questions, we want to know whether standard weight heparin is better than low molecular weight heparin, and we are equally interested in the alternative question. So to me we need a symmetric boundary. I think the strategy for the carotid endarterectomy and extracranial/intracranial bypass studies again had to be very different. Here, you have an established surgical procedure.

Page 6: Panel discussion: Scientific issues in data monitoring

588 PANEL DISCUSSION

You want to make sure that certain things actually took place. If that procedure really worked, you have to have a particular strategy. If you really thought it didn’t work you have to have a strategy that would convince surgeons who depend upon it for their career. A third example is a study in progress of digitalis in heart failure. The strategy should not be the same as for a drug which is brand new. Digitalis has been in use for a fair amount of time. There is a lot of acceptance of it. Obviously, there are questions, that’s why the study is being done. I believe that the digitalis study should not use the same combination of boundaries as for a brand-new drug.

Dr. Simon: Let me add another aspect to this discussion. Many trials have multiple endpoints and one of the things I thought might be useful to discuss is whether the trials you’re familiar with have a designated single primary endpoint? If not, should they? Wouldn’t that improve commun- ications between the trial organizers and the data monitoring committee as to what they really want to accomplish with that trial rather than leaving it up to the judgment of the monitoring committee and wouldn’t that make decision making easier and actually make it more feasible to use the boundaries in a less subjective way.

Dr. Friedewald: It depends on the question. I could think of five or six diverse endpoints that might not be closely related and that might make for a nightmare of a data monitoring session. On the other hand if you had myocardial infarction and cardiac-related death, they might be legitimate primary endpoints. It seems to depend on the nature of the question, the nature of the illness, as to whether one should look at multiple endpoints.

Dr. Simon: In the 5FU-levamisole trial that Tom Fleming described, the protocol said clearly that survival was the primary endpoint. Whereas disease-free survival or time to recurrence may be related to that, the objective of that study was to determine whether there was an effect of treatment on survival. The results became statistically significant for time to recurrence earlier than they did for survival but the fact that the protocol said survival was the primary endpoint made it feasible for the monitoring committee to continue the study at that point.

Dr. Friedewald: There should be differentiation between a combined endpoint, two or three events that together would be the primary endpoint, as opposed to possible multiple different things that one would look at. But I am part of a board where a year and a half after we convened as a board the study investigators still didn’t know what the primary question was. I know it sounds silly but it’s absolutely true. We kept coming together as a board and saying, what’s the primary endpoint? What are you trying to do? It’s sort of like the old example of the person who walks into the statistician and asks ‘Does the drug work? The statistician spends the next several hours trying to refine the question. It’s a very important exercise, and it’s also important here. To answer your question directly, it’s really important if only for clarity of thinking, that one define a primary endpoint, which could be a combined endpoint. That doesn’t preclude monitoring many things, and if something else pops up, you may well stop. I think it is critical for purposes of design and thinking that you clearly specify a primary endpoint.

Dr. Harrington: I agree completely that whenever possible you should have a primary designated endpoint. I am sceptical about combined endpoints because I simply don’t believe that I know how to interpret them. I don’t know how to put initial response to therapy, time to disease recurrence, and survival into a pot and scale them in some way that I get a single method of efficacy. So I find perhaps it’s statistical ego roaring through. I am more comfortable looking at the several outcomes in a context of having designated one as the primary outcome. I get nervous about scaling and scoring and things which combine largely disparate measures.

Dr. Laurence Freedman: It might be uncomfortable to have to do that, but there may be situations where you absolutely have to. In prevention trials, particularly, we’re now discussing interventions such as tamoxifen or oestrogen replacement therapy in women, which may at one time affect breast cancer, endometrial cancer, heart disease and osteoporosis. These are all

Page 7: Panel discussion: Scientific issues in data monitoring

PANEL DISCUSSION 589

primary endpoints in a way, and they have to be looked at together. I agree it is a difficult problem, but it is one we have to tackle.

Dr. Harrington: Do you have your algorithm yet for combining them? Dr. Freedman: No, we’re going to have to work on it. Dr. Bristow: Can I press Bill Friedewald a bit? Let’s suppose we’re going to design a trial to

deal with coronary disease and we adopt a single endpoint, myocardial infarction and cardio- vascular death. Suppose there is a profound reduction in non-fatal myocardial infarction or in unstable angina. Just intuitively it would seem better to plan for that sort of analysis from the beginning and say ‘this is an important part of this question, shouldn’t I be prepared to deal with it? rather than deal with it as a secondary issue 1, 2, or 3 years down the line.

Dr. Friedewald: I’m not sure I understand your question. Dr. Bristow: Well, you said we had to have a single primary endpoint and the data that are

really going to turn out to be significant in my hypothetical trial are not included in that primary endpoint, and yet they are something we might predict. Why not simply have two endpoints?

Dr. Friedewald I would prefer, and I hope this is not just semantics, to have a primary endpoint and have secondary questions. Very frequently this is the case in many trials, where you have a primary endpoint that is driving the sample size and the primary analysis, but you also prespecify other hypotheses that are related. I thought you were referring to the concern that a combined endpoint ought to be based on one common mechanism. If it’s atherosclerotic coronary disease, for example, and you have multiple endpoints related to that, I don’t have as much problem with that as a combined endpoint as long as you can measure them and put them together.

Dr. Simon: I found that one good motivation for reducing the number of endpoints is when the FDA tells you you need to divide the 0.05 by the number of endpoints.

Dr. Geller: I have a terrible example. This trial is still ongoing, so I can’t reveal too much about it. I t is a heart trial involving infants, and the protocol specified, as I counted them getting involved late in the trial, nine primary endpoints. I did establish that those outcome measures were going to be evaluated at 1 year after this intervention. There was no stopping rule in this trial. Unfortunately there were some adverse events in one arm. This really did provide a problem to the data monitoring committee. I think it would be nice if you could have one single outcome measure, but I think this was an example where you simply couldn’t.

Dr. Jean-Pierre Boissel Should the primary outcome for efficacy monitoring be the same as the primary endpoint of the trial? For instance, if you have a trial which is proving an efficacy on the rate of fatal myocardial infarction, should not the monitoring committee take the total mortality as the outcome for efficacy monitoring?

Dr. Friedewald: It depends on what the question is. The LRC study used coronary heart disease, non-fatal and fatal, and came to an outcome which they said was significant, but there was no difference in total mortality. As you are well aware, some criticism of that study has been that they didn’t use total mortality. A primary question is whether every trial of this nature should use total mortality as a primary endpoint. In the LRC study, the trialists said that lowering cholesterol shouldn’t have any impact on other events and therefore they stuck to the coronary events, but you have to decide what you want to do going into the trial.

Dr. Gent: If the study group has a primary outcome for efficacy, that surely has to be the same outcome that the committee monitors for efficacy. They may well observe something else in the course of the trial. To take David’s (Bristow) example, you may find you are doing a trial where the primary outcome is ischemic stroke, MI, or vascular death. You may find in the course of study that there is a marked difference in the incidence of unstable angina. I certainly wouldn’t want a study stopped because of that, because that to me is not of the same sort of clinical

Page 8: Panel discussion: Scientific issues in data monitoring

590 PANEL DISCUSSION

importance as the outcome that was defined. On the other hand, I was involved in a trial, on a safety committee, where we did stop the study because of another outcome. But it was a much more important outcome as it turned out. We were monitoring a study of percutaneous transluminal coronary angioplasty (PTCA) and the principal outcome was re-stenosis. We observed during the course of the study that there was a marked difference between the groups in the early ischemic events, including myocardial infarction. They were much more significant outcomes than the ones specified. Because there was a real difference in those, we recommended stopping the study and that to me seemed to be an appropriate stopping.

Dr. Armitage: In most trials there are going to be several endpoints that you really must look at, not in the final but also in the interim analysis. To some extent, it’s a semantic question as to what you call primary and secondary and so on. On the whole, I’d rather try to avoid questions of amalgamation. I’d rather keep things separate. It is after all a multidimensional problem, and to put things in one dimension is missing a lot of possibly important information. Could I slightly move onto a different aspect of multiple endpoints? As well as what we call primary endpoints there are probably a lot of other things we want to look at. Secondary endpoints, if you will. The question often arises, do you adjust for this multiplicity in the statement of P-values and so on? I am in favour of not adjusting, but not because I don’t think it’s important. I think it’s very difficult to adjust in any sensible way for multiplicity, not only because it’s arbitrary as to how many of these endpoints you actually have under consideration, but also because they are probably always intercorrelated and it’s rather difficult to examine the effect of this intercorrela- tion. So, on the whole I don’t like adjusting for multiple endpoints.

Dr. Harrington: I concur with that completely. The issue of possible intercorrelation worries me, when we think about adjusting, and whether we may do more harm to the investigators in their experiments than protecting the presumed consumer of the trial, the people who will use the significant results.

Dr. Yusuf: I agree with a lot of what was said about multiple endpoints. I can see that there are various situations where you do combine endpoints, for instance, where the same mechanism applies or where the two components are indistinguishable-for example, you may not be able to distinguish a stroke from a sudden death. There are other situations where there are two totally independent outcomes which may be affected in the same direction that is favourable by an agent but the mechanisms are completely different. What comes to mind are oestrogens. In primary prevention, they could prevent heart disease, osteoporosis and fractures. In designing the Women’s Health Trial, coronary heart disease was considered to be the primary endpoint. We do expect to have an effect on, let’s say, major fractures, including hip fractures, which is not inconsequential. Let’s say half-way into the trial we have an interesting trend on the primary outcome, nowhere near our monitoring boundaries, but a vastly overwhelming effect on our secondary, but clinically very important, outcome. I just want some discussion on whether it is worth formalizing monitoring boundaries for important secondary outcomes. The other issue I wanted to raise was situations, for instance, tamoxifen for primary prevention of breast cancer, where you have secondary endpoints such as CHD prevention and fractures that are also ascertained. You might find you haven’t crossed boundaries or even had persuasive P-values on any one of these components. But for some reason total mortality is quite clearly significant; it’s a more severe endpoint that seems to have produced a very clear result, but the components of mortality, that is, causes of death and non-fatal events don’t yet show that very clear trend.

Dr. Friedewald: In the women’s trial that you described, I think the data board would have to address this issue with everybody involved in the study. One could easily anticipate the outcome you described because you’d have a much higher probability of reaching something in osteoporo- sis than in coronary disease. You’d have to decide what was important and if you thought it was

Page 9: Panel discussion: Scientific issues in data monitoring

PANEL DISCUSSION 59 1

important enough to stop on fractures, you ought to confront that before starting the trial. My own feeling is you wouldn’t need to stop based on fractures alone. You’d address this in the protocol and say we may see an early positive outcome on fractures, we’re interested in this outcome, but we’re going to keep the study going based on our primary endpoint, whatever it may be.

Dr. Simon: I would like to enlarge the horizon of this discussion a bit to include multiple subsets of patients, which is another aspect of multiplicity. Generally trials are planned large enough to be able to detect a certain size difference overall for the population entered. Subset analyses, even at the end of the trial, are done with poor statistical power for detecting real differences but inflated type I errors for finding spurious effects. There are many examples in which we wind up with results that are not confirmable in other trials. When you then think of doing subset analyses at interim points and multiply the number of subsets by the number of interim analyses you are doing, I wonder whether we’re not to some extent kidding ourselves when we think that a data monitoring committee can effectively distinguish real subset effects at an interim point from spurious fluctuations in the data.

Dr. Bristow: Surely, we always ought to try to identify subsets ahead of time and plan statistically for dealing with them, but you can’t always. It seems to me that one of my primary responsibilities on such boards is to be looking for a subset that is surfacing that was previously either undefined or unrecognized to have an important relationship to the treatment. The difference that one would have to identify would be quite a bit larger, I should think, than a straightforward primary endpoint analysis. I think back to the Veterans Administration co-operative trial in coronary surgery with the main left coronary disease issue. It was not planned for ahead of time, it was an amazingly large difference and radically changed the practice of cardiology in the United States. It would have been unfortunate had they not identified the subset that was not planned for. On the other hand, it was brilliant that they did, and influential, perhaps even beyond the rest of the trial. Again, it seems to me that we have to be watching for subsets. If we think there is a difference, it ought to be quite a bit larger than tested for as a primary endpoint. We can’t be oblivious to that as a responsibility for data monitoring boards.

Dr. Armitage: Picking out subsets is difficult in even a final analysis, and it’s even more difficult in an interim analysis where the data are smaller. I would be unhappy about trying to identify a subgroup merely because there was a significant difference there and not in other subgroups. You’d need very strong evidence on an overall interaction before you actually took any steps on that. There are situations where some groups are identified in the protocol and it is quite reasonable to look at those separately and perhaps take separate action. In the ETDRS (Early Treatment Diabetic Retinopathy Study) done by the Eye Institute, there are effectively separate different protocols for different subgroups of patients. A relatively early decision was taken on one of these subsets, patients with ocular aedema, because focal treatment had a clear effect on reducing visual loss there. Another example is the ACTG 019 study, which was designed to look separately at patients with less than or greater than a certain CD4 count. The result, which was eventually publicized, related to the patient who had low CD4 counts. The British group doing a similar study took note of this and suggested that patients with consistently low CD4 counts be considered for AZT treatment and taken out of the trial.

Dr. Walker: That exact thing happened about a year ago in the carotid study. The investig- ators had the good sense to have two distinct groups, severe stenosis, and that was 70-99 per cent, versus moderate stenosis, which was 30-69 per cent, all patients being measured by a common neuroradiologic group. At the monitoring committee last year, it was clear that the patients who had severe stenosis were benefiting from surgery. So the effect became apparent very quickly, was very robust, and held up under a whole series of subset analyses. It was really no choice, as far as

Page 10: Panel discussion: Scientific issues in data monitoring

592 PANEL DISCUSSION

the clinicians were concerned or as far as we were concerned, that that part of the study had to be opened up. It was. I t resulted in a decision by the Institute director that a clinical alert was appropriate, because it affected so many patients. And the rest of the study goes on. This was mentioned yesterday as a parallel finding with our British colleagues, who within about the same week ran into exactly the same thing and made essentially the same decision. The impact of that on the rest of the study has been interesting, because immediately after opening it up, everybody generalized that if it’s good for the severes it probably would be relatively good for the moderates and therefore maybe we could generalize down a bit into the moderates and start to operate on some of them. It had an impact on accrual, which has been gradually over the course of the year recovering. So I think the second part of the answer we’ll be able to get in a reasonable period of time. But it has a real impact on the management of the rest of the study whenever you do have a subset like that.

Dr. Simon: The problem with subsets though, Mike (Walker), is that they’re always striking and the question is whether they are real. You never really know unless you have a confirmatory study.

Dr. Walker: Which we did with the British study and that was just by sheer good luck. Dr. Pocock: I agree with Peter’s (Armitage) comment about caution in making decisions about

subgroups of patients, particularly on the efficacy side. However, with evidence of harm by a new treatment in a subgroup one may be less cautious. One example is the PACK trial for treatment of intermittent claudication. It was found that patients on ketanserin who were also treated with a particular type of diuretic seemed to experience an excess of sudden deaths. Even though one might not reach the same stringent criteria for significance normally required on the efficacy side it was felt there was an ethical need to stop the ketanserin treatment in that subgroup of patients.

Dr. Gent: I’m not a believer in doing subgroup analyses for many reasons. But there are times when I think we have to do them, not so much to show differences, but to show consistency. We’re about to launch a study where we have three clearly defined subsets of patients, and part of our charge in the final analysis will be to show consistency in the findings. This is a study of a new anti-platelet agent, and again the outcome is ischemic stroke, myocardial infarction, or vascular death. Conventionally, these studies have been done in patients with ischemic cerebrovascular disease, or patients with myocardial infarction. We propose in this study to examine a broader spectrum of patients who are at risk of ischemic stroke, myocardial infarction, and vascular death. We are deliberately including patients such that about a third will have a recent ischemic stroke, about a third will have recent myocardial infarction, and a third will have symptomatic peripheral arterial disease. We’re going to have to show that the findings are consistent across these three different groups for the overall concept to be acceptable.

Dr. Simon: Let met shift a little bit. One issue that came up earlier in the meeting but did not get any extensive discussion was increasing the target sample size in an ongoing trial. Sometimes it’s proposed to increase the target sample size because compliance is poorer than expected, or the event rate in the control group is lower than expected, or there was a change in the primary endpoint, or a reduction in the size of difference to be detected, or the desire to analyse a subset separately. In what conditions do you think this is really appropriate? If I haven’t forgotten some of my elementary statistics from a frequentist’s viewpoint, many of these things would be strictly forbidden.

Dr. Harrington: Let me play devil’s advocate and answer the other side of this question and mention situations where it is acceptable to do it and perhaps unsound not to. One interesting thing about violations of statistical principle, I think, is that there probably isn’t a statistician in the room who has never extended a trial beyond its original design phase. Let me get the good things out of the way, initially, that we’re supposed to say. Any reasons that might cause extension

Page 11: Panel discussion: Scientific issues in data monitoring

PANEL DISCUSSION 593

of a trial that could be anticipated in advance of a trial, such as low event rates in the control group, or poor compliance issues, should be mentioned in the protocol. There should be at least some guidance, if not a specification, of what the effect would be on either the ultimate power of the design if those things start to fall apart or ways to fix those by extending the sample size. The reason I do not take a hard line-although if you’re not too careful you can find yourself in an indefensible position about extending sample sizes - I view the clinical trials that I work with in two ways. They are on the one hand carefully designed experiments to prove or disprove a hypothesis. On another hand, they are a laboratory in which in large measure we don’t know what is going on. So, we often learn things in the course of a trial that cause us to revise, sometimes radically, our whole sense of the science of a given situation. It’s especially true in some of the cancer trials that I’ve worked with, where the agents being studied are biologic agents, and the whole way in which one measures efficacy there can change. Making a pronouncement that one should never extend sample sizes makes me nervous because many of the trials mentioned here are large expensive trials that a government funding agency or pharmaceutical house and a scientific investigator have put a great deal of time into starting. Many of those trials will never be replicated. Sometimes one can’t even afford to confirm them. I wrestle very hard with the decision about whether to extend the trial. There are situations where even in the face of unanticipated things - compliance rates lower than anticipated when investigators told me that compliance would not be a problem, or event rates in a control group much lower than anticipated even when I was assured that the control group would behave in such and such a way -where I will consider extending a trial. The one thing I stay away from is extending a trial to make sure that the sample size is large enough, that the difference that you are seeing now becomes significant. I think that’s probably the one place we can all agree it’s very dangerous to extend.

Dr. Armitage: If there is a good case on scientific and medical grounds for having a longer trial, you certainly ought to do it, and you shouldn’t really parade your statistical conscience too obviously. It’s really a minor point. I doubt whether there is really a statistical problem. It depends what attitude you take to inference. If you are a Bayesian then you don’t worry a t all about this. Even if you believe in P-values adjusted for sequential stopping rules and so on, I think you still don’t worry because merely extending the boundaries to the right, as it were, makes no difference to the P-value for any particular point so far achieved. That is, the P-value is essentially the probability of hitting the boundary to the left of where you’ve got to, and so that’s not affected by extending the boundaries to the right.

Dr. Simon: Dr. Armitage, what if you say, I have a trend here that’s not significant, I’ve reached my target accrual, therefore I want to extend the sample size somewhat and calculate a P-value again. The type I error for the whole experiment presumably would be in excess of 0.05.

Dr. Armitage: The type I error for the whole experiment is certainly affected. I was talking about the P-value for a particular observed result.

Dr. Pocock: I’m thinking of a way here of keeping the statistical purist happy (if he exists!). When you are thinking of extending a trial you have two choices: either to continue the existing trial, to start a new confirmatory trial. With the second option you can combine the results in a meta-analysis. So, couldn’t you do the same with the first option, by treating the first half of the current trial as one entity, and the second new set of patients as another entity? Meta-analysis is okay, we’re told, especially if they had exactly the same protocol and you combine them into a single P-value. Is there anything wrong with this?

Dr. Harrington: In practice we do something quite similar to that, at least in spirit. When we make a major design change in ECOG we usually do a stratified analysis, which stratifies on the time at which the change was made. It’s not quite as fancy as a meta-analysis. Those were two independent studies, but it somewhat reduces the problem.

Page 12: Panel discussion: Scientific issues in data monitoring

594 PANEL DISCUSSION

Dr. Friedewald It seems to me it’s seldom the issue that the control group didn’t behave the way you wanted it to. That’s usually the excuse, because you aren’t seeing the therapeutic difference you anticipated. When you said you should never use that as the reason to extend, I would suggest that it’s always the reason you extend. You use some other explanation for why you aren’t seeing it, but the reason you’re doing it is because you’re not seeing the effect. I didn’t understand your comment about why you would never extend for that reason.

Dr. Harrington: I was getting in a point that Rich raised earlier. I most worry about the type I error of a study when observational data in the middle of a study cause an incremental redesign of the study, and then all of the study data are pooled as if it had been a single design at the outset. When I talked about rates in the control group, I meant situations where you weren’t seeing enough events when pooled in order to get the right precision.

Dr. Joseph Pater: I’d like to return to a question, which I don’t think was answered, that is of immediate practical interest to me: whether it’s appropriate to show relative efficacy data to a monitoring committee more frequently than specified in terms of formal interim analyses, and if so, do you have to pay some form of statistical penalty for having done it?

Dr. Simon: First of all, in many cancer groups, it is common practice for the monitoring committee to see relative efficacy data at times other than the official interim analysis times. Is that common practice in other areas?

Dr. Bristow: No, at least in the trials that I’ve dealt with it’s a very serious matter to decide to do an addition analysis that is unplanned, and a plan is made to pay a penalty for it. We’ve debated at length whether to take a peek, so to speak, because we’re going to use up the alpha.

Dr. Simon: Let me make sure I correctly understand what you just said. If the monitoring committee meets every 6 months in a cardiovascular trial, in many of those meetings you will not see relative efficacy data. Is that correct?

Dr. Bristow: First of all, I can’t speak for all cardiovascular trials by any means, but in the trials with which I’ve been concerned, there is a plan for a certain number of looks at certain specified times, and if we begin to see changes occurring that worry us then we’ll debate whether to take a premature look, so to speak, an additional look, and then we’re asked to pay a penalty.

Dr. Harrington: Where I work, there is an interesting distinction between a monitoring committee and the study statistician. Even in the cancer groups that have a slightly less structured setting than NHLBI, one can say that we present formal interim analysis to a monitoring committee only at the times at which they were preplanned. However, the study statistician, seeing an emerging difference, will look almost daily, and I think that’s not always an exagger- ation, because you are potentially sitting on a time bomb. I think the difficult issue for the study statistician, if they were to notice a severe change, probably would be to call the monitoring committee. So, in fact, the monitoring committee is looking at those data fairly frequently, but just don’t know it. They’re having their time screened for them by a very confident, very conscientious statistician. I don’t know the answer to Rich’s (Simon) question about whether one should pay for all those analyses in a statistical sense. I know that we have discussions that sound rather silly: one of our working statisticians will come in to me and say ‘Do you remember the analysis of that melanoma study back in January? I’ll say yes. ‘Did that count? And be very serious about wondering whether the subsequent P-values should be adjusted for that. I share Peter’s (Armitage) view on this. I don’t worry a lot about it, because my sense is that the stopping guidelines, especially the very conservative ones, are largely guarding against too much disrup- tion in the type I error of the experiment, even with those unplanned interim looks. But 1 don’t think I have the methodology to really defend that. Other people in the room may.

Dr. Bristow: But, if I came to you with a request in a trial that had planned comparisons every 6 or 12 months - let’s say planned and rigidly spelled out - and I said ‘I’d like a monthly one sent

Page 13: Panel discussion: Scientific issues in data monitoring

PANEL DISCUSSION 595

to me individually. I won’t tell the rest of the board, but I don’t sleep well and I really would like to know. Could you send it to me? Do we pay a price for that or not?

Dr. Simon: I think there is a theory that if your monitoring boundary is extreme enough, you can take many looks and it really doesn’t have much influence over your type I error. So, maybe that’s a good reason for using such boundaries.

Dr. Harrington: The point I was trying to make is that the statistician plays a critical role here. Not asking the statistician to give you those monthly doesn’t mean the statistician isn’t looking monthly. The monitoring board‘s dilemma is not necessarily resolved in that setting.

Dr. Geiler: I think Rich (Simon) made the point I was going to make. If you make the monitoring boundaries extremely stringent, like a z value of 4 in the beginning of the trial, and then you move it down to 3, you can pretty much analyse as much as you want. That’s one way that the data monitoring committees could have looked at efficacy every 6 months, and that’s actually being implemented in the digitalis trial right now.

Dr. Simon: For a point of clarification, in the NHLBI trials, basically the situation is that relative efficacy data are only presented to data monitoring committees at times of planned analysis and those numbers of analyses are accounted for in the monitoring boundaries.

Dr. Geller: I think so. I think there are others in this room who are much more experienced in this than I.

Dr. Wittes: In every NHLBI trial with which I am familiar, the co-ordinating centre presents efficacy data at each monitoring board meeting even if there are no prespecified monitoring boundaries or if the meeting is taking place at a time other than the prespecified ‘looks’.

Dr. Armitage: Could I just follow up on David’s (Harrington) remark? When I mentioned the retrospective, rather sloppy sort of analysis for the MRC vitamin study, in fact, we decided it was appropriate then to use a continuous sequential boundary on precisely the grounds mentioned. We knew perfectly well, if anything had been happening, suddenly, in between our scheduled meetings, we should have known about that.

Dr. Gent: A safety and efficacy monitoring committee should look at the data for efficacy on two grounds. One, the prespecified formal interim analyses of efficacy defined beforehand. The other time they should look at efficacy data is when they are developing a point of concern about differences in adverse effects between two treatments, and they need to see whether there is any net harm or net benefit. That’s not a formal analysis of efficacy, but the assessment of net benefit or net harm to the patient. I can’t see how looking at that second one has any effect at all on P-values.

Dr. Yusuf: I want to raise an issue related to extending trials, the issue of open-ended trials. Often when you design a trial, you have only a certain amount of resources. You think the rate of recruitment is only going to go at 100 or 200 a month. As you run the trial, you may find it’s easier to recruit than you expected it to be. Obviously, in your original sample size calculations you say ‘If I recruited 5000 I can pick up X per cent difference at such and such a level.’ But you may find, you really are able to get 10,000 people and that would have enabled you to pick up a smaller difference, which is still considered clinically important. From a medical point of view, that is desirable, unless you reach overwhelming results early. I see no problem with actually having fairly extreme boundaries for monitoring for early termination which protects your type I error in a trial that is open-ended. And you go until you run out of funds, or the investigators collapse from exhaustion or boredom. I think many of the trials are designed like that. For instance, the ISIS trials follow essentially that concept. There were three trials that Marc Buyse put up yesterday, when he said open-ended trials are a real entity.

Dr. Freedman: Perhaps this whole discussion will support the principle, which I hope will become accepted in time, that looking at the data cannot destroy information. Aside from that,

Page 14: Panel discussion: Scientific issues in data monitoring

596 PANEL DISCUSSION

I have a completely different question. I think that many of us on data monitoring committees have been in a situation where new results have arisen while our own trial is going on, and these results have either been published or extremely reliable and refer to exactly the question we are trying to answer. Do the panel members or the audience members have any views on whether such information should be formally incorporated into the future stopping rules? And if so, how should that be done?

Dr. Harrington: From my own experience, I’ve never been involved in a trial that wasn’t closed when extremely reliable results from another study were published. Partly from the scientific perspective and also from the practical perspective, once the results from a reliable trial are publicly available it becomes very difficult to continue a trial that might have a very similar design, in the face of public pressure. Even if there is a very legitimate disagreement among scientists.

Dr. Gent: That is not a decision for the safety and efficacy monitoring committee. That’s really a decision of the study group. We ourselves have closed studies down for that very reason.

Dr. Fleming: I want to just backstep to the issue of flexibility. I think the committee has very clearly and appropriately made the point that designs that are in place are really guidelines to be used, and not rules. There is some very important work that’s been done by Gordon Lan and Dave DeMets and by Stuart Pocock. Dave DeMets and Gordon Lan have done research that shows the considerable flexibility one has in implementing these types of designs. Essentially, a use function is put into place. These guidelines are flexible, as Lan and DeMets have shown, in the way they are implemented. One has the flexibility to modify the frequency of looking at the data. I’m also greatly influenced by Stuart Pocock’s work, which reveals that only a small gain in efficiency is obtained by looking more often than three or four times during the trial. Obviously, there are ethical issues and a lot of complicated considerations that could motivate one to look more often. I don’t object to that, but we probably gain less than we think by continually eyeing the efficacy results. Another reason for reducing the frequency of looks is that any time you consider early trial termination, you need accurate and complete information.

Dr. Simon: I know we’ve a couple of discussions going on here simultaneously. Salim, I wanted to return to your point. My own reaction to the idea of open-endedness was negative. At least in the cancer field, we’ve had misleading results that resulted from informal multiple repeated analyses of clinical trials. There also had been too much informality in terms of target sample sizes and initiating trials where you’re not really sure whether you are going to be able to accrue enough patients in a reasonable period of time. So, 1 tend to lean towards a more formal approach to it. Even though technically one could handle it from a theoretical perspective, I believe that there are problems with an open-ended type of approach.

Dr. Yusuf: In two scenarios that I’m aware of this has been done much more formally. For instance, in the ISIS group, the data are not seen by the investigators, and the investigators go as long as feasible given funds and given an accrual of patients. So, the decision to go on, as far as they can, is not data dependent; that takes away some of the concerns. The other issue I know is the CLASP study, which is aspirin in women with preeclampsia, being conducted in Britain. The protocol did say up front that depending on the rate of recruitment and whether 5000 or 10,OOO patients were recruited, the study would answer specific questions. The investigators stated that they were uncertain of the recruitment rate. Again, the decision to extend is done by the investigators, independent of what the difference in the trial is or the event rates are.

Dr. Simon: When you have a viable study and you find that you can substantially increase the sample size within a reasonable period of time, then I don’t have any basic problem with that. I do have a problem with people starting their own study with undocumented accrual potential when they really ought to be doing a co-operative study with somebody else.

Page 15: Panel discussion: Scientific issues in data monitoring

PANEL DISCUSSION 597

Dr. Herson: There have been several things said in this discussion about looking at the data in different ways and then deciding what has been destroyed by doing so, and how to make up for it at the end. I wonder if a way out of this is to not consider the purists as people who worry about P-values but those who worry about the likelihood function. Take the case of Professor Armitage’s examples, the first one he gave this morning where people tested several times, and then retrospectively went back and said ‘If we had applied an O’Brien-Fleming rule, this is where we’d be.’ I have been there several times too, and the approach I have taken is precisely the same as Armitage’s, but this approach has never been received with tremendous enthusiasm. What I always fall back on is the principle that the likelihood function is not dependent on how many times you look at the data, so if things fall apart, we can always go back to the likelihood function and say regardless of how many times I looked at the data, there is a lot of evidence against the null hypothesis. But when you get into adjusting the sample size based on a look of the data that was not planned, and now the sample size suddenly becomes larger, I guess I’d go back to first principles again, and ask whether I can write the likelihood for this. I don’t think I can. It seems to me that we can’t just multiply the likelihoods together. They are not independent of one another. There is probably some kind of messy indicator function in the middle. When I get into a situation like that I become a purist and say ‘I don’t see any principle behind this, so we shouldn’t allow it.’

Dr. Sylvan Green: Returning to the issue of stopping a trial early, there are really two issues a data monitoring committee confronts: when to stop accrual and when to announce the result. I think there are ways for the data monitoring committee to have multiple looks on the question of when to stop accrual that won’t impact on the analysis.

Dr. Simon: Has anybody ever done that? Is that practical for a monitoring committee to say we are going to terminate accrual early, but we are not going to release the results until after additional follow-up?

Dr. Harrington: It has been done, but I support Rich’s (Simon) implication that it is exception- ally hard to do in practice without essentially telegraphing what’s going on in the monitoring committee. We do it routinely in early phase studies, where we stop early and suspend and re-evaluate and perhaps reopen accrual. We have found, at least in the cancer groups much to my dismay, that it is not practically possibly to suspend accrual in a randomized trial for a little while, possibly reopening it later if the early interim results, which suggested that there was going to be a difference, don’t pan out.

Dr. Simon: I think that’s something we should bear in mind. You may prematurely stop accrual because of something you see, but it may not be as much evidence as you want for publicly releasing that and so those are two different decisions - the decision to stop accrual, and the decision to release the information.

Dr. Mitchell Gail: Larry Rubinstein and I have done work on this issue. The approach can be considered for certain trials in which the endpoints are non-lethal or in which the treatments must take place near the point of randomization. For example, consider a trial comparing adjuvant treatment of patients with resected lung cancer with immunotherapy (BCG) versus treatment with placebo, given at the time of thoracotomy. At a later point in the trial when the decision to stop accrual is entertained, it is no longer feasible to give ‘initial BCG to the control arm, and it is not clear that delayed use of BCG would be beneficial, even if there are indications that initial use of BCG is beneficial. In this circumstance, one can stop accrual and wait for additional information to come in as events accrue until one has achieved a prespecified degree of precision on the estimated treatment effect. Then one can publish the results, and simple methods of analysis for fixed sample sizes are appropriate. We used this approach successfully in several trials of lung cancer in which adjuvant chemotherapies were given shortly after randomization. This design would not be appropriate for a chronically administered treatment.

Page 16: Panel discussion: Scientific issues in data monitoring

598 PANEL DISCUSSION

Dr. Edmund Gehan: I’d like to put a practical issue on the table that may not have had enough attention until now. I’ve read many cancer therapy protocols and the section on objectives almost always has a statement such as ‘we’re interested in finding out whether treatment A is better than treatment B.’ A primary endpoint for the evaluation of therapies may have been chosen. However, the statement of sample size or the difference in outcomes to be detected between therapies is nearly always relegated to the section on statistical considerations. In some draft versions of protocols, this section may be left blank or have a statement such as ‘to be filled in by statistician’. When the section has finally been completed, usually by the statistician, the target difference to be detected between treatments is based on the available patient resources over the period of the trial. Much of the discussion today has concerned fairly precise specifications of null and alternative hypotheses. Yet, I think that the specification of the alternative hypothesis is not often based on the ‘real objective’, but rather the kinds of difference that can be detected with available patient resources. Often, clinicians may really be interested in small to moderate differences between treatments, whereas the available sample size may be appropriate only for detecting much larger differences. The choice of alternative hypothesis also has an effect on the monitoring boundaries of the clinical trial and issues related to multiple looks at the data. There would be less of a problem, as I thought that Laurence Freedman was going to say, if a Bayesian approach was taken to analysing the data, since it would be simpler to calculate posterior estimates of differences between treatments based on the real objectives in the trial. Perhaps too much attention and too many of the statistical characteristics of clinical trials depend on the specific statement of null and alternative hypotheses that appears in the protocol, when these statements might not be the ‘real objectives’.

Dr. Simon: For major multicentre cancer clinical trials, I think that happens much less than it used to. The targets for such trials are driven somewhat by practicality but also by what’s important to detect.

Dr. Bristow: In the cardiovascular world clinicians are beginning to look a lot harder at the distinction between a statistically significant difference and a clinically significant difference. A 50 per cent reduction and a very low event rate has less meaning than it used to have in the clinical world. I think you are going to encounter on data boards a more rigorous, more searching attitude on the part of clinicians when we difine an important difference.

Dr. Buyse: To continue on this issue of sample size, I have to take issue with a statement made earlier that a lot of confusion had been created in cancer because of sloppy trials and trials that had not been conducted properly. A lot of the confusion was due to the insufficient sample sizes much more than to sloppy designs. The other thing I disagree with is that some people tend to equate open-ended trials with sloppy trials. It’s much more useful to look at open-ended trials in the way that David Harrington described, with two views of a clinical trial. One is the scientific experiment, in which you want to test the hypothesis under prespecified conditions of power and confidence. The other view of the strictly clinical trial is that it’s just a tool for the clinicians to do something when they have absolutely no idea what to do with their patients. It seems to me that this latter viewpoint provides the rationale for open-ended trials.

Dr. Pocock: I think there hasn’t been enough discussion of what happens when the pressure is really on, after the data monitoring committee has reached the point where the data it sees merit stopping the trial. I want to ask two questions. First, the data monitoring committee is usually faced with the problem that the data it’s looking at are not 100 per cent up to date. How realistic is it to do a very rapid update of the data, before we actually decide to stop, so that their quality is 100 per cent checked and their quantity is as big as possible? The second point concerns the role of the data monitoring committee, once that decision has been taken. Is its job over? I’m disappointed in some trials if that’s the case because they have more experience of understanding

Page 17: Panel discussion: Scientific issues in data monitoring

PANEL DISCUSSION 599

the data than anybody else. I’m not sure we have realistically laid down principles yet, as to how the data monitoring committee should interface with the steering committee and others in the eventual release of information publicly.

Dr. Harrington: They are both difficult and important issues. The second one you raised is the easier one for me. I favour a rather circumscribed role of the data monitoring committee in publication and presentation of the results. I worry that while the data monitoring committee has reached a level of expertise with that study in the process of monitoring, their opportunity to influence the scientific presentation of those results removes an important part of the responsibil- ity of the original investigator from that study. I regard these studies as a pact between the funding agency who represents the public and the scientist who has launched the study and feel rather strongly that it is that scientist’s responsibility to put the information learned from that trial into the public domain. I will be the first to admit they aren’t always the best at doing it, nor are they always the best qualified. But the sponsor having awarded them the opportunity to do that trial must saddle them with the responsibility to report it. On the first issue there are two things we struggle with. One is the practicality of doing a quick update in a multicentre trial that may be international. The other is the possibility of bias when you put out a call for a rapid review of records. You hear about the relapses but not the responders, because you have a number of responders who haven’t been out long enough to verify the response.

Dr. Simon: Your first point is a very common and important issue that confronts data monitoring committees. I’ve seen many situations where the most important thing to do was to make this call to bring patient follow-up up to date before making any decisions.

Dr. Gent: The solution to up to date and not alarm investigators is to have an efficient way of running the study. It’s very straightforward to collect key data. Most key data come in on case report forms. In all of our trials, and we have several going now, we have a system where we get key information from every centre every week. We have a particular form that has to be faxed back to us, that contains patients randomized that week, patients who’ve had follow-up assess- ments that week, events that have occurred that week, etc. That works on a weekly basis, so that part of our database is complete. What you do is establish a fast-track database that gets pulled into the main one to be fed to the external safety and efficacy monitoring committee at the regular period, so that what you get to them is your best data. It may well be that some of the outcome events, if you have a validation committee, may not be validated, but you are still giving your best data. There are no other data that you can give them at that point in time.

Dr. Walker: That exact question was faced by the SPAF (Stroke Prevention in Atrial Fibrilla- tion) study a couple of years ago in which the statistician in his routine look noted that increased cardiovascular events were being seen in the placebo group. He alerted the chairman of the monitoring committee. It took 6 weeks to get everybody from their busy schedules to meet all together at the right place and the right time. That provided the time for him to do this national sweep that you are talking about. During the course of that time, they picked up a whole series of additional endpoints, from the cardiovascular point of view as well as the stroke point of view, which made the results abundantly clear that the placebo arm had to be stopped.

Dr. Hawkins: I’d like to agree with Dr. Gent’s comment that how you run a study has a big impact on how up to date and complete your data are. I’d like to comment on Dr. Pocock’s other point regarding the role of the data monitoring committee after the decision is made. I can think of a specific trial carried out among premature babies in which toxic results of a study were found, and yet a year and a half later the manuscript was still being written. In many cases I think there is a responsibility for the data monitoring committee to see that the results get into the published literature, that they don’t just sit around because the investigators are surprised or unhappy with the findings. If the investigators have reservations about the decision of the committee, the

Page 18: Panel discussion: Scientific issues in data monitoring

600 PANEL DISCUSSION

rationale for the decision may require discussion with the investigative group. I do not believe the job ends just because you’ve reached a decision. I think the data monitoring committee responsi- bilities to future patients extend to seeing that information from the trial is made known.

Dr. Green: Tom Fleming made a very important point this morning, that the data monitoring committee should support the design of the trial and be comfortable with it. Therefore, if there is going to be a totally separate independent data monitoring committee, it should be involved right in the beginning when the protocol is being written. Many of the things we’ve talked about can be best set up if the data monitoring committee plays a role at the beginning of the study, helps work on the design of the protocol.

Dr. Maureen Myers: All the trials that I’ve been involved in where DSMBs recommended early termination, certainly in AIDS, did involve that last sweep for data. The currency of the data and the quality of the data and the speed with which that information is going to be disseminated, sometimes through the lay press, is very critical, and I think we need to look at that some more.