sample size in clinical outcome research: the case of behavioral weight control

7
BEHAVIOR THERAPY 15, 550--556 (1984) Sample Size in Clinical Outcome Research: The Case of Behavioral Weight Control RENA R. WING University of Pittsburgh School of Medicine ROBERT W. JEFFERY University of Minnesota School of Public Health This paper develops the thesis that the sample sizes which are commonly used in clinical outcome research are not sufficient to detect meaningful differences between treatments. Behavioral weight control is used to exemplify this problem. The sample sizes needed to statistically detect a difference between treatment conditions of 5, 10, and 15 pounds have been computed based on the attrition and the variability of treatment effects reported in the literature. It is demonstrated that sample sizes used in behavioral weight control studies are usually too small to detect any but the largest differences between conditions. With usual sample sizes, a 10-pound difference between conditions at the end of treatment and a 15- pound difference at follow-up (effect size of 1.2-1.3) would be required to assure statistical significance. Recommendations are made for (a) greater attention to sample size calculation in study design, (b) attempts to reduce between-subject variability, and (c) consideration of relaxing standard criteria for statistical sig- nificance in exploratory studies. Clinical outcome research has stressed the importance of avoiding Type I errors--that is, concluding that there is a difference between two treat- ments when, in fact, there is no difference. Far less attention has been directed to power analysis, or Type II errors--that is, concluding that a difference does not exist when in fact one does (Cohen, 1977). It is the contention of the present paper that the sample sizes used in many studies are too small to detect important differences between treatments. Behav- ioral weight control research is used to illustrate this point. Preparation of this manuscript was supported, in part, by Grant AM 29757-02 to Dr. Rena R. Wing from the National Institute of Arthritis, Metabolism and Digestive Diseases and, in part, by Grant AM 26542-03 to Dr. Robert W. Jeffcry from the National Institute of Arthritis, Metabolism and Digestive Diseases. Requests for reprints should be sent to Rena R. Wing, Western Psychiatric Institute and Clinic, University of Pittsburgh School of Medicine, 3811 O'Hara Street, Pittsburgh, PA 15213. 550 0005-7894/84/0550--055651.00/0 Copyright 1984 by Association for Advancement of Behavior Therapy All rights of reproduction in any form reserved.

Upload: rena-r-wing

Post on 13-Sep-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sample size in clinical outcome research: The case of behavioral weight control

BEHAVIOR THERAPY 15, 550--556 (1984)

Sample Size in Clinical Outcome Research: The Case of Behavioral Weight Control

RENA R. WING

University of Pittsburgh School of Medicine

ROBERT W. JEFFERY

University of Minnesota School of Public Health

This paper develops the thesis that the sample sizes which are commonly used in clinical outcome research are not sufficient to detect meaningful differences between treatments. Behavioral weight control is used to exemplify this problem. The sample sizes needed to statistically detect a difference between treatment conditions of 5, 10, and 15 pounds have been computed based on the attrition and the variability of treatment effects reported in the literature. It is demonstrated that sample sizes used in behavioral weight control studies are usually too small to detect any but the largest differences between conditions. With usual sample sizes, a 10-pound difference between conditions at the end of treatment and a 15- pound difference at follow-up (effect size of 1.2-1.3) would be required to assure statistical significance. Recommendations are made for (a) greater attention to sample size calculation in study design, (b) attempts to reduce between-subject variability, and (c) consideration of relaxing standard criteria for statistical sig- nificance in exploratory studies.

C l in ica l o u t c o m e research has s t ressed the i m p o r t a n c e o f a v o i d i n g T y p e I e r r o r s - - t h a t is, c o n c l u d i n g tha t the re is a d i f ference b e t w e e n two t reat - m e n t s when , in fact, the re is no difference. F a r less a t t e n t i o n has been d i r ec t ed to p o w e r analysis , o r T y p e II e r r o r s - - t h a t is, c o n c l u d i n g tha t a d i f ference does no t exis t w h e n in fact one does (Cohen , 1977). It is the c o n t e n t i o n o f the p resen t pape r tha t the s a m p l e sizes used in m a n y s tudies are t oo smal l to de tec t i m p o r t a n t d i f ferences b e t w e e n t r ea tmen t s . B e h a v - iora l we igh t con t ro l r e sea rch is u sed to i l lus t ra te th is point .

Preparation of this manuscript was supported, in part, by Grant AM 29757-02 to Dr. Rena R. Wing from the National Institute of Arthritis, Metabolism and Digestive Diseases and, in part, by Grant AM 26542-03 to Dr. Robert W. Jeffcry from the National Institute of Arthritis, Metabolism and Digestive Diseases. Requests for reprints should be sent to Rena R. Wing, Western Psychiatric Institute and Clinic, University of Pittsburgh School of Medicine, 3811 O'Hara Street, Pittsburgh, PA 15213.

5 5 0 0005-7894/84/0550--055651.00/0 Copyright 1984 by Association for Advancement of Behavior Therapy

All rights of reproduction in any form reserved.

Page 2: Sample size in clinical outcome research: The case of behavioral weight control

SAMPLE SIZE IN CLINICAL OUTCOME RESEARCH 551

Extensive research has been conducted on behavioral treatments for obesity. While these studies have clearly established that behavioral pro- grams are more effective than no-treatment controls, it is not clear which specific behavioral treatment strategies or combinations of strategies con- tribute most to weight loss. Most studies comparing two or more active treatment conditions have failed to find statistically significant differences between conditions, a difficulty particularly apparent in studies with ex- tended follow-up (Wilson & Brownell, 1980).

While it may be that meaningful differences do not exist between be- havioral treatment programs, it also seems possible that these studies lack sufficient power to detect meaningful differences between treatments. The present paper reviews behavioral weight control studies with long-term follow-up to determine the rate of attrition and variability in treatment outcome. Using these estimates of attrition and variability, the sample sizes required to detect differences of 5, 10, and 15 pounds between treatment conditions were calculated.

METHOD AND RESULTS There are several factors which determine the sample size which should

be used in a clinical outcome study. First is the risk of Type I and Type II errors that the investigator is willing to tolerate. A 1 in 20 chance of Type I error (alpha or p of .05) is customary. Similarly, a 2 in 10 chance of Type II error (beta of.20) or a power of.80 (1 - beta) is, by convention, considered acceptable (Welkowitz, Ewen, & Cohen, 1976; Young, Bres- nitz, & Strom, 1983).

A second factor that enters into sample size calculation is the size of the difference between the experimental conditions which the investigator wishes to be able to detect (i.e., what treatment effect is "meaningful"). In the behavioral weight control research, there has been little discussion of what constitutes a "clinically important" or "meaningful" difference between treatments. In a study which used weight loss to reduce blood pressure and ultimately to reduce mortality from coronary heart disease, a "meaningful" weight loss would depend on the relationship between weight loss and blood pressure reduction, the initial blood pressure level of study participants, and the likelihood of endpoint events associated with blood pressure. If the clinical target was reduction in serum glucose levels, a completely different set of relationships would require consid- eration. However, most behavioral weight control studies do not have clinical targets. They are designed, instead, to compare various treatment strategies to determine which strategies produce the best short-term and long-term weight losses. A designation of a "meaningful" difference is somewhat arbitrary in these studies. Thus, we have determined the sample sizes that would be needed to detect either a 5, 10, or 15 pound difference between treatment conditions.

A final factor which affects sample size is the nature of the outcome variables (whether dichotomous or continuous) and, in the case of con- tinuous variables, the variability of the measure. To determine the vail-

Page 3: Sample size in clinical outcome research: The case of behavioral weight control

552 WING AND JEFFERY

TABLE 1 SAMPLE SIZE, ATTRITION, TREATMENT EFFECT, AND VARIABILITY IN BEHAVIORAL WEIGHT

Loss PROGRAMS WITH 1-YEAR FOLLOW-UP

Start End of treatment 1-year follow-up

Weight loss Weight loss Study/Treatment group n n lbs ± S D n lbs + S D

Ashby and Wilson (1977)

Replication 1

Behavioral-2 wk 8 6 9.8 _+ 7.1 6 4.3 + 6.7 Behavioral-4 wk 8 8 9.4 + 3.7 8 7.1 __ 5.1 Nonspecific-2 wk 8 7 11.1 ___ 5.1 7 7.9 + 12.9 Nonspecific-4 wk 8 8 10.4 _+ 3.9 8 7.8 +_ 5.9 Control 8 8 11.8 4- 8.4 8 5.1 ___ 6.0

Replication 2

Behavioral-2 wk 8 7 7.9 + 5.3 7 7.2 _+ 8.8 Behavioral-4 wk 8 7 8.9 ___ 2.3 7 12.1 + 17.9 Nonspecific-2 wk 8 7 8.6 _+ 7.1 7 10.7 _+ 10.2 Nonspecific-4 wk 8 7 6.4 + 5.2 7 8.5 -+ 10.7 Control 8 6 7.9 + 4.4 6 12.3 + 12.9

Benecke et al. (1978)

Food management 27* 20 18.5 + 7.5 17 13.9 + 9.7 Behavioral self-control 27 21 16.1 + 4.9 20 12.8 + 14.1

Black and Scherba (1983)

Behavioral skills 7 7 6.4 __. 4.6 6 5.5 ___ 6.7 Problem solving 7 7 12.3 + 4.8 6 15.2 _ 8.2

Brownell and Stunkard ( 1981)

Couples training 41" 36 19.8 _+ 11.8 33 9.9 + 15.2 Cooperative s p o u s e S alone 41 36 17.4 + 13.2 36 9.0 4- 13.2 Uncooperative spouse 41 40 18.9 + 12.5 39 9.2 + 17.9

Craighead et al. (1981)

Behavior therapy 39 32 24.0 ___ 12.4 31 19.8 + 16.2 Pharmacotherapy 32 25 31.9 ___ 12.1 25 13.9 + 16.5 Behavior therapy and pharmaco-

therapy 31 23 33.7 _+ 12.7 23 10.1 + 16.9

Murphy et al. (1982)

Alone-1 party 13 8 15.6 4- 10.5 4 7.0 + 11.5 Alone-2 party 13 7 15.1 + 7.8 6 7.7 + 14.8 Couple-1 party 13 5 18.0 ± 3.2 4 12.0 + 10.6 Couple-2 party 12 8 16.8 4- 8.3 8 19.3 + 13.9 Supportive 11 6 16.5 ___ 9.3 6 9.2 + 14.8

Ost and Gotestam (1976)

Behavior therapy 15 11 20.7 + 9.9 11 10.1 + 13.6 Fenfluramine 15 11 11.4 ± 9.9 11 1.8 + 9.0 Waiting list 15 11 7.7 + 8.8 11 5.3 + 11.7

Page 4: Sample size in clinical outcome research: The case of behavioral weight control

SAMPLE SIZE IN CLINICAL OUTCOME RESEARCH

TABLE 1 CONTINUED

553

Start End of t rea tment 1-year follow-up

Weight loss Weight loss S tudy/Trea tment group n n lbs +_ SD n lbs ± SD

Wing et al. (1981)

Weight loss /Attendance 18 14 15.8 _+ 5.8 12 23.0 ± 14.5 At tendance/Weight loss 20 18 11.8 + 7.7 11 18.9 ± 13.5

Wing et al. (1982) Scarsdale 20 19 11.2 +_ 8.5 15 15.5 ± 15.4 Behavioral 19 15 12.0 +_ 6.0 14 12.0 + 9.3

* Initial n ' s per group were es t imated based on total sample and the assumpt ion o f equal distr ibution among conditions.

ability o f weight loss, we reviewed the published literature on behavioral weight control. Nine studies which reported attrit ion and the mean and standard deviation of weight loss at the end of t reatment and at a follow- up interval of 1 year were identified (Table 1). These studies included 32 treatment groups• Attrit ion rates and within-condition variability were averaged across these 32 groups for purposes of sample size estimation•

In the nine studies reviewed, the average rate of attrition from pre- treatment to post treatment was 20%; attrition from pretreatment to fol- low-up was 25%• The average weight loss at the end o f t reatment was 14.5 pounds and at follow-up was 10•8 pounds. Average standard devia- tions for weight change over the same intervals were 7.7 pounds and 12.0 pounds, respectively.

These estimates of attrit ion and variability were then used to calculate sample size, with the standard statistical formula (Dixon & Massey, 1969):

N = [(Z,_,~ + Z~_e)/d' ] 2

M1 - M2 d ' -

S D 2

where N is the sample size needed in each group for a two-tailed t test, with an expected mean difference between groups of 341 - 3//2 and a within-group standard deviation of SD. Alpha was set at .05 and beta at • 20. Calculations were performed for expected differences between groups of 5, 10, and 15 pounds or for effect sizes of .6, 1.3, and 1.95 at post- t reatment and .42, •83, and 1.25 at follow-up.

Table 2 shows sample sizes that would be needed in each treatment condition to detect either a 5, 10, or 15 pound difference between group means at the end of t reatment or at follow-up.

D I S C U S S I O N

The estimates o f attrition and the variability in t reatment outcome obtained in the present study (Table 1) are similar to those reported in

Page 5: Sample size in clinical outcome research: The case of behavioral weight control

554 WING AND JEFFERY

TABLE 2 INITIAL SAMVLE SIZE NEEDED IN EACH TREATMENT GROUP FOR STATISTICAL SIGNIFICANCE

IN A Two-TAILED r TEST WITH a = .05 AND B = .20*

End of treatment 1-year follow-up Difference between groups Effect size Sample size Effect size Sample size

5 lb .65 45 .42 114 10 lb 1.30 11 .83 28 15 lb 1.95 5 1.25 13

* Assumpf ions - -SD posttreatment = 7.7; SD follow-up = 12.0; Initial sample sizes have been adjusted to allow for 20% attrition at posttreatment and 25% attrition at follow-up.

several comprehensive reviews of behavioral weight control programs (Wilson & Brownell, 1980; Wing & Jeffery, 1979). The standard deviation for weight loss is often as large as the mean, and variability tends to increase over time. As shown in Table 2, the assessment of differences between treatment strategies thus requires either a very large difference in treatment effect or very large sample size.

Comparing Table 2 with the sample sizes typically used in behavioral weight control studies (10-20 subjects per condition), it would appear that few studies have used large enough samples to reliably detect 5 pound differences in treatment effectiveness. Most could detect a 10 pound or greater difference at posttreatment (effect size of 1.3), but a mean difference of about 15 potmds would probably be needed to assure statistical sig- nificance at follow-up. The differences observed between active treatment conditions in behavioral weight control studies are seldom in this range. In the nine studies reviewed, the most extreme difference between any two treatment conditions exceeded 10 pounds in only one study at post- treatment and one study at follow-up. Behavioral treatments usually result in weight losses of 10-15 pounds. Thus, a procedure which reliably in- creased treatment effectiveness by 5 pounds (to the 15-20 pound range) would constitute a 30-50% increase in effectiveness. While it would be helpful to be able to detect such improvements in treatment effectiveness, the sample sizes required would be formidable.

The sample size problem outlined here is a major obstacle to assessing the relative effectiveness of different treatment strategies for weight loss. Following are some suggestions for dealing with this problem:

1. Increase sample size: The 45 subjects per cell needed to detect a 5-pound difference at posttreatment in a two-group comparison, should be within the realm of possibility for most investigators, but the sample sizes needed for long term follow-up (114 per cell) are more distressing. Perhaps collaborative studies involving several treatment centers would be helpful. Multicenter collaboration is relatively common in other areas of research where variable physiological end points are at issue (e.g., blood pressure).

Page 6: Sample size in clinical outcome research: The case of behavioral weight control

SAMPLE SIZE IN CLINICAL OUTCOME RESEARCH 555

2. Reduce variability: Researchers have already explored the possibility of using various weight reduction indices and covariates to reduce the variability in obesity t reatment outcome. So far, these efforts have met with little success, but deserve further study. In addition, it may be helpful to reduce the variability at the front end, by using samples which are more homogeneous with respect to age, sex, pretreatment weight, diet history, and other variables known to predict success at weight reduction. Statistical procedures that attenuate variability caused by extreme values (e.g., Winsorizing) should also be considered.

3. Change the level o f signiftcance: To protect against prematurely aban- doning promising new intervention strategies, it may be helpful to re- consider the value of conventional levels of significance. Perhaps in exploratory research we should set the alpha level a t . 1, rather than .05, especially for follow-up analyses. While increasing the probability of Type I errors, this strategy would allow us sufficient power to detect potentially promising techniques without inordinately increasing our sample size. This is a controversial suggestion, but deserving of some consideration.

The argument about sample size was developed here using weight con- trol as an example. The issue, however, is not unique to, nor particularly troublesome in, this field. Other areas of behavioral outcome research are likewise hampered by sample size problems. For example, in smoking intervention, where the outcome is a dichotomous variable (cessation or noncessation) and a substantial number of patients stop smoking with placebo treatment (Neaton, Broste, Cohen, Fishman, Kjelsberg, & Schoenberger, 1981), the necessary sample sizes are even larger than in weight control research.

It is suggested that the issue of statistical power be considered more carefully in designing research and in interpreting negative findings re- ported in clinical studies. A simple nomogram for sample size calculation has recently been published (Young et al., 1983), which should facilitate the consideration of sample size adequacy; Cohen's (1977) classic book on this topic is also strongly recommended.

REFERENCES Ashby, W. A., & Wilson, G.T. (1977). Behavior therapy for obesity: Booster sessions and

long-term maintenance of weight loss. Behaviour Research and Therapy, 15, 451-463. Beneke, W. M., Paulson, B., McReynolds, W. T., Lntz, R. N., & Kohrs, M. B. (1978).

Long term results of two behavior modification weight loss programs using nntritionists as therapists. Behavior Therapy, 9, 501-507.

Black, D. R., & Scherba, D. S. (1983). Contracting to problem solve versus contracting to practice behavioral weight loss skills. Behavior Therapy, 14, 100-109.

Brownell, K. D., & Stunkard, A.J. (1981). Couples training, pharmacotherapy, and be- havior therapy in the treatment of obesity. Archives of General Psychiatry, 38, 1224- 1229.

Cohen, J. (1977). Statistical power analysis for the behavioral sciences. New York: Aca- demic.

Craighead, L. W., Stunkard, A. J., & O'Brien, R. M. (1981). Behavior therapy and phar- macotherapy for obesity. Archives of General Psychiatry, 38, 763-768.

Page 7: Sample size in clinical outcome research: The case of behavioral weight control

556 WING AND JEFFERY

Dixon, W. J., & Massey, F. J. (1969). Introduction to statistical analysis. New York: McGraw-Hill.

Murphy, J., Williamson, D., Buxton, A., Moody, S., Absher, N., & Warner, M. (1982). The long-term effects of spouse involvement upon weight loss and maintenance. Be- havior Therapy, 13, 681-693.

Neaton, J. D., Broste, S., Cohen, L., Fishman, E. L., Kjelsberg, M. O., & Schoenberger, J. (1981). The Multiple Risk Factor Intervention Trial (MRFIT) VII. A comparison of risk factor changes between the two study groups. Preventive Medicine, 10, 519-543.

Ost, L., & Gotestam, K. (1976). Behavioral and pharmacological treatments for obesity: An experimental comparison. Addictive Behaviors, 1, 331-338.

Welkowitz, J., Ewen, R. B., & Cohen, J. (1976). Introductory statistics for the behavioral sciences (2nd ed.). New York: Academic.

Wilson, G. T., & Brownell, K. D. (1980). Behavior therapy for obesity: An evaluation of treatment outcome. Advances in Behaviour Research and Therapy, 3, 49-86.

Wing, R. R., Epstein, L. H., Marcus, M., & Shapira, B. (1981). Strong monetary contin- gencies for weight loss during treatment and maintenance. Behavior Therapy, 12, 702- 710.

Wing, R. R., Epstein, L. H., & Shapira, B. (1982). The effect of increasing initial weight loss with the Scarsdale Diet on subsequent weight loss in a behavioral weight control program. Journal of Consulting and Clinical Psychology, 50, 446-447.

Wing, R. R., & Jeffery, R. W. (1979). Outpatient treatment of obesity: A comparison of methodology and clinical results. International Journal of Obesity, 3, 261-279.

Young, M, J., Bresnitz, E. A., & Strom, B. (1983). Sample size nomograms for interpreting negative clinical studies. Annals of lnternal Medicine, 99, 248-251.

RECEIVED: February 27, 1984 FINAL ACCEPTANCE: May 18, 1984