evaluating changes in health status: reliability and responsiveness of five generic health status...

15
J Clin Epidemiol Vol. 50, No. 1, pp. 79-93, 1997 Copyright 0 1997 Elsevier Science Inc. ELSEVIER 0895-4356/97/$17.00 PI1 SO895*4356(96)00296-X Evaluating Changes in Health Status: Reliability and Responsiveness of Five Generic Health Status Measures in Workers with Musculoskeletal Disorders Dorcas E. Beaton, ‘5’ She&h HogpJohnson, ‘s3 and Claire Bomburdier’~4 ‘INSTITUTE FOR WORK & HEALTH, ‘DEPARTMENT OF OCCUPATIONAL THERAPY, ‘DEPARTMENT OF PREVENTIVE MEDICINE AND BIOSTATISTICS, 4D~~~~~~~~~ OF MEDICINE AND CLINICAL EPIDEMIOLOGY AND HEALTH CARE RESEARCH PROGRAM, UNIVERSITY OF TORONTO, TORONTO, ONTARIO, CANADA M4W lE6 ABSTRACT. Objectives: To compare the measurement properties over time of five generic health status assess- ment techniques. Methods: Five health status measures were completed on two occasions by a sample of workers with musculoskeletal disorders. They included the SF-36, Nottingham Health Profile, Health Status Section of the Ontario Health Survey (OHS), Duke Health Profile, the Sickness Impact Profile and a self-report of change in health between tests. Setting: Subjects were accrued from a work site (within one week of injury) (n = 53), physiotherapy clinics (four weeks after injury), (n = 34), and a tertiary level rehabilitation center (more than four weeks after injury) (n = 40). Analysis: Intraclass correlation coefficients (ICC) derived from nonparametric one-way analysis of variance were used for test-retest reliability in those who had not changed (n = 49). Various responsiveness statistics were used to evaluate responsiveness in those who claimed they had a positive change in health (n = 45) and in those who would have been expected to have a positive change (n = 79). Results: Of the 127 subjects recruited, 114 completed both questionnaires (89.8%). In the subjects who reported no change in health, analysis of targeted dimensions (overall scores, physical function, and pain) demonstrated acceptable to excellent test-retest reliability in all but the Duke Health Profile. In subjects with change in health, the SF-36 was the most responsive measure (moderate to large effect sizes [0.55-0.971 and standardized response means ranging between 0.81 and 1.13). C onclusions: The results suggest that the SF-36 was the most appropriate questionnaire to measure health changes in the population studied. The selection of a health status measure must be context-specific, taking into account the purpose and population of the planned research. Copyright @ 1997 ELevier Science Inc. J CLIN EPIDEMIOL 50;1:79-93, 19%‘. KEYWORDS. Health status indicators, quality of life, reproducibility of results INTRODUCTION Health-related quality-of-life (HRQOL) measures are in- creasingly being used in clinical trials and outcome studies [l-6]. As primary outcome measures, HRQOL assessments can document the efficacy of a treatment, guide the alloca- tion of rationed health care dollars [7-91, and serve as the basis for estimates of effect sizes for sample size calculations [lo,1 I]. The clinician or researcher wishing to use such a tool is faced with selecting one from among the many available. Guidelines have been proposed to help in making this deci- sion [3,12-161; they include practical approaches to judge the questionnaires’ feasibility, content, and face validity. Quantitative techniyues are also available to measure their reliability and their validity using data collected from sub- jects. Address for correspondence: Dorcas E. Beaton, Insnture for Work & tlealth, 250 Bloor St. E., Ste. 702, Toronto, Ontario Canada M4W lE6. Accepted fur publication on 5 August 1996. The clinician seeking a tool to measure change over time, should look for evidence of responsiveness, i.e., sensitivity of the questionnaire to clinically relevant change in health. To be of value, a tool should also be stable when no change occurs (test-retest reliability) [5,6,17,18]. Finally, it is important to select a tool that has shown these properties in the sample and under the conditions to be studied [3,16,19-211. The literature supports the use of both generic and dis- ease-specific HRQOL measures [5,8,18,22,23]. Generic tools often provide a broad picture of health across a range of conditions, whereas disease or domain-specific [5] mea- sures are more sensitive to the disorder under consideration, and are therefore more likely to reflect clinical changes [15,18,24-261. However, it may be necessary to use generic measures to assess clinical change when comparisons are being made across conditions. These comparisons will be- come more common in the competition for health care re- sources. Clinicians and researchers should seek generic instru- ments that are sensitive to the type of condition and clinical

Upload: independent

Post on 17-Feb-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

J Clin Epidemiol Vol. 50, No. 1, pp. 79-93, 1997 Copyright 0 1997 Elsevier Science Inc.

ELSEVIER

0895-4356/97/$17.00 PI1 SO895*4356(96)00296-X

Evaluating Changes in Health Status: Reliability and Responsiveness of Five Generic Health Status

Measures in Workers with Musculoskeletal Disorders

Dorcas E. Beaton, ‘5’ She&h HogpJohnson, ‘s3 and Claire Bomburdier’~4 ‘INSTITUTE FOR WORK & HEALTH, ‘DEPARTMENT OF OCCUPATIONAL THERAPY, ‘DEPARTMENT OF PREVENTIVE MEDICINE

AND BIOSTATISTICS, 4D~~~~~~~~~ OF MEDICINE AND CLINICAL EPIDEMIOLOGY AND HEALTH CARE RESEARCH PROGRAM, UNIVERSITY OF TORONTO, TORONTO, ONTARIO, CANADA M4W lE6

ABSTRACT. Objectives: To compare the measurement properties over time of five generic health status assess-

ment techniques. Methods: Five health status measures were completed on two occasions by a sample of workers

with musculoskeletal disorders. They included the SF-36, Nottingham Health Profile, Health Status Section of

the Ontario Health Survey (OHS), Duke Health Profile, the Sickness Impact Profile and a self-report of change

in health between tests. Setting: Subjects were accrued from a work site (within one week of injury) (n = 53),

physiotherapy clinics (four weeks after injury), (n = 34), and a tertiary level rehabilitation center (more than

four weeks after injury) (n = 40). Analysis: Intraclass correlation coefficients (ICC) derived from nonparametric

one-way analysis of variance were used for test-retest reliability in those who had not changed (n = 49). Various

responsiveness statistics were used to evaluate responsiveness in those who claimed they had a positive change

in health (n = 45) and in those who would have been expected to have a positive change (n = 79). Results:

Of the 127 subjects recruited, 114 completed both questionnaires (89.8%). In the subjects who reported no

change in health, analysis of targeted dimensions (overall scores, physical function, and pain) demonstrated

acceptable to excellent test-retest reliability in all but the Duke Health Profile. In subjects with change in

health, the SF-36 was the most responsive measure (moderate to large effect sizes [0.55-0.971 and standardized

response means ranging between 0.81 and 1.13). C onclusions: The results suggest that the SF-36 was the most

appropriate questionnaire to measure health changes in the population studied. The selection of a health status

measure must be context-specific, taking into account the purpose and population of the planned research.

Copyright @ 1997 ELevier Science Inc. J CLIN EPIDEMIOL 50;1:79-93, 19%‘.

KEYWORDS. Health status indicators, quality of life, reproducibility of results

INTRODUCTION

Health-related quality-of-life (HRQOL) measures are in- creasingly being used in clinical trials and outcome studies [l-6]. As primary outcome measures, HRQOL assessments can document the efficacy of a treatment, guide the alloca- tion of rationed health care dollars [7-91, and serve as the basis for estimates of effect sizes for sample size calculations [lo,1 I].

The clinician or researcher wishing to use such a tool is faced with selecting one from among the many available. Guidelines have been proposed to help in making this deci- sion [3,12-161; they include practical approaches to judge the questionnaires’ feasibility, content, and face validity. Quantitative techniyues are also available to measure their reliability and their validity using data collected from sub- jects.

Address for correspondence: Dorcas E. Beaton, Insnture for Work & tlealth, 250 Bloor St. E., Ste. 702, Toronto, Ontario Canada M4W lE6.

Accepted fur publication on 5 August 1996.

The clinician seeking a tool to measure change over time, should look for evidence of responsiveness, i.e., sensitivity of the questionnaire to clinically relevant change in health. To be of value, a tool should also be stable when no change occurs (test-retest reliability) [5,6,17,18]. Finally, it is important to select a tool that has shown these properties in the sample and under the conditions to be studied [3,16,19-211.

The literature supports the use of both generic and dis- ease-specific HRQOL measures [5,8,18,22,23]. Generic tools often provide a broad picture of health across a range of conditions, whereas disease or domain-specific [5] mea- sures are more sensitive to the disorder under consideration, and are therefore more likely to reflect clinical changes [15,18,24-261. However, it may be necessary to use generic measures to assess clinical change when comparisons are being made across conditions. These comparisons will be- come more common in the competition for health care re- sources.

Clinicians and researchers should seek generic instru- ments that are sensitive to the type of condition and clinical

80 D. E. Beaton et al.

change under investigation in terms of content and scaling. By doing so, they can help ensure that the effectiveness of the treatment (or impact of the condition) under study is likely to be revealed accurately, and that the measurement tool is not itself a source of bias [27].

The purpose of the present study was to compare the mea- surement properties over time of five generic HRQOL as- sessment tools [10,17,28-311. The specific goal was to select an instrument for a larger work-site based study of musculo- skeletal soft tissue injuries measuring health changes over time. The tools selected for this study (through literature review and on the recommendations of other researchers) were the SF-36-Acute, the Nottingham Health Profile, the Duke Health Profile, the Sickness Impact Profile, and the Health Status Section of the Ontario Health Survey. Each is a recognized generic measure of HRQOL across popula- tions of healthy and ill subjects.

METHODS Patient Selection

Subjects eligible for the study were those individuals receiv- ing workers’ compensation for a soft tissue injury to the lower back or upper extremity (including neck). Those with any comorbidity that might affect recovery (such as frac- tures, lacerations, and head injuries) were excluded.

Subjects were enrolled with the two components of the study in mind: reliability (requiring stable health) and re- sponsiveness (requiring improvement in health).

Site Selection and Procedure at Each Site

Three sites were selected to give access to workers at three different stages of illness: acutely injured workers (most of whom would improve within a short period); those at- tending an active rehabilitation program and still off work approximately four weeks after injury; and those attending a rehabilitation clinic with longer-term disabilities and therefore not expected to change between testing (see Fig. 1 and Table 1). This sampling strategy parallels our knowl- edge of the nature of recovery of low back pain [32-351 or other musculoskeletal disorders [36], which supports that the first two sites should sample people likely to improve, while the rehabilitation center people would more likely be stable in health.

WORKSITE (STEEL MILL). Workers were identified and asked to participate in the study by the occupational health staff at the time of reporting their injury. A log of newly reported injuries was kept at the work site and reviewed daily. A trained interviewer contacted eligible workers and made arrangements to meet them within one week of injury. At that meeting, signed consent was obtained, the first questionnaire package was completed, and a second package was left for completion in three weeks. Three weeks was

0.8 Acute Stage Natural healing Worksite

0.6 I \

/

0.0 3 weeks 8 weeks > 3 months

Time off work

FIGURE 1. Hypothetical recovery curve following work-re- lated soft tissue disorders-rationale for site and sample se- lection. The curve represents the proportion of people still off work at a given point after the onset of the disorder. The

early stages show rapid recovery, with a plateau in the latter stages. S(t) represents proportion still with pain at time t.

selected to sample those who recovered within that time of injury [6,29,33,36,37]. After the three weeks had elapsed, subjects were contacted by telephone to remind them to complete and return the second package. Follow-up calls were made to ensure that completion rates exceeded 80%.

COMMUNITY CLINICS. The second site consisted of six Workers’ Compensation Board-sponsored community phys- iotherapy clinics. These units operate on the philosophy of early active intervention and education, and provide individ- ualized programs. Workers generally enter four weeks after their injury and are treated in a four- to six-week program.

The study was presented to the staff of the six participat- ing clinics. One key person at each clinic was identified to act as coordinator and contact person. A log of all new cli- ents was kept and those eligible for the study were asked to participate. To detect health changes occurring over the course of treatment, subjects were allotted time to complete questionnaires both in the early days of treatment and within three days of discharge. At the end of the study, completed questionnaire packages were gathered by the re- searchers.

REHABILITATION CENTER. The third source of subjects was a tertiary level rehabilitation center sponsored by the Ontario Workers’ Compensation Board. The center special- izes in hand injuries and pain management. Workers who attend are usually in the later stages of their disease experi- ence. Presentations on the study were given to center staff who then screened patients based on the inclusion criteria and brought eligible, willing workers to the researchers. Ses- sions were arranged for patients to complete the package. Subjects completed the first package on the site and either

Evaluating Changes in Health Status

TABLE 1. Site and sample selection: Rationale for sampling and testing strategy for the study

Sample

81

Site

Type of injured Time since injury at worker at site baseline testing Second testing

Goals for accrual from this site

Worksite: (steel mill) Acutely injured worker

Early active rehabilita- tion: (community clinics)

Workers attending early

active rehabilitation program, still off work

Longer-term rehabilita- tion: (rehabilitation center)

Longer-term rehabili- tation

On average, within one Three weeks after initial Improvement in week of injury testing HRQOL between the

testing times due to

natural history of acute soft tissue injury [29,37]

Admission to clinic Four to six weeks after Improvement in initial testing HRQOL between

admission and dis- charge from a rehabili- tation program

Wide variation One week after initial Minimal or no change

All > 3 months testing in HRQOL between

testings in workers with longer term dis- ability

returned one week later for the second or returned it by mail. A one-week interval was selected to be long enough to avoid a recall bias and yet short enough to avoid problems related to the inherent instability of health. Follow-up was continued by telephone to obtain an 80% return rate.

The Questionnaire Package: Health-Related Quality-of-Life Measures

Each instrument used is reviewed below as to its general format, its content, and the literature supporting its use. The results presented in this paper will be for those dimen- sions of HRQOL most likely to be affected by a musculo- skeletal injury, i.e., pain, physical function, and the overall score (Table 2). Analysis for other dimensions is included in Appendix A.

SF-36-ACUTE. The SF-36 is a 36-item questionnaire that takes between 7 and 10 minutes to complete. Results of 35 of the 36 items are aggregated into eight dimensions: physi- cal function, role function (physical), role function (emo- tional), pain, social functioning, mental health, energy, and general health perceptions. In the SF-36-Acute, the ques- tions are framed over a one-week period, as opposed to the four-week period in the regular SF-36. The responses vary from dichotomous (yes/no) to six-point verbal rating scales (ordinal) depending on the original source of the questions. No global score is recommended. The questions used in the SF-36 were selected from the Rand Health Insurance long form fielded in the Rand Health Insurance Experiment [38- 40]. The SF-36 has gained popularity across North America

and is now used by many Patient Outcome Research Teams (PORTS). The literature supports its content and construct validity [38-401 and its use in general health surveys

[41,42], and has encouraged research to examine the issue of responsiveness [38].

NOTTINGHAM HEALTH PROFILE. The Nottingham Health Profile is a 37-item questionnaire that takes approxi- mately 5-7 minutes to complete. Responses are compiled into six dimensions: energy, emotional reaction, social isola- tion, pain, physical disability, and sleep. The responses are dichotomous; respondents answer “yes” if they are currently experiencing the described behavior due to their health prob- lems. Positive responses are weighted based on scoring de- rived from a Thurstone paired comparison technique [43,44]. Weighting has also been developed for use in Sweden [41,43]. No summary score is recommended for this tool.

The Nottingham Health Profile was designed for use in general populations and has also gained popularity in pri- mary care and clinical trials [16,45-481. It has had some application in musculoskeletal conditions including hip arthroplasty [24] and a mixed sample including low back pain [49].

SICKNESS IMPACT PROFILE. The Sickness Impact Profile (SIP) is widely accepted as a generic measure of health sta- tus. Like the Nottingham, it measures “illness behaviors” by asking respondents to indicate which of the 136 items they are experiencing. Only those behaviors experienced are marked on the page. The SIP takes up to 30 minutes to complete. Weighting of the items was developed using a Thurstone paired comparison technique [50-521. Scores are summarized in 12 subscales, some of which are combined to form the Physical Dimension and Psychosocial Dimen- sion; all are combined to derive the overall score. Five sub- scales are not included in either the Physical or Psychoso- cial Dimension, but are included in the overall score.

82 D. E. Beaton er al.

TABLE 2. Target dimensions used in the analysis” and other dimensions in the questionnaires fielded

Target dimensions used in analysis”

Questionnaire Physical function Other dimensions (analysis

Pain Overall score presented in Appendix)

SF-36

Nottingham Health Profile (NHP)

Duke Health Profile (Duke)

Ontario Health Survey

(OHS)

Sickness Impact Profile

(SIP)

Physical Function

Physical Mobility

Physical Health

Mobility

Physical

Pain (Unweighted mean Role-Physical across all dimen- Role-Emotional sions)h Social function

Mental functioning Energy General health

Pain (Unweighted mean Energy across all dimen- Emotional reaction sions)’ Social isolation

Sleep Pain General health Mental health

Social health Perceived health Self-esteem Anxiety Depression Disability

Pain Overall score Cognitive

Emotional Self-care Senses

(Overall) (proxy for Overall score Ambulation pain-SIP does not Sleep and rest have pain dimen- Emotional behavior sion) Body care & movement

Mobility

Home management Social interaction Alertness Recreation & pastimes Work [Eating] not included

[Communication] not included Psychosocial

“The dimensions mosr likely to be affected by a musculoskeletal injury hThis type of aggregation nor recommended by developers.

The SIP has been studied in patients with low back pain and other musculoskeletal disorders. It has been shown to be sensitive to change [17,29,53-551, though not as sensi- tive as a disease-specific measure derived from it (Roland Scale) [17,29,56].

In the present study, the subscales of eating and commu- nications that contained items not felt to be relevant to workers with musculoskeletal injuries were deleted to re- duce the respondent burden [6]. The originators of the scale state that, unlike individual items, subscales may be re- moved, though they do not recommend it (unpublished us- er’s manual). Scoring was carried out assuming that there would not have been any affirmative responses in the di- mensions we excluded. These dimensions contributed only to the overall score.

DUKE HEALTH PROFILE. The Duke Health Profile (Duke) is a 17-item revised form of the longer (63-item) Duke-

UNC health profile [57-591. The Duke takes between 5 and 7 minutes to complete. The 17 items are combined into

10 summary scores: physical health, mental health, social health, perceived health, self-esteem, anxiety, depression, pain, and disability. A general health score which combines the first three subscales is used as an overall score. Some items contribute to several summary scores; for example, number 12 “getting tired easily” contributes to the physical health, anxiety, and depression scales. Responses are made on a three-point scale.

The Duke Health Profile reflects a revision and reconcep- tualization of the Duke-UNC Health Profile. It has been shown to correlate with the Medical Outcome Study (MOS) Short Form [58]. The Duke was developed for use in primary care. Some of the sample on which it was tested included patients with low back pain however, there was no stratification to show exactly how it functioned in this group [57].

MODIFIED ONTARIO HEALTH SURVEY HEALTH STATUS

QUESTIONS (OHS). This health status questionnaire was part of the 1990 Ontario Health Survey, an interviewer adminis-

Evaluating Changes in Health Status 83

tered survey. For the current study, the format was modified to make it self-administered. It should be noted that these changes may alter the measurement properties of the instru- ment. The health status section of the Ontario Health Sur- vey includes up to 31 questions on health in the areas of cognition, emotions, mobility, pain, self-care (dexterity), and senses (sight, hearing). The scaling for individual ques- tions varies from dichotomous to four-point ordinal scales. Extensive skip patterns are present as the questions in each dimension are hierarchical in nature.

Uniquely, this instrument provides a utility score. An in- dividual is classified into one of a finite number of health states within each dimension depending on his or her re- sponses. All dimensional health states have preassigned utility weights multiplied across dimensions to obtain an overall score [60].

The developers who provided the utility weights empha- sized that this is a provisional system and should be used with caution. They are currently funded to validate the scoring mechanism of the tool further. Utility weights in this provisional scoring were derived using standard gamble techniques in a sample of parents of school-aged children in preparation for a clinical trial [61].

Subsequent to the present study, the developers have in- troduced a self-completed format that is currently undergo- ing psychometric testing. The content and layout of this new format differ from the self-completed questionnaire used here. Thus, the present results may differ from those that would be achieved with either the interviewer-admin- istered tool or this newer self-completed format.

GLOBAL QUESTIONNAIRE. A transitional index was in- cluded that identifies the change in health status between testing times on a five-point ordinal scale (“much worse,” ‘Lsomewhat worse,” “no change,” “somewhat better,” and “much better”). This change in health status is used, as it has been by others, as a global measure of change in subjects’ clinical health state to assess the psychometric properties of the HRQOL tools [17,53,62,63].

Ethical Review

This study was reviewed and approved by the ethical review hoard at the University of Toronto, and the research board at one site (Downsview Rehabilitation Centre). Signed in- formed consent was obtained from subjects before participa- tion in the study.

STATISTICAL ANALYSIS Sample Size Calculation

TEST-RETEST RELIABILITY. %npk. Size CdCdatiOn for examining test-retest reliability was based on work by Kraemer and Korner [64] and Donner and Eliasziw [65]. Alpha was set to 0.05, and beta to 0.20. RhofO, (minimal standard) and Rho(,) (reliability coefficient anticipated in

this study) were set to 0.60 and 0.85, respectively. This sam- ple would therefore have the power to detect a significant difference between our minimal standard (0.60) and the ex- pected level of 0.85. A sample size of 42.1 was calculated. Oversampling of 15% was done to allow for incomplete questionnaires.

RESPONSIVENESS. No literature was found to describe the calculation of a sample size for a study to explore respon- siveness. For this study it was estimated that a sample at least the same size as for the reliability study should be used. It was decided to sample 45 people who stated they had a positive change in health between the two testing times. Post hoc testing using the methods of Cohen [l l] suggested that this was an adequate sample.

Adjustment of Questionnaires

To allow figures on change in health to be easily compared, all dimension scores were adjusted so that higher scores in- dicated better health and a perfect (positive health state) score was equal to 100. This involved reversing scores on the Nottingham and SIP and altering the Duke and the Ontario Health Survey. Although not recommended by the developers, a proxy for an overall aggregate score for the NHP and SF-36 was created using the unweighted mean across dimensions [3,45,47,66]. Missing values were cor- rected as suggested by the developers of each instrument.

As stated, the analysis presented in this paper represents three dimensions of health determined a priori (by the au- thors) to be most likely to be affected by a musculoskeletal disorder: physical functioning, pain, and the overall score. The results from the other dimensions are presented as an appendix.

Test-Retest Reliability

Subjects from the work site or rehabilitation center who felt that they had not changed in health (on the transition in- dex) were analyzed for test-retest reliability. One way anal- ysis of variance was calculated on each dimension using nonparametric (rank) methods. Nonparametric, ranked methods were used for two reasons. First, one developer (NHP) recommended nonparametric methods for their in- strument and we wished to be consistent across instruments. Secondly, when reliability coefficients were calculated using parametric methods, the residuals were not normally distrib- uted for one instrument (Health status section of the On- tario Health Survey). Although this would not affect the validity of the intraclass correlation coefficient (ICC) itself, it might affect our ability to use the F distribution for confi- dence intervals around the ICC. Again nonparametric methods were used for all questionnaires for comparability of results. Post hoc comparisons of parametric and nonpara- metric methods revealed differences in coefficients only in one questionnaire (OHS). ICCs were calculated according

a4 D. E. Beaton et al.

to the methods of Shrout and Fleiss 1979, and Bartko 1966 [67,68]. The results were expected to be lower than the ICC calculated by Deyo, who used two-way analysis of variance [6]. Two way analysis of variance would eliminate the error attributed to a systematic shift between testing times [68]. We wished to include the possibility of this shift because it could represent a learning effect, and therefore a bias in using the yuestionnaire to measure change over time. One way analysis of variance was used so that any systematic shift would lead to a lower reliability coefficient (error at- tributed to test time is in the denominator [69]). Confidence intervals were created for the intraclass correlation coeffi- cient using the method of Stratford for a one-way analysis of variance [70].

ResponsiwenesslSensitiwity to Change

Responsiveness requires a standard outside the yuestion- naires to indicate clinical change. Traditionally, this has been the transitional index of health status [6,17,53]. Those subjects from the work site or the physiotherapy clinics who indicated that they had improved in health (either “some- what better” or “much better”) between tests were consid- ered to have had a “clinically estimated improvement” in health [6].

We also used another criterion change: those expected to have a positive change between testing. Our knowledge of the natural history of these conditions suggested that the work site and the physiotherapy clinics samples would expe- rience changes in health. This entire sample was expected to have a positive change and we therefore repeated the responsiveness analysis for this entire group. This allowed comparisons of the responsiveness without using the transi- tional health index as a “gold standard” of change in health.

Summary statistics which describe the magnitude of a positive change in health in subjects who experience one are many, each emphasizing a slightly different aspect of this con- cept [2,6,14,17,29,54,55,66,71]. Most reflect a standardized ratio of “signal” (observed change) to “noise” (some measure of variance). In the present study, three different ways of ex- pressing responsiveness are reported: change scores, standard- ized effect size (SES), and standardized response mean (SRM). Other methods such as relative change, paired t-sta- tistic, and relative efficiency only demonstrate variations on the selected statistics because of the closeness of the formulae. Guyatt’s Responsiveness is an alternative measure of respon- siveness [6,62]. In the methods described by Guyatt [62] the same sample was used to measure test-retest (for the variance in the denominator) and a minimal clinically important dif- ference for the numerator. The present study was not designed to allow for this. Others have made adaptations to this statis- tic, for example by using mean change rather than determin- ing the minimally clinically important difference [6,24]. This “minimally clinically important difference” has yet to be de- termined for these questionnaires [6,66,71,72] and therefore we do not report the results for this statistic.

I. Obserored Change [29]

MEAN (TESTY-TEST+ The observed change mean (testz- test,) scores were calculated as a measure of the extent to which the questionnaire measured the clinical change expe- rienced by the subjects. Responsiveness was also examined by determining whether the change score took the form of a gradient across those people who felt they had “no change,” were “somewhat better,” or were “much better” on the transitional index.

II. Standurdi~ed Effect Size (SES) [71]

OBSERVED CHANGE/STANDARD DEVIATION,,,,,. The ef- fect size is calculated by taking the observed change scores and dividing the result by the standard deviation of the baseline scores. Kazis explains that this statistic aids in the understanding of the magnitude of the change, rather than its statistical significance [10,71]. It provides a standardized metric for the change scores.

III. Standardized Response Mean (SRMI f54,661

OBSERVED CHANGE/STANDARD DEVIATIONdiffrrrncer. Like the SES, the SRM uses the mean observed change as the numerator, but divides it by the standard deviation of the difference scores. This reflects the concept of a signal : noise ratio more effectively than the SES. It is very similar to a paired t-test; however, because it avoids the use of standard error of the mean in the denominator, it is less influenced by the sample size. Confidence intervals (95%) were calcu- lated for this statistic under the assumption that difference scores followed a normal distribution and therefore the SRM distribution could be approximated by a normal distri- bution with mean of zero and standard deviation of one over the square root of the sample size (n).

RESULTS Sample

A total of 127 workers agreed to participate in the study. Subjects were accrued evenly across the three sites. Forty- five subjects reported a positive change in health and 49 had no change in health (Table 3). An a priori decision was made to exclude subjects who reported unexpected changes in health from the analysis of responsiveness based on the transitional health status index. This included those with no change at discharge from active therapy (n = 3), and a positive change in one week when sampled from the long- term rehabilitation centre (n = 5).

Test-Retest Reliability

The ICCs and their 95%) confidence intervals for the tar- geted dimensions are presented in the first column of Table 4 (for other dimensions see Appendix). In Fig. 2, the ICCs for the target dimensions in the five tools are ordered from

Evaluating Changes in Health Status 85

TABLE 3. Number of subjects by sampling site, self-reported change in health status and percent completing both question- naires

Worksite (steel mill) Early active rehabilitation

(community clinics) Longer-term rehabilitation

(rehabilitation center) Total

Total in Positive change No change Decline in Percent completing

study in health in health health both questionnaires”

53 23 22 7 98.1% (52)

34 22 [31h 2 79.4% (27)

0 [51” 27 2 87.5% (35)

127 45 49 12 89.8% (114)

,‘Those not completing the second questionnaire included loss to follow up. ‘Not included in the analysts hased on an a priori decision of expecred direction of change in this component of rhe sample.

the highest coefficient to the lowest. The shaded area indi- vary between 0.74 and 0.85. The mobility section of the cates a target coefficient of R = 0.75, reflecting adequate to OHS had little variance (only one person changed in health good test-retest reliability [68,69]. The Nottingham Health between testings) and a meaningful ICC could not be calcu- Profile and the Sickness Impact Profile had the highest reli- lated, despite 98% agreement. This has been described as ability coefficients (0.76-0.95). The SF-36 and OHS ICCs the paradox of the Kappa’s [73,74]. The ICCs for the Duke

TABLE 4. Reliability and responsiveness statistics for the targeted dimensions of pain, physical functioning, and overall scores

Responsivenessh (n = 45) Reliability” n = 49 ~

Standardized response ICC mean (SW)

Test-Retest Effect size Mean (Test 2-Test l)/ (95% confidence Observed change Mean (Test 2-Test l)/ Std Dev of Observed Change

intervals) Mean (Test 2-Test 1) Std Dev of Test 1 (95% confidence intervals)

SF-36 Overall

, 5F-36 Pain

SF-36 Physical function

NHP Overall

NHP Pain

NHP Physical function

Duke Overall

Duke Pain

Duke Physical function

OHS Overall

OHS Pain

OHS Physical function

SIP Overall

SIP Physical function

0.85 (0.74, 0.92)

0.74

(0.58, 0.84) 0.74

(0.58, 0.85)

0.95

(0.91,0.97) 0.87

(0.77, 0.98) 0.76

(0.61, 0.86)

0.59 (0.35,0.75)

0.57 (0.34, 0.73)

0.69 (0.51, 0.82)

0.78 (0.68, 0.89)

0.70 (0.54, 0.83)

tm’

0.93 (0.88, 0.96)

0.94 (0.89, 0.96)

12.44 0.67

23.48 0.99

15.48 0.55

10.2 0.52

21.14 0.57

8.6 0.48

5.58 0.34

19.7 0.74

12.68 0.51

7.60 0.40

7.30 0.45

0 0

5.24 0.42

3.79 0.39

1.09

(0.77, 1.40) 1.13

(0.88, 1.43)

0.81 (0.51, 1.11)

0.66

(0.34, 0.98) 0.70

(0.39, 1.01) 0.64

(0.34, 0.95)

0.48

(0.15,0.83) 0.73

(0.43, 1.03) 0.62

(0.31, 0.92)

0.57 (0.27, 0.87)

0.54 (0.25, 0.85)

(-0.;9,0.29)

0.66 (0.39, 0.93)

0.51

(0.22, 0.81)

~Reliabihty statistics reported in subjects who reported no change in health between testings. ‘Responsweness statistics reporred in subjects who reported @rive change in health between testings. ‘The ICC for the OHS mobihq dimension was not valid. The sample had 98% agreement, due t(o having only one person change from the 100% ceiling

effect at hat-line. This extreme lack of variability in responses leads to an ICC which is an arrifacc of the data, rather than a reflection of rhe concordance.

86 D. E. Beaton et al.

Dimension Mean Change in Physical Function Score ~O.................................~

NHP Avg

SIP Physical

SIP Overall

NHP Pain

SF36 Avg

OHS self care

OHS overall

NHP Phys

SF36 Phys

SF36 Pain

OHS Pain

Duke Phys

Duke Gen H.

Duke Pain

OHS Mobility

0 0.2 0.4 0.6 0.8 1

ICC Value FIGURE 2. Test-retest reliability co-efficients results. The magnitude of the nonparametric, one-way analysis of vari- ance ICC for each of the target dimensions based on the

sample of 49 subjects with no change in health between test- ings. The shaded represents an “acceptable” level of 0.75 (see text for discussion). NA = not available.

Health Profile were lower than the target, ranging between 0.57 and 0.69.

Responsiveness

I. USING THE RESPONSE TO THE TRANSITIONAL INDEX AS

CRITERION OF CHANGE. Figure 3 shows the mean change scores for each of the self-reported levels of change in health in the three target dimensions: overall, physical function- ing, and pain. Hypothetically, a tool sensitive to clinical change should also show increasing change scores for in- creasing levels of improvement in self-perceived health. In those with no change, one would expect a change score close to zero. When examining these figures two questions should be asked: Did the tool detect the change in health? And, did the tool detect the gradient from “no change” to “somewhat better” to “much better”? The SF-36, Notting ham, and Duke detected the biggest difference, with the most obvious gradient in the SF-36.

In those people who said they had a positive change in health, the observed change score varied between zero for OHS mobility to 23.5 for the SF-36 pain score (Table 4).

.7 SF36

- 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -

Mean Change in Overall Score 20.................................~

IS- -.

/OHS

SIP

Mean Change in Pain Score 30............,....................

. . . . . .

5.. :;.:. . .

SIP - no pain score 0 . . . . . . . . . . . . . . . . . . ~

Same Slightly better Much better

FIGURE 3. Mean change score in targeted dimensions of

physical function, overall health, and pain for each level of self-perceived change in health in each instrument. All tools scaled to 100 (100 = good health). Abbreviations: SF36 = Short Form-36, NHP = Nottingham Health Profile, Duke = Duke Health Profile, SIP = Sickness Impact Profile, OHS = Ontario Health Survey (modified).

Evaluating Changes in Health Status 87

FIGURE 4. Statistical mea-

sures of responsiveness: Stan- dardized Response Mean (SRh4) and Standardized Ef- feet Size (SES) for each of the target dimensions in each in- strument in those participants who stated that they had im- proved between testings (n = 45).

Overall Score

Pain

Physical function

Overall Score

Pain

Physical function

EISF-36 aNottingham Health Profile

WDuke Health Profile EIHealth Status-OHS

USickness ImDact Profile

The Nottingham showed change scores ranging from 8.7 to 21.1, and the Duke from 5.6 to 12.7. The SIP and the OHS had the smallest change scores (O-7.6). The results of the SES and SRM are numerically summarized in Table 4. Ac- cording to Cohen, SESs can be considered large (>0.8), moderate (O&0.8), or small (0.2-0.5) [ll]. The effect sizes calculated for the SF-36 were large to moderate. For the Nottingham and the Duke, they were moderate to small. The SIP and the OHS showed only small effect sizes in the targeted dimensions. Similar guidelines for interpreting the magnitude of the SRM could be followed. Comparison of the point estimates of the SRMs of the tools show a trend in favor of the SF-36 (Fig. 4), though the differences were not statistically significant as shown by the overlapping con- fidence intervals (Table 4). SF-36 SRMs ranged between 0.8 1 and 1.13; the next highest were the Duke and Notting- ham pain dimensions (0.73 and 0.70, respectively); other

tools were in the 0.4-0.6 range (except the OHS mobility, which was zero).

Although the three measures of responsiveness (change scores, SES, SRM) shown here resulted in different rankings of the instruments by dimension, the SF-36 consistently ranked first for all dimensions (physical function, pain, and overall) using all three responsiveness measures.

ILUSING THE SAMPLE OFTHOSEEXPECTEDTO CHANGE

ASACRITERIONFORCHANGE. Thesamestatisticalmethods were used on the entire sample from the work site and the physiotherapy clinics in an attempt to examine respon- siveness without depending on the one “gold standard” transitional health status question (Table 5). The results of this analysis suggest that although the magnitude of the statistics is smaller, the same pattern emerges. The SF-36 remains most responsive to the clinical change hypothe-

88 D. E. Beaton et al.

TABLE 5. Responsiveness statistics for the sample expected to improve between testings

Responsiveness

Effect size

Observed change Mean (Test 2-Test 1) I Mean (Test d-Test 1) Std Dev of Test 1

SF-36 Overall 7.9 0.40

SF-36 Pain 14.5 0.60

SF-36 Physical function 10.5 0.37

Standardized response mean (SRM)

Mean (Test 2-Test 1)1

Std Dev Observed Change (95% confidence intervals)

0.65 (0.42, 0.89)

0.67 (0.45, 0.90)

0.59 (0.37, 0.83)

NHP Overall 4.2 0.20 0.30

(0.07,0.54) NHP Pain 9.7 0.26 0.35

(0.13,0.58) NHP Physical function 5.2 0.29 0.41

(0.18, 0.64)

Duke Overall 3.0 0.23 0.28 (0.03, 0.53)

Duke Pain 13.8 0.45 0.48 (0.25, 0.70)

Duke Physical function -8.2 0.31 0.44 (0.22, 0.67)

OHS Overall 4.3 0.23 0.29 (0.07,0.52)

OHS Pain 4.2 0.28 0.22 (0.08,0.53)

OHS Physical function -0.04 0 0.07 (-0.16, 0.29)

SIP Overall 2.87 0.24 0.39 (0.17, 0.61)

SIP Physical function 2.19 0.26 0.36

(0.14,0.58)

All subiects from the work site and from the physiotherapy clinics (n = 79). Results are presented for the targeted dimcnaions of pain, physical function,

sized through the sampling strategy. The SRM values for the SF-36 ranged between 0.59 and 0.67, with the next clos- est being the Duke (0.28-0.48). The others were between 0.07 (OHS physical functioning) and 0.41 (NHP physical functioning). Again, no statistically significant differences were found between instruments.

DISCUSSION

The choice of a measure of health status for use in a study

should be based on the measurement properties of the tool for its intended use in its intended population. In this study, we undertook a concurrent comparison over time of five generic measures. To detect change over time a tool must be both reliable (consistent responses when no change has occurred) and responsive to clinical change (when change has occurred).

Concurrent comparison of different tools in the same

study can help investigators determine the relative merits of each in certain applications. Unfortunately, there are very few such comparisons in the literature [41,54,55,58] and even fewer articles explore responsiveness in head-to- head comparisons [2,55,66].

The selection of subjects for analysis in each of the reli- ability and responsiveness sections of the present study re- quired an external criterion on which to base the judgment of change between testing sessions. The transitional scale of health (“much better” to “much worse”) has had support in previous psychometric studies [14,17,53,62]. However, the true validity of this technique may warrant further in- vestigation, particularly to explore the influence of expecta- tions (internal or external) on the individual reporting change (i.e., the expectation of improvement after a course of physiotherapy). Subjects were expected to change in health in two sites: either due to the natural history of the disorder (i.e., workers enrolled from a work site at time of

Evaluating Changes in Health Status 89

injury) or due to the expected improvement with involve- ment in physiotherapy. Repeating the responsiveness analy- sis on the entire sample expected to change between testings did not change the relative strengths of the different tools in terms of responsiveness.

Instrument Selection

The purpose of this study was to guide the selection of an instrument to measure change in health in injured workers with work related musculoskeletal disorders.

The selection of an evaluative tool must be based on a balance of reliability and responsiveness [6,62]. The Duke Health Profile did relatively well in the responsiveness anal- ysis, though it was unstable in test-retest reliability. On the other hand, the SIP had excellent reliability, but little re- sponsiveness to clinical change. The SF-36 had adequate test-retest reliability, and tended to be stronger in several statistical measures of responsiveness. We have therefore se- lected the SF-36 as a tool to measure change in our planned study of injured workers.

At the time of the analysis, no summary score for the SF- 36 were available. Using eight dimensions as the principal outcome in a clinical trial raises issues of adjustments to account for the multiple comparisons and the need to in- crease the sample size. Aggregate scores for the SF-36 have now been developed summarizing the dimensions into men- tal health and physical health scores [5,38]. Like others [3,66], we have formulated an unweighted mean across the dimension as an aggregate score. Although the developers do not recommend this summary score, it performed well when compared with other recommended aggregate scores in other tools (SIP, Duke, and OHS). An unweighted mean across dimensions is based on the assumption that each di- mension has equal weighting in the overall health of an individual. This assumption may need to be validated, and does not reflect the method of aggregation recommended by the developers.

Test-Retest Reliability

Guidelines for acceptable ICC values vary. Streiner and Norman [69] suggest that a tool with good reliability when studying groups of people should have an ICC exceeding 0.85; Fleiss [75] and Tammemagi [76] cite R > 0.75 to be acceptable. Donner and Eliasziw [65] suggest that R > 0.60 is a minimal standard. McHorney and Tarlov [77] reinforce that higher requirements are required if using the question- naire to interpret individual data (such as change in a score in one patient) rather than group data (change in a score of a cohort of patients). They suggest that coefficients greater than 0.90 are required for use of the questionnaire in individuals. Our test-retest analysis rarely produced values greater than 0.85, although they did exceed 0.70. Like

McHorney and Tarlov’s [77] work this may suggest some difficulty in interpreting score changes in individuals, but may mean the instruments are reliable for interpreting grouped data. The lower reliability could be due to instabil- ity in the tools, or instability in the subjects. The latter effect might reflect the inherent instability of health, even when deliberately sampling subjects who are experiencing disability relevant to the target population (musculoskeletal soft tissue disorders in injured workers) at a later stage of their disease experience.

Higher coefficients would probably have been achieved by decreasing the time between testing [69]; however, this could have increased recall bias between testing, and would have failed to reflect the likely time between testing in the proposed study for which this comparison was made. Guyatt [72,78], among others, has debated the need to have a high reliability coefficient when seeking a tool that is to be used for evaluative functions. The results from our study suggest that the Duke Health Profile was least stable in our sample (ICC between 0.349 and 0.6935). Other tools demonstrated adequate to good test-retest reliability for group compari- sons in the targeted dimensions.

Responsiweness

There is no “gold standard” for summarizing responsiveness, although some consensus is needed to compare results in published studies [54,66]. The most efficient responsiveness statistic is still a matter of debate [11,54,66,71]. Like other researchers [2,24,29], we have looked for a trend across dif- ferent recommended statistics. The literature demonstrates inconsistency in the methods used for calculating respon- siveness statistics [2,24,54] d d an rea ers must be cautioned to examine the formulae and adaptations made to the different statistics.

The results of the present study show the SF-36 as most responsive to the clinical change experienced by the sub- jects investigated, whether estimated according to the change in health status, or the sampling strategy. The mag- nitudes of our statistics are comparable to those reported by Katz in 1992 [66]. Like Katz [66] we did not have an ade- quate sample size to find a statistically significant difference between the questionnaires. We also found differences in point estimates that might be clinically important and this would make a difference in sample size calculation for a clin- ical trial.

The study’s weaknesses lie in not having the power to determine whether the responsiveness was as good in the different conditions within the sample as overall. For exam- ple, would the tool be equally responsive in upper extremity conditions as in low back pain patients? Exploratory analysis of the data suggests that the SF-36 is still the most respon- sive of the five tools investigated, but that effect sizes were not the same for conditions in different body parts (upper

90 D. E. Beaton et al.

limb versus low back pain). There is therefore a need for further research into the responsiveness across different conditions and understanding the meaning of this differ- ence (an insensitive instrument versus less disabled or less responsive sample of subjects).

Although sensitivity to clinical change is traditionally felt to be within the domain of disease-specific measures of HRQOL, the researcher must be aware of the need for a generic tool also to be sensitive to clinical change. Since disease-specific measures may not allow a comparison be- tween diseases, injuries, and conditions, the clinical trials incorporating generic measures might be used as indicators of the severity of the disease state concerned, the apparent need for health services, and the response to those services. Because generic tools will be used across conditions, they will be called upon to determine the relative effectiveness of different treatments across different conditions in the health care system. The researcher selecting a generic health status measure should seek an instrument responsive to the clinical change expected in their study. This study demonstrated that, compared to four other generic health status measures, the SF-36 Acute was the most responsive to clinical improvement in injured workers with musculo- skeletal disorders. Further research is required to discover the most responsive instrument for other populations and other conditions.

The authors wish to acknowledge the cooperation of staff and study participants at the sites wed in this study: Dofasco Steel, Hamilton; Downsview Rehabilitation Centre, Downswiew; and the six physiother- apy clinics: Canadian Back Institute (Brampton and York St.), Cana-

dian Injury Recovery Clinic, College Street Physiotherapy, Physiother- apy One, and the Sports Rehabilitation institute. They would also like to thank Dr. G. Torrance, Dr. S. Stansfield, Dr. F. Guilleman, and Ms. Charmaine Heath for their thoughtful feedback and assistance with the manuscript. Special thanks are due to the people involved in the data

collection: Sue Ferrier , Rose Lee, Theresa Ghan , and Bernadine Ayres Thanks also to Michael Mondloch for his assistance with the scoring program for the Ontario Health Survey. Ms. D. Beaton was supported by the Institute for Work B Health Clinical Research Fellowship for this project.

References 1. Aaronson NK. Quality of life assessment in clinical trials:

Methodological issues. Con Clin Trial 1989; 10: 195S-208s. 2. Bombardier C, Raboud J. A comparison of health-related

quality-of-life measures for rheumatoid arthritis research. Con Clin Trial 1991; 12: 243S-256s.

3. Cox DR, Fitzpatrick R, Fletcher AE, Gore SM, Spiegelhalter D, Jones DR. Quality of life assessment: Can we keep it sim-

ple? J R Statis Sot 1992; 155: 353-393. 4. Bergner M. Quality of life, health status, and clinical research.

Med Care 1989; 27(3 Suppl.): S148-S156.

5. Patrick DL, Erickson I?. Health Status and Health Policy- Allocating Resources to Health Care. Oxford: Oxford Uni- versity Press; 1993.

6. Deyo RA, Diehr I’, Patrick DL. Reproducibility and respon-

siveness of health status measures. Con Clin Trial 1991; 12:

142S-158s. 7. Spiegelhalter D, Gore SM, Fitzpatrick R, Fletcher AE, Jones

DR, Cox DR. Quality of life measures in health care: III. Re- source allocation. Br Med J 1992; 305: 1205-1209.

8. Fletcher A, Gore S, Jones D, Fitzpatrick R, Spiegelhalter D,

Cox D. Quality of life measures in health care: II. Design, analysis, and interpretation. Br Med J 1992; 305: 1145-1148.

9. Deyo RA, Patrick DL. Barriers to the use of health status mea- sures in clinical investigation, patient care, and policy re- search. Medical Care 1989; 27(3 Suppl.): S254-S268.

10. Guyatt GH, Townsend M, Berman LB, Keller J. A comparison of Likert and visual analogue scales for measuring change in function. J Chronic Dis 1987; 40: 1129-1133.

11. Cohen J. Statistical Power Analysis for the Behavioral Sci- ences, 2nd Edition. New Jersey: Lawrence Erlbaum Associ-

ates; 1988. 12. Bombardier C, Tugwell PX. Methodological considerations in

functional assessment. J Rheumatol 1987; 14(Suppl. 15): 6- 10.

13. Ware JE Jr., Brook RH, Davies AR, Lohr KN. Choosing mea-

sures of health status for individuals in general populations. Am J. Pub Health 1981; 71: 620-625.

14. Deyo RA, Inui TS. Toward clinical applications of health sta- tus measures: Sensitivity of scales to clinically important changes. HSR 1984; 19: 275-289.

15. Bergner M. Health status measures: an overview and guide for selection. Annu Rev Pub1 Health 1987; 8: 191-210.

16. Fitzpatrick R, Fletcher A, Gore S, Jones D, Spiegelhalter D, Cox D. Quality of life measures in health care: I. Applications and issues in assessment. Br Med J 1992; 305: 1074-1077.

17. Deyo RA, Centor RM. Assessing the responsiveness of func- tional scales to clinical change: An analogy to diagnostic test performance. J Chronic Dis 1986; 3: 897-906.

18. Patrick DL, Deyo RA. Generic and disease-specific measures in assessing health status and quality of life. Medical Care 1989; 27 Suppl.: S217-S232.

19. Hunt SM, McKenna SP, Williams J. Reliability of a popula-

tion survey tool for measuring perceived health problems: A study of patients with osteoarthrosis. J Epi Common Health 1981; 35: 297-300.

20. Ware JE. Measuring patients’ views: The optimum outcome measure. Br Med J 1993; 306: 1429-1430.

21. Feinstein AR. Benefits and obstacles for development of

health status assessment measures in clinical settings. Medical Care 1992; 30(5 Suppl.): MS50-MS56.

22. Patrick DL, Bergner M. Measurement of health status in the 1990s. Annu Rev Pub1 Health 1990; 11: 165-183.

23. Guyatt GH, Feeny DH, Patrick DL. Measuring health-related

quality of life. Ann Intern Med 1993; 118: 622-629. 24. Wiklund I, Karlberg J. Evaluation of quality of life in clinical

trials: Selecting quality of life measures. Con Clin Trial 1991; 12: 204S-216s.

25. Guyatt GH, Bombardier C, Tugwell PX. Measuring disease- specific quality of life in clinical trials. CMAJ 1986; 134: 889- 895.

26. Guyatt GH, Van Zanten SJOV, Feeny DH, Patrick DL. Mea- suring quality of life in clinical trials: a taxonomy and review. CMAJ 1989; 140: 1441-1448.

27. Bindman AB, Keane D, Lurie N. Measuring health changes among severely illpatients. MedicalCare 1990; 28: 1142-l 152.

28. Kirshner B, Guyatt GH. A methodological framework for as- sessing health indices. J Chronic Dis 1985; 38:27-36.

29. Deyo RA. Comparative validity of the sickness impact profile and shorter-scales for functional assessment in low-back pain. Spine 1986; 11: 951-954.

Evaluating Changes in Health Status 91

30. Jacobson HG. Magnetic resonance imaging of the head and neck region. Present status and future potential. JAMA 1988; 260:3313-3326.

3 1. Lankhorst GJ, Van de Stadt RJ, Vogelaar TW, Van der Korst JK, Prevo AJH. The effect of the Swedish back school in chronic idiopathic low back pain-A prospective controlled

study. &and J Rehab Med 1983; 15: 141-145. 32. Waddell G. A new clinical model for the treatment of low-

back pain. Spine 1987; 12: 632-644. 33. Spitzer WO, LeBlanc FE, Dupuis M, et al. Scientific approach

to the assessment and management of activity-related spinal disorders: a monograph for clinicians. Report of the Quebec

task force on spinal disorder. Spine 1987; 12: s4-~55. 34. Frymoyer JW, Ducker TB, Hakler NM, Kostuik JP, Weinstein

JN, Whitecloud TS, III. The Adult Spine: Principles and Practice. New York: Raven Press; 1991.

35. Roland M, Morris R. A study of the natural history of low- back pain. Part II. Development of guidelines for trials of

treatment on primary care. Spine 1983; 8: 145-150. 36. Cheadle A, Franklin G, Wolfhagen C, et al. Factors influenc-

ing the duration of work-related disability: A population- based study of Washington state workers’ compensation. AJPH 1994; 84: 190-196.

37. Dillane JB, Fry J, Kalton G. Acute back syndrome-A study from general practice. Br Med J 1966; 2: 82-84.

38. Ware JE, Jr, Snow KK, Kosinski M, Gandek B. SF-36 Health Survey Manual and Interpretation Guide. Boston: The Health Institute; 1993.

39. McHorney CA, Ware JE, Jr, Rogers W, Raczek AE, Lu JFR. The validity and relative precision of MOS short- and long-

form health status scales and Dartmouth COOP charts: Re- sults from the medical outcomes study. Medical Care 1992; 30( 5 Suppl.): MS253-MS265.

40. McHomey CA, Ware JE, Jr, Rogers W,Raczek AE. The MOS 36-Item Short-Form Survey (SF-36): II. Psychometric and clinical tests of validity in measuring physical and mental health ConstrucCs. Medical Care 1993; 31: 247-263.

41. Brazier JE, Harper R, Jones NMB, et nl. Validating the SF-36 health survey questionnaire: new outcome measure for pri- mary care. Br Med J 1992; 305: 160-164.

42. Garratt AM, Ruta DA, Abdalla MI, Buckingham JK, Russell IT. The SF 36 health survey questionnaire: An outcome mea- sure suitable for routine use within the NHS! Br Med J 1993;

306: 1440-1444. 43. Hunt SM, Wiklund I. Cross-cultural variation in the

weighting of health statements: A comparison of English and Swedish valuations. Health Policy 1987; 8: 227-235.

44. Jenkinson C, Fitzpatrick R, Argyle M. The Nottingham

health profile: An analysis of its sensitivity in differing illness groups. Sot Sci Med 1988; 27: 1411-1414.

45. Ehrahim S, Barer D, Nouri F. Use of the Nottingham health profile with patients after a stroke. J Epi Comm Health 1986; 40: 166-169.

46. Barkun JS, Barkun AN, Sampalis JS, et al. Randomised con- trolled trial of laparoscopic versus mini cholecystectomy. Lan- cet 1992; 340: 1116-1119.

47. Cox IM, Campbell MJ, Dowson D. Red blood cell magnesium

and chronic fatigue syndrome. Lancet 1991; 337: 757-760. 48. Lagro-Janssen TLM, Smits AJA, van Wee1 C. Women with

urinary incontinence: Self-perceived worries and general prac- titioners’ knowledge of problem. Br J Gen Practice 1990; 40: 331-334.

49. McKenna SP, Hunt SM, McEwen J. Absence from work and perceived health among mine rescue workers. J Sot Occup Med 1981; 31: 151-157.

50. Bergner M, Bobbitt RA, Carter WB, Gilson BS. The sickness

impact profile: Development and final revision of a health sta- tus measure. Medical Care 1981; 19: 787-805. (Abstract)

51. Gilson BS, Gilson JS, Bergner M, et al. The sickness impact profile. Development of an outcome measure of health care. Can J Pub Health 1975; 65: 1304-1310.

52. Rossignol M. Planning preventive occupational health ser- vices at the community level. Can J Pub Health 1991; 82: 115-119.

53. MacKenzie CR, Charlson ME, DiGioia D, Kelley K. Can the

sickness impact profile measure change? An example of scale assessment. J Chronic Dis 1986; 39: 429-438.

54. Liang MH, Fossel AH, Larson MG. Comparisons of five health status instruments for orthopedic evaluation. Medical

Care 1990; 28: 632-642. 55. Liang MH, Larson MG, Cullen KE, Schwartz JA. Compara-

tive measurement efficiency and sensitivity of five health sta- tus instruments for arthritis research. Arthritis Rheum 1985; 28: 542-547.

56. Roland M, Morris R. A study of the natural history of back pain. Part I. Development of a reliable and sensitive measure of disability in low-back pain. Spine 1983; 8: 141-144.

57. Parkerson GR, Broadhead WE, Tse C-KJ. The Duke health profile-A 17sitem measure of health and dysfunction. Medi- cal Care 1990; 28: 1056-1072.

58. Parkerson GR, Broadhead WE, Tse C-KJ. Comparison of the Duke health profile and the MOS short-form in healthy young adults. Medical Care 1991; 29: 679-683.

59. Parkerson GR, Gehlbach SH, Wagner EH, James SA, Clapp NE, Muhlbaier LH. The Duke-UNC Health Profile: An adult

health status instrument for primary care. Medical Care 1981; XIX: 806-823. (Abstract)

60. Torrance GW, Zhang Y, Feeny D, Furlong W, Barr R. Multi- Attribute Preference Functions for a Comprehensive Health Status Classification System. Hamilton: McMaster Univer- sity; 1992.

61. Barr R, Pai MKR, Weitzman S, et al. A multi-attribute ap- proach to health status measurement and clinical manage- ment-Illustrated by an application to brain tumors in child- hood. Int J Oncology 1994; 4: 639-648.

62. Guyatt GH, Walter SD, Norman GR. Measuring change over

time: Assessing the usefulness of evaluative instruments. J Chron Dis 1987; 40: 171-178.

63. Singer J, Gilbert JR, Hutton T, Taylor DW. Predicting out- come in acute low-back pain. Can Fam Physician 1987; 33: 655-659.

64. Kraemer HC, Korner AF. Statistical alternatives in assessing

reliability, consistency, and individual differences for yuanti- tative measures: Application to behavioral measures of neo- nates. Psych Bulletin 1976; 83: 914-921.

65. Donner AP, Eliasziw M. Sample size requirements for reliabil- ity studies. Stat Med 1987; 6: 441-448.

66. Katz JN, Larson MG, Phillips CB, Fossel AH, Liang MH. Comparative measurement sensitivity of short and longer health status instruments. Medical Care 1992; 30: 917-925.

67. Bartko JJ. The intraclass correlation coefficient as a measure of reliability. Psycho1 Rep 1966; 19: 3-l 1.

68. Shrout PE, Fleiss JL. Intraclass correlations: Uses in assessing rater reliability. Psych Bulletin 1979; 86: 420-428.

69. Streiner DL, Norman GR. Health Measurement Scales. A Practical Guide to Their Development and Use. Oxford, New York, Tokyo: Oxford University Press; 1989.

70. Stratford PW. Confidence limits for your ICC. Phys Ther 1989; 69: 237-238.

71. Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Medical Care 1989; 27(3 Suppl.): S178-S189.

92 D. E. Beaton et al.

72. Jaeschke R, Singer J, Guyatt OH. Measurement of health sta- tus. Ascertaining the minimal clinically important difference.

Con Clin Trial 1989; 10: 407-415. 73. Feinstein AR, Cicchetti DV. High agreement but low kappa:

I. The problems of two paradoxes. J Clin Epidemiol 1990; 43: 543-549.

74. Feinstein AR, Cicchetti DV. High agreement hut low kappa: II. Resolving the paradoxes. J Clin Epidemiol 1990; 43: 551- 558.

75. Fleiss JL. The Design and Analysis of Clinical Experiments. New York: John Wiley & Sons; 1986.

76. Tammemagi MC, Frank JW, Levlanc M, Artsob H, Streiner DL. Methodological issues in assessing reproducibility-A

comparative study of various indices of reproducibility applied to repeat Elisa serological tests for Lyme Disease. J Clin Epide- miol 1995; 48: 1123-1132.

77. McHorney CA, Tarlov AR. Individual-patient monitoring in clinical practice: Are available health status surveys adequate? Qua1 Life Res 1995; 4: 293-307.

78. Guyatt GH, Kirshner B, Jaeschke R. Measuring health status: What are the necessary measurement properties? J Clin Epide- miol 1992; 45: 1341-1345.

APPENDIX. Results of the responsiveness analysis for all dimensions in the five instruments

Reliability (n = 49)” Responsiveness (n = 45)’

ICC Observed change Effect size Standardized response mean

Test-Retest Mean ,testL-test 1, Mean cte,t~-t,,tdStd Devtert I Mean(trst2-teJt,~lStd Devftest2-testl)

SF-36 Physical function 0.74 15.48 0.55 0.81

SF-36 Role-physical 0.65 25 0.76 0.62 SF-36 Role-emotional 0.64 10.23 0.36 0.36 SF-36 Pain 0.74 23.48 0.99 1.13 SF-36 Social 0.79 14.73 0.45 0.66 SF-36 Mental 0.81 1.3 0.06 0.08 SF-36 Energy 0.74 10.23 0.45 0.54

SF-36 General health 0.86 3.45 0.18 0.23 SF-36 Average 0.85 12.44 0.67 1.09

OHS Cognitive 0.71 0.6 0.16 0.26 OHS Emotional 0.86 0 0 0 OHS Mobility 0 0 0 OHS Pain

i% 7.30 0.45 0.54

OHS Self-care 0.79 0.30 0.07 0.17 OHS Senses 0.71 -0.1 -0.03 -0.02 OHS Overall 0.78 7.60 0.40 0.57

Duke Physical health 0.69 12.68 0.51 0.62 Duke Mental health 0.42 4.8 0.25 0.358

Duke Social health 0.35 -1.4 -0.09 -0.077 Duke General health 0.59 5.58 0.34 0.48

Duke Perceived health 0.61 8.3 0.29 0.29

Duke Self-esteem 0.47 -0.03 -0.19 -0.23

Duke Anxiety 0.68 5.2 0.32 0.36

Duke Depression 0.50 7.03 0.42 0.56 Duke Pain 0.57 19.7 0.74 0.73

Duke Disabilitv 0.68 17.4 0.46 0.45

NHP Energy 0.86 12.3 0.42 0.41 NHP Emotional reaction 0.83 3.69 0.18 0.25 NHP Social isolation 0.81 7.67 0.30 0.49 NHP Pain 0.87 21.14 0.57 0.70 NHP Physical mobility 0.76 8.6 0.48 0.64 NHP Sleep 0.83 10.5 0.32 0.39 NHP Averaee 0.95 10.2 0.52 0.66

continued

Evaluating Changes in Health Status

APPENDIX continued

93

Reliability (n = 49) a Responsiveness (n = 45)’

ICC Observed change Effect size Standardized response mean Test-Retest Mean (tesrl-test 1) Mean (test~-cest ,JStd Devtestl Mean~,,t~-teat dstd Devttestl-cestlb

SIP Ambulation 0.84 6.62 0.30 0.35

SIP Sleep and rest 0.63 7.38 0.40 0.55 SIP Emotional behavior 0.88 5.29 0.29 0.51 SIP Body care & movement 0.89 3.56 0.35 0.43 SIP Mobility 0.84 4.31 0.37 0.41 SIP Home management 0.81 7.52 0.38 0.38 SIP Social interaction 0.93 11.16 0.57 0.362 SIP Alertness 0.88 3.91 0.39 0.475 SIP Recreation & pastimes 0.87 6.48 0.32 0.324 SIP Work 0.63 10.27 0.33 0.379 SIP Physical 0.94 3.79 0.39 0.51 SIP Psychosocial 0.92 5.05 0.30 0.51 SIP OVERALL 0.93 5.24 0.42 0.66

‘Rcliabdity statistics reporred in subjects who reported no change in health between testings. “Responsiveness xmsncs reported in subjects who reported positive change in health between tesrings