validation of computer-administered clinical rating scale: hamilton depression rating scale...

6
Regular Article Validation of computer-administered clinical rating scale: Hamilton Depression Rating Scale assessment with Interactive Voice Response technology – Japanese version Hiroshi Kunugi, MD, PhD, 1 * Norie Koga, MA, 1 Miyako Hashikura, MA, 1 Takamasa Noda, MD, 2 Yu Shimizu, MA, 2 Takayuki Kobayashi, B. Pharm, 3 Jun Yamanaka, M. Pharm, 3 Noriaki Kanemoto, M. HSc 4 and Teruhiko Higuchi, MD, PhD 5 1 Department of Mental Disorder Research, National Institute of Neuroscience, National Center of Neurology and Psychiatry, 2 National Center of Neurology and Psychiatry Hospital, 3 Clinical Research Department, Development and Medical Affairs Division, 4 Biomedical Data Sciences Department, Development and Medical Affairs Division, GlaxoSmithKline K.K. and 5 National Center of Neurology and Psychiatry, Tokyo, Japan Aim: The aim of this study was to examine the reli- ability and validity of the Interactive Voice Response (IVR) program to rate the 17-item Hamilton Rating Scale for Depression (HAM-D) score in Japanese depressive patients. Methods: Depression severity was assessed in 60 patients by a clinician and psychologists using HAM-D. Scoring by the IVR program was conducted on the same and the following days. Test–retest reli- ability, internal consistency, and concurrent validity for total HAM-D scores were examined by calculating intraclass correlation coefficient, Cronbach’s alpha, and Pearson’s correlation coefficient. Inter-rater con- sistency for each HAM-D item was examined by Cohen’s kappa. Results: Test–retest reliability of the IVR program was high (intraclass correlation coefficient: 0.93). Internal consistency of each total score obtained by the clini- cian, psychologists, and IVR program was high (Cronbach’s alpha: 0.77, 0.79, 0.78, and 0.83). Regarding concurrent validity, correlation coefficients between total scores obtained by the clinician versus IVR and that by the clinician versus psychologists were high (0.81 and 0.93). The HAM-D total score rated by the clinician was 3 points lower than that of IVR. Inter-rater consistency for each HAM-D item evaluated by the clinician versus IVR was estimated to be fair (Cohen’s kappa coefficient: 0.02–0.50). Conclusion: Our results suggest that the Japanese IVR HAM-D program is reliable and valid to assess 17-item HAM-D total score in Japanese depressive patients. However, the current program tends to over- estimate depression severity, and the score of each item did not always show high agreement with clini- cian’s rating, which warrants further improvement in the program. Key words: Hamilton Rating Scale for Depression, interactive voice response program, Japanese, reli- ability, validity. T REATMENT OUTCOMES IN antidepressant medication trials are usually assessed by clinician- administered rating scales, such as the Hamilton Rating Scale for Depression (HAM-D). 1 In addition to the use of a structured interview guide for such scales (e.g. Williams, 1988), 2 training raters to administer such outcome measures reliably and validly has become a critical concern for clinical research over the past decade. 3–7 During this period, structured, procedurally invariant computer algorithms to obtain comparable electronic patient reported out- comes (ePRO) using interactive voice response (IVR) technology have been developed and validated in the *Correspondence: Hiroshi Kunugi, MD, PhD, Department of Mental Disorder Research, National Institute of Neuroscience, National Center of Neurology and Psychiatry, 4-1-1, Ogawahigashi, Kodaira, Tokyo 187-8502, Japan. Email: [email protected] Received 9 August 2012; revised 27 December 2012; accepted 9 January 2013. Psychiatry and Clinical Neurosciences 2013; 67: 253–258 doi:10.1111/pcn.12048 253 © 2013 The Authors Psychiatry and Clinical Neurosciences © 2013 Japanese Society of Psychiatry and Neurology

Upload: teruhiko

Post on 14-Dec-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Validation of computer-administered clinical rating scale: Hamilton Depression Rating Scale assessment with Interactive Voice Response technology - Japanese version

Regular Article

Validation of computer-administered clinical rating scale:Hamilton Depression Rating Scale assessment withInteractive Voice Response technology – Japanese versionHiroshi Kunugi, MD, PhD,1* Norie Koga, MA,1 Miyako Hashikura, MA,1 Takamasa Noda, MD,2

Yu Shimizu, MA,2 Takayuki Kobayashi, B. Pharm,3 Jun Yamanaka, M. Pharm,3

Noriaki Kanemoto, M. HSc4 and Teruhiko Higuchi, MD, PhD5

1Department of Mental Disorder Research, National Institute of Neuroscience, National Center of Neurology andPsychiatry, 2National Center of Neurology and Psychiatry Hospital, 3Clinical Research Department, Development andMedical Affairs Division, 4Biomedical Data Sciences Department, Development and Medical Affairs Division,GlaxoSmithKline K.K. and 5National Center of Neurology and Psychiatry, Tokyo, Japan

Aim: The aim of this study was to examine the reli-ability and validity of the Interactive Voice Response(IVR) program to rate the 17-item Hamilton RatingScale for Depression (HAM-D) score in Japanesedepressive patients.

Methods: Depression severity was assessed in 60patients by a clinician and psychologists usingHAM-D. Scoring by the IVR program was conductedon the same and the following days. Test–retest reli-ability, internal consistency, and concurrent validityfor total HAM-D scores were examined by calculatingintraclass correlation coefficient, Cronbach’s alpha,and Pearson’s correlation coefficient. Inter-rater con-sistency for each HAM-D item was examined byCohen’s kappa.

Results: Test–retest reliability of the IVR program washigh (intraclass correlation coefficient: 0.93). Internalconsistency of each total score obtained by the clini-cian, psychologists, and IVR program was high(Cronbach’s alpha: 0.77, 0.79, 0.78, and 0.83).

Regarding concurrent validity, correlation coefficientsbetween total scores obtained by the clinician versusIVR and that by the clinician versus psychologistswere high (0.81 and 0.93). The HAM-D total scorerated by the clinician was 3 points lower than that ofIVR. Inter-rater consistency for each HAM-D itemevaluated by the clinician versus IVR was estimated tobe fair (Cohen’s kappa coefficient: 0.02–0.50).

Conclusion: Our results suggest that the JapaneseIVR HAM-D program is reliable and valid to assess17-item HAM-D total score in Japanese depressivepatients. However, the current program tends to over-estimate depression severity, and the score of eachitem did not always show high agreement with clini-cian’s rating, which warrants further improvement inthe program.

Key words: Hamilton Rating Scale for Depression,interactive voice response program, Japanese, reli-ability, validity.

TREATMENT OUTCOMES IN antidepressantmedication trials are usually assessed by clinician-

administered rating scales, such as the Hamilton

Rating Scale for Depression (HAM-D).1 In addition tothe use of a structured interview guide for such scales(e.g. Williams, 1988),2 training raters to administersuch outcome measures reliably and validly hasbecome a critical concern for clinical research overthe past decade.3–7 During this period, structured,procedurally invariant computer algorithms toobtain comparable electronic patient reported out-comes (ePRO) using interactive voice response (IVR)technology have been developed and validated in the

*Correspondence: Hiroshi Kunugi, MD, PhD, Department of MentalDisorder Research, National Institute of Neuroscience, NationalCenter of Neurology and Psychiatry, 4-1-1, Ogawahigashi, Kodaira,Tokyo 187-8502, Japan. Email: [email protected] 9 August 2012; revised 27 December 2012; accepted 9January 2013.

Psychiatry and Clinical Neurosciences 2013; 67: 253–258 doi:10.1111/pcn.12048

253© 2013 The AuthorsPsychiatry and Clinical Neurosciences © 2013 Japanese Society of Psychiatry and Neurology

Page 2: Validation of computer-administered clinical rating scale: Hamilton Depression Rating Scale assessment with Interactive Voice Response technology - Japanese version

USA.8–11 In light of mounting evidence of systemicbias in clinician-administered rating scales in ran-domized clinical trials12,13 and evidence of psycho-metric equivalence between the clinician-rated andIVR-rated HAM-D assessments (clinician HAM-D;IVR HAM-D),8–11 the Food and Drug Administration(FDA) Division of Psychiatric Products announcedthat ePRO assessments, such as the IVR HAM-Dassessment, would be acceptable as primary outcomemeasures to establish the efficacy of antidepressantmedications in outpatient randomized clinical trials(presented by T. Laughren from the FDA at the NewClinical Drug Evaluation Unit at Phoenix, AZ on 30May 2008). Although the clinical validity and useful-ness of the IVR HAM-D assessment in clinical trialshave been established and accepted in the USA, therehave been no studies that have specifically demon-strated comparability between clinician and IVRHAM-D assessments in Japanese or any other Asianpopulations. This study was aimed to examine thereliability and validity of a Japanese IVR HAM-D (17items) program.

METHODS

Subjects

Subjects were 60 Japanese patients with majordepressive episode or dysthymia who had a widerange of depression severity. They were under treat-ment and recruited at the National Center of Neurol-ogy and Psychiatry (NCNP) Hospital (Tokyo, Japan)from July 2010 through October 2011. Diagnosis wasmade according to the DSM-IV-TR14 based on a struc-tural interview Mini-International NeuropsychiatricInterview,15 an additional unstructured interview andinformation from medical charts. The subjects whohad functional disturbance of recognition, deliriumor other conditions that would compromise insightwere excluded from the study. The study protocolwas approved by the ethics committees at NCNP andGlaxoSmithKline K.K (GSKKK). After describing thestudy, written informed consent was obtained fromevery subject. Research was conducted in accordancewith the Helsinki Declaration as revised in 2008.

Assessment procedure

The IVR HAM-D program9–11 was provided by eRe-search Technology (ERT; Philadelphia, PA, USA). TheJapanese HAM-D script (pre-recorded by a Japanese

native speaker) built in the IVR program was linguis-tically validated with the original English script byERT. The original English script was developed byHealthcare Technology Systems (Philadelphia, PA,USA) (http://www.healthtechsys.com/). The IVRHAM-D assessment was conducted through tele-phone. The patients called the IVR HAM-D center andinput their ID numbers and passwords by pushingbuttons on the phone, which initiated automatedvoice-based questions about symptoms according tothe script. The patients answered all questions bypushing buttons on the phone. Based on the answers,the program provided a score for each HAM-D item.

Initially, each patient was subject to IVR HAM-Dassessment, and then he or she underwent clinicianHAM-D and psychologist-rated HAM-D (psycholo-gist HAM-D) assessments (Day 1). Assistance by acoordinator on how to operate the phone was avail-able on Day 1. The clinician and the psychologists,who did not know the detail of the script and algo-rithm of the IVR program, rated the HAM-D on thebasis of the Japanese version of the Structured Inter-view Guide for the Hamilton Depression Rating Scale(SIGH-D).16 The clinician rater was a fellow psychia-trist (H.K.) member of the Japanese Society of Psy-chiatry and Neurology who had 25 years of clinicaland research experience. Two psychologists rated thesubjects (the first 10 subjects were done by M.H. andthe remaining 50 by N.K.). The patients were alsoinstructed to self-administer (i.e. without attendanceof the coordinator) IVR HAM-D 1 or 2 days after theirinitial assessment at their own place (Day 2). Theywere instructed to do this second assessment of IVRHAM-D at a time when the first assessment was donein order to avoid the possible effect of circadianchange of depression severity. On Day 1, a well-trained psychologist was concurrently present atthe interview at which the clinician administeredHAM-D, and provided an independent rating for thesame subject. Each separated assessment result fromthe clinician and the psychologist was taken care ofindependently throughout the study.

Statistical analyses

To assess the test–retest reliability of the IVRprogram, the intraclass correlation coefficient (ICC)of IVR HAM-D total scores was calculated using dataobtained on Days 1 and 2. To investigate internalconsistency and construct validity of HAM-D totalscore, Cronbach’s alpha coefficient was calculated for

254 H. Kunugi et al. Psychiatry and Clinical Neurosciences 2013; 67: 253–258

© 2013 The AuthorsPsychiatry and Clinical Neurosciences © 2013 Japanese Society of Psychiatry and Neurology

Page 3: Validation of computer-administered clinical rating scale: Hamilton Depression Rating Scale assessment with Interactive Voice Response technology - Japanese version

clinician HAM-D (Day 1), psychologist HAM-D (Day1) and IVR HAM-D assessments (Days 1 and 2).Scores of the two psychologists were combined inthe analysis. To assess concurrent validity of HAM-Dtotal scores, Pearson’s correlation coefficient was cal-culated for clinician HAM-D and IVR HAM-D assess-ments (Day 1) and from clinician and psychologistHAM-D assessments (Day 1). To investigate inter-rater consistency, Cohen’s kappa coefficient wascalculated for clinician HAM-D and IVR HAM-Dassessments (Day 1), IVR HAM-D assessments (Days1 and 2), and clinician and psychologist HAM-Dassessments (Day 1) concerning each HAM-D item.Paired t-test was used to compare total HAM-D scoresrated by IVR (Day 1), clinician, and psychologist.Statistical analyses were performed using SAS 9.1.3(http://www.sas.com/).

RESULTSThe patients consisted of 40 men and 20 women aged22–69 years (mean = 40.7; SD = 11.1). There were11 patients with major depressive disorder (MDD)single episode, nine with MDD single episode anddysthymia (double depression), 27 with MDD recur-rent, five with MDD recurrent and dysthymia, twowith dysthymia alone, five with bipolar II, and onewith bipolar I. One patient failed to conduct thesecond IVR assessment; however, the remaining 59patients completed the full procedure.

The clinician-HAM-D score ranged between 2 and33 points (mean = 15.1; SD = 6.7). The ICC of IVRHAM-D assessments between Day 1 and Day 2 was0.93, indicating that test–retest reliability was high(Fig. 1). Internal consistency of total scores obtainedby the clinician (Day 1), psychologists (Day 1) andIVR program (Days 1 and 2) were all acceptable togood consistency (Cronbach’s alpha: 0.77, 0.79,0.78, and 0.83).

Regarding concurrent validity, Pearson’s correla-tion coefficient of HAM-D total scores on Day 1 was0.81 (P < 0.0001) for clinician HAM-D versus IVRHAM-D and 0.93 (P < 0.0001) for clinician HAM-D versus psychologist HAM-D (Fig. 2), indicatingstrong correlations and high concurrent validity.Mean HAM-D total scores for clinician, psychologist,and IVR (Days 1 and 2) are shown in Table 1. Withrespect to the differences in HAM-D total scores, themean IVR HAM-D score (Day 1) was approximately3 and 4 points higher than the mean clinician HAM-D (t = 5.2, d.f. = 59, P < 0.001, paired t-test) and

psychologist HAM-D (t = 6.0, d.f. = 59, P < 0.001)scores, respectively, indicating that IVR tends to over-estimate the depression severity, compared with theclinician and psychologists.

The inter-rater consistency of each HAM-D itemdetermined by different assessments (i.e. IVR, clini-cian and psychologist HAM-D) was examined usingCohen’s kappa coefficient (Table 2). The obtainedcoefficients between IVR (Day 1) versus IVR (Day 2)HAM-D scores and those between clinician versuspsychologist showed moderate and substantial agree-ment, respectively (mean: 0.55 [SD 0.13] and 0.62[SD 0.18]). However, relatively lower values wereobtained for coefficients between clinician versus IVR(Day 1) HAM-D scores (mean 0.25 [SD 0.13]), indi-cating that observer agreement between clinician andIVR was ‘fair’ according to the criteria by Landis andKoch.17

DISCUSSIONWe examined, for the first time, the reliability andvalidity of the Japanese version of the IVR program toscore the HAM-D scale in 60 patients with depressivedisorder. The observed ICC of 0.93 between IVR

0 10

50

40

30

20

10

0

20 30

ICC: 0.93

IVR

HA

M-D

tota

l sco

re (

Day

2)

IVR HAM-D total score (Day 1)

40 50

Figure 1. Scatter plots of Interactive Voice Response (IVR)Hamilton Rating Scale for Depression (HAM-D) total scoreson Days 1 and 2 (n = 59). ICC, intraclass correlationcoefficient.

Psychiatry and Clinical Neurosciences 2013; 67: 253–258 Validation of IVR program for HAM-D 255

© 2013 The AuthorsPsychiatry and Clinical Neurosciences © 2013 Japanese Society of Psychiatry and Neurology

Page 4: Validation of computer-administered clinical rating scale: Hamilton Depression Rating Scale assessment with Interactive Voice Response technology - Japanese version

HAM-D scores of Day 1 and Day 2 indicates that thetest–retest reliability of the IVR HAM-D program forthe assessment of depression severity is quite high.The computed Cronbach’s alpha coefficients (0.77–0.83) showed acceptable to good internal consistencywhen the patients, the clinician or psychologistsassessed HAM-D separately using the IVR programor by a clinical interview, suggesting that constructvalidity of the IVR program is acceptable to goodas well.

With regards to the concurrent validity of overalldepression severity assessment (i.e. total HAM-D

score) using the IVR HAM-D program, Pearson’s cor-relation coefficient of 0.81 between clinician HAM-Dand IVR HAM-D supports its validity. The high coef-ficient (0.93) for clinician and psychologist assess-ments indicates a very high inter-rater reliability intheir assessments. All our data suggest that the Japa-nese IVR HAM-D program is reliable and valid toassess HAM-D total score in patients with depressivedisorders. Although the HAM-D total score obtainedby the IVR program and that by the clinician showeda strong correlation (0.81), the former was threepoints higher than the latter on average, suggestingthat adjustment is required when the program wasintroduced and used in a regular clinical practice. Asimilar difference between IVR and clinician wasreported in a previous study conducted in the USA.11

The computed Cohen’s kappa coefficient for eachHAM-D item assessed using the IVR program and bythe clinician seemed to follow different trendsdepending on the item (Table 2). The obtained coef-ficients between IVR (Day 1) versus IVR (Day 2)HAM-D scores and those between clinician versuspsychologist showed moderate and substantial agree-ment, respectively. However, relatively lower valueswere obtained for coefficients between clinicianversus IVR (Day 1) HAM-D scores, indicating that

0

50

40

30

20

10

0

10

Clinician HAM-D total score

Pearson's r = 0.93

Psy

chol

ogis

t HA

M-D

tota

l sco

re

20 30 40 50

Clinician HAM-D total score

Pearson's r = 0.81

(a) (b)

IVR

HA

M-D

tota

l sco

re

0

50

40

30

20

10

0

10 20 30 40 50

Figure 2. Scatter plots of Hamilton Rating Scale for Depression (HAM-D) total scores on Day 1 (n = 60). (a) Plots of clinicianHAM-D versus Interactive Voice Response (IVR) HAM-D, Pearson’s r = 0.81. (b) Plots of clinician HAM-D versus psychologistHAM-D, r = 0.93.

Table 1. Mean HAM-D total score determined bydifferent assessments on Days 1 and 2

Day nHAM-D total scoreMean (SD)

IVR HAM-D 1 60 18.1 (7.5)IVR HAM-D 2 59 17.6 (8.3)Clinician HAM-D 1 60 15.1 (6.7)Psychologist HAM-D 1 60 14.2 (6.2)

IVR, Interactive Voice Response; HAM-D, HamiltonRating Scale for Depression.

256 H. Kunugi et al. Psychiatry and Clinical Neurosciences 2013; 67: 253–258

© 2013 The AuthorsPsychiatry and Clinical Neurosciences © 2013 Japanese Society of Psychiatry and Neurology

Page 5: Validation of computer-administered clinical rating scale: Hamilton Depression Rating Scale assessment with Interactive Voice Response technology - Japanese version

observer agreement between clinician and IVR wasfair.17 Accordingly, caution is required when one usesindividual item score of HAM-D by the IVR programin clinical practice. In particular, Cohen’s kappa coef-ficients between clinician and IVR scores for feelingsof guilt (0.08), work and activities (0.11), retardation(0.18), agitation (0.13) and mental anxiety (0.02)items were very low (see Table 2). When these itemswere scrutinized, a significant correlation betweenclinician and IVR scores (nominal P < 0.05) wasobtained for feelings of guilt (r = 0.44, P < 0.001),work and activities (r = 0.39, P = 0.002), and mentalanxiety (r = 0.27, P = 0.036), but not for retardation(r = 0.25, P = 0.056) or agitation (r = 0.02, P = 0.86).Because agitation and retardation of the HAM-D scalemust be scored on the basis of objective observationin the clinical setting, the results of low correlationsas well as low kappa values for these items are notsurprising. Taken together, adjustment or improve-ment of the IVR program is required to obtain higherkappa values for some items (e.g. feelings of guilt,work and activities, and mental anxiety) and face-to-face interview might be required to obtain accuratescores for retardation and agitation, which might be alimitation of the current IVR program.

There are several limitations in the study. First, itshould be considered that this study included a rela-tively small number of subjects with severe depres-sion; there were only five subjects whose clinicianHAM-D score was 24 or more. Secondly, the imbal-ance of the subjects’ sexes (40 men and 20 women)and the fact that the study was performed at only onesite should also be taken into account. Thirdly, psy-chologists rated HAM-D score by attending andreferring to the clinician’s interview but not byinterviewing the patients themselves. Therefore, psy-chologists’ rating is not entirely independent, whichmay have biased the result of inter-rater reliabilitytowards the observed high correlation (r = 0.93)between clinician and psychologist HAM-D scores.These limitations, or findings, require careful consid-eration for the clinicians and/or the psychologists tomake an accurate assessment of depression severity inclinical practice.

In conclusion, despite the limitations of this study,our results confirm that the Japanese version of theIVR HAM-D program is a reliable and valid methodto assess overall depression severity in clinical prac-tice. Although the total score was reliably and validlyobtained, the score of each item did not always show

Table 2. Inter-rater consistency of each HAM-D item by different assessments on Days 1 and 2

HAM-D item

Cohen’s kappa coefficient

IVR vs clinician(Day 1) (n = 60)

IVR (Day 1) vs IVR(Day 2) (n = 59)

Clinician vspsychologist (n = 60)

01 Depressed mood 0.33 0.56 0.5202 Feelings of guilt 0.08 0.55 0.6903 Suicide 0.50 0.64 0.7904 Early insomnia 0.26 0.74 0.7005 Middle insomnia 0.23 0.71 0.6006 Late insomnia 0.21 0.49 0.7807 Work and activities 0.11 0.54 0.5308 Retardation 0.18 0.36 0.2509 Agitation 0.13 0.39 0.3210 Mental anxiety 0.02 0.39 0.4811 Somatic anxiety 0.19 0.44 0.5712 Somatic symptoms (gastrointestinal) 0.33 0.70 0.8513 Somatic symptoms (general) 0.39 0.52 0.4414 Genital symptoms 0.49 0.78 0.7715 Hypochondriasis 0.25 0.65 0.6616 Loss of weight by history 0.27 0.52 0.9317 Insight 0.23 0.41 0.66Average (SD) 0.25 (0.13) 0.55 (0.13) 0.62 (0.18)

IVR, Interactive Voice Response; HAM-D, Hamilton Rating Scale for Depression.

Psychiatry and Clinical Neurosciences 2013; 67: 253–258 Validation of IVR program for HAM-D 257

© 2013 The AuthorsPsychiatry and Clinical Neurosciences © 2013 Japanese Society of Psychiatry and Neurology

Page 6: Validation of computer-administered clinical rating scale: Hamilton Depression Rating Scale assessment with Interactive Voice Response technology - Japanese version

high agreement. Further research in a larger numberof subjects will provide Japanese clinicians moredetailed information regarding the use of the IVRHAM-D program.

ACKNOWLEDGMENTSWe acknowledge the various contributions of all theparticipants in the research. Hiroshi Kunugi designedthe study, rated the subjects (clinician HAM-D), andwrote the manuscript. Miyako Hashikura and NorieKoga assessed the subjects as psychologists using thesimultaneous interview method. Takamasa Nodaand Yu Shimizu supported subject recruitment andprovided valuable comments on the manuscript ofthe study. Takayuki Kobayashi, who is a full-timeemployee of GSKKK, designed the study protocol incollaboration with the first author. Jun Yamanaka(another full-time employee of GSKKK) coordinatedwith ERT, and supported publication creation.Noriaki Kanemoto (another full-time employee ofGSKKK) advised the lead author concerning statisti-cal analyses. Teruhiko Higuchi supervised the studyconduct and participated in the preparation of themanuscript. We would also like to express our appre-ciation to ERT, the service provider of the IVR HAM-Dprogram used in the study. Funding for this researchwas provided by GSKKK.

REFERENCES1. Hamilton M. A rating scale for depression. J. Neurol. Neu-

rosurg. Psychiatry 1960; 23: 56–62.2. Williams JB. A structured interview guide for the Hamilton

Depression rating scale. Arch. Gen. Psychiatry 1988; 45:742–747.

3. Bagby RM, Ryder AG, Schuller DR, Marshall MB. TheHamilton Depression Rating Scale: Has the gold standardbecome a lead weight? Am. J. Psychiatry 2004; 161: 2163–2177.

4. Demitrack MA, Faries D, Herrera JM, DeBrota DJ, PotterWZ. The problem of measurement error in multisite clini-cal trials. Psychopharmacol. Bull. 1998; 34: 19–24.

5. Greist J, Mundt J, Jefferson J, Katzelnick D. Comments on‘Why do clinical trials fail? The problem of measurementerror in clinical trials: Time to test new paradigms?’ J. Clin.Psychopharmacol. 2007; 27: 535–537.

6. Kobak KA, Brown B, Sharp I et al. Sources of unreliabilityin depression ratings. J. Clin. Psychopharmacol. 2009; 29:82–85.

7. Kobak KA, Lipsitz J, Williams JB, Engelhardt N, Jeglic E,Bellew KM. Are the effects of rater training sustainable?Results from a multicenter clinical trial. J. Clin. Psychop-harmacol. 2007; 27: 534–535.

8. Kobak KA, Mundt JC, Greist JH, Katzelnick DJ, JeffersonWJ. Computer assessment of depression: Automating theHamilton depression rating scale. Drug Inf. J. 2000; 34:145–156.

9. Kobak KA, Greist JH, Jefferson JW, Mundt JC, KatzelnickDJ. Computerized assessment of depression and anxietyover the telephone using interactive voice response. MDComput. 1999; 16: 64–68.

10. Moore HK, Mundt JC, Modell JG et al. An examination of26,168 Hamilton depression rating scale scores adminis-tered via interactive voice response across 17 randomizedclinical trials. J. Clin. Psychopharmacol. 2006; 26: 321–324.

11. Mundt JC, Kobak KA, Taylor LV et al. Administration ofthe Hamilton Depression Rating Scale using interactivevoice response technology. MD Comput. 1998; 15: 31–39.

12. Kobak KA, Kane JM, Thase ME, Nierenberg AA. Why doclinical trials fail? The problem of measurement error inclinical trials: Time to test new paradigms? J. Clin. Psychop-harmacol. 2007; 27: 1–5.

13. Mundt JC, Greist JH, Jefferson JW et al. Is it easier to findwhat you are looking for if you think you know what itlooks like? J. Clin. Psychopharmacol. 2007; 27: 121–125.

14. American Psychiatric Association. Diagnostic and StatisticalManual of Mental Disorders 4th, Edn Text Revision (DSM-IV-TR). APA, Washington, DC, 2000.

15. Sheehan DV, Lecrubier Y, Sheehan KH et al. The Mini-International Neuropsychiatric Interview (M.I.N.I.): Thedevelopment and validation of a structured diagnosticpsychiatric interview for DSM-IV and ICD-10. J. Clin. Psy-chiatry 1998; 59 (Suppl. 20): 22–57.

16. Nakane Y, Williams JB. Japanese Version of StructuredInterview Guide for the Hamilton Depression Rating Scale(SIGH-D). Seiwa Shoten, Tokyo, 2004 (in Japanese).

17. Landis JR, Koch GG. The measurement of observer agree-ment for categorical data. Biometrics 1977; 33: 159–174.

258 H. Kunugi et al. Psychiatry and Clinical Neurosciences 2013; 67: 253–258

© 2013 The AuthorsPsychiatry and Clinical Neurosciences © 2013 Japanese Society of Psychiatry and Neurology