agreement in human interpretation of analog thallium myocardial perfusion images

10
JE Atwood, D Jensen, V Froelicher, K Witztum, K Gerber, E Gilpin and W Ashburn images Agreement in human interpretation of analog thallium myocardial perfusion ISSN: 1524-4539 Copyright © 1981 American Heart Association. All rights reserved. Print ISSN: 0009-7322. Online 72514 Circulation is published by the American Heart Association. 7272 Greenville Avenue, Dallas, TX 1981, 64:601-609 Circulation http://circ.ahajournals.org/content/64/3/601 located on the World Wide Web at: The online version of this article, along with updated information and services, is http://www.lww.com/reprints Reprints: Information about reprints can be found online at [email protected] Fax: 410-528-8550. E-mail: Kluwer Health, 351 West Camden Street, Baltimore, MD 21202-2436. Phone: 410-528-4050. Permissions: Permissions & Rights Desk, Lippincott Williams & Wilkins, a division of Wolters http://circ.ahajournals.org//subscriptions/ Subscriptions: Information about subscribing to Circulation is online at by guest on July 12, 2011 http://circ.ahajournals.org/ Downloaded from

Upload: pharmgkb

Post on 20-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

JE Atwood, D Jensen, V Froelicher, K Witztum, K Gerber, E Gilpin and W Ashburnimages

Agreement in human interpretation of analog thallium myocardial perfusion

ISSN: 1524-4539 Copyright © 1981 American Heart Association. All rights reserved. Print ISSN: 0009-7322. Online

72514Circulation is published by the American Heart Association. 7272 Greenville Avenue, Dallas, TX

1981, 64:601-609Circulation 

http://circ.ahajournals.org/content/64/3/601located on the World Wide Web at:

The online version of this article, along with updated information and services, is

http://www.lww.com/reprintsReprints: Information about reprints can be found online at   [email protected]: 410-528-8550. E-mail: Kluwer Health, 351 West Camden Street, Baltimore, MD 21202-2436. Phone: 410-528-4050. Permissions: Permissions & Rights Desk, Lippincott Williams & Wilkins, a division of Wolters  http://circ.ahajournals.org//subscriptions/Subscriptions: Information about subscribing to Circulation is online at

by guest on July 12, 2011http://circ.ahajournals.org/Downloaded from

TI INTERPRETATION AGREEMENT/Atwood et al.

a longitudinal study by radionuclide angiocardiography. Am JCardiol 45: 244, 1980

22. Upton MT, Rerych SK, Newman GE, Bounous EP Jr, JonesRH: The reproducibility of radionuclide angiocardiographicmeasurements of left ventricular function in normal subjectsand during exercise. Circulation 62: 126, 1980

23. Upton MT, Rerych SK, Newman GE, Port S, Cobb FR, JonesRH: Detecting abnormalities in left ventricular function duringexercise before angina and ST-segment depression. Circulation62: 341, 1980

24. Newman GE, Gibbons RJ, Jones RH: Cardiac function duringrest and exercise in patients with mitral valve prolapse. Am JCardiol 47: 14, 1981

25. Port S, Cobb FR, Coleman RE, Jones RH: The effect of age onleft ventricular function at rest and during exercise. N Engl JMed 303: 1131, 1980

26. Port S, McEwan P, Cobb FR, Jones RH: Influence of restingleft ventricular function on the left ventricular response to exer-cise in patients with coronary artery disease. Circulation 63:856, 1981

Agreement in Human Interpretation of AnalogThallium Myocardial Perfusion Images

J. EDWIN ATWOOD, M.D., DAVID JENSEN, M.S., VICTOR FROELICHER, M.D.,

KATHRYN WITZTUM, M.D., KENNETH GERBER, M.D., ELIZABETH GILPIN, M.S.,

AND WILLIAM ASHBURN, M.D.

SUMMARY To assess the agreement of human interpretation of analog thallium myocardial perfusion im-ages, four experienced interpreters evaluated 100 images on two occasions using a form designed to limitreader variability. A high intraobserver agreement (agreement by same observer at separate times) of 89-93%was found when films were interpreted as normal or abnormal (a dichotomous decision). Interobserver agree-ment for a majority grouping of observers (three or four) was 75% for an abnormal and 68% for a normal inter-pretation. However, agreement ranged from 11-79% when interpreters were asked to read the anatomic loca-tion of defects. Posterior and lateral wall defects were interpreted with the least amount of agreement. Theseresults indicate that caution must be taken when interpreting defect location. Using a scale of 1-10 to gradethe severity of a defect, correlations of 0.82-0.86 were found when reading defects in the lateral and anteriorprojections. Higher correlations, from 0.86-0.94, were found in left anterior oblique views. Use of reportingforms with specific criteria, multiple observers at one occasion, and/or computer processing may improveagreement. A brief review of the agreement of cardiology testing procedures is also presented.

THE SUBJECTIVE NATURE of medical diag-noses requires questioning of the results of most diag-nostic methods, not only in regard to accuracy or val-idity, but also agreement. Attempts to describe orassess agreement have been variable. Numerous termsare used - agreement, variability, consistency, with-in-observer correlation coefficients of disagreement,and many others. This report is restricted to as fewterms as possible. Agreement has two subgroupingsl:intraobserver, referring to agreement of the indivi-dual observer on two readings, and interobserver, re-ferring to agreement among two or more observers.Agreement studies have been done in several branchesof medicine, including pathology,2 roentgenology3 andclinical diagnosis.'

From the Division of Cardiology, Department of Medicine, andthe Division of Nuclear Medicine, Department of Radiology,University of California, San Diego, California.

Supported by SCOR in Ischemic Heart Disease grant HL 17682,NHLBI, NIH.

Dr. Atwood was a research cardiologist, NIH training grant PHSHL 07444; he is currently a clinical cardiology fellow at the Univer-sity of Utah, Salt Lake City, Utah.

Address for correspondence: Victor F. Froelicher, M.D., Direc-tor, Cardiac Rehabilitation and Exercise Testing, UniversityHospital, 225 Dickinson Street, San Diego, California 92103.

Received July 1, 1980; revision accepted December 29, 1980.Circulation 64, No. 3, 1981.

Almost all diagnostic methods have been scruti-nized for agreement, including the clinical examina-tion,' ECG,6-7 exercise ECG," and echocardiogram.§Even coronary angiography has been examined forobserver variability.10'3

Nuclear cardiology is an example of the difficultywith agreement because interpretation of the results isinherently subjective. This is particularly so for tech-netium-99m pyrophosphate infarct imagingl" andthallium-201 myocardial perfusion imaging,16 whichare visualized as gray-scale images or 16-scale colorimages. The purpose of this investigation was to assessintra- and interobserver agreement in reading thalliumimages at our institution and to propose a method ofimproving agreement.

Methods

Treadmill testing was performed using a modifiedBalke-Ware protocol with a speed of either 2.0 or 3.3mph w.th the grade increased 5% every 2 or 3minutes.16 Before exercise, a catheter was placed in anantecubital or hand vein. One and one-half to 2.0 mCiof thallium-201, followed by 10-15 ml of saline, wereinjected approximately 1 minute before symptom-limited maximal effort. Exercise was terminated atvolitional fatigue, definite anginal pain, or significant

601

by guest on July 12, 2011http://circ.ahajournals.org/Downloaded from

VOL 64, No 3, SEPTEMBER 1981

electrocardiographic abnormalities (more than 0.2mV horizontal or downsloping ST displacement ordangerous dysrhythmias). The patient's heart rate,blood pressure, 12-lead ECG and signs and symptomswere carefully monitored.

After exercise, the patient immediately lay down,the precordial electrodes were removed, and imagingwas begun within 5 minutes. Imaging was begun as

soon as possible to avoid missing transient acute is-chemia. Images were obtained in four views: anterior,450 and 600 left anterior oblique (LAO), and rightlateral decubitus. A 37-photomultiplier-tube, 25-cmfield-of-view camera with a ¼h-inch crystal (Picker-Dynamo with a micro-Z processor) was used for im-age recording. A general-purpose, parallel-hole col-limator was used with a 30% window centered on the80-KeV mercury x-ray associated with the radioac-tive decay of thallium-201. Imaging in each view was

continued until a 2000-count information density(counts/cm2) was reached in the most normal portionof the left ventricular myocardium. This was approxi-mately equivalent to 300,000-400,000 total counts andusually required 7-10 minutes of acquisition time perview. All analog scintigraphic images were unproc-essed and recorded on x-ray transparent film. If a de-fect was noted on any of these postexercise images, thepatient was asked to return 3-4 hours later for de-layed imaging.

Film Selection

One hundred scintigrams were interpreted fromconsecutive patients undergoing research studies.Immediate or delayed films with one or more viewswere randomly numbered 1-100 and then renumberedfor the second reading. The patient and study datewere unknown to the readers, although films with lessthan four images were suspected to be delayed viewsbecause our protocol for delayed imaging at that timerequired taking only suspicious views. The films from10 normal subjects along with 26 patients either beforeor after an exercise training program were included.These 26 patients had coronary artery disease asdocumented by angina, an abnormal treadmill, his-tory of myocardial infarction, or cardiac catheteri-zation. Five patients had undergone coronary arterybypass surgery.Four readers (one cardiologist and three nuclear

radiologists) were blinded as to patient identificationand individually read the randomly numbered films.Most readers required 3-4 hours to review the 100films. Each reader interpreted the films for a secondtime 3-4 weeks later after rerandomization andrenumbering. For each film the reader completed a

form (fig. 1) that included (l) a choice of abnormal,normal or uninterpretable because of poor quality; (2)written description as to the anatomic location of thedefect; (3) pencilling in the defect on a schematicrepresentation of a normal thallium image; and (4)grading the defect as to size and intensity. Relativedefect size was graded on a scale of 1-5, in which 1 =

10% or more of the total myocardial area, 2 = 20% or

more, 3 = 30% or greater, 4 = 40% or greater, and 5= greater than 50%. Intensity of myocardial ac-cumulation in the abnormal region was graded on ascale of 1-4, in which 1 = normal uptake, 2 = just lessthan normal, 3 = just greater than background, and 4= equivalent to background. Severity of a lesion was acomposite of size and intensity graded on a scale of1-10 determined for each view (table 1). This scalewas empirically developed by the readers, based ontheir experience.

Statistics

The thallium image interpretation data wereanalyzed in several ways in an effort to describe allaspects of observer variability and agreement. An-alyses included means, correlation, coefficients, co-efficient of concordance and percentages. The inter-observer variability was calculated by comparing thefirst readings of the four observers.

In analyzing the severity of a defect on a scale of1-10 using size and intensity as determinants ofseverity, we used a nonparametric rank-correlationanalysis as described by Fried17 for determiningcoefficient of concordance. The kappa test was used totest the hypothesis that agreement was greater thanchance alone.18

ResultsThe observers had a high intraobserver agreement

when reading the scintigrams as either being abnormalor normal (table 2). They gave the same interpreta-tion at the two readings 89-93% of the time. Thefourth column lists the inconsistent readings andshows both the total number of readings and the orderin which the inconsistencies occurred. Not all filmtotals equaled 100 for this analysis because films con-sidered uninterpretable by each reader were elim-inated. Elimination of a greater number of films didnot increase the agreement. Observer B, who

TABLE 1. Grading Scale for the Severity of a "Cold-spot"Lesion, Considering the Size and Intensity of the Defect

Severity Intensity paired with size1 1,1; 1,2; 1,3; 1,4; 1,52 2,13 2,2; 3,14 2,35 3,2; 4,16 2,4; 3,37 4,28 2,5; 3,4; 4,39 3,5; 4,410 4,5

Intensity was graded 1-4: 1 = normal myocardialactivity; 2 = just less than normal; 3 = just greater thanbackground activity; and 4 = equal to background activity.

Size was graded 1-5: 1 = a lesion of at least 10%o of thetotal heart size; 2 = more than 20%; 3 = more than 30%; 4=more than 40%; and 5 = more than 50%.

602 CIRCULATION

by guest on July 12, 2011http://circ.ahajournals.org/Downloaded from

Ti INTERPRETATION AGREEMENT/Atwood et al.

Patient Name:Yes No

Pretest Clinical History: Ml

Yes NoMedications: Beta Blockers

Yes N

Hx AbnI TM O CYes Ni

Nitrates r

Age: Date:o Yes No] Chest Pain0

Other

Yes No Yes NoMax Effort 0 0 Angina 0 0 Test Endpoint:Rest HR _ Rest Blood Pressure Maximum HR Maximum Blood Pressure

Comments:

Background:Lung Uptake Normal 0 Increased 0Visceral Uptake Normal 0 Increased 0Right Ventricle Normal 0 Increased 0Comments:

Heart:Chamber Size Normal Small EnlargedWall Thickness Normal Thin EnlargedComments:

ANT LAO 450-500 LAO 60°-700 LEFT LATERALExercise Delay Exercise Delay Exercise Delay Exercise Delay

Defect Size

Defect Intensity i TrlTIr =

Size: 1 = 10% of total myocardial area; 2 = 20%; 3 = 30%; 4 = 40%; 5 = 50%Intensity: 1 = Normal; 2 = Just less than normal; 3 = Just greater than background; 4 = Background

Final Assessment:FIGURE 1. A proposed thallium myocardial perfusion imaging form to improve observer reliability. Theillustrations were used in our study to evaluate reliability.

eliminated the most films (100 films - 94 read, sixuninterpretable), achieved a 91% agreement. The kap-pa test showed that the agreement was significantlygreater than could occur by chance alone.

Table 3 shows that four of four observers agreed onabnormal scans an average of 64% for readings 1 and2. When readings in which three of four readers agreedare included with those in which four of four readersagreed, an average of 75% of the scans labeled abnor-

mal by at least one reader were interpreted as abnor-mal on both readings, whereas an average of 16% ofthe scans were interpreted abnormal by only oneobserver. The total number of films, again, is not 100because inadequate studies were eliminated and not allfilms were abnormal or interpreted as such. Agree-ment was significantly greater than could occur bychance alone (table 4).

In interpreting normal scans, all four readers agreed

TABLE 2. Intraobserver Agreement in Reading Abnormal and Normal ScintigramsInconsistent(one time Number of % of films read

Abnormal Normal normal, one films read the same onReaders (both readings) (both readings) time abnormal) twice both readings Kappa ± SEM

A 62 27 7 (4-3) 96 93% 0.83 ± 0.06*B 58 34 7 (6-1) 99 93% 0.85 ± 0.05*C 64 24 11 (9-2) 99 89% 0.73 ± 0.08*D 58 27 9 (8-1) 94 91% 0.86 ± 0.06*

*p <0.05.Films felt to be ofinadequate quality by individual readers have been eliminated. The kappa test was used to determine the

probability of agreement being greater than chance alone.

603

by guest on July 12, 2011http://circ.ahajournals.org/Downloaded from

VOL 64, No 3, SEPTEMBER 1981

TABLE 3. The Number and Percent of Films Interpreted as Abnormal or Normal by All Four Readers, by Three of FourReaders, by Two ofFour Readers or One ofFour Readers

Total films readas abnormal ornormal by anyclassification

of readerAll 4 readers 3 of 4 readers 2 of 4 readers 1 reader agreement

1st reading Abnormal 55 (67%) 8 (10%) 8 (10%) 11 (13%) 82 (100%)Normal 15 (36%) 12 (29%o) 7 (16%) 8 (19%) 42 (100%)

2nd reading Abnormal 51 (64%) 9 (12%) 4 (5%) 15 (19%) 79 (100%)Normal 20 (40%) 15 (31%) 6 (13%) 8 (16%) 49 (100%b)

The percentage offilms in the above table represents the number offilms read as abnormal or normal in each classificationof reader agreement (four of four, three of four, two of four or one of four) divided by the total films read as abnormal ornormal by any number of readers times 100. Discrepancies in the totals are due to differences in the number of filmsconsidered uninterpretable.

TABLE 4. Results of the Kappa Test Applied to Testingthe Agreement Between Observers for Calling an ImageNormal or Abnormal During the First Readings

Observers Kappa ± SEMAvsB 0.74 ± 0.07*AvsC 0.60 ± 0.09*AvsD 0.70 ±0.08*BvsC 0.73 ±0.08*B vs D 0.63 ± 0.09*C vs D 0.56 ± 0.10*

*p< 01.

in an average of only 40% of the cases labeled normalby at least one reader. In contrast to the abnormal in-terpretations, a much larger percentage (30% vs 1 1%)of the normal scans had three-of-four-observer agree-ment. However, when at least three of four observersagree, there is a similar average total (75% vs 68%).The ability to interpret scans reliably with respect to

defect location is seen in tables 5 and 6 and figure 2.Intraobserver agreement shows wide ranges of agree-ment in the apical, inferior, lateral and posterior areas(35-76%, 30-61%, 33-58%, and 11-78%, respectively)in table 4. However, the greatest intraobserver agree-ment as to defect location was in the septal region,with a range of 65-81% and an average for all readersof 75%. The average intraobserver agreement for allareas was highest in reader D and lowest in reader C.There was a definite tendency to describe defects in theapical region more frequently than in others. Resultsof interobserver agreement (i.e., agreement of two or

more observers) with respect to defect location arelisted in table 6. The greatest agreement was in theseptal area, for which 74% of the scans were inter-preted the same by three of four or four of fourobservers. The highest percentage of films in whichonly one of four readers found a defect occurred in theposterior (67%) and lateral (69%) regions. At leastthree of four readers agreed only 12% in these regions.

Using the grading scale of severity of defectdescribed in the Methods section, we correlated ob-server scores between readings one and two as seen indifferent views. The highest correlation was in theLAO 450 and 600 views and in readers A and B (table7). Interobserver correlations were generally high, ex-cept when comparing observer D with either observerB or C in the lateral view (table 8). The coefficient ofconcordance supports these generally high values. Intable 8, the column listed as average is generallyhigher than the first and second reading because theaverage grade for each scan was first determined andthe coefficient of concordance was then computed, andaveraging reduces variability.

DiscussionMost techniques in medicine require observer inter-

pretation and thus are subject to observer subjectivity.Validity measures the proximity of a test result to acertain standard of accuracy. One can also measurethe amount of agreement between individuals or theindividual to himself. Terms such as inter- or in-traobserver variability, variance, reliability or con-sistency express a measurement of this agreement.

TABLE 5. Intraobserver Agreement in Interpreting Anatomic Location of Defects

Apex Inferior AnteriorObserver A B C D A B C D A B C D

No. of readings 63 42 31 49 41 33 35 36 41 21 29 36% of films with same loca-

tion identified onsecond reading 76% 58% 35% 76% 61% 30%o 34% 56% 63% 61% 72% 69%

Average for each area byall readers 65% 48% 67%

604 CIRCULATION

by guest on July 12, 2011http://circ.ahajournals.org/Downloaded from

TI INTERPRETATION AGREEMENT/Atwood et al.

INTEROBSERVER RELIABILITY

.SLUi

LUJ

CDae

ABNORMAL NORMAL SEPTAL ANTERIOR APEX INFERIOR LATERAL POSTERIOR

FIGURE 2. A histographic representation of the interobserver reliability in interpreting scintigrams as ab-

normal or normal and the defect location.

However, the statistical expression of agreement iscomplex and variable.

In one of the few studies of agreement concerningthe cardiac physical examination, two cardiologistshad excellent agreement on heart size and murmurs(greater than 94% agreement), but agreement on extrasounds was 72-92%.' The best agreement occurredbetween cardiologists as opposed to three non-specialists. In a study assessing agreement of threephysicians as to the presence or absence of tibial ordorsalis pedis pulses, interobserver agreement wasfound in approximately 70% and 80%, respectively,and intraobserver agreement ranged from 73-87%.19The echocardiogram has been criticized for its

technical reproducibility and its lack of observeragreement in interpretation. The echocardiogram maybe used for studying large groups, but inter- and intra-observer reliability for measuring left-heart dimen-sions must be assessed." Using correlation coefficients,observer intraclass correlation coefficient (anothermethod of describing intraobserver agreement) rangedfrom 0.87-0.98 for various dimensions and inter-observer coefficient ranged from 0.86-0.98, except forthe left ventricular posterior wall measurement, whichhad a coefficient of 0.57. Crawford and colleagues20

found their standards more accurate than thoserecommended by the American Society of Echo-cardiography.21 Experimental factors have beenevaluated in interpretation and in testing protocol(subject gender, day and time done, subject position)that might be involved in the variability of echo-cardiographic measurements.22 Using a variance com-ponent model, the relative contribution of various fac-tors to the total variability in measurement wasdemonstrated. As expected, the subject variance was amajor component in all measurements. However, theinterpreter variance component was significant, par-ticularly in measuring interventricular septal andposterior left ventricular wall thicknesses. To assuremore agreement, they stressed the necessity of readingechocardiograms either using one interpreter on twoseparate occasions or by using two interpreters.The ECG and vectorcardiogram have a low level of

observer agreement (or high reader variability). Agroup of physicians interpreted 100 ECGs as normal,as showing an old myocardial infarction, or as show-ing nonspecific abnormalities." Although only pairedinterobserver correlation was calculated, 70% or moreof the 20 readers could agree on only 77 of the 100ECGs. Using the same reporting categories, nine ex-

TABLE 5. (Continued)Averages for all areas

Lateral Posterior Septal by each reader

A B C D A B C D A B C D A B C D

25 12 9 15 21 18 9 6 33 31 31 38 224 157 144 180

40% 58% 44% 33% 38% 11% 78% 33% 79% 81% 65% 76% 64% 54% 52% 66%

42% 35% 75%

605

by guest on July 12, 2011http://circ.ahajournals.org/Downloaded from

VOL 64, No 3, SEPTEMBER 1981

TABLE 6. Interobserver Agreement as to Anatomic Location ofDefect in Thallium-201 Scintigrams Summing Both ReadingOccasions (Number of Readings with Agreement as to Location)Reading agreement Apex Inferior Anterior Lateral Posterior Septal4 of 4 17 (13%) 19 (18%) 24 (32%) 0 (091%) 2 (4%) 37 (48%)3 of 4 46 (37%) 18 (17%) 254(46%) 8 (12%) 4 (8%)) 201(26%)2 of44 36 (29%) 27 (26%) 13 (17%) 13 (19%) 11 (21%) 6 (8%)1 reader 27 (27%) 40 (38%) 14 (18%) 46 (69%) 34 (67%) 14 (18%)Total films with defectread by any one observer 126 (100%) 104 (100%) 76 (100%) 67 (100%) 51 (100%) 771(100%)

perienced electrocardiographers interpreted 100ECGs on two times at least 2 weeks apart.6 At leasttwo-thirds agreed in 78% of the ECGs and completeagreement occurred in only 29% of the cases. Interob-server correlations of readings were not calculated,but intraobserver agreement was determined for eachof the experienced nine readers and showed an intra-observer agreement of 81-93%. Acheson noted a 90%intraobserver agreement in one reader and an overallagreement in all five readings of 60%.23Simonson et al. evaluated 10 observers who inter-

preted 114 vectorcardiograms and 105 ECGs.7 A widevariation was found in making the correct diagnosisfor five conditions, as evidenced by wide standarddeviations for the average correct or incorrectdiagnoses.

Blackburn had 14 observers from seven institutionsinterpret 38 individual exercise ECG tests as normal,abnormal or borderline. Five readers repeated thereadings.8 In only nine of the 38 exercise ECGs (24%)was there complete (14 of 14) agreement and when atleast two-thirds agreed (nine of 14 readers), only 22(58%) were read in agreement. This value of 58% ofthe exercise ECGs in which nine of 14 readers agreedis much lower than our value of 76% for at least three-of-four-reader agreement. However, Blackburn'sstudy did not allow a dichotomous decision, as therewas a third interpretation of borderline. Intra-observer agreement had a wide range, from 58-92%,and an average still less than ours for a dichotomousdecision. Blackburn attributed this wide variation inboth inter- and intraobserver agreement to theabsence of defined criteria, technical problems such asnoise, and differences in opinion as to ST-segment up-sloping. Strict criteria, such as the Minnesota code,24and computer analysis have been recommended asways of increasing agreement in electrocardiography.

Detre and colleagues studied observer agreement indetecting a 50% or greater lesion in 13 coronary angio-grams reviewed by 22 readers on two occasions.'0 Shefound results midway between chance expectation and100% agreement. This study demonstrated con-siderable inter- and intraobserver variability (lowagreement) and found the lowest interobserver agree-ment among those who demonstrated the lowest intra-observer agreement. There was also a strong correla-tion between observer experience and intraobserverconsistency. That is, experienced observers usuallyagreed on two readings.Four readers, including two radiologists and two

cardiologists, assessed coronary artery stenoses andwall motion abnormalities in 20 patients.'1 Inter-observer variability was striking, particularly when in-terpreting arterial or ventricular wall segments withthe highest percentage of positive findings. Inter-observer variability in interpreting coronary angio-grams has been correlated with postmortem patho-logic findings.'2 Despite the presence of coronaryartery disease, angiographic interpretations of signifi-cant lesions (at least 50% angiographic diameterocclusion) were noted in only approximately 80% ofthe arteries with such occlusions on pathologic inspec-tion. In addition, when the majority opinion at angio-graphic interpretation was used, it added little to ac-curacy.The measurement of wall motion and ejection frac-

tion using both contrast angiography and radio-nuclides have been investigated for observer agree-ment. Chaitman and co-workers looked at bothsubjective (estimation) and objective (measurementsfrom frame tracings) evaluation of angiographic filmfor volume, ejection fraction and wall motion.18Greater agreement was demonstrated between the ob-jective observers than between the subjective

TABLE 7. Intraobseruer Correlation of First and Second Readings as to Seuerity of Defect When Usingthe Grading Scale Combining Intensity and Size as Shown in Table 1

ViewAverage for

Reader Anterior LAO 450 LAO 60° Lateral all viewsA 0.86 0.89 0.94 0.87 0.89B 0.87 0.91 0.89 0.85 0.88C 0.82 0.89 0.87 0.87 0.86D 0.84 0.92 0.86 0.82 0.86

Average forall readers 0.85 0.90 0.89 0.85Abbreviation: LAO = left anterior oblique.

606 CIRCULATION

by guest on July 12, 2011http://circ.ahajournals.org/Downloaded from

TI INTERPRETATION AGREEMENT/Atwood et al.

TABLE 8. Interobserver Variability When Using Both Size and Intensity in Grading the Severity of a Defect on a Scale of1-10 as Shown in Table 1

Anterior LAO 450 LAO 600 Lateral

Aver- Aver- Aver- Aver-1st 2nd age 1st 2nd age 1st 2nd age 1st 2nd age

ReadersA, B 0.84 0.86 0.89 0.83 0.84 0.87 0.84 0.87 0.89 0.78 0.85 0.83A, C 0.81 0.80 0.86 0.83 0.83 0.84 0.85 0.80 0.87 0.76 0.88 0.82

A, D 0.69 0.74 0.75 0.78 0.81 0.83 0.75 0.82 0.83 0.70 0.64 0.72

B, C 0.82 0.83 0.87 0.82 0.88 0.87 0.87 0.85 0.89 0.86 0.87 0.84B, D 0.77 0.76 0.83 0.83 0.86 0.88 0.81 0.84 0.88 0.66 0.66 0.67

C, D 0.81 0.75 0.85 0.85 0.82 0.85 0.81 0.80 0.84 0.66 0.71 0.70

Average 0.79 0.79 0.84 0.82 0.84 0.86 0.82 0.83 0.87 0.74 0.77 0.76

Coefficient ofconcordance 0.84 0.84 0.89 0.86 0.88 0.90 0.86 0.87 0.90 0.80 0.83 0.82

Abbreviation: LAO = left anterior oblique.

observers, particularly with respect to volume mea-surements. There was even less agreement when com-paring objective observer measurements with those ofthe subjective observers in all areas. The best inter-and intraobserver agreement occurred in ejection frac-tion measurements, particularly in the objective in-traobserver measurement of ejection fraction, inwhich a 0.99 correlation was noted.

Slutsky et al. assessed reproducibility of ejectionfraction and ventricular volumes by gated radio-nuclide angiography and included a small section oninter- and intraobserver agreement.25 They found min-imal variations of 0.02 ± 0.02 and 0.03 ± 0.02 ejec-tion fraction units (EF units) for intra- and inter-observer variability. Okada et al. found similar orsuperior interobserver variance for wall motion byradionuclide angiography than by contrast angio-graphy, except for the septal wall, in which contrastangiography was superior.26

Imaging of myocardial perfusion using cationicradionuclide tracers has become popular in diagnosingexercise-induced myocardial ischemia as well as myo-cardial scar at rest. Because thallium has excellentmyocardial uptake, with a drop in blood levels to 3%within. a few minutes after injection, and an effectivehalf-life of 7 hours, it is the most commonly usedagent. It is a noninvasive method of diagnosing cor-onary artery disease, and studies with angiographiccorrelation demonstrate sensitivities of 68-92% andspecificities of 85-100%.27McLaughlin and colleagues studied reproducibility

of the thallium-201 technique, testing a group of sub-jects twice.28 They found only six nonreproduciblesegments out of 76. In terms of observer agreement,they noted total agreement of three observers in 60 of76 studies, but did not report intraobserver agree-ment when defect location was correlated with cor-onary artery stenoses. In studies concerned with theinterpretation or reading of thallium scans, evalua-tion of agreement has considered primarily inter-observer and not intraobserver agreement. Trobaughand colleagues reported an interinstitutional study of

observer variability (another term describing inter-observer agreement)." Two readers from two differentinstitutions interpreted 100 resting studies (50 fromeach institution) as normal, borderline and abnormal.Exact agreement by all four observers occurred in 44%of the studies and in an additional 35% of the studiesthree of the four observers agreed and the fourthobserver differed by one grade of abnormality. Hence,in 79% of the studies at least three of four observersagreed. The percentage of interobserver agreement forabnormal-normal is similar to ours and other studies(Bailey et al.30 - 90% agreement; Verani et al.3l82%; Blood et al.32 - 90%; Ritchie et al.33- 80%; andLenaers3` - 80%).Our study had the greatest intraobserver agree-

ment when the interpreters were asked for only anabnormal or normal response (a dichotomous inter-pretation). Intraobserver agreement averaged 91%and interobserver agreement of at least three readersfor abnormal scintigrams occurred in 77% of thestudies. This increased consistency in dichotomousjudgments is intuitively pleasing and has been found inother studies.' The first readings were used tocalculate interobserver variability because this simu-lates the usual clinical situation, where only one read-ing is done.The poorest intra- or interobserver agreement was

noted in defect location, particularly in areas in whichoverlap of several areas occurs, most notably the apicalarea. The apical area is wherever the reader puts animaginary line demarcating inferior from true apexfrom anterolateral portions of the heart. Hence, wecall this the area of "anyone's best guess." Part of thelack of observer agreement may also be due to inade-quate anatomic definition as well as normal variation.In several cases, the drawn (shaded areas) lesionsmatched in the first and second reading, but the writ-ten location did not. This result is not unique.28' 9>9 Areporting form that contains the usual written descrip-tion of the defect and its anatomic location as well as ashaded area or drawn lesion on schematic representa-tions of the heart might be helpful.

607

by guest on July 12, 2011http://circ.ahajournals.org/Downloaded from

VOL 64, No 3, SEPTEMBER 1981

Although used primarily as a diagnostic test in-dicating the presence or absence of coronary arterydisease (a dichotomous decision), thallium scintig-raphy has been used to localize the site of coronary oc-clusion. Studies correlating thallium segmental defectlocation to actual major arterial lesions have shown ahigh degree of specificity (greater than 90%) but lowsensitivity in three segments: Anteroseptal defectssuggest left anterior descending disease, inferiordefects suggest right coronary artery lesions, andposterolateral defects suggest left circumflexlesions.3540 Two of these studies noted high inter- andintraobserver agreement - as high as 90% intra-observer (only one observer) and 87% interobserver(two readers). Both studies used standardized regionsfor reporting defect location, which may have con-tributed to such high agreement.Our study indicates that agreement in thallium in-

terpretation of anatomic location of defect is poorand could play a major role in lowering not only thesensitivity, but also the specificity of defect locationcorrelated to major artery lesions. We used more ob-servers, more films, and possibly a longer interval be-tween reinterpretations to eliminate the possibility ofreader recall of previous readings than did these otherstudies. In addition, other studies also used standardcriteria for reporting defect location, limiting the roleof "anyone's best guess" in describing a defect.', 40

Thallium imaging has been used to evaluate therapysuch as coronary artery bypass surgery,4" exercisetraining'2 3 and percutaneous transluminal angio-plasty.44 In comparing images before and after an in-tervention, one may subjectively say better, worse orno change when trying to quantitate the improvementor change in the intensity of the defect. To place anobjective grading on severity of a defect that could beused for comparative purposes, we set up a scale of1-10 to represent severity in terms of the combinationof size and intensity. The intraobserver correlation ofthe first and second readings was fairly high,0.82-0.94. The best average inter- and intraobserveragreement was seen in the two readers with the great-est experience, as was found by Detre et al.10 In ourstudy, experience had no effect on agreement when in-terpreting either abnormal or normal, but seemed toimprove interpretation of the anatomic location of thedefects. The highest agreements were for the LAOview, which allows the best anatomic view of the septalarea and total myocardium.

In summary, we found the best observer agreementwhen describing the images using dichotomous judg-ments and the least agreement with more complex de-scriptions, such as anatomic location of defects, and inoverlapping areas, such as the apical region. Thereview of the literature and our study suggest severalpossible modes for improvement, including (1) simpledichotomous decisions; (2) standardized report formssuch as the one used in this study; (3) multipleobservers or one very experienced reader; (4) multipleblinded or unbiased interpretations; and (5) computeranalysis.'

References

1. Koran LM: The reliability of clinical methods, data and judg-ments. Parts I and II. N Engl J Med 293: 642, 695, 1975

2. Feinstein AR, Nelson AG, Yesner R, Auerbach 0, Hackel DB,Pratt PC: Observer variability in the histopathologic diagnosisof lung cancer. Am Rev Respir Dis 101: 671, 1970

3. Yerushalmy J: The statistical assessment of the variability inobserver perception and description of roentgenographic pul-monary shadows. Roentgenol Clin North Am 7: 381, 1969

4. Raftery EB, Holland WW: Examination of the heart: an in-vestigation into variation. Am J Epidemiol 85: 438, 1967

5. Segall HN: The electrocardiogram and its interpretation: astudy of reports by 20 physicians on a set of 100 electro-cardiograms. Can Med Assoc J 82: 2, 1960

6. Davies LG: Observer variation in reports on electro-cardiograms. Br Heart J 20: 153, 1958

7. Simonson E, Tuna N, Okamoto N, Toshima H: Diagnostic ac-curacy of the vectorcardiogram and electrocardiogram. Am JCardiol 17: 829, 1966

8. Blackburn H: The exercise electrocardiogram: differences in in-terpretation. Am J Cardiol 21: 871, 1968

9. Schieken RM, Clarke WR, Mahoney LT, Lauer RM: Meas-urement criteria for group echocardiographic studies. Am JEpidemiol 110: 504, 1979

10. Detre KM, Wright E, Murphy ML, Takaro T: Observer agree-ment in evaluating coronary angiograms. Circulation 52: 979,1975

11. Zir LM, Miller SW, Dinsmore RE, Gilbert JP, Harthorne JW:Interobserver variability in coronary angiography. Circulation53: 627, 1976

12. Galbraith JE, Murphy ML, de Soyza N: Coronary angiograminterpretation. JAMA 240: 2053, 1978

13. Chaitman BR, DeMots H, Bristow JD, Rosch J, RahimtoolaSH: Objective and subjective analysis of left ventricular angio-grams. Circulation 52: 420, 1975

14. Mason JW, Myers RW, Goris ML, Doherty P, Alderman EL,Kriss JP: Reliability and reproducibility of interpretation of""mtechnetium pyrophosphate myocardial scintigrams. ClinCardiol 2: 446, 1979

15. Botvinick EH, Dunn RF, Hattner RS, Massie BM: A con-sideration of factors affecting the diagnostic accuracy ofThallium-201 myocardial perfusion scintigraphy in detectingcoronary artery disease. Semin Nucl Med 10: 157, 1980

16. Wolthuis RA, Froelicher VF, Fischer J, Noquera I, David B,Stewart AJ, Triebwasser JH: New practical treadmill protocolfor clinical use. Am J Cardiol 39: 697, 1977

17. Fried R: Correlation II: measures of association. In Introduc-tion To Statistics. Selected Procedures for the BehavioralSciences. New York, Oxford University Press, 1969, pp201-206

18. Cohen J: A coefficient of agreement for nominal scales. EducPsychol Measurement 20: 37, 1960

19. Meade TW, Gardner MJ, Cannon P: Observer variability in re-cording the peripheral pulses. Br Heart J 30: 661, 1968

20. Crawford MH, Grant D, O'Rourke RA, Starling MR, GrovesBM: Accuracy and reproducibility of new M-mode echo-cardiographic recommendations for measuring left ventriculardimensions. Circulation 61: 137, 1980

21. Sahn DJ, DeMaria A, Kisslo J, Weyman A: Recommenda-tions regarding quantitation in M-mode echocardiography:results of a survey of echocardiographic measurements. Cir-culation 58: 1072, 1978

22. Felner JM, Blumenstein BA, Schlant RC, Carter AD,Alimurung BN, Johnson MJ, Sherman SW, Klicpera MW,Kutner MA, Drucker LW: Sources of variability in echo-cardiographic measurements. Am J Cardiol 45: 995, 1980

23. Acheson RM: Observer error and variation in the interpreta-tion of electrocardiograms in an epidemiological study of cor-onary heart disease. Br J Prev Soc Med 14: 99, 1960

24. Rose GA, Blackburn H: Minnesota Code for resting electro-cardiograms. In Cardiovascular Survey Methods. Belgium,World Health Organization, 1968, pp 137-154

25. Slutsky R, Karliner J, Battler A, Pfisterer M, Swanson S,

CIRCULATION608

by guest on July 12, 2011http://circ.ahajournals.org/Downloaded from

Ti INTERPRETATION AGREEMENT/Atwood et al.

Ashburn W: Reproducibility of ejection fraction and ventric-ular volume by gated radionuclide angiography after myo-cardial infarction. Radiology 132: 155, 1979

26. Okada RD, Kirshenbaum HD, Kushner FG, Strauss HW,Dinsmore RE, Newell JB, Boucher CA, Block PC, Pohost GM:Observer variance in the qualitative evaluation of left ventric-ular wall motion and the quantitation of left ventricular ejectionfraction using rest and exercise multigated blood pool imaging.Circulation 61: 128, 1980

27. Pohost GM, Alpert NM, Ingwall JS, Strauss HW: Thalliumredistribution: mechanisms and clinical utility. Semin NuclMed 10: 70, 1980

28. McLaughlin PR, Martin RP, Doherty P, Daspit S, Goris M,Haskell W, Lewis S, Kriss JP, Harrison DC: Reproducibility ofthallium-201 myocardial imaging. Circulation 55: 497, 1977

29. Trobaugh GB, Wackers FJ, Sokole EB, DeRouen TA, RitchieJL, Hamilton GW: Thallium-201 myocardial imaging: an inter-institutional study of observer variability. J Nucl Med 19: 359,1978

30. Bailey IK, Griffith LS, Rouleau J, Strauss W, Pitt B: Thallium-201 myocardial perfusion imaging at rest and during exercise.Circulation 55: 79, 1977

31. Verani MS, Marcus ML, Razzak MA, Ehrhardt JC: Sen-sitivity and specificity of thallium-201 perfusion scintigrams un-der exercise in the diagnosis of coronary artery disease. J NuclMed 19: 773, 1978

32. Blood DK, McCarthy DM, Sciacca RR, Cannon PJ: Com-parison of single-dose and double-dose thallium-201 myocar-dial perfusion scintigraphy for the detection of coronary arterydisease and prior myocardial infarction. Circulation 58: 777,1978

33. Ritchie JL, Trobaugh GB, Hamilton GW, Gould KL,Narahara KA, Murray JA, Williams DL: Myocardial imagingwith thallium-201 at rest and during exercise. Circulation 56:66, 1977

34. Lenaers A: Thallium-201 myocardial perfusion scintigraphyduring rest and exercise. Cardiovasc Radiol 2: 195, 1979

35. Verani MS, Marcus ML, Spoto G, Rossi NP, Ehrhardt JC,Razzak MA: Thallium-201 myocardial perfusion scintigrams inthe evaluation of aorto-coronary saphenous bypass surgery. J

Nucl Med 19: 765, 197836. Botvinick EH, Taradash MR, Shames DM, Parmley WW:

Thallium-201 myocardial perfusion scintigraphy for the clinicalclarification of normal, abnormal and equivocal electro-cardiographic stress tests. Am J Cardiol 41: 43, 1978

37. Wackers FJ, Sokole EB, Samson G, van der Schoot JB:Anatomy of the normal myocardial image. In Thallium-201Myocardial Imaging, edited by Ritchie JL, Hamilton GW,Wackers FJT. New York, Raven Press, 1978, p 50

38. Lenaers A, Block P, vanThiel E, Lebedelle M, Becquevort P,Erbsmann R, Ermans AM: Segmental analysis of T1-201 stressmyocardial scintigraphy. J Nucl Med 18: 509, 1977

39. Massie BM, Botvinick EH, Brundage BH: Correlation ofthallium-201 scintigrams with coronary anatomy: factorsaffecting region by region sensitivity. Am J Cardiol 44: 616,1979

40. Rigo PR, Bailey IK, Griffith LSC, Pitt B, Burow RD, WagnerHN, Becker LC: Value and limitations of segmental analysis ofstress thallium myocardial imaging for localization of coronaryartery disease. Circulation 61: 973, 1980

41. Verani MS, Marcus ML, Spoto G, Rossi NP, Ehrhardt JC,Razzuk MA: Thallium-201 myocardial perfusion scintigramsin the evaluation of aorto-coronary saphenous bypass surgery. JNucl Med 19: 765, 1978

42. Froelicher VF, Jensen DG, Atwood E, McKirnan D, Gerber K,Slutsky R, Battler A, Ashburn W, Ross J: Evidence for im-provement in myocardial perfusion and function after cardiacrehabilitation. Arch Phys Med Rehabil 61: 517, 1980

43. Atwood E, Jensen D, Froelicher V, Gerber K, Witztum K,Slutsky R, Ashburn W: Radionuclide perfusion images beforeand after cardiac rehabilitation. Aviat Space Environ Med. 51:892, 1980

44. Hirzel H, Gruentzig A, Neusch K, Krayenbuhl HP, Horst W:Thallium-201 imaging for the evaluation of myocardial perfu-sion after percutaneous transluminal angioplasty of coronaryartery stenosis. (abstr) Circulation 58 (suppl II): II-180, 1978

45. Tubau J, Witzam K, Jensen D, Atwood E, Froelicher V, Ash-burn W: Changes in myocardial perfusion and left ventricularfunction after physical conditioning. (abstr) J Nucl Med 22:P24, 1981

609

by guest on July 12, 2011http://circ.ahajournals.org/Downloaded from