observer variability in screen-film mammography versus full-field digital mammography with soft-copy...

10
Eur Radiol (2008) 18: 11341143 DOI 10.1007/s00330-008-0878-0 BREAST Per Skaane Felix Diekmann Corinne Balleyguier Susanne Diekmann Jean-Charles Piguet Kari Young Michael Abdelnoor Loren Niklason Received: 19 September 2007 Revised: 20 November 2007 Accepted: 24 December 2007 Published online: 27 February 2008 # European Society of Radiology 2008 Observer variability in screen-film mammography versus full-field digital mammography with soft-copy reading Abstract Full-field digital mammog- raphy (FFDM) with soft-copy reading is more complex than screen-film mammography (SFM) with hard-copy reading. The aim of this study was to compare inter- and intraobserver variability in SFM versus FFDM of paired mammograms from a breast cancer screening program. Six radiol- ogists interpreted mammograms of 232 cases obtained with both tech- niques, including 46 cancers, 88 benign lesions, and 98 normals. Image interpretation included BI-RADS cat- egories. A case consisted of standard two-view mammograms of one breast. Images were scored in two sessions separated by 5 weeks. Observer vari- ability was substantial for SFM as well as for FFDM, but overall there was no significant difference between the observer variability at SFM and FFDM. Mean kappa values were lower, indicating less agreement, for microcalcifications compared with masses. The lower observer agreement for microcalcifications, and especially the low intraobserver concordance between the two imaging techniques for three readers, was noticeable. The level of observer agreement might be an indicator of radiologist perfor- mance and could confound studies designed to separate diagnostic dif- ferences between the two imaging techniques. The results of our study confirm the need for proper training for radiologists starting FFDM with soft-copy reading in breast cancer screening. Keywords Breast neoplasms . Radiography . Breast radiography . Comparative studies . Cancer screening . Full-field digital mammography . Interobserver variation Introduction The success of screening mammography depends upon the detection of small preclinical cancers. The depiction of fine microcalcifications and subtle soft tissue masses and densities on high quality mammography is key to the detection of these early breast cancers. Conventional screen-film mammography (SFM) has some distinct advantages, including hard-copy images conveniently displayed on motorized alternators in a rather simple way. Full-field digital mammography (FFDM) offers several advantages in mammography screening [1]. The true flexibility and benefit of digital technology is primarily realized in a soft-copy display of the images and consequently in soft-copy reading. Experimental clinical studies comparing SFM with FFDM hard-copy [25] or Presented at ECR, Wien 2006. P. Skaane (*) . K. Young Department of Radiology, Breast Imaging Center, Ullevaal University Hospital, Kirkeveien 166, 0407 Oslo, Norway e-mail: [email protected] Tel.: +47-22-119411 Fax: +47-23-016535 F. Diekmann . S. Diekmann Department of Diagnostic Radiology, University Charite, Berlin, Germany C. Balleyguier Institut Gustave Roussy, Villejuif, France J.-C. Piguet Institut Imagerive, Geneva, Switzerland M. Abdelnoor Center for Clinical Research, Section of Epidemiology and Biostatistics, Ullevaal University Hospital, Oslo, Norway L. Niklason Hologic Inc., Hillsborough, NC, USA

Upload: independent

Post on 18-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Eur Radiol (2008) 18: 1134–1143DOI 10.1007/s00330-008-0878-0 BREAST

Per SkaaneFelix DiekmannCorinne BalleyguierSusanne DiekmannJean-Charles PiguetKari YoungMichael AbdelnoorLoren Niklason

Received: 19 September 2007Revised: 20 November 2007Accepted: 24 December 2007Published online: 27 February 2008# European Society of Radiology 2008

Observer variability in screen-film

mammography versus full-field digital

mammography with soft-copy reading

Abstract Full-field digital mammog-raphy (FFDM) with soft-copy readingis more complex than screen-filmmammography (SFM) with hard-copyreading. The aim of this study was tocompare inter- and intraobservervariability in SFM versus FFDM ofpaired mammograms from a breastcancer screening program. Six radiol-ogists interpreted mammograms of232 cases obtained with both tech-niques, including 46 cancers, 88benign lesions, and 98 normals. Imageinterpretation included BI-RADS cat-egories. A case consisted of standardtwo-view mammograms of one breast.Images were scored in two sessionsseparated by 5 weeks. Observer vari-ability was substantial for SFM as well

as for FFDM, but overall there was nosignificant difference between theobserver variability at SFM andFFDM. Mean kappa values werelower, indicating less agreement, formicrocalcifications compared withmasses. The lower observer agreementfor microcalcifications, and especiallythe low intraobserver concordancebetween the two imaging techniquesfor three readers, was noticeable. Thelevel of observer agreement might bean indicator of radiologist perfor-mance and could confound studiesdesigned to separate diagnostic dif-ferences between the two imagingtechniques. The results of our studyconfirm the need for proper trainingfor radiologists starting FFDM withsoft-copy reading in breast cancerscreening.

Keywords Breast neoplasms .

Radiography . Breast radiography .

Comparative studies . Cancerscreening . Full-field digitalmammography . Interobservervariation

Introduction

The success of screening mammography depends upon thedetection of small preclinical cancers. The depiction of finemicrocalcifications and subtle soft tissue masses anddensities on high quality mammography is key to thedetection of these early breast cancers. Conventionalscreen-film mammography (SFM) has some distinct

advantages, including hard-copy images convenientlydisplayed on motorized alternators in a rather simpleway. Full-field digital mammography (FFDM) offersseveral advantages in mammography screening [1]. Thetrue flexibility and benefit of digital technology is primarilyrealized in a soft-copy display of the images andconsequently in soft-copy reading. Experimental clinicalstudies comparing SFM with FFDM hard-copy [2–5] or

Presented at ECR, Wien 2006.

P. Skaane (*) . K. YoungDepartment of Radiology,Breast Imaging Center,Ullevaal University Hospital,Kirkeveien 166,0407 Oslo, Norwaye-mail: [email protected].: +47-22-119411Fax: +47-23-016535

F. Diekmann . S. DiekmannDepartment of Diagnostic Radiology,University Charite,Berlin, Germany

C. BalleyguierInstitut Gustave Roussy,Villejuif, France

J.-C. PiguetInstitut Imagerive,Geneva, Switzerland

M. AbdelnoorCenter for Clinical Research, Section ofEpidemiology and Biostatistics,Ullevaal University Hospital,Oslo, Norway

L. NiklasonHologic Inc.,Hillsborough, NC, USA

soft-copy [6] reading have demonstrated comparable orslightly better results for FFDM regarding lesion detectionand characterization.

Interobserver variability is a serious problem in breastimaging, and radiologists differ substantially in theirinterpretation of mammograms [7–10]. Differences inmammogram interpretations among radiologists can influ-ence the number of detected cancers and, consequently, theeffect of screening mammography on breast cancer mor-tality. Inter- and intraobserver variability using the BreastImaging Reporting and Data System (BI-RADS) issubstantial for both mammographic feature analysis andfinal assessment recommendations [11]. The challenge ofinterobserver variation is best demonstrated in therelatively low agreement on assessment, even when theBI-RADS categories are dichotomized as binary outcome[11].

FFDM soft-copy reading is more complex than SFMhardcopy reading. Radiologists may differ in their use ofFFDM postprocessing, both to the extent of “roam-and-zoom” and to which steps they include in their soft-copyreading. Little is known about observer variability inFFDM soft-copy reading versus SFM hard-copy reading.The purpose of this study was to analyze observervariability in SFM and FFDM of paired mammogramsfrom a population-based breast cancer screening program.

Materials and methods

Study population

All cases (one case=one breast) included two standardviews [craniocaudal (CC) and mediolateral oblique(MLO)] of the breast, and were selected from theNorwegian Breast Cancer Screening Program (NBCSP).Most cases were obtained from the Oslo I study, a pairedstudy in which the women underwent a two-view exam-ination of each breast with SFM and FFDM on the sameday [12]. Some cases were collected from the Oslo II study[13], and for these cases the time interval between theexaminations with SFM and FFDM was less than threeweeks. All women were informed about the study before-hand, and their participation was voluntary. Each womanenrolled in the two studies signed a written consent form,and both studies were approved by the Regional EthicalCommittee.

A total of 250 cases having two standard views acquiredwith both SFM and FFDM were selected from the databaseby a nonradiologist. The selected study material includedinitially 100 normal cases, 100 abnormal but benign cases,and 50 cancers.

All 100 normal cases were randomly collected from theOslo I study, which included independent double readingfor SFM as well as FFDM. The inclusion criteria for anormal examination were that all four radiologists had

scored the examination as normal (category “1” on the five-point rating scale for probability of cancer used in theNBCSP), that the woman did not present with intervalcancer, and that the case was scored as normal in thesubsequent screening round. Among the 100 normal casesselected by the nonradiologist, two women with largebreasts and large format images (24 by 30 cm) on SFM andmore than the four standard images on FFDM wereexcluded from analysis.

Of the 100 benign cases, 85 cases were selected from theOslo I and 15 cases from the Oslo II study. A case wasdefined as benign if call-back assessment confirmed abenign abnormality, and the woman did not present with aninterval cancer or with a cancer in the subsequent screeninground. Twelve cases among the selected 100 benign lesionswere excluded from further analysis since no abnormalitywas confirmed on diagnostic work-up and the abnormalitysuspected on screening mammograms proved to be super-imposed glandular tissue and no true abnormality.

Fifty screen-detected cancers, including both ductalcarcinoma in-situ (DCIS) and invasive cancers, wererandomly selected from the Oslo I study (n=27) or theOslo II study (n=23). All cancers were confirmed byhistology. The histological type of the cancers, the size ofthe cancers, and the breast parenchyma density was notknown by the person who randomly selected the cases fromthe files. Four cancers were excluded from analysis: onecancer proved to be mammographically occult; one cancerproved to be a positioning failure at FFDM; and twocancers proved to have latero-medial views instead ofMLO at diagnostic work-up so that comparison wasinappropriate.

Thus, the final study population consisted of 232 cases:98 normals (mean age 56.2 years, range 49–67 years), 88benign cases (mean age 56.4 years, range 45–68 years),and 46 cancers (mean age 59.2 years, range 51–70 years).Eighteen breasts were categorized as BI-RADS density 1and 12 cases as BI-RADS density 4. The distribution of thefour density groups among the 232 cases has beendescribed in a previous study [14].

Imaging

SFM examinations were acquired on a Mammomat 300(Siemens, Erlangen, Germany) with Kodak Min-R 2000film and Min-R 2190 screens (Kodak Health Imaging,Rochester, N.Y.), using molybdenum target and a molyb-denum filter at 29 kV. The physicists selected 29 kV inaccordance with the recommendations of the NBCSP, andthe SFM images had optical density in compliance with theNBCSP requirements [15].

FFDM images were acquired on a Senographe 2000D(GE Medical Systems, Milwaukee, Wis.), which uses aCsI-amorphous silicon detector. The unit is equipped withan AOP (automatic optimization of parameters), in which

1135

anode-filter combination, kV and mAs are selectedautomatically after analysis of a short pre-exposure.

Image interpretation

Six radiologists (A–F) from four European countriesparticipated in the study. The readers’ experience in SFMvaried from 4 to 24 years and in FFDM soft-copy readingfrom 2 to 4 years. The number of screening mammogramsinterpreted by each radiologist in their own practice variedfrom 2,500 to 12,000 per year.

Soft-copy reading was carried out in a darkened room,and a darkened room with high luminance view boxes wasused for hard-copy SFM. The reading conditions wereidentical for all readers. The mammograms were madeanonymous. No clinical informations were available. Theradiologists scored the images in two sessions separated by5 weeks such that the same case was not seen twice in anysession. Each session included six interpretation “rounds”alternating between SFM and FFDM images with a timelimit of 60 min for about 40 cases. A magnifying glass wasoffered for SFM interpretation. FFDM images wereinterpreted using soft-copy reading on the GE reviewworkstation, which included two high-resolution 2K×2.5Kmonitors and a dedicated keypad. Postprocessing (win-dow-level adjustments, zooming, inversion) of the imageswere optional but recommended.

The readers marked the localization of an abnormality (ifpresent) on a sheet. For cases in which more than one lesionwas suspected the abnormality with the highest suspicionwas considered. BI-RADS scores of 1–5 were given for allcases: category 1=negative finding; category 2=benignfinding with no mammographic evidence of malignancy;category 3=probably benign finding and short-termfollow-up would have been suggested in daily practice;category 4=suspicious abnormality with biopsy beingconsidered; category 5=highly suggestive of malignancy.BI-RADS category 0 was omitted.

Breast parenchyma density for each case was retro-spectively determined by two readers (P.S. and K.Y.) usingthe BI-RADS classification (category 1=fatty; category 2=scattered dense; category 3=heterogeneously dense; cate-gory 4=extremely dense).

Statistical analysis

Observer variability was evaluated using observed agree-ment and Cohen’s kappa statistic, including linear andquadratic weighting [16]. A preliminary power analysiswas performed to estimate sample size required todetermine whether a kappa of a given magnitude issignificantly higher than 0.6. This was done taking intoconsideration the proportion of “Yes” observations of thetwo observers, the type 1 error and the power [17]. The

sample size estimated will be to determine whether thelower 95% confidence limit of a given kappa exceeds 0.6.For our study we considered a power of 80%, a type 1 errorof 5%, and an expected value of kappa of 0.8 with afrequency of “Yes” of 35%. We would then need a samplesize of 138 patients.

Weighted kappa statistic was used for the ordinal data[18]. We used quadratic weighting for the five-point BI-RADS scale. Additionally, we applied linear weighting fora three-level scale of the collapsed five-point BI-RADSscale for evaluation of concordance between the twoimaging modalities for each reader. For this evaluation, thescores 1 and 2 were grouped together and the scores 4 and 5were grouped together since their difference has littleconsequence on decision-making in daily practice. Kappavalue <0.20 is poor, 0.21–0.40 is fair, 0.41–0.60 ismoderate, 0.61–0.80 is good, and kappa value >0.81 isvery good observer agreement. Kappa values werecompared using the U-test.

Two-by-two table analysis was used for comparing SFMand FFDM interpretations based on BI-RADS categoriesfor all lesions and subgroups of the study population. Forbinary outcome with two-by-two table analysis, a cut-offbetween BI-RADS 2 and 3 was used for positive versusnegative score; i.e., for the cancer cases a BI-RADS scoreof 3 or higher was defined as true positive. McNemar test(P value less than 0.05 considered statistically significant)was used to compare the discordant pairs in the two-by-twotable analysis (Epi Info, Version 6, Centers for DiseaseControl and Prevention, Atlanta, Ga.).

TheWilcoxon signed-rank test for matched pairs was usedto compare the scores for cancers at SFM and FFDM for theindividual readers. Receiver operating characteristic (ROC)analysis was used for calculating the diagnostic performanceof the individual reader (ROCKIT program, Macintosh PPCversion 0.9.1 Beta, Charles E. Metz, University of Chicago,Ill.). For comparing the mean area under the curve (Az)values for SFM and FFDM, a P value less than 0.05 wasconsidered to show statistical significance.

Results

All cases

Overall diagnostic performance for all cases for the sixreaders using ROC analysis revealed that five readersperformed better at FFDM, of which one reader demon-strated a statistically significant higher performance atFFDM and one reader showed borderline significance infavor of FFDM (Table 1). The ROC analyses performed fora fixed single false-positive fraction (FPF) and conse-quently a given true-positive fraction (TPF) revealed thattwo readers (B and E) had a statistically significant higherperformance at FFDM compared with SFM (P value 0.01and 0.03, respectively).

1136

Interobserver agreement for FFDM versus SFM for allpairs of readers showed a slightly higher kappa value(quadratic weighting) for SFM in ten and a slightly higherkappa value for FFDM in five of the 15 pairs of readers(Fig. 1). The mean weighted kappa score for SFMwas 0.74(range, 0.68–0.81) and for FFDM 0.71 (range, 0.61–0.82).The mean kappa values for each reader are summarized inTable 2.

Concordance for SFM versus FFDM for the six readersfor all lesions using kappa with linear weighting (three-level scale of the collapsed five-point BI-RADS scale)showed a mean kappa value of 0.54 (range, 0.45–0.62).The individual results were: reader A, 0.51; reader B, 0.53;reader C, 0.60; reader D, 0.51; reader E, 0.62; and reader F,0.45 (Fig. 2).

Comparison of the discordant pairs for SFM versusFFDM in a two-by-two table analysis (BI-RADS scores 1and 2 negative and 3–5 positive test result) usingMcNemar’s test for paired proportions revealed no signif-icant difference for any of the readers.

Subgroup: normal cases

Concordance between the two imaging techniques fornormal cases (n=98) as expressed by observed agreementranged among the six radiologists from 57% to 86%: readerA, 57%; reader B, 57%; reader C, 70%; reader D, 62%;reader E, 86%; and reader F, 72%. The observed agreementusing the three-level scale of the collapsed five-point BI-RADS scores showed the following values: reader A, 72%;reader B, 77%; reader C, 85%; reader D, 77%; reader E,99%; and reader F, 87%. Kappa statistic with quadraticweighting was not calculated because the observedconcordance was smaller than mean-chance concordancefor some tables.

The number of false-positive interpretations for normalcases (defined as BI-RADS score 3–5) for the sixradiologists was in total 48 of 588 (8.2%) for SFM(range among readers, 1–14) versus 70 of 588 (11.9%) forFFDM (range among readers, 0–17). Three normal caseswere given BI-RADS score 5 at SFM compared with 0 cases

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

AB AC AD AE AF BC BD BE BF CD CE CF DE DF EF

Reader

Ka

pp

a

SFM FFDMFig. 1 Observer agreement forall pairs of the six readers for allcases (n=232) presented asweighted kappa with quadraticweighting based on the five-point BI-RADS scale

Table 1 Comparison of the area under the ROC curve (Az) for FFDM and SFM for the six readers (A–F) using the BI-RADS classification(scores 1–5). The Az values are given for all cases (n=232), for masses (n=71), and for microcalcifications (n=57)

Reader Az all cases (n=232) Az masses (n=71) Az calc only (n=57)

FFDM SFM P value FFDM SFM P value FFDM SFM P value

A 0.85 0.88 0.47 0.72 0.81 0.15 0.88 0.77 0.26

B 0.96 0.91 0.09 0.94 0.90 0.38 n.a.a

C 0.92 0.90 0.32 0.85 0.82 0.66 n.a.a

D 0.93 0.89 0.11 0.92 0.90 0.81 0.82 0.75 0.45

E 0.93 0.86 0.04 0.90 0.78 0.08 0.88 0.82 0.37

F 0.90 0.89 0.66 0.83 0.80 0.66 0.79 0.81 0.71

aNonapplicable; data degenerate and expelled from analysis (either clustered or tied values)

1137

at FFDM (Fig. 3). The number of false-positive scores forboth imaging modalities among the six radiologists rangedfrom 1 to 31 (Fig. 3).

Subgroup: masses and densities

ROC analysis for masses and densities (n=71: 45 benignand 26 malignant masses; six cancers manifesting as masswith microcalcifications excluded from this analysis)showed for FFDM Az values ranging from 0.72 to 0.94and for SFM Az values from 0.78 to 0.90. Five of the sixreaders showed a higher Az value for FFDM. There was nostatistically significant difference among the readers atFFDM versus SFM for masses, but the differenceapproached borderline significance in favor of FFDM forone reader (Table 1).

Observer agreement for FFDM versus SFM for a pair ofreaders for masses showed for SFM a mean kappa score(quadratic weighting) of 0.66 (range, 0.50–0.80) and forFFDM 0.64 (range, 0.54–0.77). The mean kappa values foreach reader are summarized in Table 2.

Concordance between the two imaging techniques forthe individual readers using the kappa statistic (linearweighting based on the three-level scale of the collapsedfive-point BI-RADS scale) showed, for masses, a meankappa value of 0.40 for the six readers. Values for theindividual readers were: reader A, 0.38; reader B, 0.56;reader C, 0.43; reader D, 0.45; reader E, 0.34; and forreader F, 0.21 (Fig. 2).

Comparison of discordant pairs in two-by-two tableanalysis using McNemar’s test for paired proportions (BI-RADS scores 1 and 2 negative and the BI-RADS scores 3–5 grouped as a positive test result) did not show anystatistically significant difference for the readers.

Table 2 Mean kappa scores with quadratic weighting (the meanvalue of agreement for the five comparisons of each reader with theothers) based on the five-point BI-RADS scores 1–5 for the sixradiologists (A–F) at SFM and FFDM for all cases (n=232), masses

(n=71), and for microcalcifications (n=57). The overall mean kappavalue is based on all 30 possible pair of comparisons for the sixradiologists

Reader All cases (n=232) Masses (densities) (n=71) Microcalcifications (n=57)

SFM FFDM SFM FFDM SFM FFDM

A 0.74 0.66 0.72 0.58 0.51 0.45

B 0.76 0.73 0.71 0.67 0.54 0.51

C 0.75 0.74 0.66 0.68 0.56 0.54

D 0.69 0.68 0.61 0.63 0.45 0.46

E 0.76 0.73 0.65 0.67 0.62 0.56

F 0.73 0.73 0.61 0.63 0.54 0.53

Overall mean 0.74 0.71 0.66 0.64 0.54 0.51

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

Ka

pp

a

A B C D E F

Reader

All lesions Densities CalcificationsFig. 2 Concordance at SFMversus FFDM for the six readersfor all lesions (n=232), masses(densities) only (n=71), andmicrocalcifications only (n=57).Concordance is presented askappa values (linear weighting)based on the three-level scale ofthe collapsed five-point BI-RADS scores

1138

Subgroup: microcalcifications

ROC analysis for microcalcifications only (n=57: 43benign and 14 malignant cases; six cancers manifesting asmass with microcalcifications were excluded from analy-sis) showed higher Az value for FFDM compared withSFM for three of four readers. The data were “degenerate”(either “clustered” with most data representing just a fewscores without enough range, or “tied values” with thesame score for many analogue and digital interpretations)for two readers (B and C) and these data were expelledfrom analysis by the computer program. None of thedifferences between FFDM and SFM for the four readerswere statistically significant (Table 1).

Observer agreement for FFDM versus SFM for pair ofreaders using the kappa statistic (quadratic weighting)showed a mean weighted kappa for SFM of 0.54 (range,0.40–0.68) and for FFDM0.51 (range, 0.40–0.75) (Table 2).There was a higher kappa value at SFM for nine of the 15pairs and a higher kappa value at FFDM for five of the 15pairs of radiologists. For one pair (reader A versus D) thekappa value was identical (Fig. 4). The difference of themean kappa value (quadratic weighting) for SFM versusFFDM were statistically not significant (U-test, P=0.89).However, interobserver variation was more marked formicrocalcifications compared with all cases (Figs. 1, 4).

Concordance at SFM versus FFDM for the six readersfor microcalcifications only (n=57) showed a mean kappa

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Nu

mb

er

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Nu

mb

er

SFM - 5

SFM - 4

SFM - 3

FFDM - 5

FFDM - 4

FFDM - 3

A B C D E F

Reader

A B C D E F

Reader

a

b

Fig. 3 Normal cases with false-positive interpretations (BI-RADS score 3 or higher) for thesix readers (A–F). a SFM: a totalof 48 false positives. b FFDM: atotal of 70 false positives. Therewas no false positive BI-RADS5 score at FFDM

1139

value (linear weighting based on the three-level scale of thecollapsed five-point BI-RADS scale) of 0.36. The valuesfor the six readers were: reader A, 0.29; reader B, 0.18;reader C, 0.58; reader D, 0.25; reader E, 0.54; and reader F,0.33. For three readers (readers A, B and D) these kappavalues were lower than for masses, whereas the corre-sponding values were higher for the three other readers, C,E, and F (Fig. 2).

Comparison of discordant pairs in two-by-two tableanalysis (BI-RADS 1 and 2 negative and BI-RADS score3–5 positive test) using McNemar’s test for pairedproportions again did not show any significant differenceamong the readers.

Comparison of diagnostic performance at SFM versusFFDM showed remarkable differences among readers formasses versus microcalcifications. The Az values for alllesions for reader A and reader F at FFDM versus SFMwere comparable (Table 1). Comparison of the Az valuesfor the subgroups masses and microcalcifications, howev-er, showed noteworthy differences among the two readers:reader A had problems with masses at FFDM (Az forFFDM 0.72 versus 0.81 for SFM), but showed higherperformance for microcalcifications at FFDM (Az forFFDM 0.88 versus 0.77 for SFM). Reader F on the otherhand had comparable results at the two imaging techniquesfor both subgroups (Az for masses 0.83 versus 0.80 and forcalcifications 0.79 versus 0.81, respectively) (Fig. 5).

Subgroup: cancers

Since the kappa statistic (observer agreement) does notreflect the conspicuity for cancer and consequently not thelevel of the BI-RADS scores, we compared the interpreta-tions for SFM and FFDM for cancers (n=46) for theindividual reader using the Wilcoxon signed-rank test for

matched pairs. The mean score at SFM and FFDM and theWilcoxon P value for the six readers were, respectively:reader A, 4.04-3.96, P=0.76; reader B, 4.35–4.37, P=0.98;reader C, 3.87–4.11, P=0.14; reader D, 4.02–4.20, P=0.16; reader E, 3.74–4.09, P=0.08; and reader F, 4.11–4.20, P=0.93. The mean BI-RADS score for cancers wasslightly higher at FFDM for five of the six readers, but noneof the differences were statistically significant.

Discussion

A high level of agreement between SFM and FFDM hasbeen reported in a diagnostic evaluation of breast cancer,and significant disagreement affecting treatment ap-proaches between the two imaging techniques was reportedin only 4% [19]. Generalization of results from experi-mental clinical studies on observer variation is problematic,and results from such studies are difficult to compare withprospective interpretations performed in daily practice.Three large-scale trials with a paired study designcomparing SFM and FFDM in a screening setting showeda noticeable high number of cancer cases detected only atone of the two imaging techniques [1, 12, 20, 21].Observed agreement for cancers based on a two-by-twotable analysis was for the Colorado–Massachusetts study,the Oslo I study, and the DMIST trial 52%, 68%, and 66%,respectively [12, 20, 21]. In a recently published paper itwas concluded that using two mammograms, one screen-film and one digital, would significantly increase thedetection of breast cancer [22].

In our study the diagnostic performance for all lesionswas higher at FFDM compared with SFM for five of sixreaders, of which the difference in favor of FFDM wasstatistically significant for one reader (Table 1). Observeragreement among the readers for all cases was comparable

0

10

20

30

40

50

60

70

80

AB AC AD AE AF BC BD BE BF CD CE CF DE DF EF

Reader

Ka

pp

a

SFM FFDMFig. 4 Observer agreement forall pair of readers for casespresenting as microcalcifica-tions only (n=57). Interobserveragreement is presented asweighted kappa (quadraticweighting) based on the five-point BI-RADS scale. The dif-ference between the mean kappavalue for SFM and FFDM wasnot statistically significant (U-test, P=0.89)

1140

at SFM and FFDM (Fig. 1). Concordance for the BI-RADSscores with the two imaging techniques was, however, onlymoderate for all cases (Fig. 2).

Detection and characterization of microcalcifications atFFDM with soft-copy reading has been a hot topic in thepast. Studies have shown that, despite lower spatialresolution at FFDM compared with SFM, the diagnosticaccuracy of FFDM is comparable or even higher than SFMfor microcalcifications [2–4, 23]. Higher numeric values ofspecificity in favor of FFDM for microcalcifications havebeen reported, although the differences were not statisti-cally significant [6]. The observer performance fordetecting microcalcifications of breast cancers is compar-able among hard-copy film and five-megapixel soft-copymonitor reading [24], and monitor zooming of the FFDMimages was reported to be equivalent to direct magnifica-tion in a previous study [25]. Thus, the higher contrastresolution and the ability to do image postprocessing withFFDM using soft-copy reading compensate for thelimitations in spatial resolution.

An interesting and important finding of our study wasthe low observer agreement for microcalcifications. Ob-server agreement for pair of readers was lower forcalcifications than for all cases (Figs. 1, 4), and the meankappa values were lower for microcalcifications compared

with all cases and densities (Table 2). The mean kappascores (quadratic weighting) at SFM and FFDM were,however, not statistically different. The concordance atSFM versus FFDM for microcalcifications was noticeablylow for three of the six readers (Fig. 2). Studies oninterobserver variability have shown lower kappa valuesfor calcification descriptions than for other mammographicfeatures [11, 26]. Characterization of microcalcifications isdifficult, and BI-RADS training may improve the inter-observer agreement for microcalcification morphology[27].

Observer agreement for masses (densities) was lower inour study than for all cases but higher than formicrocalcifications. Concordance at SFM versus FFDMfor masses was noticeably low for only one of the sixreaders, but the concordance varied considerably amongthe readers (Fig. 2). Diagnostic performance for all caseswas nearly the same for reader A and F, but the differencebetween the two subgroups was substantial although thesedifferences were not statistically significant (Fig. 5).

The number of false-positive scores at SFM and FFDMfor normal cases (Fig. 3) indicate that at least five of the sixreaders probably did not interpret the images as they wouldhave done in daily practice (“expectancy bias”). The highernumber of false-positive interpretations at FFDM com-

0

0,2

0,4

0,6

0,8

1

0 0,2 0,4 0,6 0,8 1

FPF

0 0,2 0,4 0,6 0,8 1

FPF

0 0,2 0,4 0,6 0,8 1

FPF

0 0,2 0,4 0,6 0,8 1

FPF

TP

F

0

0,2

0,4

0,6

0,8

1

TP

F0

0,2

0,4

0,6

0,8

1

TP

F0

0,2

0,4

0,6

0,8

1

TP

F

FFDM

SFM

Reader A: Masses

FFDM

SFM

Reader A: Microcalcifications

FFDM

SFM

Reader F: Masses

FFDM

SFM

Reader F: Microcalcifications

Fig. 5 ROC curves for readersA and F. Diagnostic perfor-mance (Az values) for all lesionswere comparable for the tworeaders (0.88–0.85 and 0.89–0.90 at SFM and FFDM, re-spectively). ROC curves (Azvalues) for masses and micro-calcifications are, however,quite different although the dif-ferences in Az values were notstatistically significant

1141

pared with SFM (Fig. 3) seems to support the results of theOslo I and II studies, which showed a higher recall rate forFFDM [12, 13]. The higher number of false positives atFFDM for normal cases could also reflect a learning curveeffect. The six readers were all more experienced in SFMthan FFDM, and the experience with FFDM soft-copyreading also varied among the readers. Close attention mustbe paid to proper reader training and systematic use ofimage display protocols when introducing FFDM withsoft-copy reading, especially in order to avoid missingsmall cancers in mammography screening [28].

In conclusion, our study has shown a substantial observervariability for SFM hard-copy reading as well as for FFDMsoft-copy reading in breast cancer screening cases. Overall,there was no statistically significant difference between

observer variability at SFM and FFDM. An importantfinding was the low observer agreement (low kappa values)for calcifications, and especially the low concordance forfour of six readers between the two imaging techniques formicrocalcifications. We suggest that a learning curve effectmight have contributed to this finding. Consequently, ourresults confirm the need for proper training for radiologistsstarting FFDM with soft-copy reading in breast cancerscreening. The level of observer agreement might be auseful indicator of radiologist performance in the absenceof accuracy evaluation. The variability in mammographyinterpretation at SFM and FFDM may confound observerstudies, and it is important to be aware of this problemwhendesigning studies to evaluate the diagnostic differencesbetween the two imaging techniques.

References

1. Bick U, Diekmann F (2007) Digitalmammography: what do we and whatdon’t we know? Eur Radiol 17:1931–1942

2. Diekmann S, Bick U, von Heyden H,Diekmann F (2003) Visualization ofmicrocalcifications on mammographiesobtained by digital full-field mammog-raphy in comparison to conventionalfilm-screen mammography (in Ger-man). Rofo 175:775–779

3. Fischer U, Baum F, Obenauer S,Luftner-Nagel S, von Heyden D,Vosshenrich R, Grabbe E (2002)Comparative study in patients withmicrocalcifications: full-field digitalmammography vs screen-film mam-mography. Eur Radiol 12:2679–2683

4. Obenauer S, Luftner-Nagel S, vonHeyden D, Munzel U, Baum F, GrabbeE (2002) Screen film vs full-fielddigital mammography: image quality,detectability and characterization oflesions. Eur Radiol 12:1697–1702

5. Obenauer S, Hermann KP, Marten K,Luftner-Nagel S, von Heyden D,Skaane P, Grabbe E (2003) Soft copyversus hard copy reading in digitalmammography. J Digit Imaging16:341–344

6. Kim HH, Pisano ED, Cole EB, JiroutekMR, Muller KE, Zheng Y, KuzmiakCM, Koomen MA (2006) Comparisonof calcification specificity in digitalmammography using soft-copy displayversus screen-film mammography. AJRAm J Roentgenol 187:47–50

7. Beam CA, Layde PM, Sullivan DC(1996) Variability in the interpretationof screening mammograms by USradiologists. Arch Intern Med 156:209–213

8. Elmore JG, Wells CK, Lee CH,Howard DH, Feinstein AR (1994)Variability in radiologists’ interpreta-tions of mammograms. N Engl J Med331:1493–1499

9. Elmore JG, Nakano CY, Koepsell TD,Desnick LM, D’Orsi CJ, Ransohoff DF(2003) International variation inscreening mammography interpreta-tions in community-based programs.J Natl Cancer Inst 95:1384–1393

10. Skaane P, Engedal K, Skjennald A(1997) Interobserver variation in theinterpretation of breast imaging.Comparison of mammography, ultra-sonography, and both combined in theinterpretation of palpable noncalcifiedbreast masses. Acta Radiol 38:497–502

11. Berg WA, Campassi C, Langenberg P,Sexton MJ (2000) Breast imagingreporting and data system: inter- andintraobserver variability in featureanalysis and final assessment. AJR AmJ Roentgenol 174:1769–1777

12. Skaane P, Young K, Skjennald A (2003)Population-based mammographyscreening: comparison of screen-filmand full-field digital mammographywith soft-copy reading—Oslo I study.Radiology 229:877–884

13. Skaane P, Skjennald A (2004) Screen-film mammography versus full-fielddigital mammography with soft-copyreading: randomized trial in a popula-tion-based screening program—TheOslo II study. Radiology 232:197–204

14. Skaane P, Balleyguier C, Diekmann F,Diekmann S, Piguet JC, Young K,Niklason LT (2005) Breast lesiondetection and classification: compari-son of screen-film mammography andfull-field digital mammography withsoft-copy reading—observer perfor-mance study. Radiology 237:37–44

15. Pedersen K, Nordanger J (2002) Qual-ity control of the physical and technicalaspects of mammography in the Nor-wegian breast-screening programme.Eur Radiol 12:463–470

16. Cohen J (1968) Weighted kappa.Nominal scale agreement with provi-sion for scaled disagreement or partialcredit. Psychol Bull 70:213–220

17. Donner A, Eliasziw M (1992) Agoodness-of-fit approach to inferenceprocedures for the kappa statistic:confidence interval construction, sig-nificance-testing and sample size esti-mation. Stat Med 11:1511–1519

18. Cicchetti DV (1976) Assessing inter-rater reliability for rating scales:resolving some basic issues. Br JPsychiatr 129:452–456

19. Venta LA, Hendrick RE, Adler YT,DeLeon P, Mengoni PM, Scharl AM,Comstock CE, Hansen L, Kay N,Coveler A, Cutter G (2001) Rates andcauses of disagreement in interpretationof full-field digital mammography andfilm-screen mammography in a diag-nostic setting. AJR Am J Roentgenol176:1241–1248

20. Lewin JM, Hendrick RE, D’Orsi CJ,Isaacs PK, Moss LJ, Karellas A, SisneyGA, Kuni CC, Cutter GR (2001)Comparison of full-field digital mam-mography with screen-film mammog-raphy for cancer detection: results of4,945 paired examinations. Radiology218:873–880

1142

21. Pisano ED, Gatsonis C, Hendrick E,Yaffe M, Baum JK, Acharyya S,Conant EF, Fajardo LL, Bassett L,D’Orsi C, Jong R, Rebner M (2005)Diagnostic performance of digitalversus film mammography forbreast-cancer screening. N EnglJ Med 353:1773–1783

22. Glueck DH, Lamb MM, Lewin JM,Pisano ED (2007) Two-modalitymammography may confer anadvantage over either full-fielddigital mammography or screen-filmmammography. Acad Radiol14:670–676

23. Vigeland E, Klaasen H, Klingen TA,Hofvind S, Skaane P (2008) Full-fielddigital mammography with flat-panelselenium detectors in a population-basedscreening programme: The VestfoldCounty Study. Eur Radiol 18:183-191

24. Kamitani T, Yabuuchi H, Soeda H,Matsuo Y, Okafuji T, Sakai S, FuruyaA, Hatakenaka M, Ishii N, Hona H(2007) Detection of masses and micro-calcifications of breast cancer on digitalmammograms: comparison amonghard-copy film, 3-megapixel liquidcrystal display (LCD) monitors and 5-megapixel LCD monitors: an observerperformance study. Eur Radiol17:1365–1371

25. Fischer U, Baum F, Obenauer S, FunkeM, Hermann KP, Grabbe E (2002) Full-field digital mammography (FFDM):intraindividual comparison of directmagnification versus monitor zoomingin patients with microcalcifications(in German). Radiologe 42:261–264

26. Lazarus E, Mainiero MB, Schepps B,Koelliker SL, Livingston LS (2006) BI-RADS lexicon for US and mammog-raphy: interobserver variability andpositive predictive value. Radiology239:385–391

27. Berg WA, D’Orsi CJ, Jackson VP,Bassett LW, Beam CA, Lewis RS,Crewson PE (2002) Does training inthe breast imaging reporting and datasystem (BI-RADS) improve biopsyrecommendations or feature analysisagreement with experienced breast im-agers at mammography? Radiology224:871–880

28. Skaane P, Skjennald A, Young K, EggeE, Jebsen I, Sager EM, Scheel B, SøvikE, Ertzaas AK, Hofvind S, AbdelnoorM (2005) Follow-up and final results ofthe Oslo I study comparing screen-filmmammography and full-field digitalmammography with soft-copy reading.Acta Radiol 46:679–689

1143