volumetric evaluation of hepatic tumors: multi-vendor, multi-reader liver phantom study

9
Volumetric evaluation of hepatic tumors: multi-vendor, multi-reader liver phantom study Meghan G. Lubner, B. Dustin Pooler, Alejandro Munoz del Rio, Ben Durkee, Perry J. Pickhardt Department of Radiology, University of Wisconsin School of Medicine and Public Health, E3/311 Clinical Science Center, 600 Highland Ave., Madison, WI 53792-3252, USA Abstract Purpose: To compare liver lesion volume measurement on multiple 3D software platforms using a liver phan- tom. Methods: An anthropomorphic phantom constructed with ten liver lesions of varying size, attenuation, and shape with known volume and long axis measurement was scanned (120 kV p , 80–440 smart mA, NI 12). DICOM data were uploaded to five commercially available 3D visualization systems and manual tumor volume was obtained by three-independent readers. Accuracy and reproducibility of linear and volume measurements were compared. The two most promising systems were then compared with an additional prototype system by two readers using both manual and semi-automated measure- ment with similar comparison between linear and volume measures. Measurements were performed on 5- and 1.25- mm data sets. Inter- and intra-observer variability was also assessed. Results: Overall mean % volume error on the five commercially available software systems (averaging all ten liver lesions among all three readers) was 8.0% ± 7.5%, 13.7% ± 11.2%, 14.2% ± 15.2%, 16.4% ± 14.8 %, and 16.9% ± 13.8%, varying almost twofold across ven- dor. Moderate inter-observer variability was present. Volume measurement was slightly more accurate than linear measurement, but linear measurement was more reproducible across readers and systems. On the two ‘‘best’’ systems, the manual measurement method was more accurate than the automated method (p = 0.001). The prototype system demonstrated superior semi-auto- mated assessment, with a mean % volume error of 5.3% ± 4.1% (vs. 17.8% ± 11.1% and 31.5% ± 19.7%, p < 0.001), with improved inter- and intra-observer variability. Conclusions: Accuracy and reproducibility of volume assessment of liver lesions varies significantly by vendor, which has important implications for clinical use. Key words: Tumor—Volume—3D—Phantom—CT— Lesion—Attenuation There has been a growing interest in quantitative imaging biomarkers in recent years, particularly in the evaluation of cancer response to therapy. However, as our knowl- edge of the molecular signature of different tumor types improves and therapies become increasingly varied and individualized, assessment of response to therapy be- comes increasingly complex and in many cases needs to be tailored to the specific tumor and therapy type. As a result, a profusion of imaging biomarkers have emerged using advanced imaging techniques, many of which are still being validated. However, for patients receiving cytotoxic chemotherapy, tumor response still is assessed mainly by change in tumor size and cumulative data have shown that agents which can cause a decrease in tumor size have a reasonable chance of improving survival [14]. Although anatomic assessment of tumor size may not tell the whole story of tumor response, it still plays a major role in the assessment of many tumor types and chemotherapeutic regimens. The most widely applied system in current use is RECIST (Response Evaluation Criteria in Solid Tumors) 1.1, which relies on the use of uni-dimensional measures for overall evaluation of tu- mor burden [5, 6]. Three-dimensional (3D) volume measurements of tumor size offer several theoretical advantages over uni- and bi-dimensional measurements. First, they provide an objective estimate of the absolute bulk of the tumor, Various components of this study presented at ESGAR 2011, SGR 2012. Correspondence to: Meghan G. Lubner; email: [email protected] ª Springer Science+Business Media New York 2014 Published online: 4 February 2014 Abdominal Imaging Abdom Imaging (2014) 39:488–496 DOI: 10.1007/s00261-014-0079-z

Upload: perry-j

Post on 24-Jan-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Volumetric evaluation of hepatic tumors: multi-vendor, multi-reader liver phantom study

Volumetric evaluation of hepatic tumors:multi-vendor, multi-reader liver phantom study

Meghan G. Lubner, B. Dustin Pooler, Alejandro Munoz del Rio, Ben Durkee,

Perry J. Pickhardt

Department of Radiology, University of Wisconsin School of Medicine and Public Health, E3/311 Clinical Science Center,

600 Highland Ave., Madison, WI 53792-3252, USA

Abstract

Purpose: To compare liver lesion volume measurementon multiple 3D software platforms using a liver phan-tom.Methods: An anthropomorphic phantom constructedwith ten liver lesions of varying size, attenuation, andshape with known volume and long axis measurement wasscanned (120 kVp, 80–440 smart mA, NI 12). DICOMdata were uploaded to five commercially available 3Dvisualization systems and manual tumor volume wasobtained by three-independent readers. Accuracy andreproducibility of linear and volume measurements werecompared. The two most promising systems were thencompared with an additional prototype system by tworeaders using both manual and semi-automated measure-ment with similar comparison between linear and volumemeasures. Measurements were performed on 5- and 1.25-mmdata sets. Inter- and intra-observer variabilitywas alsoassessed.Results: Overall mean % volume error on the fivecommercially available software systems (averaging allten liver lesions among all three readers) was 8.0% ±

7.5%, 13.7% ± 11.2%, 14.2% ± 15.2%, 16.4% ± 14.8 %,and 16.9% ± 13.8%, varying almost twofold across ven-dor. Moderate inter-observer variability was present.Volume measurement was slightly more accurate thanlinear measurement, but linear measurement was morereproducible across readers and systems. On the two‘‘best’’ systems, the manual measurement method wasmore accurate than the automated method (p = 0.001).The prototype system demonstrated superior semi-auto-mated assessment, with a mean % volume error of5.3% ± 4.1% (vs. 17.8% ± 11.1% and 31.5% ± 19.7%,

p < 0.001), with improved inter- and intra-observervariability.Conclusions: Accuracy and reproducibility of volumeassessment of liver lesions varies significantly by vendor,which has important implications for clinical use.

Key words: Tumor—Volume—3D—Phantom—CT—Lesion—Attenuation

There has been a growing interest in quantitative imagingbiomarkers in recent years, particularly in the evaluationof cancer response to therapy. However, as our knowl-edge of the molecular signature of different tumor typesimproves and therapies become increasingly varied andindividualized, assessment of response to therapy be-comes increasingly complex and in many cases needs tobe tailored to the specific tumor and therapy type. As aresult, a profusion of imaging biomarkers have emergedusing advanced imaging techniques, many of which arestill being validated. However, for patients receivingcytotoxic chemotherapy, tumor response still is assessedmainly by change in tumor size and cumulative data haveshown that agents which can cause a decrease in tumorsize have a reasonable chance of improving survival [1–4]. Although anatomic assessment of tumor size may nottell the whole story of tumor response, it still plays amajor role in the assessment of many tumor types andchemotherapeutic regimens. The most widely appliedsystem in current use is RECIST (Response EvaluationCriteria in Solid Tumors) 1.1, which relies on the use ofuni-dimensional measures for overall evaluation of tu-mor burden [5, 6].

Three-dimensional (3D) volume measurements oftumor size offer several theoretical advantages over uni-and bi-dimensional measurements. First, they provide anobjective estimate of the absolute bulk of the tumor,

Various components of this study presented at ESGAR 2011, SGR2012.

Correspondence to: Meghan G. Lubner; email: [email protected]

ª Springer Science+Business Media New York 2014

Published online: 4 February 2014AbdominalImaging

Abdom Imaging (2014) 39:488–496

DOI: 10.1007/s00261-014-0079-z

Page 2: Volumetric evaluation of hepatic tumors: multi-vendor, multi-reader liver phantom study

which has been proven to be useful in the prediction oftreatment failure and/or prognosis in certain neoplasms[7–9]. For example, recent studies have shown that earlytumor volume reduction rate in treatment of rectal can-cer may be useful in predicting treatment outcome [10].Second, assessments based on uni-dimensional mea-surements presume that tumors change in a uniform,symmetric manner. Volume measurements improve theability to assess irregular masses changing in a non-symmetric fashion or masses that invaginate aroundnormal structures and are able to compensate for actualtumor morphology. Similarly, it makes intuitive sensethat small uni-dimensional size changes would beamplified in a volumetric assessment, and this has beenconfirmed in several studies [11–13]. Therefore, it hasbeen suggested that volumetric assessment of tumor sizemay be a more accurate reflection of the tumor burden,and in turn, a better or earlier predictor of response andsurvival [6, 14, 15].

The accuracy and reproducibility of volumetricmeasurement of lung nodules has been well documented[16–23]. Multiple studies have also looked at using vol-umes to assess tumor burden in the abdomen and pelvis,particularly in the liver and the lymph nodes, both inphantom models and in patients [24–33]. These mea-surement targets in the abdomen and pelvis may beslightly more challenging, given that they are often closerin attenuation to surrounding structures and may haveirregular or infiltrating margins. However, these studieshave shown fairly promising results with relativelyaccurate and reproducible measurement of tumors usingmanual, semi-automated, or fully automated softwaresystems over a variety of imaging parameters (e.g.,changes in dose, slice thickness).

As a result, volume endpoints have become increasingadopted by oncology groups for clinical trials, which areoften multicenter. In addition, a variety of advancedvisualization platforms have become available for thistask. Although several of the different software productshave been evaluated in isolation, different vendors havenot been extensively compared [29–32, 34, 35]. Recentstudies looking at more complicated imaging biomarkerslike CT and MRI perfusion have demonstrated sub-stantial variability in these measures across vendors [36,37]. The purpose of this study was to compare liver lesionvolume measurement on multiple different commer-cially-available 3D software platforms using a specifi-cally constructed liver phantom.

Materials and methods

Phantom

A custom anthropomorphic abdominal imaging phan-tom with ten-embedded liver lesions of known dimen-sions was constructed (based on Model 057 TripleModality 3D Custom Phantom, customized to our

specifications, CIRS, Inc., Norfolk, Virginia). The le-sions varied in size, shape, and attenuation (n = 5 highattenuation relative to liver, 5 low attenuation relative toliver). The background liver attenuation was 70–77Hounsfield units (HU). High attenuation lesions were110–120 HU (difference of approximately 40–50 HUcompared to background liver), and low attenuation le-sions were 27–32 HU (difference of approximately 40–50 HU compared to background liver). These two typesof lesions were meant to simulate both hypervascular(e.g., hepatocellular carcinoma, neuroendocrine tumors)and hypovascular (e.g., metastatic GI adenocarcinoma)liver lesions. Lesions were well-circumscribed and uni-form in density. Lesions varied in shape from round toovoid to gently lobulated. Volumes of the lesions wereknown based on water displacement prior to insertion inthe phantom, ranging from 0.55–4.69 cc. Long-axis uni-dimensional measures of each lesion were also obtainedprior to insertion in the phantom, varying from 0.7 to2.7 cm.

Initial vendor assessment

The phantom was scanned on a 16-detector-row scanner(GE LightSpeed Xtra, GE Medical, Waukesha, WI) at120 kVp using Smart-mA with noise index (NI) set at 12and 80–440 mA range. This protocol represents a stan-dard abdomen protocol used onmany of our patients withhepatic metastatic disease. The data was reconstructed at1.25 and 5-mm intervals. The DICOM data was then up-loaded onto five commercially available advanced visual-ization software systems (Fig. 1). The software systems(specific vendor-blinded according to results) included:Ziosoft v2.0.0.2 (Redwood City, CA), GE Medical v4.4(Waukesha, WI), Philips EBW v4.5.2 (Best, The Nether-lands), ViatronixV3DExplorer v3.2.3(StonyBrook,NY),and Vitrea/Vital Images v 5.1 (Minnetonka,MN). Resultsfor each commercial system are reported generically (i.e.,systems A-E), without mention of specific vendor. Theliver lesions were manually segmented on four of the fivesystems using liver window settings (W: 270 HU, L:125 HU) (Fig. 1). The manual volume assessment wasperformed by hand drawing the region of interest aroundthe lesion at its margin on every other slice from the top tothe bottom of the lesion. The Philips system was semi-automated, therefore anROIwas drawn around the lesionand the system tried to detect the lesion margin. Theborders were then adjusted by the readers. Given that notall systems had a truly automated method, this was diffi-cult to reliably evaluate across the systems in this initialassessment.

In addition to volumes, orthogonal linear measure-ments (long axis, short axis, craniocaudal dimension)were also recorded.

The manual volume measurements obtained from theimages were then compared to the known volumes of

M. G. Lubner et al.: Volumetric evaluation of hepatic tumors 489

Page 3: Volumetric evaluation of hepatic tumors: multi-vendor, multi-reader liver phantom study

each lesion. Mean absolute (|Volobserved - Volactual|) andpercent error {[(Volobserved - Volactual)/Volactual] 9 100}were calculated and compared by system. For compari-son purposes, measured uni-dimensional long-axisdimensions were compared to the known uni-dimen-sional long axis measurements of the lesions. Meanabsolute and percent error were also calculated for longaxis uni-dimensional measurements by system.

Readers

Three readers performed all 50 volume measurements(ten lesions, five systems) independently. Readers con-sisted of a medical student (Reader 2), a post-graduatetrainee (Reader 1, 6 months experience), and a fellow-ship-trained abdominal radiologist (Reader 3, 7 yearsexperience). Systems were assessed by all three readerswith at least 1-week washout period between each sys-tem. Each reader recorded interpretation time for eachlesion across the systems. In addition, each reader sub-jectively ranked the five systems according to overall easeof use on a five point scale (favored system = 1, leastfavored = 5).

Mean absolute and percent volume error were alsoanalyzed by reader and compared across readers. Meanrelative percent differences between readers for linear sizeand for volume measurements were assessed to evaluateinterobserver variability.

Lesions

In addition, a per lesion analysis was performed,comparing error by reader and by system acrosseach individual lesion. Lesions were also grouped basedon the lesion characteristics (large vs small, high vs.low attenuation) and comparisons of absolute andmean percent error were performed by lesion charac-teristic.

Advanced vendor assessment

Based on this initial assessment, the ‘‘best’’ commer-cially available platforms based on accuracy andreproducibility across readers (Systems A, C) werefurther evaluated and compared with a prototype tumorassessment platform M (Visia, MeVis, Breman, Ger-many). None of the authors have a financial relation-ship with this vendor. Measurements were performedusing both 1.25- and 5-mm slice thickness by tworeaders (the two more experienced readers, Readers 1, 3as above) using both manual and semi-automatedtechniques. The manual technique again entailed hand-drawing regions of interest around the tumor on mul-tiple slices and extrapolating the volume. The semi-automated technique in each case involves droppingtwo points along a line delineating the lesion. The sys-tem then attempts to detect the lesion using attenuationand edge detection algorithms (often proprietary to thevendor). As before, mean absolute and percent volumeerror was analyzed by system, reader, slice thicknessand measurement technique. Reader 3 repeated allmeasurements at least 30 days later. Inter- and intra-observer variability was assessed.

Fig. 1. CT images (A–C) of liver tumor phantom, includingexamples of both high and low attenuation lesions. Note thevariation in size, shape, and attenuation of the lesions, threeof which have been manually segmented by three differentsoftware systems. For example, in C, the green segmentedlesion represents HD1.

490 M. G. Lubner et al.: Volumetric evaluation of hepatic tumors

Page 4: Volumetric evaluation of hepatic tumors: multi-vendor, multi-reader liver phantom study

Statistical analysis

Absolute and percent volume errors were calculated foreach combination of lesion, reader, and system. Meanabsolute andpercent volumeerrorswere calculated for eachreader and for each system. The absolute value of percentvolume error was obtained and used for calculations of themean. This was done so that negative and positive errorswouldnot cancel eachotherout.However, in evaluating thedirection of error, the distribution was fairly random (someover and some underestimation both by vendor and readerwithout systematic pattern identified). These are summa-rized as mean ± 1 standard deviation (SD). One-way ana-lysis of variance (ANOVA) was used to assess each systemon accuracy; separate analyses were performed for eachreader. If the overall p value for an effect was significant,pairwise comparisons were performed (Fisher’s-protectedLSD). Similar analyses were used to assess the impact ofreader, lesion size (<1 cm vs. >1 cm), and lesion attenu-ation (lower vs. higher than liver attenuation). Variablesfound to be significant in the univariate analysis were en-tered into a multivariable main effects (i.e., no interactionterms) ANOVA model. Pairwise comparisons were per-formed for vendor, lesion and reader. p < 0.05 (two-sided)was the criterion for statistical significance.Mean relative%

differences for linear size and software-derived volumesacross readers were compared. BlandAltman 95% limits of

agreement were used to assess inter- and intra-observervariability. Separate analyses were done for the initial andadvanced vendor assessments. R2.12.1 (R DevelopmentCore Team 2009) was used for all statistical computationsand graphics [38].

Results

Initial vendor assessment

The overall mean percent volume error for the initial fivecommercially available vendors (i.e., all ten lesionsamong all three readers across all systems) was13.8% ± 13.1% but varied by more than a factor of twoamong the different software systems. Specifically, themean % volume error for system A was 8.0% ± 7.5%,system B 13.7% ± 11.2%, system C 14.2% ± 15.2%,system D 16.4% ± 14.8 %, and system E 16.9% ± 13.8%

(Table 1; Fig. 2). In the univariate analysis, system/ven-dor used had a significant impact on the volume mea-surement for all readers (p = 0.001, p < 0.001, andp = 0.04 for readers 1–3, respectively). This effect per-sisted in the multivariate model (p = 0.0013). The dif-ference between vendors was less pronounced whenevaluating the most experienced reader alone (range5.8% ± 4.1% to 10.8% ± 8.8%, ave 7.5% ± 5.9%,Table 1) but remained statistically significant.

Fig. 2. Percent volume error by system by reader. Box andwhisker plots for percent volume error for manual measure-ments on each of the five advanced visualization software

platforms from the initial vendor assessment (A–E), stratifiedby reader (1–3). Note that, systems A and C were the mostaccurate of the five.

Table 1. Summary of mean percent volume error by system

Mean percent volume error Mean percent 1D error

System Reader 1 Reader 2 Reader 3 Average Average

A 7.2% ± 5.2% 10.7% ± 11.1% 5.8% ± 4.1% 8.0% ± 7.5% 15.7% ± 13.1B 12.6% ± 8.5% 18.3% ± 16.4% 10.1% ± 5.1% 13.7% ± 11.2% 15.7% ± 10.3C 10.1% ± 8.2% 28.8% ± 17.1% 3.9% ± 3.8% 14.2% ± 15.2% 14.2% ± 9.9D 14.5% ± 16.3% 28.1% ± 13.1% 6.8% ± 4.2% 16.4% ± 14.8% 15.1% ± 10.4E 13.4% ± 10.1% 26.4% ± 16.3% 10.8% ± 8.8% 16.9% ± 13.8% 15.8% ± 12.0Average 11.6% ± 10.4% 22.5% ± 16.0% 7.5% ± 5.9% 13.8% ± 13.1% 15.3% ± 11.1

Summary of mean percent volume error as stratified by reader and system, as well as overall mean percent errors for each. This can be compared withthe overall mean percent unidimensional measurement error, seen in the far right column

M. G. Lubner et al.: Volumetric evaluation of hepatic tumors 491

Page 5: Volumetric evaluation of hepatic tumors: multi-vendor, multi-reader liver phantom study

Mean percent volume error also varied by reader(p < 0.001), with significant differences between theleast experienced reader (Reader 2) and the two moreexperienced readers (Reader 1, 3) (p < 0.001) but nonebetween the two more experienced readers (Reader 1 and3) (p = 0.24) in pairwise comparisons (Table 1). Theunivariate effect of reader (least experienced vs. twomore experienced) on accuracy persisted in the multi-variate model (p < 0.001).

The mean uni-dimensional measurement error (long-axis dimension) was similar to that seen for volumeassessment, with an overall mean percent error of15.3% ± 11.1% (all lesions, all readers, all systems),compared to 13.8% ± 13.1% for volumes. However, therewas less variability seen across vendors, withmean percenterrors of 15.7% ± 13.1% for systemA, 15.7% ± 10.3% forsystemB, 14.2% ± 9.9% for systemC, 15.1% ± 10.4% forsystemD, 15.8% ± 12% for systemE (Table 1). Therewasalso slightly less variability seen across readers, with meanrelative percent difference for linear size measurementamong the three readers of 3.85% ± 3.72% compared to19.8% ± 24.26% for software-derived volumes.

Lesion characteristics also had an impact on theaccuracy of volume estimation. Lesions were divided intosmall and large using 1-cm linear size as the divider (n = 5

>1-cm large, n = 5 <1-cm small). Readers showedsimilar but slightly smaller percent errors for large lesionsrelative to small lesions, but this effect was not statisticallysignificant (p = 0.97) (Fig. 3). The lesion attenuationimpacted measurement, with larger percent volume errorsconsistently seen with the high attenuation lesions (meanpercent error 14%) compared to the low (mean percenterror 3%) (p < 0.001) (Fig. 4). The effect of lesion atten-uation persisted in the multivariate model (p < 0.001).

During the initial vendor assessment, no clear ‘‘best’’system emerged. System A was most accurate and pre-cise, and tied for greatest ease-of-use. However, theaverage time required by system was widely variableranging from 3.5 to 6.1 min per lesion, with system Ebeing the fastest (ranked last in volume error), and sys-tem A the slowest. System C was second in accuracy, tiedfor first for greatest ease of use, and was second in timerequired per lesion (Table 2). Based on this assessment,systems A and C were selected for the more advancedvendor analysis.

Advanced vendor analysis

There remained significant variability across vendors inthe advanced vendor analysis looking at the two most

Fig. 4. Percent volume error by lesion attenuation (high vs. low) across readers. Percent volume error by lesion attenuation.Greater percent error was seen with the high attenuation lesions across all readers (p < 0.001).

Fig. 3. Percent volume error by size, large versus smallacross readers. Box and whisker plots for percent volumeerror by lesion size by reader (1–3). Although slightly higher

percent errors were seen with the smaller lesions (even smallabsolute errors were a greater percentage of the whole), thisdifference was not statistically significant (p = 0.97).

492 M. G. Lubner et al.: Volumetric evaluation of hepatic tumors

Page 6: Volumetric evaluation of hepatic tumors: multi-vendor, multi-reader liver phantom study

accurate commercially available systems and a prototypesystem (Table 3) (p < 0.001). For example, the overallpercent error across all three systems using a semi-automated technique at 5-mm slice thickness was18.2% ± 16.9%, with errors of 5.3% ± 4.1%, 17.8% ±

11.1%, 31.5% ± 19.7% for systems M (prototype), C, A,respectively. The semi-automated assessment performedon the prototype system M showed much less error thanany of the commercially available systems using a similarsemi-automated technique (p < 0.001).

Measurement technique (manual vs. semi-automated)also significantly impacted % volume error in this study(p < 0.001). For the commercially available systemsassessed, the manual technique was more accurate thanthe semi-automated technique (system A, 6.6% ± 4.6%

vs. 31.5% ± 19.7%; system C 7.0% ± 7.0% vs.17.8% ± 11.1%, respectively, Table 3), whereas the semi-automated technique was much more accurate than themanual technique in the prototype system M(5.3% ± 4.1% semi-automated vs. 79.3% ± 35% man-ual).

Surprisingly, slice thickness and lesion characteristics(size, attenuation) did not significantly impact the resultsin this analysis. On the prototype system, there was atrend toward lesion attenuation impacting % volumeerror (8.7% high attenuation, 1.9% low attenuation).Although only a single lesion was located near theinterface of the phantom liver and lung (equivalent ofhepatic dome), this lesion was consistently problematicacross systems, suggesting that specific location can alsoimpact measurement.

Moderate interobserver variability was seen on thetwo commercially available systems. For example, the

95% limits of agreement on the Bland Altman plot were+35 and -53% with a bias of -8.7 for system C usingthe automated technique, and was +14 and -20% with abias of -2.8 on the same system using the manualtechnique. Similarly, moderate intra-observer variabilitywas seen with the commercially available systems. Forexample, on System C, the upper limit +29, lower limit-33%, bias -2 was seen for the semi-automated technique.Inter-observer variability improved on the prototypesystem M, with upper limit of +8 and lower limit of-8% with bias almost at 0 using the semi-automatedtechnique (Fig. 5). Intra-observer variability also im-proved on systemM, upper limit +10, lower limit -13%,bias -1.7.

Discussion

With continued advancement in tumor characterizationand treatment, improved imaging biomarkers are neededin assessing response to therapy. Groups such as theQuantitative Imaging Biomarkers Alliance and theAmerican College of Radiology Imaging Network havemade extensive efforts in establishing and validating new

Table 3. Advanced vendor assessment

System Mean % error,5 mm auto

Mean % error,1.25 mm auto

Mean % error,5 mm Man

Mean % error,1.25 mm Man

A 31.5 ± 19.7 23.5 ± 14.9 6.6 ± 4.6 14.5 ± 10.6C 17.8 ± 11.1 25.6 ± 23.4 7.0 ± 7.0 7.7 ± 5.0M 5.3 ± 4.1 5.5 ± 5.9 79.3 ± 35 78 ± 35.9Overall 18.2 ± 16.9 18.2 ± 18.4 30.9 ± 40.2 33.4 ± 38.5

Comparison of % volume error by vendor, slice thickness, and measurement technique (semi-automated vs manual). Auto, automated or semi-automated; man, manual

Table 2. Summary of initial vendor assessment

Ranking (1–5)

System Accuracy Time (min/lesion) Ease of use

A 1 5 (6.1) 1B 2 3 (4.4) 4C 3 2 (4.1) 1D 4 4 (4.6) 3E 5 1 (3.5) 5

Summary of the scores received by each system (1 = best, 5 = worst)for categories of accuracy, average time per lesion, and subjective easeof use

Fig. 5. Bland Altman plot of inter-observer variability be-tween readers 1 and 3 on the prototype system M using thesemiautomated technique. Note the 95% limits of agreementshow fairly low interobserver difference (upper limit 7.6, lowerlimit -7.6%, bias -0.003). The demarcations on the plotrepresent each lesion (HD, high density lesions 1–5; LD, lowdensity lesions 1–5) for each reader.

M. G. Lubner et al.: Volumetric evaluation of hepatic tumors 493

Page 7: Volumetric evaluation of hepatic tumors: multi-vendor, multi-reader liver phantom study

imaging biomarkers. Size remains a valuable metric,particularly in patients receiving cytotoxic chemotherapyand volumetric assessment has shown promise as ananatomic imaging biomarker. However, if volumes are tobe a viable measure, for example in metastatic hepaticdisease, measurements must be accurate—or at leastreproducible. Measured volumes should reasonably re-flect true changes in tumor burden, not measurementerror/variability related to post processing or the specificadvanced visualization software used. Despite recenttechnologic improvements in advanced visualizationsoftware platforms and improved edge detection algo-rithms, there remains significant variation in lesion vol-umes obtained across vendors. In the initial portion ofthis study, manual lesion segmentation was utilized tocompare systems given the extreme heterogeneity in theavailability and methodology of semi-automated orautomated techniques on different platforms. Even usingthe same measurement technique, the specific vendorsystem employed had a significant impact on volumemeasurement results.

In the more advanced vendor assessment, bothmanual and semi-automated techniques were assessed.The significant effect of vendor on volume measurementpersisted across techniques. On the commercially avail-able systems assessed, the manual technique remainedmore accurate, but a very promising prototype systemshowed very accurate and reproducible measurementsusing a semi-automated technique. These results under-score the need for more uniform standards among soft-ware vendors if volumetric assessment is to achievewidespread use, in addition to better semi-automatedtechniques if this method is to become a feasible part ofclinical workflow. However, with continued improve-ments as demonstrated by the prototype system, accu-rate, and reproducible measurements seem possible.

The most important metric in assessing tumor re-sponse in many cases is change over time, which may notbe impacted if a single vendor is used to make thesemeasurements. However, even if measurements are per-formed on the same system, moderate intra- and inter-observer variability persists on commercially availablesystems, which could affect serial measurements. Inaddition, variability between vendors may become clini-cally significant if an institution uses multiple vendors, ifa patient receives follow up imaging at varying institu-tions over time, or if data from multiple institutionsusing different vendors is pooled in multicenter trials.Although the relative differences between the vendors arereasonably small and may not be clinically significantalone based on current tumor response assessment cri-teria (for example, RECIST 1.1 requires a 20% increasein the sum of diameters to be progressive disease com-pared to a mean 8% difference by vendor for an experi-enced reader), these differences in vendor taken incombination with inherent intra and interobserver vari-

ability (more than 20% in some cases) may lead to var-iability in measurement unrelated to tumor response.

There are some settings where gross tumor burden isa useful metric in terms of prognosis, predicting responseto therapy, or determining treatment. In these cases,accuracy of assessment regardless of vendor becomesimportant. For example, in patients with hepatocellularcarcinoma awaiting transplant, their tumor burden mustbe within very specific guidelines (Milan criteria) cur-rently described by uni-dimensional measurements [39].However, if this method of tumor burden assessment wasto be changed to a volume assessment, a standard acrossvendors would be needed so that patients were notunnecessarily excluded from transplant.

In a study recently performed by Keil et al. [30] usinga similar phantom model with a semi-automatedassessment technique with a single vendor, they reportedhigh correlation and accuracy of long axis and volumemeasurements with changing imaging parameters (slicethickness and dose), suggesting that these do not impactthese measurements in the way that vendor did in ourstudy.

A variety of factors other than vendor may also im-pact volume measurement, including reader experience,lesion characteristic, and possibly lesion location. Readerexperience did impact the volume measurements to acertain extent, in that readers with even a limited expe-rience were more accurate than those with none (earlypost graduate trainee versus medical student). This effectmay become more pronounced in the clinical settingwhere tumor imaging features are more complex andmargins may be more difficult to delineate. The inter-observer variability may be reduced with improvedautomation techniques, although a recent study by Du-bus et al. [34] showed slightly better agreement betweenreaders with the manual compared to semi-automatedtechnique.

Lesion characteristics, particularly lesion attenuation,seemed to impact the accuracy of volume assessment.This may be related to a mild blooming effect producedby high attenuation lesions or better edge detection be-tween the relatively high attenuation liver and the lowerattenuation lesion. However, the difference in attenua-tion between the high attenuations lesions and back-ground liver was very similar to that seen with lowattenuation lesions (40–50 HU in both cases). This couldhave clinical implications for tumor types demonstratingbrisk arterial enhancement such as hepatocellular carci-noma or metastatic neuroendocrine tumors, particularlyin patients where very accurate tumor burden is needed(e.g., liver transplant candidates). These lesion charac-teristics could be taken into consideration as newer ver-sions of these software platforms emerge. Further studyof this effect in a clinical model is warranted.

There are limitations to this study. Advanced visual-ization software platforms remain a substantially moving

494 M. G. Lubner et al.: Volumetric evaluation of hepatic tumors

Page 8: Volumetric evaluation of hepatic tumors: multi-vendor, multi-reader liver phantom study

target as new and improved versions are frequently re-leased. Further, not all systems had an automated tumorassessment method, making a comparison of automatedand semi automated measurement techniques to themanual method problematic in the initial vendorassessment. Not all commercially available systems wereincluded in this vendor assessment, only a selected subsetbased on availability. This study was performed in anidealized liver tumor phantom, rather than a true patienttumor model. However, this allowed us to compare thereaders’ measurements to the true tumor volumes, a goldstandard not generally available in clinical models.Although a clinical model is ultimately more relevant, wewanted to truly assess the accuracy of this tool prior tothe application to patients, where obtaining a goldstandard (known volume measurement) is difficult. Thisalso gave us the ability to compare uni-dimensionalmeasures to the known long-axis dimension. All thesystems were evaluated in the same order by each reader.It is possible that a learning effect was in place (i.e.,readers got better at manually assessing volumes by theend of the study). In addition, most systems were eval-uated a week apart, which may have produced a memoryeffect for the lesions (inadequate washout period). Theevaluation was slightly limited by the relatively smallnumber of lesions (n = 10) and the small subgroupanalyses (n = 5).

In summary, the specific 3D software platform (ven-dor) used had a significant impact on volume measure-ments in our liver phantom model. These resultsunderscore the need for the establishment of uniformstandards among software vendors before volumetricassessment can achieve widespread use. In addition,multiple other factors seem to impact volume measure-ment, suggesting that this remains a relatively complexmeasurement, highlighting the need for improved semi-automated techniques to make this more reproducibleand clinically feasible. Despite these issues, a promisingprototype system suggests that with the right technology,this may be a viable method for assessing tumors thatwarrants continued evaluation in clinical models.

Funding AUR-GE Radiology Research Academic Fellowship (GER-RAF).

References

1. Paesmans M, Sculier JP, Libert P, et al. (1997) Response to che-motherapy has predictive value for further survival of patients withadvanced non-small cell lung cancer: 10 years experience of theEuropean Lung Cancer Working Party. Eur J Cancer 33:2326–2332

2. Buyse M, Thirion P, Carlson RW, et al. (2000) Relation betweentumour response to first-line chemotherapy and survival in ad-vanced colorectal cancer: a meta-analysis. Meta-Analysis Group inCancer. Lancet 356:373–378

3. Goffin J, Baral S, Tu D, Nomikos D, Seymour L (2005) Objectiveresponses in patients with malignant melanoma or renal cell cancerin early clinical studies do not predict regulatory approval. ClinCancer Res 11:5928–5934

4. El-Maraghi RH, Eisenhauer EA (2008) Review of phase II trialdesigns used in studies of molecular targeted agents: outcomes andpredictors of success in phase III. J Clin Oncol 26:1346–1354

5. Therasse P, Arbuck SG, Eisenhauer EA, et al. (2000) New guide-lines to evaluate the response to treatment in solid tumors. Euro-pean Organization for Research and Treatment of Cancer,National Cancer Institute of the United States, National CancerInstitute of Canada. J Natl Cancer Inst 92:205–216

6. Eisenhauer EA, Therasse P, Bogaerts J, et al. (2009) New responseevaluation criteria in solid tumours: revised RECIST guideline(version 1.1). Eur J Cancer 45:228–247

7. Arimoto T (1993) Significance of computed tomography-measuredvolume in the prognosis of cervical carcinoma. Cancer 72:2383–2388

8. Chen SW, Yang SN, Liang JA, Lin FJ, Tsai MH (2009) Prognosticimpact of tumor volume in patients with stage III–IVA hypopha-ryngeal cancer without bulky lymph nodes treated with definitiveconcurrent chemoradiotherapy. Head Neck 31:709–716

9. Yeo SG, Kim DY, Park JW, et al. (2011) Tumor volume reductionrate after preoperative chemoradiotherapy as a prognostic factor inlocally advanced rectal cancer. Int J Radiat Oncol Biol Phys18(13):3686–3693. doi:10.1245/s10434-011-1822-0

10. Kim YC, Lim JS, Keum KC, et al. (2011) Comparison of diffusion-weighted MRI and MR volumetry in the evaluation of earlytreatment outcomes after preoperative chemoradiotherapy for lo-cally advanced rectal cancer. J Mag Reson Imaging: JMRI34(3):570–576. doi:10.1002/jmri.22696

11. Hopper KD, Kasales CJ, Eggli KD, et al. (1996) The impact of 2Dversus 3D quantitation of tumor bulk determination on currentmethods of assessing response to treatment. J Comput Assist To-mogr 20:930–937

12. Pickhardt PJ, Lehman VT, Winter TC, Taylor AJ (2006) Polypvolume versus linear size measurements at CT colonography:implications for noninvasive surveillance of unresected colorectallesions. AJR Am J Roentgenol 186:1605–1610

13. Prasad SR, Jhaveri KS, Saini S, et al. (2002) CT tumor measure-ment for therapeutic response assessment: comparison of unidi-mensional, bidimensional, and volumetric techniques initialobservations. Radiology 225:416–419

14. Sargent DJ, Rubinstein L, Schwartz L, et al. (2009) Validation ofnovel imaging methodologies for use as cancer clinical trial end-points. Eur J Cancer 45:290–299

15. Hillman SL, An MW, O’Connell MJ, et al. (2009) Evaluation of theoptimal number of lesions needed for tumor evaluation using theresponse evaluation criteria in solid tumors: a north central cancertreatment group investigation. J Clin Oncol 27:3205–3210

16. Bolte H, Riedel C, Muller-Hulsbeck S, et al. (2007) Precision ofcomputer-aided volumetry of artificial small solid pulmonarynodules in ex vivo porcine lungs. Br J Radiol 80:414–421

17. Buckler AJ, Mozley PD, Schwartz L, et al. (2010) Volumetric CT inlung cancer: an example for the qualification of imaging as a bio-marker. Acad Radiol 17:107–115

18. Erasmus JJ, Gladish GW, Broemeling L, et al. (2003) Interobserverand intraobserver variability in measurement of non-small-cellcarcinoma lung lesions: implications for assessment of tumor re-sponse. J Clin Oncol 21:2574–2582

19. Hein PA, Romano VC, Rogalla P, et al. (2009) Linear and volumemeasurements of pulmonary nodules at different CT dose levels:intrascan and interscan analysis. RoFo 181:24–31

20. Marchiano A, Calabro E, Civelli E, et al. (2009) Pulmonary nod-ules: volume repeatability at multidetector CT lung cancer screen-ing. Radiology 251:919–925

21. Marten K, Engelke C (2007) Computer-aided detection and auto-mated CT volumetry of pulmonary nodules. Eur Radiol 17:888–901

22. Mozley PD, Schwartz LH, Bendtsen C, et al. (2010) Change in lungtumor volume as a biomarker of treatment response: a critical re-view of the evidence. Ann Oncol 21:1751–1755

23. Wormanns D, Kohl G, Klotz E, et al. (2004) Volumetric mea-surements of pulmonary nodules at multi-row detector CT: in vivoreproducibility. Eur Radiol 14:86–92

24. Costello P, Duszlak EJ, Lokich J, Matelski H, Clouse ME (1983)Assessment of tumor response by computed tomography livervolumetry. J Comput Tomogr 7:323–326

M. G. Lubner et al.: Volumetric evaluation of hepatic tumors 495

Page 9: Volumetric evaluation of hepatic tumors: multi-vendor, multi-reader liver phantom study

25. De Vriendt G, Rigauts H, Meeus L (1998) A semi-automatedprogram for volume measurement in focal hepatic lesions: a firstclinical experience. J Belg Radiol 81:181–183

26. Fabel M, Bolte H, von Tengg-Kobligk H, et al. (2011) Semi-automated volumetric analysis of lymph node metastases duringfollow-up: initial results. Eur Radiol 21:683–692

27. Fabel M, von Tengg-Kobligk H, Giesel FL, et al. (2008) Semi-automated volumetric analysis of lymph node metastases in pa-tients with malignant melanoma stage III/IV: a feasibility study.Eur Radiol 18:1114–1122

28. Keil S, Behrendt FF, Stanzel S, et al. (2008) Semi-automatedmeasurement of hyperdense, hypodense and heterogeneous hepaticmetastasis on standard MDCT slices. Comparison of semi-auto-mated and manual measurement of RECIST and WHO criteria.Eur Radiol 18:2456–2465

29. Keil S, Bruners P, Ohnsorge L, et al. (2010) Semiautomated versusmanual evaluation of liver metastases treated by radiofrequencyablation. J Vascular Interv Radiol: JVIR 21:245–251

30. Keil S, Plumhans C, Behrendt FF, et al. (2009) Semi-automatedquantification of hepatic lesions in a phantom. Invest Radiol 44:82–88

31. Keil S, Plumhans C, Behrendt FF, et al. (2009) Automated measure-ment of lymph nodes: a phantom study. Eur Radiol 19:1079–1086

32. Keil S, Plumhans C, Nagy IA, et al. (2010) Dose reduction for semi-automated volumetry of hepatic metastasis in MDCT studies.Investig Radiol 45:77–81

33. Zhou JY, Wong DW, Ding F, et al. (2010) Liver tumour segmen-tation using contrast-enhanced multi-detector CT data: perfor-mance benchmarking of three semiautomated methods. Eur Radiol20:1738–1748

34. Dubus L, Gayet M, Zappa M, et al. (2011) Comparison of semi-automated and manual methods to measure the volume of livertumours on MDCT images. Eur Radiol 21:996–1003

35. Keil S, Bruners P, Schiffl K, et al. (2010) Radiofrequency ablationof liver metastases-software-assisted evaluation of the ablation zonein MDCT: tumor-free follow-up versus local recurrent disease.Cardiovasc Interv Radiol 33:297–306

36. Kudo K, Christensen S, Sasaki M, et al. (2013) Accuracy andreliability assessment of CT and MR perfusion analysis softwareusing a digital phantom. Radiology 267:201–211

37. Heye T, Davenport MS, Horvath JJ, et al. (2013) Reproducibilityof dynamic contrast-enhanced MR imaging. Part I. Perfusioncharacteristics in the female pelvis by using multiple computer-aided diagnosis perfusion analysis solutions. Radiology 266:801–811

38. RDCT (2009) R: A language and environment for statisticalcomputing. In: R Foundation for Statistical Computing, Vienna,Austria. http://www.R-project.org

39. Mazzaferro V, Regalia E, Doci R, et al. (1996) Liver transplanta-tion for the treatment of small hepatocellular carcinomas in pa-tients with cirrhosis. N Engl J Med 334:693–699

496 M. G. Lubner et al.: Volumetric evaluation of hepatic tumors