journal of the society for simulation in healthcare ... · 86 computerized, automated scoring...
TRANSCRIPT
JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 1
Validation of EDGE, a Reality-BasedLaparascopic Box Trainer for Quantitative
Psychomotor Skill AssessmentTim Kowalewski MS, Lee White, Thomas S. Lendvay MD, Iris Jiang, Andrew Wright MD,
and Blake Hannaford PhD, Fellow, IEEE
Abstract—Computerized analysis of tool motion and force recordings from a multi-institutional study of laparoscopic surgeonsperforming dry lab tasks discriminated psychomotor skill level and correlated with a Fundamentals of Laparoscopic Skills (FLS)evaluation score. The research demonstrates a means of automatically quantifying psychomotor skill immediately upon the completionof a task using an integrated laparoscopy trainer and motion capture system called the Electronic Data Generation and Evaluation(EDGE) Platform by Simulab Corporation. Additionally, the FLS scoring method was found to provide no significant benefit over tasktime alone.
Index Terms—Psychomotor Skill, FLS, Hidden Markov Models, Machine Learning, EDGE.
F
1 INTRODUCTION1
THE ACGME (Accrediation Council of Graduate2
Medical Education) specifically identifies ”technical3
competence in conducting surgical procedures” as an4
component in its core competencies under Patient Care5
and Practice-based learning [1]. Moreover, the ACGME6
may establish Technical skills as a seventh core com-7
petency [2][TOM: nasca2012 never mentions technical8
skills, what are you talking about?]. Surgical societies9
need to codify means for assessing surgical skills. Yet10
related studies are often of variable quality and do not11
use comparable methodologies [3]. We further observe12
that the simulation platforms used by such studies often13
employ different performance measures.14
Surgical educators currently rely on crude surgical15
performance metrics to ‘define’ proficiency such as task16
time and basic error recognition. In addition, existing17
reality-based surgical simulators (ones that use real18
world tools and objects as opposed to virtual computer19
models) are resource intensive due to human graders20
and slow the training feedback process. Skill assessment21
systems such as Objective Structured Analysis of Tech-22
nical Skills (OSATS) or the McGill Inanimate System23
for Training and Evaluation of Laparoscopic Skills (MIS-24
TELS) and its resulting Fundamentals of Laparoscopic25
• T. Kowalewski and B. Hannaford are with the Department of ElectricalEngineering, University of Washington, Seattle, WA 98195.E-mail: [email protected], [email protected]
• L. White and I. Jiang are with the Bioengineering Department, Universityof Washington, Seattle, WA 98195.E-mail: [email protected], [email protected]
• T. Lendvay and A. Wright are with the Departments of Urology andSurgery, University of Washington, Seattle, WA 98195.E-mail: [email protected], [email protected]
Skills (FLS) have attempted to provide frameworks for26
human graders to analyze surgical performance for skill27
[4], [5], [6], [7]. In 2004, the FLS assessment certification28
process was launched and after 5 years of its existence,29
almost 3000 clinicians have become certified [8]. One of30
the limitations of the technical skills portion of the FLS31
curriculum is that there are few objective metrics that32
are captured. Currently Task Time and Errors are the33
two outcomes. However, such summary measurements34
may overlook more granular measures of performance35
since they neglect the temporal or sequential phenom-36
ena inherent in the task—phenomena that could, for37
example, indicate at what point in time, space, or pro-38
cedural context a subject exhibits a degree of skill. And,39
although discriminating ‘FLS’-experienced and inexpe-40
rienced subjects has been repeatedly demonstrated, [9],41
[10], [11], [12] more granular elements of surgical skills42
such as economy of motion, grasp forces, and instrument43
accelerations have not. In part, this is because existing44
reality-based laparoscopic box trainers are not capable45
of capturing or reporting these metrics. Moreover, there46
is no rigorous and repeatable method established to47
carry out identical performance measures on different48
platforms [1], [13], [14], [15], [16].49
An ideal surgical training platform would automat-50
ically quantitatively assess technical skill immediately51
after or even during a surgical task and provide imme-52
diate feedback to the trainee. We herein evaluate such53
a candidate training platform for its validity in pro-54
viding automated quantitative assessment of technical55
(psychomotor) skills in reality-based (RB) training tasks.56
Unlike its virtual reality (VR) counterparts, RB train-57
ing can be significantly cheaper, employ real surgical58
tools, and obviate additional simulation and validation59
steps by providing physical interaction with a real en-60
JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 2
Fig. 1: The EDGE Platform was developed by SimulabCorporation (Seattle, WA) and is based on a mechanismdeveloped by the University of Washington Bioroboticslab. It consists of a pair of interchangeable surgical toolswhose motion is constrained to rotate about a fulcrumthe same way laparoscopic instruments are constrainedby their access ports.
vironment and objects. Moreover, RB training provides61
accurate force feedback and tactile sensation induced62
by real physical objects that VR may only approximate63
at considerable expense. However, the realism of RB64
simulation is limited by the mechanical properties of RB65
tissue surrogates that may not adequately mimic human66
anatomy.67
Unlike the traditional FLS box trainer, an instru-68
mented, computerized RB platform can quantitatively69
provide both granular performance metrics as well as au-70
tomated immediate skill evaluation feedback. To realize71
these benefits, the University of Washington Biorobotics72
lab created such a platform called the Red DRAGON73
[17], [18], a scaled-down table-top version of its orig-74
inal BlueDRAGON [18] platform which was success-75
fully employed in live porcine models during training.76
This RedDRAGON prototype, along with the established77
Hidden Markov Model-based scoring methodology [18]78
as licensed and commercialized, see Fig.1, by Simulab79
Corporation (Seattle, WA) into the Electronic Data Gen-80
eration and Evaluation platform (EDGE). To our knowl-81
edge, EDGE is the first dry lab RB box-trainer which can82
obtain high-accuracy motion (position and orientation)83
measurements along with grasping force [19].84
The goal of this work is to establish whether a fully85
computerized, automated scoring methodology such as86
the EDGE platform can provide equivalent or bet-87
ter psychomotor skill evaluation in RB settings than88
performance-based scoring methodologies such as those89
employed by the widely-used FLS protocol.90
2 METHODS91
This study employed the EDGE platform to collect tool92
motion and task video data from faculty and training93
surgeons at multiple surgical centers in the United States.94
2.1 EDGE Platform Description and Analysis Meth-95
ods96
EDGE hardware consists of an instrumented mechanism97
which holds laparoscopic tools about a fixed pivot point.98
These were either Stryker Endoscopy (San Jose, CA)99
tools with interchangeable tool tip inserts (5mm diam,100
33cm length with 250-080-282 Maryland Grasper or 250-101
080-267 Endo Metzenbaum Curved Scissor inserts) or102
Karl Storz (Tuttlingen, Germany) curved needle drivers103
(26173 KAL and KAR). Position sensors (potentiometers104
and optical encoders) measure tool position (x, y, z in105
cm) in tool roll (r in degrees), and grasper angle (θ in106
degrees). A calibrated strain-gauge measures grasping107
force (Fg in Newtons). EDGE includes custom software108
that time-stamps all sensors measurements for each hand109
at 30 Hz with synchronized video capture of the task.110
Videos are compressed on-the-fly with MPEG4 codecs111
and average approximately 10MB per minute of video.112
Timing is automatically recorded and begins and ends113
when tools are moved in and out of a fixed “homebase”114
position directly behind each task block which, in turn, is115
rigidly and repeatably held to a fixed coordinate system.116
EDGE employed a laptop computer running Microsoft117
(Redmond, WA) Windows XP (UW sites only) or Win-118
dows 7 (all other sites) interfaced via USB 2.0 to EDGE119
during all recordings. EDGE’s custom software stored120
tool data and task video files and labeled them with121
unique codes to identify subject, site, task, date, and time122
of each iteration. All tool motion and demographic data123
for all iterations was compiled into a single database,124
sorted, and analyzed by custom scripts in MATLAB125
software (Mathworks Inc., Natick, MA). Pearon’s r126
and Spearman’s ρ with corresponding (p-value) were127
adopted as a measures of correlation for linearity and128
monotonicity respectively.129
2.2 Subject Pool Description130
Subject enrollment was approved and registered under131
Western IRB 19125-A/B. This multicenter effort included132
surgeons of various skill levels from the University of133
Washington Medical Center, the University of Minnesota134
Medical Center, and three sites in the city of New135
Orleans enumerated below. The subject pool spanned136
General Surgery, Urology, and Gynecology specialties.137
It consisted of active surgical faculty, surgical fellows,138
residents, and experienced practicing surgeons. Medical139
students pursuing surgical practice or FLS-experienced140
technicians were also enrolled in the study.141
JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 3
(a) Peg Transfer (b) Circle Cutting (c) Suturing
Fig. 2: EDGE Screenhots of videos recorded during performance of the three FLS tasks: Peg Transfer, Cutting, andSuturing.
2.3 Data Collection Sites142
Three EDGE platforms, one dedicated to each task, were143
deployed at each site to maximize subject throughput144
and allow for simultaneous subjects. An approved study145
administrator set up the equipment at each site on146
a daily basis and subjects were invited to voluntarily147
participate in the study whenever their schedules al-148
lowed. Subjects were allowed to complete the study over149
multiple sessions.150
2.3.1 Site: University of Washington151
Data were collected over a three week period at a152
surgical simulation training center or within a porcine153
lab laparoscopic training center.154
2.3.2 Site: University of Minnesota155
Data were collected over a period of two weeks mostly at156
a surgeon’s OR lounge during regular operating hours.157
Some subjects did their tasks at an adjacent surgical158
simulation training center.159
2.3.3 Site: New Orleans, Louisiana160
Data were collected over a 10-day period. Sites included161
the Louisiana State University Health Science Center,162
University Hospital New Orleans, and Interim LSU Pub-163
lic Hospital. [Lee: add details]164
2.4 Questionnaire165
Subject demographics were recorded after consent was166
obtained. The de-identified questionnaire included sub-167
ject’s gender, age, handedness, training level, surgical168
specialty, laparoscopic experience, approximate number169
of relevant procedures done (where a subject completed170
more than half of the case), time since last laparoscopic171
procedure, total number of FLS tasks done in past, and172
FLS certification status.173
A post-task questionnaire invited feedback regarding174
the acceptability of EDGE based on categorical Likert175
scales and written general comments. A subject’s oral176
comments during the study were also noted in the177
general comments section by the study administrator.178
2.5 Surgical Task Description and FLS Scoring179
Three of the five FLS tasks were used in the study: Peg180
Transfer, Cutting, and Intra-corporeal Suturing. Descrip-181
tions of each task appear below along with representa-182
tive screen-shots in Fig. 2. An iteration was defined as183
one complete execution of a single task. Subjects were184
asked to complete three iterations of the Peg Trans-185
fer task, two iterations of the Cutting task, and two186
iterations of the Suturing task in that order. Subjects187
were also invited to perform additional iterations of any188
task if they were willing to do so. Each subject was189
introduced to each task via printed instructions. For each190
iteration, time (t) starts upon removing either tool tip191
from a fixed “home-base” position and stopps when both192
tools are returned to that position. The published FLS193
scoring methodology [20], [21] was adopted to calculate194
FLS scores for each task (denoted FLStask), explicit195
computation is shown in Table 1, and each task’s error196
variables EErr are described below.197
TABLE 1: Equations used to compute FLS scores [20],[21].
FLS Task FLS Score
Peg Transfer FLSpeg = (300− t− 17Edr)/237Cutting FLScut = (300− t− 2Ea)/280Suturing FLSsut = (600− t− Epd − Eg − Eq)/520
2.5.1 Peg Transfer (PegTx)198
The Peg Transfer task, also called block transfer, em-199
ployed two curved Maryland graspers. Instructions were200
to transfer six blocks in minimum time and with minimal201
errors, from one side to another and back again without202
regard for order or color of blocks, to transfer each block203
mid-air between hands, and to avoid dropping blocks.204
EDGE automatically-computed task time t. Each video205
was later manually reviewed to count the total number206
of non-recovered drops considered to be errors (Edr).207
JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 4
2.5.2 Circle Cutting208
The circle cutting task (Cutting), also called pattern209
cutting, employed a curved Maryland grasper in the210
non-dominant hand and a curved sheer in the dominant211
hand for the duration of each task. Instructions were to212
cut gauze along a marked circular pattern (diam = 4cm)213
in minimum time and with minimal error and to begin214
by either making a puncture anywhere on the circle or215
cutting in from the gauze edge. Task time t was auto-216
matically computed. The accumulated area (mm2) cut217
beyond the marked circle boundary composed the error218
count Ea. To minimize subjective grader error and ex-219
ploit automation, cut circles were flattened, electronically220
scanned, and cutting error was automatically computed221
via ImageJ (NIH), a public domain image processing222
suite [22]. The wand tool automatically outlined and223
measured out-of-bound areas through an edge-finding224
algorithm that measures pixel values. Scans included a225
printed reference line in order to scale the image from226
pixels to mm.227
2.5.3 Suturing228
The intracorporeal suturing task (Suturing), also called229
knot tying with intracorporeal suture, employed two230
curved small needle drivers. Subjects had a choice of a231
ratcheting or non-ratcheting mechanism at the beginning232
of each task. Instructions were to complete the task in233
minimum time and with minimum errors. The task was234
to puncture a Penrose drain at marked entry and exit235
dots with a 2-0 V-20 tapered half circle needle and 12.5236
cm of suture and tie a surgeon’s knot with two initial237
throws and one final throw. Errors included distance238
away from puncture dots in mm (Epd), gap of sutured239
slit in mm (Eg), and knot quality (Eq) where 0 indicated240
a secure knot, 10 a slipping knot, and 20 a knot that241
came apart. Task time t was automatically computed, but242
errors were manually determined by two FLS-certified243
graders.244
2.6 EDGE Computed Metrics245
EDGE software automatically computed tool-tip motion246
metrics such as tool path length (PathLength, the sum247
of left and right tool-tip distance traveled) and economy248
of motion (EoM , the ratio of path length and task249
time). EDGE also measured the grasp force exerted by250
a surgeon at the handle. The maximum of the peak251
force (Fpeak) exerted by either hand during an iteration252
was utilized to characterize force behavior. Each of these253
EDGE-computed metrics was calculated in the same way254
for all iterations, independent of task.255
2.7 Establishment of the “True Expert” Set256
Three methods commonly used in the surgical literature257
to identify “experts” to establish construct validity were258
employed. All were then combined to establish a set of259
individual iterations (not subjects) that exemplifies “true260
expert” skill. First, a subject’s self-reported demographic261
criteria were considered. These included training level262
(None, Medical School, Post Graduate Years of resi-263
dency 1 through 5, Fellowship, and Practicing Surgeon)264
as well as the self-reported estimate of all-time total265
number laparoscopic cases performed. Only practicing266
laparoscopic surgeons and fellows who completed more267
that 100 laparoscopic cases (where they did 50% or268
more of the case) were considered candidate subjects269
for the “true expert” candidate subject pool. Second, the270
highest FLS-scoring iteration for each task of each expert271
candidate was taken as a set of “true expert” candidate272
iterations. Finally, the third approach utilized an Objec-273
tive Structured Assessment of Technical Skill (OSATS)274
protocol [4], [5] which was modified to focus exclusively275
on psychomotor skills, denoted p-OSATS and shown in276
Table 2. Only the “true expert” candidate iterations (best277
FLS-coring iterations of laparoscopic fellows and practic-278
ing surgeons) were considered for p-OSATS review. The279
videos of these iterations were randomly renamed and280
ordered before evaluation. Two faculty surgeons (coders281
A and B) served as p-OSATS reviewers. Reviewers were282
blind to the identity and demographics of the subjects283
whose videos they reviewed and to the scores of the284
other reviewer.285
Three methods commonly used in the surgical litera-286
ture to identify “experts” to establish construct validity287
were employed. All were then combined to establish a288
set of individual iterations (not subjects) that exemplifies289
“true expert” skill. First, a subject’s self-reported demo-290
graphic criteria was considered. These included training291
level (None, Medical School, Post Graduate Years of res-292
idency 1 through 5, Fellowship, and Practicing Surgeon)293
as well as the self-reported estimate of all-time total294
number laparoscopic cases performed. Only Practicing295
laparoscopic Surgeons and Fellows who completed more296
that 100 laparoscopic cases (where they did 50% or297
more of the case) were considered candidate subjects for298
the “true Expert” Pool. Second, the highest FLS-scoring299
iteration for each task of each expert candidate was taken300
as a set of “true expert” candidate iterations. Finally, the301
third approach utilized an Objective Structured Assess-302
ment of Technical Skill (OSATS) protocol [4], [5] which303
was modified to focus exclusively on psychomotor skills,304
denoted p-OSATS and shown in Table 2. Only the “true305
expert” candidate iterations (best FLS-coring iterations306
of laparoscopic fellows and practicing surgeons) were307
considered for pOSATS review. The videos of these iter-308
ations were randomly renamed and ordered before eval-309
uation. Two faculty surgeons (coders A and B) served as310
p-OSATS reviewers. Reviewers were blind to the identity311
and demographics of the subjects whose videos they312
reviewed and to the scores of the other reviewer. Only313
the iterations that received p-OSATS scores 3 or above314
in all domains were included in the set of “True Expert”315
set. This set consists of single iterations, not individuals.316
In this way the three criteria for identifying expert skill—317
Training/Experience Level, validated performance mea-318
JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 5
TABLE 2: Psychomotor OSATS (p-OSATS) grading scaleused to evaluate and numerically code psychomotor skill[4], [5].
Score Bimanuality Motion Quality
1 One arm paralyzed, of-fering no help to com-plete the step
Unnecessary,hesitant or awkwardmovements of tools
2
3 Using both arms mostof the time, but clearperceivable bias ofaccomplishing most ofthe task with dominanthand
Reasonably efficientmovments of tools butfrequent non effectivemoves
4 Mostly Fluid
5 Both arms naturallycomplementing eachother. Optimal use ofnon-dominant hand.
Elegant, Fluid, and Ef-ficient movements oftools
sures, and OSATS review—were combined.319
To address inter-session reliability, coder A reviewed320
all videos again, but in a different randomized order321
approximately 10 days after first review. Coder B’s322
scores were compared to Coder A’s combined scores for323
inter-rater reliability. Both coders were asked to limit324
their coding to psychomotor characteristics of the tool325
motion in p-OSATS alone, but were repeatedly asked326
to verbalize their thoughts during review which were327
subsequently recorded for each reviewed iteration.328
TABLE 3: Inter-session and inter-rater reliability ofpOSATS scores. Coders only scored the Ground Truthexperts candidates subset (N = 56). Inter-session com-pares coder A’s scores with himself; Inter-rater comparescoder A’s mean scores with coder B’s. Pearon’s r andSpearman’s ρ with corresponding (p-value) are shown.
Inter-session: Peg Tranfer Cutting Suturing
Bimanuality r 0.57 (0.01) 0.49 (0.03) 0.76 (0.00)ρ 0.45 (0.04) 0.51 (0.02) 0.69 (0.00)
Motion Quality r 0.74 (0.00) 0.34 (0.14) 0.85 (0.00)ρ 0.70 (0.00) 0.32 (0.17) 0.90 (0.00)
Inter-rater: Peg Tranfer Cutting Suturing
Bimanuality r 0.55 (0.01) 0.04 (0.86) 0.70 (0.00)ρ 0.45 (0.05) 0.04 (0.88) 0.66 (0.01)
Motion Quality r 0.63 (0.00) 0.27 (0.26) 0.78 (0.00)ρ 0.60 (0.00) 0.21 (0.38) 0.80 (0.00)
3 RESULTS329
3.1 Inter-session and Inter-rater p-OSATS Reliability330
If Pearson’s r > 0.5 is a considered a strong fit and 0.4 <331
r < 0.5 is considered a moderate fit, then both p-OSATS332
domains exhibited acceptable reliability (p < .05 or less)333
for the Peg Transfer and Suturing tasks, with strongest334
reliability for the “Motion Quality” p-OSATS domain.335
(a) Box plots for training level.
(b) Box plots for Lap. Experience.
(c) Box plots for Age
Fig. 3: Box plots of FLS scores vs. demographics typ-ically used to establish construct validity. (a) TrainingLevel (0=none, MS=medical school, R1-5=post graduate yearof surgical residency, FL=fellow, SG=practicing surgeons). (b)Laparoscopic Experience: the self-reported estimate ofall-time total number of laparoscopic cases performed.1+ indicates subjects who had performed 1 or more lapr.cases, where they did 50% or more of the case. (c) Age,25+ indicates subjects who are 25 years or older, etc.
JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 6
Inter-session and inter-rater reliability were weakest for336
the Cutting task. See Table 3 for details.337
3.2 Data Collection Overview338
Not all subjects completed all requested iterations in the339
study. Some subjects voluntarily completed additional340
iterations. Incomplete iterations or iterations with cor-341
rupted data such as missing video which prevented post-342
task scoring were excluded from analysis. An overview343
of the collected data appears in Table 4.
TABLE 4: Overview of all collected EDGE data.
UW UMN NOLA TOTAL
“Expert” Subjects 6 8 3 17Total Subjects 32 35 31 98Peg Transfer Iterations 78 88 27 193Cutting Iterations 61 53 51 165Suturing Iterations 0 59 30 89Total Iterations 139 200 108 447Total Time (hours) 22.7
344
Fig. 5: 3D Toolpath for the left (blue) and right(red)hands of one iteration of a Peg Tranfer task for a novice(right) and expert (left).
Representative plots of 3D toolpath of an FLS novice345
(low-scoring, no FLS experience) and proficient subject346
(high FLS-scoring, faculty surgeon) appear in Figure 5347
along with corresponding scores for a single Peg Transfer348
iteration. Figure 6 shows the left and right hand grasping349
force plotted in time for the same iteration as well as the350
corresponding computed values.351
3.3 Demographics and FLS Scores352
Demographic categories typical for construct validity353
like training level and laparoscopic experience do not354
correlate well over the entire database (N = 447), see355
Table 5—Spearman’s ρ ≤ 0.5 for all but the cutting356
task vs Training level and Lapr. Case count. The cor-357
responding scatter plots (see Fig. 3) indicate significant358
variation of FLS scores in the subjects with most training359
and most experience: demographically identified experts360
exhibited FLS scores significantly lower than fellows361
and not significantly different than third- or fourth-year362
residents. FLS scores correlated most weakly or not at363
all across age.364
Fig. 6: The grasping force vs time for left(blue) and right(red) hands for the same task as Fig. 5
TABLE 5: Construct validity categories vs FLS score bytask for entire databse (N = 447). Spearman’s ρ (p-value)is shown.
PegTx Cutting Suturing
Lapr. Cases 0.50(0.00) 0.59(0.00) 0.33(0.00)Training Level 0.48(0.00) 0.63(0.00) 0.41(0.00)Age 0.17(0.02) 0.32(0.00) 0.07(0.52)
3.4 Comparison of Criteria Used to Establish “True365
Experts” Set366
All iterations of “true Expert” canidates received p-367
OSATS scores (N = 52) in addition to their FLS scores368
and demographic ranking data. For this pool, no demo-369
graphic category correlated well with either FLS or p-370
OSATS scores. However, FLS and p-OSATS correlated371
well (p < .01) for all tasks. See Table 6 for details. Ad-372
ditionally, each individual p-OSATS domain (not shown373
in table) correlated significantly with FLS (p < .01) but374
none did so with laparascopic experience.375
FLS and p-OSATS were further compared with simple376
task time. FLS score was found to be nearly identical377
to time (r=−1.00, p< .0000001) indicating that the “true378
expert” candidates made virtually no mistakes. p-OSATS379
correlated well with time, though exhibited some devia-380
tion. The correlation details for each task are listed in the381
lower section of Table 6. Moreover, Fig. 4a illustrates the382
specific deviations between p-OSATS and FLS; Fig. 4b383
the near-perfect equivalence of FLS and time; and Fig.384
4c illustrates the deviation between p-OSATS scores and385
time.386
3.5 Comparison of FLS and EDGE Metrics387
All iterations from the database (N = 447) were em-388
ployed to compare the automated performance metrics389
(task time, path length, economy of motion, and peak390
grasp force) and FLS scores. Correlation details appear in391
Table 7. Fig. 7 indicates near-perfect correlation between392
FLS and task time. Path Length correlates well with task393
time for all tasks (with correlation coefficients at 0.87 or394
above with significance p < .0001 or better). EoM shows395
substantially less correlation with FLS scores, especially396
JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 7
(a) (b) (c)
Fig. 4: Scatter plots of only the iterations in the “true expert” set for (a) FLS scores vs. p-global OSATS score, (b)FLS vs. Time, and (c) p-OSATS vs. Time. FLS shows little or no statistical deviation from time indicating that theexperts performed minimal errors, however p-OSATS varied substantially from time.
TABLE 6: Comparison of criteria typically employed forconstruct validity used to establish “True experts” set(N = 56). Pearon’s r and Spearman’s ρ with correspond-ing (p-value) are shown.
Comparison Peg Tranfer Cutting Suturing
Lap Exp vs. p-OSATS r 0.15 (0.52) -0.01 (0.95) -0.11 (0.68)ρ 0.25 (0.29) 0.11 (0.65) -0.32 (0.22)
Lap Exp vs. FLS r 0.09 (0.71) -0.21 (0.38) -0.31 (0.24)ρ -0.09 (0.72) -0.09 (0.72) -0.16 (0.56)
p-OSATS vs. FLS r 0.57 (0.01) 0.57 (0.01) 0.79 (0.00)ρ 0.55 (0.01) 0.64 (0.00) 0.77 (0.00)
Time vs. p-OSATS r -0.57 (0.01) -0.56 (0.01) -0.79 (0.00)ρ -0.55 (0.01) -0.64 (0.00) -0.77 (0.00)
Time vs. FLS r -1.00 (0.00) -1.00 (0.00) -1.00 (0.00)ρ -1.00 (0.00) -1.00 (0.00) -1.00 (0.00)
for lower EoM values. Finally, peak grasping force Fpeak397
shows no discernible overall trend across tasks.398
A closer view of peak grasping force and FLS scores399
appears in Fig. 8. Iterations with the best FLS scores400
exhibited the highest peak forces.401
TABLE 7: Correlation of EDGE automatically computedmetrics with FLS scoring.
PegTx Cutting Suturing
Time (s)r -0.99 (0.00) -1.00 (0.00) -1.00 (0.00)ρ -0.99 (0.00) -1.00 (0.00) -1.00 (0.00)
Path Length (cm)r -0.92 (0.00) -0.87 (0.00) -0.95 (0.00)ρ -0.89 (0.00) -0.88 (0.00) -0.93 (0.00)
EoM (cm/s)r 0.70 (0.00) 0.35 (0.00) 0.40 (0.00)ρ 0.82 (0.00) 0.42 (0.00) 0.61 (0.00)
Peak Force (N)r -0.18 (0.01) -0.27 (0.00) -0.16 (0.14)ρ -0.08 (0.30) -0.27 (0.00) -0.26 (0.01)
4 DISCUSSION402
Several Limitations are present in this study. Data was403
collected in variable environments. Most of the subjects404
Fig. 7: EDGE computed performance metrics vs. FLSscores for entire database.
Fig. 8: Peak force behavior for Peg Transfer (red) andSuturing tasks (green) near highest FLS scores.
JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 8
were not a captive audience: they voluntarily partici-405
pated and thus had no binding incentive to perform406
at their peak ability at all times. Reliability testing em-407
ployed in the pOSATS video review could benefit from408
more coders and perhaps a workshop review setting.409
Post analysis of recorded reviewer comments made dur-410
ing p-OSATS review of cutting tasks indicated coder B411
frequently emphasized handling of the gauze—a cogni-412
tive skill—and coder A intentionally ignored doing so413
to limit his review to the strictly psychomotor categories414
of p-OSATS. The cutting task varied slightly in that a415
double ring was used for the UW sites and a single ring416
was used everywhere else, this potentially resulted in417
different scoring criteria between sites for the cutting418
task.419
Typically, only a single criterion is employed to420
group subjects into skill levels like ‘expert’ and ‘novice’421
in the literature. We employed three: OSATS review,422
demographically-identified skill via training level or la-423
parascopic experience, and the validated FLS scoring424
methodology. In many cases, they disagreed. Particu-425
larly, higher laparascopic experience level did not imply426
better FLS or p-OSATS scores. If experience and training427
level are taken as the ground truth that determines428
expert skill, this suggests that the FLS tasks themselves429
may not relevantly test surgical expertise, at least among430
higher skill levels. If FLS and p-OSATS are taken as431
ground truth, this suggest that perhaps demographic432
categories like experience or training level provide im-433
perfect or even poor means to establish construct validity434
in simulation studies.435
According to the equations of Table 1, strict correlation436
between FLS and task time is required if and only if437
there are no errors. According to our results in Table438
7 the correlation between FLS and task time is never439
substantially reduced by errors for any task for the entire440
database. This suggests that the error scores—which441
must be manually collected and computed for each FLS442
task iteration—have a negligible effect on the final score443
given these equations and may not warrant the added444
resource cost they incur.445
Since we expect “true experts” to have a near zero446
FLS error rates, this near-perfect correlation between447
task time and FLS score is not surprising among them448
(shown in Fig 4b and Table 7) and indicates that with449
FLS scoring, only task time can discriminate skill among450
experts. However, p-OSATS scores of these iterations451
deviated substantially from task time (see Fig. 4a, 4c and452
Table 7). If the p-OSATS scores are accurate, this suggests453
that they discriminate some dimension of psychomotor454
skill among experts not detectable with task time alone.455
According to Table 7 and Figure 7, path length, and to456
a lesser extent EoM, correlated favorably with FLS (and457
time, implicitly). It is reasonable to expect that longer458
path lengths for a given procedures will result in longer459
times and hence be closely dependent. However, EoM—460
which is merely average velocity in a task (units: cm/s)—461
specifically controls for task time. If it correlated per-462
fectly with task time, this would imply that it is unneces-463
sary. Instead it appears to measure some other aspect of464
psychomotor skill. However, the clinical significance of465
training to higher or lower EoM performance measures466
is somewhat uncertain, particularly at the whole-task467
level which considers the overall average rather than the468
EoM at specific times.469
Fig. 8 shows incidence of high peak force (> 30N )470
for the best FLS scores. However, no data linking peak471
grasping force in FLS to tissue damage or outcomes is472
presented, thus this result provides limited insight. We473
speculate that using excessive force vs minimal force474
may improve task times. Currently, the FLS methodol-475
ogy provides no way to deter or penalize such behavior,476
nor do the technical skills objectives and instructions477
include grasping force targets. We expect this, at least478
in part, results in the side spread (lack of correlation)479
shown between peak grasp force and FLS scores in480
Figure 7 and Table 7.481
Such behavior can be easily identified and appropri-482
ately addressed (e.g. a score penalty or automated audio483
feedback) by and instrumented platform like EDGE.484
Moreover, they computerized scoring has better accuracy485
and repeatability than FLS scoring since it eliminates486
evaluator bias and human error. Such accuracy is par-487
ticularly required for the force measurements which are488
difficult to quantify from video. However, more research489
is required to establish clinically relevant grasping force490
targets for the different tasks or targeted tissues.491
To our knowledge, all prior studies indicate FLS to492
be valid and show positive transfer of skills to the493
operating room. We do not see any contraindication of494
this in our results. However, our work suggests that a495
stopwatch can effectively replace the FLS scoring system:496
accounting for ‘precision’ via error counts is superflu-497
ous with the currently published FLS scoring system.498
While this in no way diminishes the success of the FLS499
program, it may suggest that alternative metrics beyond500
task time have not been adequately explored in the FLS501
tasks. Also, it suggests that subjects who train only to502
task time and sacrifice precision will be rewarded by503
FLS scoring, since additional errors will negligibly count504
against them. However, either adjusting the FLS scoring505
equations or employing a more granular metric that can506
indicate dexterity can remedy this.507
We conclude that the FLS scoring equations may pro-508
vide negligible benefit over task time. We also conclude509
that instrumented reality-based platforms such as EDGE510
along with the more granular, automated psychomotor511
metrics they enable (Path Length, Economy of Motion,512
Peak Grasp Force) may offer significant skill discrim-513
ination beyond task time alone, but more research is514
required to establish this.515
ACKNOWLEDGMENTS516
The authors would like to thank Dr. Rob Sweet, Troy E.517
Reihsen, and SimPORTAL staff from the University of518
JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 9
Minnesota site, Dr. John Paige and his staff at Louisiana519
State Univeristy, and all the participants of the study.520
Funding for this study was provided by Research and521
Technology Development Grant RTD11 UW SB01 from522
the Washington Technology Center in partnership with523
Simulab Corporation.524
REFERENCES525
[1] Heinrichs LR, et al. Criterion-based training with surgical526
simulators: proficiency of experienced surgeons. JSLS: Journal527
of the Society of Laparoendoscopic Surgeons. 2007;11(3):273.528
Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/529
PMC3015829/.530
[2] Nasca TJ, Philibert I, Brigham T, Flynn TC. The Next GME531
Accreditation System—Rationale and Benefits. New England532
Journal of Medicine. 2012;.533
[3] Sturm LP, Windsor JA, Cosman PH, Cregan P, Hewett PJ, Mad-534
dern GJ. A systematic review of skills transfer after surgical535
simulation training. Annals of surgery. 2008;248(2):166.536
[4] Martin J, Regehr G, Reznick R, MacRae H, Murnaghan J, Hutchi-537
son C, et al. Objective structured assessment of technical538
skill (OSATS) for surgical residents. British journal of surgery.539
1997;84(2):273–278. Available from: http://onlinelibrary.wiley.540
com/doi/10.1046/j.1365-2168.1997.02502.x/abstract.541
[5] Reznick R, Regehr G, MacRae H, Martin J, McCulloch W. Test-542
ing technical skill via an innovative ”bench station” exami-543
nation. The American journal of surgery. 1997;173(3):226–230.544
Available from: http://www.sciencedirect.com/science/article/545
pii/S0002961097895979.546
[6] Peters J, Fried GM, Swanstrom LL, Soper NJ, Sillin LF, Schirmer547
B, et al. Development and validation of a comprehen-548
sive program of education and assessment of the basic fun-549
damentals of laparoscopic surgery. Surgery. 2004;135(1):21–550
27. Available from: http://www.astec.arizona.edu/sites/astec.551
arizona.edu/files/pdf files/Lap%20Program.pdf.552
[7] Fried GM, Feldman LS, Vassiliou MC, Fraser SA, Stanbridge553
D, Ghitulescu G, et al. Proving the value of simulation in554
laparoscopic surgery. Annals of surgery. 2004;240(3):518.555
[8] Okrainec A, Soper NJ, Swanstrom LL, Fried GM. Trends and re-556
sults of the first 5 years of Fundamentals of Laparoscopic Surgery557
(FLS) certification testing. Surgical endoscopy. 2011;25(4):1192–558
1198.559
[9] Kolozsvari NO, Kaneva P, Brace C, Chartrand G, Vaillancourt560
M, Cao J, et al. Mastery versus the standard proficiency target561
for basic laparoscopic skill training: effect on skill transfer and562
retention. Surgical endoscopy. 2011;p. 1–8.563
[10] Rosenthal ME, Ritter EM, Goova MT, Castellvi AO, Tesfay ST,564
Pimentel EA, et al. Proficiency-based Fundamentals of Laparo-565
scopic Surgery skills training results in durable performance566
improvement and a uniform certification pass rate. Surgical567
endoscopy. 2010;24(10):2453–2457.568
[11] Derevianko AY, Schwaitzberg SD, Tsuda S, Barrios L, Brooks DC,569
Callery MP, et al. Malpractice carrier underwrites Fundamentals570
of Laparoscopic Surgery training and testing: a benchmark for571
patient safety. Surgical endoscopy. 2010;24(3):616–623.572
[12] Sroka G, Feldman LS, Vassiliou MC, Kaneva PA, Fayez R, Fried573
GM. Fundamentals of Laparoscopic Surgery simulator training to574
proficiency improves laparoscopic performance in the operating575
room–a randomized controlled trial. The American Journal of576
Surgery. 2010;199(1):115–120.577
[13] Gallagher AG, Ritter EM, Champion H, Higgins G, Fried MP,578
Moses G, et al. Virtual reality simulation for the operating room:579
proficiency-based training as a paradigm shift in surgical skills580
training. Annals of surgery. 2005;241(2):364.581
[14] Satava RM, Cuschieri A, Hamdorf J. Metrics for objective assess-582
ment. Surgical endoscopy. 2003;17(2):220–226.583
[15] Seymour NE, Gallagher AG, Roman SA, O’Brien MK, Bansal584
VK, Andersen DK, et al. Virtual reality training improves585
operating room performance: results of a randomized, double-586
blinded study. Annals of surgery. 2002;236(4):458. Available from:587
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1422600/.588
[16] Pearson A, Gallagher A, Rosser J, Satava R. Evaluation of589
structured and quantitative training methods for teaching intra-590
corporeal knot tying. Surgical endoscopy. 2002;16(1):130–137.591
[17] Gunther S. Red DRAGON: A Multi-Modality System for Simu-592
lation and Training in Minimally Invasive Surgery. University of593
Washington; 2006. Red Draggon.594
[18] Rosen J, Brown JD, Chang L, Barreca M, Sinanan M, Hannaford595
B. The BlueDRAGON—a system for measuring the kinematics596
and dynamics of minimally invasive surgical tools in-vivo. In:597
Robotics and Automation, 2002. Proceedings. ICRA’02. IEEE In-598
ternational Conference on. vol. 2. IEEE; 2002. p. 1876–1881. Blue599
dragon.600
[19] Andrew S Wright BH Timothy M Kowalewski. Novel Laparo-601
scopic Box Trainer with Integrated Force and Positioning Sensors.602
In: 12th World Congress of Endoscopic Surgery, Emerging Tech-603
nology Session, National Harbor, MD; April 2010. .604
[20] Derossis M, Anna M, Fried M, Gerald M, Abrahamowicz PhD605
M, Sigman M, et al. Development of a model for training and606
evaluation of laparoscopic skills. The American journal of surgery.607
1998;175(6):482–487.608
[21] Fraser S, Feldman L, Stanbridge D, Fried G. Characterizing the609
learning curve for a basic laparoscopic drill. Surgical endoscopy.610
2005;19(12):1572–1578.611
[22] Rasband WS. ImageJ; 1997-2011. U. S. National Institutes of612
Health, Bethesda, Maryland, USA, http://imagej.nih.gov/ij/.613
Timothy Kowalewski Mr. Kowalewski received614
the B.S. and M.S. degrees in Electrical Engi-615
neering from the University of Washington, Seat-616
tle, in 2003 and 20XX respectively. His doctoral617
research is in the area of training surgeons using618
machine learning techniques.619
620
Lee White Mr. White received the B.S.E. degree621
in BioMedical Engineering from Tulane Univer-622
sity in New Orleans, LA in 2008. He is cur-623
rently pursuing the Ph.D. degree with research624
into medical technology and the use of human-625
guided robotic systems to enable advanced sur-626
gical procedures. Please see http://leewhite.org/627
628
Iris Jiang Body of Iris’s biography.629
630
631
JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 10
Thomas Lendvay Body of Tom’s biography.632
633
634
Andrew Wright Body of Andy’s biography.635
636
637
Blake Hannaford Dr. Hannaford received the638
B.S. degree in Engineering and Applied Science639
from Yale University in 1977, and the M.S. and640
Ph.D. degrees in Electrical Engineering from641
the University of California, Berkeley, in 1982642
and 1985 respectively. From 1986 to 1989 he643
worked in the Automated Systems Section of the644
NASA Jet Propulsion Laboratory, Caltech. Since645
September 1989, he has been at the University646
of Washington in Seattle, where he has been647
Professor of Electrical Engineering since 1997.648
649