journal of the society for simulation in healthcare ... · 86 computerized, automated scoring...

10
JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 1 Validation of EDGE, a Reality-Based Laparascopic Box Trainer for Quantitative Psychomotor Skill Assessment Tim Kowalewski MS, Lee White, Thomas S. Lendvay MD, Iris Jiang, Andrew Wright MD, and Blake Hannaford PhD, Fellow, IEEE Abstract—Computerized analysis of tool motion and force recordings from a multi-institutional study of laparoscopic surgeons performing dry lab tasks discriminated psychomotor skill level and correlated with a Fundamentals of Laparoscopic Skills (FLS) evaluation score. The research demonstrates a means of automatically quantifying psychomotor skill immediately upon the completion of a task using an integrated laparoscopy trainer and motion capture system called the Electronic Data Generation and Evaluation (EDGE) Platform by Simulab Corporation. Additionally, the FLS scoring method was found to provide no significant benefit over task time alone. Index Terms—Psychomotor Skill, FLS, Hidden Markov Models, Machine Learning, EDGE. 1 I NTRODUCTION 1 T HE ACGME (Accrediation Council of Graduate 2 Medical Education) specifically identifies ”technical 3 competence in conducting surgical procedures” as an 4 component in its core competencies under Patient Care 5 and Practice-based learning [1]. Moreover, the ACGME 6 may establish Technical skills as a seventh core com- 7 petency [2][TOM: nasca2012 never mentions technical 8 skills, what are you talking about?]. Surgical societies 9 need to codify means for assessing surgical skills. Yet 10 related studies are often of variable quality and do not 11 use comparable methodologies [3]. We further observe 12 that the simulation platforms used by such studies often 13 employ different performance measures. 14 Surgical educators currently rely on crude surgical 15 performance metrics to ‘define’ proficiency such as task 16 time and basic error recognition. In addition, existing 17 reality-based surgical simulators (ones that use real 18 world tools and objects as opposed to virtual computer 19 models) are resource intensive due to human graders 20 and slow the training feedback process. Skill assessment 21 systems such as Objective Structured Analysis of Tech- 22 nical Skills (OSATS) or the McGill Inanimate System 23 for Training and Evaluation of Laparoscopic Skills (MIS- 24 TELS) and its resulting Fundamentals of Laparoscopic 25 T. Kowalewski and B. Hannaford are with the Department of Electrical Engineering, University of Washington, Seattle, WA 98195. E-mail: [email protected], [email protected] L. White and I. Jiang are with the Bioengineering Department, University of Washington, Seattle, WA 98195. E-mail: [email protected], [email protected] T. Lendvay and A. Wright are with the Departments of Urology and Surgery, University of Washington, Seattle, WA 98195. E-mail: [email protected], [email protected] Skills (FLS) have attempted to provide frameworks for 26 human graders to analyze surgical performance for skill 27 [4], [5], [6], [7]. In 2004, the FLS assessment certification 28 process was launched and after 5 years of its existence, 29 almost 3000 clinicians have become certified [8]. One of 30 the limitations of the technical skills portion of the FLS 31 curriculum is that there are few objective metrics that 32 are captured. Currently Task Time and Errors are the 33 two outcomes. However, such summary measurements 34 may overlook more granular measures of performance 35 since they neglect the temporal or sequential phenom- 36 ena inherent in the task—phenomena that could, for 37 example, indicate at what point in time, space, or pro- 38 cedural context a subject exhibits a degree of skill. And, 39 although discriminating ‘FLS’-experienced and inexpe- 40 rienced subjects has been repeatedly demonstrated, [9], 41 [10], [11], [12] more granular elements of surgical skills 42 such as economy of motion, grasp forces, and instrument 43 accelerations have not. In part, this is because existing 44 reality-based laparoscopic box trainers are not capable 45 of capturing or reporting these metrics. Moreover, there 46 is no rigorous and repeatable method established to 47 carry out identical performance measures on different 48 platforms [1], [13], [14], [15], [16]. 49 An ideal surgical training platform would automat- 50 ically quantitatively assess technical skill immediately 51 after or even during a surgical task and provide imme- 52 diate feedback to the trainee. We herein evaluate such 53 a candidate training platform for its validity in pro- 54 viding automated quantitative assessment of technical 55 (psychomotor) skills in reality-based (RB) training tasks. 56 Unlike its virtual reality (VR) counterparts, RB train- 57 ing can be significantly cheaper, employ real surgical 58 tools, and obviate additional simulation and validation 59 steps by providing physical interaction with a real en- 60

Upload: others

Post on 15-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ... · 86 computerized, automated scoring methodology such as 87 the EDGE platform can provide equivalent or bet-88 ter psychomotor

JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 1

Validation of EDGE, a Reality-BasedLaparascopic Box Trainer for Quantitative

Psychomotor Skill AssessmentTim Kowalewski MS, Lee White, Thomas S. Lendvay MD, Iris Jiang, Andrew Wright MD,

and Blake Hannaford PhD, Fellow, IEEE

Abstract—Computerized analysis of tool motion and force recordings from a multi-institutional study of laparoscopic surgeonsperforming dry lab tasks discriminated psychomotor skill level and correlated with a Fundamentals of Laparoscopic Skills (FLS)evaluation score. The research demonstrates a means of automatically quantifying psychomotor skill immediately upon the completionof a task using an integrated laparoscopy trainer and motion capture system called the Electronic Data Generation and Evaluation(EDGE) Platform by Simulab Corporation. Additionally, the FLS scoring method was found to provide no significant benefit over tasktime alone.

Index Terms—Psychomotor Skill, FLS, Hidden Markov Models, Machine Learning, EDGE.

F

1 INTRODUCTION1

THE ACGME (Accrediation Council of Graduate2

Medical Education) specifically identifies ”technical3

competence in conducting surgical procedures” as an4

component in its core competencies under Patient Care5

and Practice-based learning [1]. Moreover, the ACGME6

may establish Technical skills as a seventh core com-7

petency [2][TOM: nasca2012 never mentions technical8

skills, what are you talking about?]. Surgical societies9

need to codify means for assessing surgical skills. Yet10

related studies are often of variable quality and do not11

use comparable methodologies [3]. We further observe12

that the simulation platforms used by such studies often13

employ different performance measures.14

Surgical educators currently rely on crude surgical15

performance metrics to ‘define’ proficiency such as task16

time and basic error recognition. In addition, existing17

reality-based surgical simulators (ones that use real18

world tools and objects as opposed to virtual computer19

models) are resource intensive due to human graders20

and slow the training feedback process. Skill assessment21

systems such as Objective Structured Analysis of Tech-22

nical Skills (OSATS) or the McGill Inanimate System23

for Training and Evaluation of Laparoscopic Skills (MIS-24

TELS) and its resulting Fundamentals of Laparoscopic25

• T. Kowalewski and B. Hannaford are with the Department of ElectricalEngineering, University of Washington, Seattle, WA 98195.E-mail: [email protected], [email protected]

• L. White and I. Jiang are with the Bioengineering Department, Universityof Washington, Seattle, WA 98195.E-mail: [email protected], [email protected]

• T. Lendvay and A. Wright are with the Departments of Urology andSurgery, University of Washington, Seattle, WA 98195.E-mail: [email protected], [email protected]

Skills (FLS) have attempted to provide frameworks for26

human graders to analyze surgical performance for skill27

[4], [5], [6], [7]. In 2004, the FLS assessment certification28

process was launched and after 5 years of its existence,29

almost 3000 clinicians have become certified [8]. One of30

the limitations of the technical skills portion of the FLS31

curriculum is that there are few objective metrics that32

are captured. Currently Task Time and Errors are the33

two outcomes. However, such summary measurements34

may overlook more granular measures of performance35

since they neglect the temporal or sequential phenom-36

ena inherent in the task—phenomena that could, for37

example, indicate at what point in time, space, or pro-38

cedural context a subject exhibits a degree of skill. And,39

although discriminating ‘FLS’-experienced and inexpe-40

rienced subjects has been repeatedly demonstrated, [9],41

[10], [11], [12] more granular elements of surgical skills42

such as economy of motion, grasp forces, and instrument43

accelerations have not. In part, this is because existing44

reality-based laparoscopic box trainers are not capable45

of capturing or reporting these metrics. Moreover, there46

is no rigorous and repeatable method established to47

carry out identical performance measures on different48

platforms [1], [13], [14], [15], [16].49

An ideal surgical training platform would automat-50

ically quantitatively assess technical skill immediately51

after or even during a surgical task and provide imme-52

diate feedback to the trainee. We herein evaluate such53

a candidate training platform for its validity in pro-54

viding automated quantitative assessment of technical55

(psychomotor) skills in reality-based (RB) training tasks.56

Unlike its virtual reality (VR) counterparts, RB train-57

ing can be significantly cheaper, employ real surgical58

tools, and obviate additional simulation and validation59

steps by providing physical interaction with a real en-60

Page 2: JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ... · 86 computerized, automated scoring methodology such as 87 the EDGE platform can provide equivalent or bet-88 ter psychomotor

JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 2

Fig. 1: The EDGE Platform was developed by SimulabCorporation (Seattle, WA) and is based on a mechanismdeveloped by the University of Washington Bioroboticslab. It consists of a pair of interchangeable surgical toolswhose motion is constrained to rotate about a fulcrumthe same way laparoscopic instruments are constrainedby their access ports.

vironment and objects. Moreover, RB training provides61

accurate force feedback and tactile sensation induced62

by real physical objects that VR may only approximate63

at considerable expense. However, the realism of RB64

simulation is limited by the mechanical properties of RB65

tissue surrogates that may not adequately mimic human66

anatomy.67

Unlike the traditional FLS box trainer, an instru-68

mented, computerized RB platform can quantitatively69

provide both granular performance metrics as well as au-70

tomated immediate skill evaluation feedback. To realize71

these benefits, the University of Washington Biorobotics72

lab created such a platform called the Red DRAGON73

[17], [18], a scaled-down table-top version of its orig-74

inal BlueDRAGON [18] platform which was success-75

fully employed in live porcine models during training.76

This RedDRAGON prototype, along with the established77

Hidden Markov Model-based scoring methodology [18]78

as licensed and commercialized, see Fig.1, by Simulab79

Corporation (Seattle, WA) into the Electronic Data Gen-80

eration and Evaluation platform (EDGE). To our knowl-81

edge, EDGE is the first dry lab RB box-trainer which can82

obtain high-accuracy motion (position and orientation)83

measurements along with grasping force [19].84

The goal of this work is to establish whether a fully85

computerized, automated scoring methodology such as86

the EDGE platform can provide equivalent or bet-87

ter psychomotor skill evaluation in RB settings than88

performance-based scoring methodologies such as those89

employed by the widely-used FLS protocol.90

2 METHODS91

This study employed the EDGE platform to collect tool92

motion and task video data from faculty and training93

surgeons at multiple surgical centers in the United States.94

2.1 EDGE Platform Description and Analysis Meth-95

ods96

EDGE hardware consists of an instrumented mechanism97

which holds laparoscopic tools about a fixed pivot point.98

These were either Stryker Endoscopy (San Jose, CA)99

tools with interchangeable tool tip inserts (5mm diam,100

33cm length with 250-080-282 Maryland Grasper or 250-101

080-267 Endo Metzenbaum Curved Scissor inserts) or102

Karl Storz (Tuttlingen, Germany) curved needle drivers103

(26173 KAL and KAR). Position sensors (potentiometers104

and optical encoders) measure tool position (x, y, z in105

cm) in tool roll (r in degrees), and grasper angle (θ in106

degrees). A calibrated strain-gauge measures grasping107

force (Fg in Newtons). EDGE includes custom software108

that time-stamps all sensors measurements for each hand109

at 30 Hz with synchronized video capture of the task.110

Videos are compressed on-the-fly with MPEG4 codecs111

and average approximately 10MB per minute of video.112

Timing is automatically recorded and begins and ends113

when tools are moved in and out of a fixed “homebase”114

position directly behind each task block which, in turn, is115

rigidly and repeatably held to a fixed coordinate system.116

EDGE employed a laptop computer running Microsoft117

(Redmond, WA) Windows XP (UW sites only) or Win-118

dows 7 (all other sites) interfaced via USB 2.0 to EDGE119

during all recordings. EDGE’s custom software stored120

tool data and task video files and labeled them with121

unique codes to identify subject, site, task, date, and time122

of each iteration. All tool motion and demographic data123

for all iterations was compiled into a single database,124

sorted, and analyzed by custom scripts in MATLAB125

software (Mathworks Inc., Natick, MA). Pearon’s r126

and Spearman’s ρ with corresponding (p-value) were127

adopted as a measures of correlation for linearity and128

monotonicity respectively.129

2.2 Subject Pool Description130

Subject enrollment was approved and registered under131

Western IRB 19125-A/B. This multicenter effort included132

surgeons of various skill levels from the University of133

Washington Medical Center, the University of Minnesota134

Medical Center, and three sites in the city of New135

Orleans enumerated below. The subject pool spanned136

General Surgery, Urology, and Gynecology specialties.137

It consisted of active surgical faculty, surgical fellows,138

residents, and experienced practicing surgeons. Medical139

students pursuing surgical practice or FLS-experienced140

technicians were also enrolled in the study.141

Page 3: JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ... · 86 computerized, automated scoring methodology such as 87 the EDGE platform can provide equivalent or bet-88 ter psychomotor

JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 3

(a) Peg Transfer (b) Circle Cutting (c) Suturing

Fig. 2: EDGE Screenhots of videos recorded during performance of the three FLS tasks: Peg Transfer, Cutting, andSuturing.

2.3 Data Collection Sites142

Three EDGE platforms, one dedicated to each task, were143

deployed at each site to maximize subject throughput144

and allow for simultaneous subjects. An approved study145

administrator set up the equipment at each site on146

a daily basis and subjects were invited to voluntarily147

participate in the study whenever their schedules al-148

lowed. Subjects were allowed to complete the study over149

multiple sessions.150

2.3.1 Site: University of Washington151

Data were collected over a three week period at a152

surgical simulation training center or within a porcine153

lab laparoscopic training center.154

2.3.2 Site: University of Minnesota155

Data were collected over a period of two weeks mostly at156

a surgeon’s OR lounge during regular operating hours.157

Some subjects did their tasks at an adjacent surgical158

simulation training center.159

2.3.3 Site: New Orleans, Louisiana160

Data were collected over a 10-day period. Sites included161

the Louisiana State University Health Science Center,162

University Hospital New Orleans, and Interim LSU Pub-163

lic Hospital. [Lee: add details]164

2.4 Questionnaire165

Subject demographics were recorded after consent was166

obtained. The de-identified questionnaire included sub-167

ject’s gender, age, handedness, training level, surgical168

specialty, laparoscopic experience, approximate number169

of relevant procedures done (where a subject completed170

more than half of the case), time since last laparoscopic171

procedure, total number of FLS tasks done in past, and172

FLS certification status.173

A post-task questionnaire invited feedback regarding174

the acceptability of EDGE based on categorical Likert175

scales and written general comments. A subject’s oral176

comments during the study were also noted in the177

general comments section by the study administrator.178

2.5 Surgical Task Description and FLS Scoring179

Three of the five FLS tasks were used in the study: Peg180

Transfer, Cutting, and Intra-corporeal Suturing. Descrip-181

tions of each task appear below along with representa-182

tive screen-shots in Fig. 2. An iteration was defined as183

one complete execution of a single task. Subjects were184

asked to complete three iterations of the Peg Trans-185

fer task, two iterations of the Cutting task, and two186

iterations of the Suturing task in that order. Subjects187

were also invited to perform additional iterations of any188

task if they were willing to do so. Each subject was189

introduced to each task via printed instructions. For each190

iteration, time (t) starts upon removing either tool tip191

from a fixed “home-base” position and stopps when both192

tools are returned to that position. The published FLS193

scoring methodology [20], [21] was adopted to calculate194

FLS scores for each task (denoted FLStask), explicit195

computation is shown in Table 1, and each task’s error196

variables EErr are described below.197

TABLE 1: Equations used to compute FLS scores [20],[21].

FLS Task FLS Score

Peg Transfer FLSpeg = (300− t− 17Edr)/237Cutting FLScut = (300− t− 2Ea)/280Suturing FLSsut = (600− t− Epd − Eg − Eq)/520

2.5.1 Peg Transfer (PegTx)198

The Peg Transfer task, also called block transfer, em-199

ployed two curved Maryland graspers. Instructions were200

to transfer six blocks in minimum time and with minimal201

errors, from one side to another and back again without202

regard for order or color of blocks, to transfer each block203

mid-air between hands, and to avoid dropping blocks.204

EDGE automatically-computed task time t. Each video205

was later manually reviewed to count the total number206

of non-recovered drops considered to be errors (Edr).207

Page 4: JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ... · 86 computerized, automated scoring methodology such as 87 the EDGE platform can provide equivalent or bet-88 ter psychomotor

JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 4

2.5.2 Circle Cutting208

The circle cutting task (Cutting), also called pattern209

cutting, employed a curved Maryland grasper in the210

non-dominant hand and a curved sheer in the dominant211

hand for the duration of each task. Instructions were to212

cut gauze along a marked circular pattern (diam = 4cm)213

in minimum time and with minimal error and to begin214

by either making a puncture anywhere on the circle or215

cutting in from the gauze edge. Task time t was auto-216

matically computed. The accumulated area (mm2) cut217

beyond the marked circle boundary composed the error218

count Ea. To minimize subjective grader error and ex-219

ploit automation, cut circles were flattened, electronically220

scanned, and cutting error was automatically computed221

via ImageJ (NIH), a public domain image processing222

suite [22]. The wand tool automatically outlined and223

measured out-of-bound areas through an edge-finding224

algorithm that measures pixel values. Scans included a225

printed reference line in order to scale the image from226

pixels to mm.227

2.5.3 Suturing228

The intracorporeal suturing task (Suturing), also called229

knot tying with intracorporeal suture, employed two230

curved small needle drivers. Subjects had a choice of a231

ratcheting or non-ratcheting mechanism at the beginning232

of each task. Instructions were to complete the task in233

minimum time and with minimum errors. The task was234

to puncture a Penrose drain at marked entry and exit235

dots with a 2-0 V-20 tapered half circle needle and 12.5236

cm of suture and tie a surgeon’s knot with two initial237

throws and one final throw. Errors included distance238

away from puncture dots in mm (Epd), gap of sutured239

slit in mm (Eg), and knot quality (Eq) where 0 indicated240

a secure knot, 10 a slipping knot, and 20 a knot that241

came apart. Task time t was automatically computed, but242

errors were manually determined by two FLS-certified243

graders.244

2.6 EDGE Computed Metrics245

EDGE software automatically computed tool-tip motion246

metrics such as tool path length (PathLength, the sum247

of left and right tool-tip distance traveled) and economy248

of motion (EoM , the ratio of path length and task249

time). EDGE also measured the grasp force exerted by250

a surgeon at the handle. The maximum of the peak251

force (Fpeak) exerted by either hand during an iteration252

was utilized to characterize force behavior. Each of these253

EDGE-computed metrics was calculated in the same way254

for all iterations, independent of task.255

2.7 Establishment of the “True Expert” Set256

Three methods commonly used in the surgical literature257

to identify “experts” to establish construct validity were258

employed. All were then combined to establish a set of259

individual iterations (not subjects) that exemplifies “true260

expert” skill. First, a subject’s self-reported demographic261

criteria were considered. These included training level262

(None, Medical School, Post Graduate Years of resi-263

dency 1 through 5, Fellowship, and Practicing Surgeon)264

as well as the self-reported estimate of all-time total265

number laparoscopic cases performed. Only practicing266

laparoscopic surgeons and fellows who completed more267

that 100 laparoscopic cases (where they did 50% or268

more of the case) were considered candidate subjects269

for the “true expert” candidate subject pool. Second, the270

highest FLS-scoring iteration for each task of each expert271

candidate was taken as a set of “true expert” candidate272

iterations. Finally, the third approach utilized an Objec-273

tive Structured Assessment of Technical Skill (OSATS)274

protocol [4], [5] which was modified to focus exclusively275

on psychomotor skills, denoted p-OSATS and shown in276

Table 2. Only the “true expert” candidate iterations (best277

FLS-coring iterations of laparoscopic fellows and practic-278

ing surgeons) were considered for p-OSATS review. The279

videos of these iterations were randomly renamed and280

ordered before evaluation. Two faculty surgeons (coders281

A and B) served as p-OSATS reviewers. Reviewers were282

blind to the identity and demographics of the subjects283

whose videos they reviewed and to the scores of the284

other reviewer.285

Three methods commonly used in the surgical litera-286

ture to identify “experts” to establish construct validity287

were employed. All were then combined to establish a288

set of individual iterations (not subjects) that exemplifies289

“true expert” skill. First, a subject’s self-reported demo-290

graphic criteria was considered. These included training291

level (None, Medical School, Post Graduate Years of res-292

idency 1 through 5, Fellowship, and Practicing Surgeon)293

as well as the self-reported estimate of all-time total294

number laparoscopic cases performed. Only Practicing295

laparoscopic Surgeons and Fellows who completed more296

that 100 laparoscopic cases (where they did 50% or297

more of the case) were considered candidate subjects for298

the “true Expert” Pool. Second, the highest FLS-scoring299

iteration for each task of each expert candidate was taken300

as a set of “true expert” candidate iterations. Finally, the301

third approach utilized an Objective Structured Assess-302

ment of Technical Skill (OSATS) protocol [4], [5] which303

was modified to focus exclusively on psychomotor skills,304

denoted p-OSATS and shown in Table 2. Only the “true305

expert” candidate iterations (best FLS-coring iterations306

of laparoscopic fellows and practicing surgeons) were307

considered for pOSATS review. The videos of these iter-308

ations were randomly renamed and ordered before eval-309

uation. Two faculty surgeons (coders A and B) served as310

p-OSATS reviewers. Reviewers were blind to the identity311

and demographics of the subjects whose videos they312

reviewed and to the scores of the other reviewer. Only313

the iterations that received p-OSATS scores 3 or above314

in all domains were included in the set of “True Expert”315

set. This set consists of single iterations, not individuals.316

In this way the three criteria for identifying expert skill—317

Training/Experience Level, validated performance mea-318

Page 5: JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ... · 86 computerized, automated scoring methodology such as 87 the EDGE platform can provide equivalent or bet-88 ter psychomotor

JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 5

TABLE 2: Psychomotor OSATS (p-OSATS) grading scaleused to evaluate and numerically code psychomotor skill[4], [5].

Score Bimanuality Motion Quality

1 One arm paralyzed, of-fering no help to com-plete the step

Unnecessary,hesitant or awkwardmovements of tools

2

3 Using both arms mostof the time, but clearperceivable bias ofaccomplishing most ofthe task with dominanthand

Reasonably efficientmovments of tools butfrequent non effectivemoves

4 Mostly Fluid

5 Both arms naturallycomplementing eachother. Optimal use ofnon-dominant hand.

Elegant, Fluid, and Ef-ficient movements oftools

sures, and OSATS review—were combined.319

To address inter-session reliability, coder A reviewed320

all videos again, but in a different randomized order321

approximately 10 days after first review. Coder B’s322

scores were compared to Coder A’s combined scores for323

inter-rater reliability. Both coders were asked to limit324

their coding to psychomotor characteristics of the tool325

motion in p-OSATS alone, but were repeatedly asked326

to verbalize their thoughts during review which were327

subsequently recorded for each reviewed iteration.328

TABLE 3: Inter-session and inter-rater reliability ofpOSATS scores. Coders only scored the Ground Truthexperts candidates subset (N = 56). Inter-session com-pares coder A’s scores with himself; Inter-rater comparescoder A’s mean scores with coder B’s. Pearon’s r andSpearman’s ρ with corresponding (p-value) are shown.

Inter-session: Peg Tranfer Cutting Suturing

Bimanuality r 0.57 (0.01) 0.49 (0.03) 0.76 (0.00)ρ 0.45 (0.04) 0.51 (0.02) 0.69 (0.00)

Motion Quality r 0.74 (0.00) 0.34 (0.14) 0.85 (0.00)ρ 0.70 (0.00) 0.32 (0.17) 0.90 (0.00)

Inter-rater: Peg Tranfer Cutting Suturing

Bimanuality r 0.55 (0.01) 0.04 (0.86) 0.70 (0.00)ρ 0.45 (0.05) 0.04 (0.88) 0.66 (0.01)

Motion Quality r 0.63 (0.00) 0.27 (0.26) 0.78 (0.00)ρ 0.60 (0.00) 0.21 (0.38) 0.80 (0.00)

3 RESULTS329

3.1 Inter-session and Inter-rater p-OSATS Reliability330

If Pearson’s r > 0.5 is a considered a strong fit and 0.4 <331

r < 0.5 is considered a moderate fit, then both p-OSATS332

domains exhibited acceptable reliability (p < .05 or less)333

for the Peg Transfer and Suturing tasks, with strongest334

reliability for the “Motion Quality” p-OSATS domain.335

(a) Box plots for training level.

(b) Box plots for Lap. Experience.

(c) Box plots for Age

Fig. 3: Box plots of FLS scores vs. demographics typ-ically used to establish construct validity. (a) TrainingLevel (0=none, MS=medical school, R1-5=post graduate yearof surgical residency, FL=fellow, SG=practicing surgeons). (b)Laparoscopic Experience: the self-reported estimate ofall-time total number of laparoscopic cases performed.1+ indicates subjects who had performed 1 or more lapr.cases, where they did 50% or more of the case. (c) Age,25+ indicates subjects who are 25 years or older, etc.

Page 6: JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ... · 86 computerized, automated scoring methodology such as 87 the EDGE platform can provide equivalent or bet-88 ter psychomotor

JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 6

Inter-session and inter-rater reliability were weakest for336

the Cutting task. See Table 3 for details.337

3.2 Data Collection Overview338

Not all subjects completed all requested iterations in the339

study. Some subjects voluntarily completed additional340

iterations. Incomplete iterations or iterations with cor-341

rupted data such as missing video which prevented post-342

task scoring were excluded from analysis. An overview343

of the collected data appears in Table 4.

TABLE 4: Overview of all collected EDGE data.

UW UMN NOLA TOTAL

“Expert” Subjects 6 8 3 17Total Subjects 32 35 31 98Peg Transfer Iterations 78 88 27 193Cutting Iterations 61 53 51 165Suturing Iterations 0 59 30 89Total Iterations 139 200 108 447Total Time (hours) 22.7

344

Fig. 5: 3D Toolpath for the left (blue) and right(red)hands of one iteration of a Peg Tranfer task for a novice(right) and expert (left).

Representative plots of 3D toolpath of an FLS novice345

(low-scoring, no FLS experience) and proficient subject346

(high FLS-scoring, faculty surgeon) appear in Figure 5347

along with corresponding scores for a single Peg Transfer348

iteration. Figure 6 shows the left and right hand grasping349

force plotted in time for the same iteration as well as the350

corresponding computed values.351

3.3 Demographics and FLS Scores352

Demographic categories typical for construct validity353

like training level and laparoscopic experience do not354

correlate well over the entire database (N = 447), see355

Table 5—Spearman’s ρ ≤ 0.5 for all but the cutting356

task vs Training level and Lapr. Case count. The cor-357

responding scatter plots (see Fig. 3) indicate significant358

variation of FLS scores in the subjects with most training359

and most experience: demographically identified experts360

exhibited FLS scores significantly lower than fellows361

and not significantly different than third- or fourth-year362

residents. FLS scores correlated most weakly or not at363

all across age.364

Fig. 6: The grasping force vs time for left(blue) and right(red) hands for the same task as Fig. 5

TABLE 5: Construct validity categories vs FLS score bytask for entire databse (N = 447). Spearman’s ρ (p-value)is shown.

PegTx Cutting Suturing

Lapr. Cases 0.50(0.00) 0.59(0.00) 0.33(0.00)Training Level 0.48(0.00) 0.63(0.00) 0.41(0.00)Age 0.17(0.02) 0.32(0.00) 0.07(0.52)

3.4 Comparison of Criteria Used to Establish “True365

Experts” Set366

All iterations of “true Expert” canidates received p-367

OSATS scores (N = 52) in addition to their FLS scores368

and demographic ranking data. For this pool, no demo-369

graphic category correlated well with either FLS or p-370

OSATS scores. However, FLS and p-OSATS correlated371

well (p < .01) for all tasks. See Table 6 for details. Ad-372

ditionally, each individual p-OSATS domain (not shown373

in table) correlated significantly with FLS (p < .01) but374

none did so with laparascopic experience.375

FLS and p-OSATS were further compared with simple376

task time. FLS score was found to be nearly identical377

to time (r=−1.00, p< .0000001) indicating that the “true378

expert” candidates made virtually no mistakes. p-OSATS379

correlated well with time, though exhibited some devia-380

tion. The correlation details for each task are listed in the381

lower section of Table 6. Moreover, Fig. 4a illustrates the382

specific deviations between p-OSATS and FLS; Fig. 4b383

the near-perfect equivalence of FLS and time; and Fig.384

4c illustrates the deviation between p-OSATS scores and385

time.386

3.5 Comparison of FLS and EDGE Metrics387

All iterations from the database (N = 447) were em-388

ployed to compare the automated performance metrics389

(task time, path length, economy of motion, and peak390

grasp force) and FLS scores. Correlation details appear in391

Table 7. Fig. 7 indicates near-perfect correlation between392

FLS and task time. Path Length correlates well with task393

time for all tasks (with correlation coefficients at 0.87 or394

above with significance p < .0001 or better). EoM shows395

substantially less correlation with FLS scores, especially396

Page 7: JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ... · 86 computerized, automated scoring methodology such as 87 the EDGE platform can provide equivalent or bet-88 ter psychomotor

JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 7

(a) (b) (c)

Fig. 4: Scatter plots of only the iterations in the “true expert” set for (a) FLS scores vs. p-global OSATS score, (b)FLS vs. Time, and (c) p-OSATS vs. Time. FLS shows little or no statistical deviation from time indicating that theexperts performed minimal errors, however p-OSATS varied substantially from time.

TABLE 6: Comparison of criteria typically employed forconstruct validity used to establish “True experts” set(N = 56). Pearon’s r and Spearman’s ρ with correspond-ing (p-value) are shown.

Comparison Peg Tranfer Cutting Suturing

Lap Exp vs. p-OSATS r 0.15 (0.52) -0.01 (0.95) -0.11 (0.68)ρ 0.25 (0.29) 0.11 (0.65) -0.32 (0.22)

Lap Exp vs. FLS r 0.09 (0.71) -0.21 (0.38) -0.31 (0.24)ρ -0.09 (0.72) -0.09 (0.72) -0.16 (0.56)

p-OSATS vs. FLS r 0.57 (0.01) 0.57 (0.01) 0.79 (0.00)ρ 0.55 (0.01) 0.64 (0.00) 0.77 (0.00)

Time vs. p-OSATS r -0.57 (0.01) -0.56 (0.01) -0.79 (0.00)ρ -0.55 (0.01) -0.64 (0.00) -0.77 (0.00)

Time vs. FLS r -1.00 (0.00) -1.00 (0.00) -1.00 (0.00)ρ -1.00 (0.00) -1.00 (0.00) -1.00 (0.00)

for lower EoM values. Finally, peak grasping force Fpeak397

shows no discernible overall trend across tasks.398

A closer view of peak grasping force and FLS scores399

appears in Fig. 8. Iterations with the best FLS scores400

exhibited the highest peak forces.401

TABLE 7: Correlation of EDGE automatically computedmetrics with FLS scoring.

PegTx Cutting Suturing

Time (s)r -0.99 (0.00) -1.00 (0.00) -1.00 (0.00)ρ -0.99 (0.00) -1.00 (0.00) -1.00 (0.00)

Path Length (cm)r -0.92 (0.00) -0.87 (0.00) -0.95 (0.00)ρ -0.89 (0.00) -0.88 (0.00) -0.93 (0.00)

EoM (cm/s)r 0.70 (0.00) 0.35 (0.00) 0.40 (0.00)ρ 0.82 (0.00) 0.42 (0.00) 0.61 (0.00)

Peak Force (N)r -0.18 (0.01) -0.27 (0.00) -0.16 (0.14)ρ -0.08 (0.30) -0.27 (0.00) -0.26 (0.01)

4 DISCUSSION402

Several Limitations are present in this study. Data was403

collected in variable environments. Most of the subjects404

Fig. 7: EDGE computed performance metrics vs. FLSscores for entire database.

Fig. 8: Peak force behavior for Peg Transfer (red) andSuturing tasks (green) near highest FLS scores.

Page 8: JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ... · 86 computerized, automated scoring methodology such as 87 the EDGE platform can provide equivalent or bet-88 ter psychomotor

JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 8

were not a captive audience: they voluntarily partici-405

pated and thus had no binding incentive to perform406

at their peak ability at all times. Reliability testing em-407

ployed in the pOSATS video review could benefit from408

more coders and perhaps a workshop review setting.409

Post analysis of recorded reviewer comments made dur-410

ing p-OSATS review of cutting tasks indicated coder B411

frequently emphasized handling of the gauze—a cogni-412

tive skill—and coder A intentionally ignored doing so413

to limit his review to the strictly psychomotor categories414

of p-OSATS. The cutting task varied slightly in that a415

double ring was used for the UW sites and a single ring416

was used everywhere else, this potentially resulted in417

different scoring criteria between sites for the cutting418

task.419

Typically, only a single criterion is employed to420

group subjects into skill levels like ‘expert’ and ‘novice’421

in the literature. We employed three: OSATS review,422

demographically-identified skill via training level or la-423

parascopic experience, and the validated FLS scoring424

methodology. In many cases, they disagreed. Particu-425

larly, higher laparascopic experience level did not imply426

better FLS or p-OSATS scores. If experience and training427

level are taken as the ground truth that determines428

expert skill, this suggests that the FLS tasks themselves429

may not relevantly test surgical expertise, at least among430

higher skill levels. If FLS and p-OSATS are taken as431

ground truth, this suggest that perhaps demographic432

categories like experience or training level provide im-433

perfect or even poor means to establish construct validity434

in simulation studies.435

According to the equations of Table 1, strict correlation436

between FLS and task time is required if and only if437

there are no errors. According to our results in Table438

7 the correlation between FLS and task time is never439

substantially reduced by errors for any task for the entire440

database. This suggests that the error scores—which441

must be manually collected and computed for each FLS442

task iteration—have a negligible effect on the final score443

given these equations and may not warrant the added444

resource cost they incur.445

Since we expect “true experts” to have a near zero446

FLS error rates, this near-perfect correlation between447

task time and FLS score is not surprising among them448

(shown in Fig 4b and Table 7) and indicates that with449

FLS scoring, only task time can discriminate skill among450

experts. However, p-OSATS scores of these iterations451

deviated substantially from task time (see Fig. 4a, 4c and452

Table 7). If the p-OSATS scores are accurate, this suggests453

that they discriminate some dimension of psychomotor454

skill among experts not detectable with task time alone.455

According to Table 7 and Figure 7, path length, and to456

a lesser extent EoM, correlated favorably with FLS (and457

time, implicitly). It is reasonable to expect that longer458

path lengths for a given procedures will result in longer459

times and hence be closely dependent. However, EoM—460

which is merely average velocity in a task (units: cm/s)—461

specifically controls for task time. If it correlated per-462

fectly with task time, this would imply that it is unneces-463

sary. Instead it appears to measure some other aspect of464

psychomotor skill. However, the clinical significance of465

training to higher or lower EoM performance measures466

is somewhat uncertain, particularly at the whole-task467

level which considers the overall average rather than the468

EoM at specific times.469

Fig. 8 shows incidence of high peak force (> 30N )470

for the best FLS scores. However, no data linking peak471

grasping force in FLS to tissue damage or outcomes is472

presented, thus this result provides limited insight. We473

speculate that using excessive force vs minimal force474

may improve task times. Currently, the FLS methodol-475

ogy provides no way to deter or penalize such behavior,476

nor do the technical skills objectives and instructions477

include grasping force targets. We expect this, at least478

in part, results in the side spread (lack of correlation)479

shown between peak grasp force and FLS scores in480

Figure 7 and Table 7.481

Such behavior can be easily identified and appropri-482

ately addressed (e.g. a score penalty or automated audio483

feedback) by and instrumented platform like EDGE.484

Moreover, they computerized scoring has better accuracy485

and repeatability than FLS scoring since it eliminates486

evaluator bias and human error. Such accuracy is par-487

ticularly required for the force measurements which are488

difficult to quantify from video. However, more research489

is required to establish clinically relevant grasping force490

targets for the different tasks or targeted tissues.491

To our knowledge, all prior studies indicate FLS to492

be valid and show positive transfer of skills to the493

operating room. We do not see any contraindication of494

this in our results. However, our work suggests that a495

stopwatch can effectively replace the FLS scoring system:496

accounting for ‘precision’ via error counts is superflu-497

ous with the currently published FLS scoring system.498

While this in no way diminishes the success of the FLS499

program, it may suggest that alternative metrics beyond500

task time have not been adequately explored in the FLS501

tasks. Also, it suggests that subjects who train only to502

task time and sacrifice precision will be rewarded by503

FLS scoring, since additional errors will negligibly count504

against them. However, either adjusting the FLS scoring505

equations or employing a more granular metric that can506

indicate dexterity can remedy this.507

We conclude that the FLS scoring equations may pro-508

vide negligible benefit over task time. We also conclude509

that instrumented reality-based platforms such as EDGE510

along with the more granular, automated psychomotor511

metrics they enable (Path Length, Economy of Motion,512

Peak Grasp Force) may offer significant skill discrim-513

ination beyond task time alone, but more research is514

required to establish this.515

ACKNOWLEDGMENTS516

The authors would like to thank Dr. Rob Sweet, Troy E.517

Reihsen, and SimPORTAL staff from the University of518

Page 9: JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ... · 86 computerized, automated scoring methodology such as 87 the EDGE platform can provide equivalent or bet-88 ter psychomotor

JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 9

Minnesota site, Dr. John Paige and his staff at Louisiana519

State Univeristy, and all the participants of the study.520

Funding for this study was provided by Research and521

Technology Development Grant RTD11 UW SB01 from522

the Washington Technology Center in partnership with523

Simulab Corporation.524

REFERENCES525

[1] Heinrichs LR, et al. Criterion-based training with surgical526

simulators: proficiency of experienced surgeons. JSLS: Journal527

of the Society of Laparoendoscopic Surgeons. 2007;11(3):273.528

Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/529

PMC3015829/.530

[2] Nasca TJ, Philibert I, Brigham T, Flynn TC. The Next GME531

Accreditation System—Rationale and Benefits. New England532

Journal of Medicine. 2012;.533

[3] Sturm LP, Windsor JA, Cosman PH, Cregan P, Hewett PJ, Mad-534

dern GJ. A systematic review of skills transfer after surgical535

simulation training. Annals of surgery. 2008;248(2):166.536

[4] Martin J, Regehr G, Reznick R, MacRae H, Murnaghan J, Hutchi-537

son C, et al. Objective structured assessment of technical538

skill (OSATS) for surgical residents. British journal of surgery.539

1997;84(2):273–278. Available from: http://onlinelibrary.wiley.540

com/doi/10.1046/j.1365-2168.1997.02502.x/abstract.541

[5] Reznick R, Regehr G, MacRae H, Martin J, McCulloch W. Test-542

ing technical skill via an innovative ”bench station” exami-543

nation. The American journal of surgery. 1997;173(3):226–230.544

Available from: http://www.sciencedirect.com/science/article/545

pii/S0002961097895979.546

[6] Peters J, Fried GM, Swanstrom LL, Soper NJ, Sillin LF, Schirmer547

B, et al. Development and validation of a comprehen-548

sive program of education and assessment of the basic fun-549

damentals of laparoscopic surgery. Surgery. 2004;135(1):21–550

27. Available from: http://www.astec.arizona.edu/sites/astec.551

arizona.edu/files/pdf files/Lap%20Program.pdf.552

[7] Fried GM, Feldman LS, Vassiliou MC, Fraser SA, Stanbridge553

D, Ghitulescu G, et al. Proving the value of simulation in554

laparoscopic surgery. Annals of surgery. 2004;240(3):518.555

[8] Okrainec A, Soper NJ, Swanstrom LL, Fried GM. Trends and re-556

sults of the first 5 years of Fundamentals of Laparoscopic Surgery557

(FLS) certification testing. Surgical endoscopy. 2011;25(4):1192–558

1198.559

[9] Kolozsvari NO, Kaneva P, Brace C, Chartrand G, Vaillancourt560

M, Cao J, et al. Mastery versus the standard proficiency target561

for basic laparoscopic skill training: effect on skill transfer and562

retention. Surgical endoscopy. 2011;p. 1–8.563

[10] Rosenthal ME, Ritter EM, Goova MT, Castellvi AO, Tesfay ST,564

Pimentel EA, et al. Proficiency-based Fundamentals of Laparo-565

scopic Surgery skills training results in durable performance566

improvement and a uniform certification pass rate. Surgical567

endoscopy. 2010;24(10):2453–2457.568

[11] Derevianko AY, Schwaitzberg SD, Tsuda S, Barrios L, Brooks DC,569

Callery MP, et al. Malpractice carrier underwrites Fundamentals570

of Laparoscopic Surgery training and testing: a benchmark for571

patient safety. Surgical endoscopy. 2010;24(3):616–623.572

[12] Sroka G, Feldman LS, Vassiliou MC, Kaneva PA, Fayez R, Fried573

GM. Fundamentals of Laparoscopic Surgery simulator training to574

proficiency improves laparoscopic performance in the operating575

room–a randomized controlled trial. The American Journal of576

Surgery. 2010;199(1):115–120.577

[13] Gallagher AG, Ritter EM, Champion H, Higgins G, Fried MP,578

Moses G, et al. Virtual reality simulation for the operating room:579

proficiency-based training as a paradigm shift in surgical skills580

training. Annals of surgery. 2005;241(2):364.581

[14] Satava RM, Cuschieri A, Hamdorf J. Metrics for objective assess-582

ment. Surgical endoscopy. 2003;17(2):220–226.583

[15] Seymour NE, Gallagher AG, Roman SA, O’Brien MK, Bansal584

VK, Andersen DK, et al. Virtual reality training improves585

operating room performance: results of a randomized, double-586

blinded study. Annals of surgery. 2002;236(4):458. Available from:587

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1422600/.588

[16] Pearson A, Gallagher A, Rosser J, Satava R. Evaluation of589

structured and quantitative training methods for teaching intra-590

corporeal knot tying. Surgical endoscopy. 2002;16(1):130–137.591

[17] Gunther S. Red DRAGON: A Multi-Modality System for Simu-592

lation and Training in Minimally Invasive Surgery. University of593

Washington; 2006. Red Draggon.594

[18] Rosen J, Brown JD, Chang L, Barreca M, Sinanan M, Hannaford595

B. The BlueDRAGON—a system for measuring the kinematics596

and dynamics of minimally invasive surgical tools in-vivo. In:597

Robotics and Automation, 2002. Proceedings. ICRA’02. IEEE In-598

ternational Conference on. vol. 2. IEEE; 2002. p. 1876–1881. Blue599

dragon.600

[19] Andrew S Wright BH Timothy M Kowalewski. Novel Laparo-601

scopic Box Trainer with Integrated Force and Positioning Sensors.602

In: 12th World Congress of Endoscopic Surgery, Emerging Tech-603

nology Session, National Harbor, MD; April 2010. .604

[20] Derossis M, Anna M, Fried M, Gerald M, Abrahamowicz PhD605

M, Sigman M, et al. Development of a model for training and606

evaluation of laparoscopic skills. The American journal of surgery.607

1998;175(6):482–487.608

[21] Fraser S, Feldman L, Stanbridge D, Fried G. Characterizing the609

learning curve for a basic laparoscopic drill. Surgical endoscopy.610

2005;19(12):1572–1578.611

[22] Rasband WS. ImageJ; 1997-2011. U. S. National Institutes of612

Health, Bethesda, Maryland, USA, http://imagej.nih.gov/ij/.613

Timothy Kowalewski Mr. Kowalewski received614

the B.S. and M.S. degrees in Electrical Engi-615

neering from the University of Washington, Seat-616

tle, in 2003 and 20XX respectively. His doctoral617

research is in the area of training surgeons using618

machine learning techniques.619

620

Lee White Mr. White received the B.S.E. degree621

in BioMedical Engineering from Tulane Univer-622

sity in New Orleans, LA in 2008. He is cur-623

rently pursuing the Ph.D. degree with research624

into medical technology and the use of human-625

guided robotic systems to enable advanced sur-626

gical procedures. Please see http://leewhite.org/627

628

Iris Jiang Body of Iris’s biography.629

630

631

Page 10: JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ... · 86 computerized, automated scoring methodology such as 87 the EDGE platform can provide equivalent or bet-88 ter psychomotor

JOURNAL OF THE SOCIETY FOR SIMULATION IN HEALTHCARE ???, 2012 10

Thomas Lendvay Body of Tom’s biography.632

633

634

Andrew Wright Body of Andy’s biography.635

636

637

Blake Hannaford Dr. Hannaford received the638

B.S. degree in Engineering and Applied Science639

from Yale University in 1977, and the M.S. and640

Ph.D. degrees in Electrical Engineering from641

the University of California, Berkeley, in 1982642

and 1985 respectively. From 1986 to 1989 he643

worked in the Automated Systems Section of the644

NASA Jet Propulsion Laboratory, Caltech. Since645

September 1989, he has been at the University646

of Washington in Seattle, where he has been647

Professor of Electrical Engineering since 1997.648

649