validation studies in simulation-based education - deb rooney

Validation Studies in Simulation-based Education

April 15, 2015 Deb Rooney, PhD Professor of Learning Health Sciences All Rights Reserved.

1

•  Validity in the current framework used to evaluate evidence

•  How we gather /evaluate validity evidence from a simulator and it’s associated measures

•  Context of academic product (manuscripts)

•  Final considerations

Objectives

2

A few definitions to consider….

Validity: What is it?

2. the degree to which evidence and theory support the interpretations of test scores as entailed by proposed uses of tests –Standards (1999)

1. the degree to which the tool measures what it claims to measure

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education.

(1999) Standards for educational and psychological testing. Washington, DC: American Educational Research Association. 3

•  Evidence relevant to relationships to other variables (e.g. novice versus expert discriminant validity) is over-represented

•  Evidence relevant to response process and consequences of testing and are infrequently reported

Cook, D. A., Brydges, R., Zendejas, B., Hamstra, S. J., & Hatala, R. (2013). Technology-enhanced simulation to assess health professionals: A systematic review of validity evidence, research methods, and reporting quality. Academic Medicine. Jun;88(6):872-83.

•  Apply current Standards to ensure rigorous research and reporting

Simulator Validation: The Framework

4

Simulator Validation: The Evidence

Current AERA Standards* o  Not new/novel o  Unitary Construct (All evidence falls under “Construct” validity)

o  Test content

o  Internal structure

o  Response Processes

o  Relationships to other variables

o  Consequences of testing

(psychometric properties of measures)

(comparison w/ previously-validated measures)

(standard setting, rater/ings quality, fidelity vs. stakes)

(reliability, dimensionality, function across groups)

(face validity, subjective measures, construct alignment)

*Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014) 5

Validity evidence;

Validity: What is it NOT?

•  Does not allow us to make inferences about a curriculum

•  Does not allow us to make inferences about different applications, settings, or learners

•  Is not a terminal quality determination (about the quality of your measures, application)

•  “The scale was valid” ! “evidence supports the use of a scale to measure X in a particular setting/application” 6

We have a much more complex environment to evaluate!

Simulator test content relationships to other variables consequences of testing

Instrument (measures) test content internal structure relationships to other variables response processes consequences of testing

Simulator Validation: How does this evidence apply to us?

7

CREATION DESIGN & PLANNING

IMPLEMENTATION & EVALUATION

Test content (measures/simulator)

Internal structure (m)

Response processes (m)

Relationship to other variables (m/s)

Consequences of testing (m/s)

Validity Evidence in Simulation How/when do we gather evidence?

8

PAPER 2 PAPER 1 PAPER 3

Test content (measures & simulator) Before implementation

Validity Evidence in Simulation How disseminate findings?

Quality of performance measures Before or After Implementation

Impact on performance and/or patient outcomes After full implementation

9

Most Recent Example: Neurosurgery Sim

a Tai B, Rooney D Stephenson F, Liao P, Sagher O, Shih A, Savastano LE. Development of 3D-printing built ventriculostomy placement simulator, Journal of Neurosurgery, (in press) bRooney DM, Tai BL, Sagher O, Shih AJ, Wilkinson DA, Savastano LE. A Simulator and two tools: Validation of performance measures from a novel neurosurgery simulator using the current Standards framework. Surgery, submitted 3/15.

Paper 1a: Preliminary evaluation of quality of simulator/ measures Paper 2b: Evaluation of performance measures from simulator Paper 3: Evaluation of Impact on performance measures / patient outcomes

10

Paper 1: Simulator validation process

n=7 n=5 n=5 4 months

11

The Content Validity Form: 5 domains

•  Physical Attributes

•  Realism-experience

•  Value

•  Relevance

•  Overall (global)

12

Paper 1: The preliminary validation process (Sim)*

•  Using Rasch model, analyzed data for; •  Looked at domain rating differences across 3 sites •  Mean averages by item •  Looked at Rasch variability indices to identify possible

inconsistency in ratings

•  Ensured psychometric quality of survey •  Using traditional methods, estimated;

•  Inter-item consistency, Cronbach alpha •  Inter-rater agreement, ICC(2,k)

•  Using a Rasch model ensured rating scales’ function

*performance checklist is separate/ different process 13

Results: Domain mean averages by site

0 0.5 1

1.5 2

2.5 3

3.5 4

4.5

UM

HF

WS

3.4 3.3 3.9 3.3 2.4 Combined mean averages

“This simulator requires minor adjustments before it can be considered for use in ventriculostomy placement training.” 14

Results: Mean averages by item

15

Paper 1*: Test Content-Checklist

*shoulda, coulda, woulda 16

Proposed items Definitely do not include this task

(1)

Not sure if this task should be

included (2)

Pretty sure this task should be

included (3)

Definitely include this task

(4)

Position head and mark midline

Locate Kocher's point (10.5 cm posterior to the nasion and 3 cm lateral to midline)

Mark incision (approximately 2cm long in a parasagittal location)

Incise, clear tissue off cranium, retracted scalp

…

Suture wound (Staples or a 3-0 running nylon or prolene suture)


17


Ask expert instructors about the value of included steps (items) for measuring X at doing Y. •  Reasonable number of experts ~ 3 What else do you ask about?

•  Clarity of item •  Appropriateness of qualifiers (use X instrument, at x location) •  Rating scale •  Missing steps •  Objective measures to include (eg. time to)

*shoulda, coulda, woulda 18


Before implementation Test content (measures & simulator)

Before or After Implementation* Quality of performance measures

After full implementation Impact on performance and/or patient outcomes

Next Step: Deeper Evaluation (Paper 2)

19

Next Step: Deeper Evaluation (Paper 2)

•  Evaluation of all validity evidence of performance measures [ala Standards]

•  Capture broader (regional/national) sample of performance data via videotaped performances

•  ideal N (+++) and range of experience

•  Compare measures from the novel performance checklist and gold standard (eg-OSATS] (relationship with other variables*)

•  Set/test performance standards (if appropriate) Rooney DM, Tai BL, Sagher O, Shih A, Wilkinson DA, Savastano, L. A Simulator and two tools: Validation of performance measures from a novel neurosurgery simulator using the current Standards framework. Surgery, submitted 3/15)

20

•  Nationally-recognized training program sponsored by Society of Neurological Surgeons

•  Total n=14 (11 trainees*, 3 attendings) performed ventriculostomy on simulator

•  All performances were video-captured, scored by 3 raters using novel checklist and modified version of OSATS

Paper 2: study design

*first year neurosurgery fellows 21

Checklist: Ventriculostomy Procedural Assessment Tool (V-PAT)

22

Martin JA, Regehr G, Reznick R, MacRae H, Murnaghan J, Hutchison C, et al. (1997). Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg 84:273-278.

Modified Objective Structured Assessment of Technical Skills:

m-OSATS

23

Measures adequately reflect ventriculostomy performance “quality” •  Relationships to other variables

•  Trainee v expert ratings •  Correlation of summed V-PAT with summed OSATS scores

Measures are psychometrically sound (Adequate “Quality Control”) •  Psychometric function of V-PAT & OSATS measures;

•  Response processes: Rasch indices ! rating scale function •  Test Content: Rasch item point-measure correlations, item fit (variability)

•  Internal structure: Inter-item consistency-Cα, Inter-rater agreement- ICC(2,k)

Measures are free from rater bias •  Consequences of testing

•  Evaluated Rasch bias indices to identify potential rating differences at rater level

Paper 2: evidence examined Examined evidence from 5 sources, but packaged a bit differently;

24

Response processes •  Rasch indices: (Avg meas., Fit statistics, and RA

thresholds) indicated all rating scales for both V-PAT & OSATS were well-functioning

Test content Point-measure correlations: all positive, [.39, .81] Rasch item Outfit MS: all < 2.0

Internal structure •  Cronbach α: both α = 0.95 •  Intraclass correlation: V-PAT= [-0.33, 0.93] OSATS = [0.80, 0.93]

Results: “Quality control”

25

Do V-PAT measures adequately differentiate trainee and expert performances?

Instrument Indices V-PAT

Resident Observed Average

(SE)

Attending Observed Average

(SE)

P- value

ICC (2,k) Value

1. Position head and mark midline 3.75 (.20) 4.00 (.47) 0.54 * 2. Locate Kocher's point 3.84 (.18) 4.11 (.34) 0.52 .86 3. Mark an incision 3.61 (.16) 4.00 (.32) 0.42 .10 4. Select drain exit site from the scalp 2.94 (.17) 3.71 (.50) 0.22 * 5. Incise, clear tissue off cranium, retract scalp

3.56 (.14)

4.11 (.34)

0.20

.83

6. Set drill stop and drill trephine 2.80 (.16) 3.78 (.37) 0.08 .70 7. Confirm dura and pierce 2.98 (.17) 3.67 (.35) 0.25 .76

8. Confirm landmarks and place catheter

2.88 (.17)

3.33 (.35)

0.24

.91

9. Confirm CSF flow 3.41 (.13) 3.67 (.30) 0.39 .73 10. Remove trocar cover, tunnel trocar to exit site and recap trocar

2.78 (.14)

3.00 (.47)

0.62

-.33

11. Place purse string suture to anchor the catheter at scalp exit site

3.00 (.36)

4.25 (.69)

0.26

*

Overall average 3.30 (.06) 3.80 (.11) 0.01 –

OSATS 1. Respect for Tissue 2.66 (.21) 4.11 (.34) 0.004 .93 2. Time and Motion 2.42 (.22) 4.00 (.43) 0.005 .85 3. Instrument Handling 2.51 (.22) 4.00 (.46) 0.007 .86 4. Knowledge of Instruments 2.36 (.21) 4.33 (.43) 0.002 .84 5. Flow of Operation 2.36 (.21) 4.22 (.45) 0.001 .85 6. Knowledge of specific procedure 2.33 (.23) 4.33 (.46) .80 Overall average 2.32 (0.08) 3.73 (0.15) 0.001 –

* too few cases to estimate



(SE)


(SE)

P- value

ICC (2,k) Value


3.56 (.14)

4.11 (.34)

0.20

.83



2.88 (.17)

3.33 (.35)

0.24

.91


2.78 (.14)

3.00 (.47)

0.62

-.33


3.00 (.36)

4.25 (.69)

0.26

*

Overall average 3.30 (.06) 3.80 (.11) 0.01 –



Results: Relationship to Other variables (V-PAT)



(SE)


(SE)

P- value

ICC (2,k) Value


3.56 (.14)

4.11 (.34)

0.20

.83



2.88 (.17)

3.33 (.35)

0.24

.91


2.78 (.14)

3.00 (.47)

0.62

-.33


3.00 (.36)

4.25 (.69)

0.26

*

Overall average 3.30 (.06) 3.80 (.11) 0.01 –



26

Do m-OSATS measures adequately differentiate trainee and expert performances?

•  Correlation of summed V-PAT scores with summed

m-OSATS •  Pearson’s r = 0.72, p = 0.001



(SE)


(SE)

P- value

ICC (2,k) Value

1. Position head and mark midline 3.75 (.20) 4.00 (.47) 0.54 * 2. Locate Kocher's point 3.84 (.18) 4.11 (.34) 0.52 .86 3. Mark an incision approximately 2 cm long in a parasagittal location

3.61 (.16)

4.00 (.32)

0.42

.10

4. Select drain exit site from the scalp 2.94 (.17) 3.71 (.50) 0.22 * 5. Incise, clear tissue off cranium, retract scalp

3.56 (.14)

4.11 (.34)

0.20

.83

6. Set drill stop and drill trephine 2.80 (.16) 3.78 (.37) 0.08 .70 7. Confirm dura and pierced with 18g spinal needle or 11 blade scalpel

2.98 (.17)

3.67 (.35)

0.25

.76

8. Confirm landmarks and place catheter to 6-7cm from outer table of skull

2.88 (.17)

3.33 (.35)

0.24

.91 9. Confirm CSF flow 3.41 (.13) 3.67 (.30) 0.39 .73 10. Remove trocar cover, tunnel trocar to exit site and recap trocar

2.78 (.14)

3.00 (.47)

0.62

-.33

11. Place purse string suture at the scalp exit site to anchor the catheter

3.00 (.36)

4.25 (.69)

0.26

*

Overall average 3.30 (.06) 3.80 (.11) 0.01 –

OSATS 1. Respect for Tissue 2.66 (.21) 4.11 (.34) 0.004 .93 2. Time and Motion 2.42 (.22) 4.00 (.43) 0.005 .85 3. Instrument Handling 2.51 (.22) 4.00 (.46) 0.007 .86 4. Knowledge of Instruments 2.36 (.21) 4.33 (.43) 0.001 .84 5. Flow of Operation 2.36 (.21) 4.22 (.45) 0.001 .85 6. Knowledge of specific procedure 2.33 (.23) 4.33 (.46) 0.001 .80 Overall average 2.32 (.08) 3.73 (.15) 0.001 –




(SE)


(SE)

P- value

ICC (2,k) Value

1. Position head and mark midline 3.75 (.20) 4.00 (.47) 0.54 * 2. Locate Kocher's point 3.84 (.18) 4.11 (.34) 0.52 .86 3. Mark an incision approximately 2 cm long in a parasagittal location

3.61 (.16)

4.00 (.32)

0.42

.10

4. Select drain exit site from the scalp 2.94 (.17) 3.71 (.50) 0.22 * 5. Incise, clear tissue off cranium, retract scalp

3.56 (.14)

4.11 (.34)

0.20

.83

6. Set drill stop and drill trephine 2.80 (.16) 3.78 (.37) 0.08 .70 7. Confirm dura and pierced with 18g spinal needle or 11 blade scalpel

2.98 (.17)

3.67 (.35)

0.25

.76

8. Confirm landmarks and place catheter to 6-7cm from outer table of skull

2.88 (.17)

3.33 (.35)

0.24

.91 9. Confirm CSF flow 3.41 (.13) 3.67 (.30) 0.39 .73 10. Remove trocar cover, tunnel trocar to exit site and recap trocar

2.78 (.14)

3.00 (.47)

0.62

-.33

11. Place purse string suture at the scalp exit site to anchor the catheter

3.00 (.36)

4.25 (.69)

0.26

*

Overall average 3.30 (.06) 3.80 (.11) 0.01 –



OSATS

Results: Relationship to Other variables (m-OSATS)

27

Are measures free from rater bias?

Results: Consequences of Testing

V-PAT

Observed Average (SE)

1. Rater 1 (LS) 3.60 (.11) 2. Rater 2 (OS) 3.60 (.10) 3. Rater 3 (DW) 3.60 (.11) 4. Participants (Variable)† 2.90 (.11) Overall average 3.60 ( – )

OSATS 1. Rater 1 (LS)* 2.10 (.17) 2. Rater 2 (OS) 3.30 (.15) 3. Rater 3 (DW) 3.00 (.15) Overall average 2.80 (–)

†Comparison with 3 expert raters, p = 0.01 * Comparison with 2 expert raters, p = 0.01

V-PAT





V-PAT





28

Are OSATS measures free from rater bias?

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Q1 Q2 Q3 Q4 Q5 Q6

Obs

erve

d av

erag

e

OSATS item

Bias interaction: Rater by item

Rater1-LS

Rater2-OS

Rater3-DW

Results: Consequences of Testing

29

Summary of Results

Evidence Inferences V-PAT OSATS

Quality Control Response Processes Adequate rating scale function √ √

Test Content Items align with construct √ √

Internal Structure Inter-item consistency, inter-rater agreement*

X* √

Test of Assumptions Rel. Other Variables Measures dif. bet. N/E performances X √

Rel. Other Variables V-PAT/OSATS summed scores correlate √ √

Cons. of Testing V-PAT/OSATS measures are bias free √ X

X = challenges that require resolution 30

Standards for educational and psychological testing. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Washington, DC: American Educational Research. Available for purchase via http://teststandards.org/

Validity: ala 2015

•  Evidence is important, but interpretive argument is critical •  Content of interpretive argument determines the kinds of

evidence that are most relevant (most important) in validation •  Strategy of developing interpretive argument based on

•  Validity evidence relevant to inferences •  Assumptions •  Challenges (alternative interpretations)

31

Problematic inter-rater agreement (ICC) for 5 items should be resolved;

•  Item 3 (Mark an incision approximately 2 cm long in a parasagittal location), ICC= .10

•  Item 10 (Remove trocar cover, tunnel trocar to exit site and recap trocar), ICC= -.33

Examine, refine items to ensure alignment with simulator capabilities/ are

mutually exclusive

Challenges: potential threats to validity (V-PAT)

•  Item 1 (Position head and mark midline)* •  Item 4 (Select drain exit site from the scalp)* •  Item 1 (Place purse string suture at the scalp exit site to

anchor the catheter) *

* ICC incalculable 32

“Hawkish” OSATS ratings by one expert rater requires follow-up

Refine items, add rater training on scoring rubric and administration standards

Challenges: potential threats to validity (m-OSATS)

33


Before implementation Test content (measures & simulator)

Before or After Implementation* Quality of performance measures

After full implementation Impact on performance and/or patient outcomes

Next Step: Evaluation of Impact (Paper 3)

34

•  Evaluation of impact on trainee’s clinical performance or patient outcomes [relationship with other variables]

•  Examine;

•  Change in trainees’ clinical performance (checklist ratings, objective measures (“time to”, LOS, adverse events)

•  Impact on hospital costs

Barsuk JH, Cohen ER, Feinglass J, Kozmic SE, McGaghie WC, Ganger D, Wayne DB. Cost savings of performing paracentesis procedures at the bedside after simulation-based education. Simul Healthc. 2014 Oct;9(5):312-8.

Next Step: Evaluation of Impact (Paper 3)

35

•  Validation process is fluid/reiterative/on-going

•  It takes a team; •  Development (clinicians, instructors,

engineers, researchers assistants)! •  Outcomes (+QI, hospital info)

•  There is funding; •  AHRQ •  P-CORi •  Michigan Blue Cross Blue Shield

Considerations

36

Thank you

Questions?

Deborah Rooney, PhD [email protected]

validation studies in simulation-based education - deb rooney

Education

validity evidence validity

represented evidence

subjective measures

tool measures

expert discriminant

psychological testing

reporting simulator

current standards