lessons learned in assessment history, research and practical implications cees van der vleuten...

Lessons learned inassessmentHistory, Research and Practical

Implications

Cees van der VleutenMaastricht University

MHPE, Unit 1,3 June 2010

Powerpoint at: www.fdg.unimaas.nl/educ/cees/mhpe

Overview of presentation

Where is education going? Lessons learned in assessment Areas of development and

research

Where is education going?

School-based learning Discipline-based curricula (Systems) integrated curricula Problem-based curricula Outcome/competency-based curricula


Underlying educational principles: Continuous learning of, or practicing with,

authentic tasks (in steps of complexity; with constant attention to transfer)

Integration of cognitive, behavioural and affective skills

Active, self-directed learning & in collaboration with others

Fostering domain-independent skills, competencies (e.g. team work, communication, presentation, science orientation, leadership, professional behaviour….).


Underlying educational principles: Continuous learning of, or practicing with,

authentic tasks (in steps of complexity; with constant attention to transfer)

Integration of cognitive, behavioural and affective skills

Active, self-directed learning & in collaboration with others

Fostering domain-independent skills, competencies (e.g. team work, communication, presentation, science orientation, leadership, professional behaviour….).

Cognitivepsycholog

y

Cognitivepsycholog

y

Constructivism

Constructivism

Cognitiveload

theory

Cognitiveload

theory

Collaborativelearningtheory

Collaborativelearningtheory

EmpiricalevidenceEmpiricalevidence


Work-based learning Practice, practice, practice…. Optimising learning by:

More reflective practice More structure in the haphazard learning

process More feedback, monitoring, guiding,

reflection, role modelling Fostering of learning culture or climate Fostering of domain-independent skills

(professional behaviour, team skills, etc).


Work-based learning Practice, practice, practice…. Optimising learning by:

More reflective practice More structure in the haphazard learning

process More feedback, monitoring, guiding,

reflection, role modelling Fostering of learning culture or climate Fostering of domain-independent skills

(professional behaviour, team skills, etc).

DeliberatePracticetheory

DeliberatePracticetheory

Emergingwork-based

learning theories

Emergingwork-based

learning theories

EmpiricalevidenceEmpiricalevidence


Educational reform is on the agenda everywhere

Education is professionalizing rapidly

A lot of ‘educational technology’ is available

How about assessment?


Where is education going? Lessons learned in assessment Areas of development and

research

Miller’s pyramid of competence

Knows

Shows how

Knows how

Does

Miller GE. The assessment of clinical skills/competence/performance. Academic Medicine (Supplement) 1990; 65: S63-S7.

Lessons learned while climbing this pyramid with assessment technology

Assessing knowing how

Knows

Shows how

Knows how

Does

Knows

Knows how 60-ies: Written complex simulations (PMPs)

Key findings written simulations (Van der Vleuten, 1995)

Performance on one problem hardly predicted performance on another

High correlations with simple MCQs Experts performed less well than

intermediate experts Stimulus format more important

than the response format


Knows

Shows how

Knows how

Does

Knows how

Specific Lessons learned! Simple short scenario-based formats work best (Case & Swanson, 2002)

Validity is a matter of good quality assurance around item construction (Verhoeven et al 1999)

Generally, medical schools can do a much better job (Jozewicz et al 2002)

Sharing of (good) test material across institutions is a smart strategy (Van der Vleuten et al 2004).

Moving from assessing knows

Knows:What is arterial blood gas analysis most likely to show in patients with cardiogenic shock?

A. Hypoxemia with normal pHB. Metabolic acidosisC. Metabolic alkalosisD. Respiratory acidosisE. Respiratory alkalosis

To assessing knowing howKnowing How:A 74-year-old woman is brought to the emergency department because of crushing chest pain. She is restless, confused, and diaphoretic. On admission, temperature is 36.7 C, blood pressure is 148/78 mm Hg, pulse is 90/min, and resp are 24/min. During the next hour, she becomes increasingly stuporous, blood pressure decreases to 80/40 mm Hg, pulse increases to 120/min, and respirations increase to 40/min. Her skin is cool and clammy. An ECG shows sinus rhythm and 4 mm of ST segment elevation in leads V2 through V6. Arterial blood gas analysis is most likely to show:

A. Hypoxemia with normal pHB. Metabolic acidosisC. Metabolic alkalosisD. Respiratory acidosisE. Respiratory alkalosis

http://www.nbme.org/publications/item-writing-manual.html

Maastricht item review process

anatomy

physiology

int medicine

surgery

psychology

item poolreviewcommittee

testadministration

item analysesstudentcomments

Info to users

item bank

Pre-test review Post-test review


Knows

Shows how

Knows how

Does

Knows how

General Lessons learned! Competence is specific, not generic

Assessment is as good as you are prepared to put into it.

Assessing showing how

Knows

Shows how

Knows how

Does

Knows how

Shows how 70-ies: Performance assessment in vitro (OSCE)

Key findings around OSCEs1

Performance on one station poorly predicted performance on another (many OSCEs are unreliable)

Validity depends on the fidelity of the simulation (many OSCEs are testing testing fragmented skills in isolation)

Global rating scales do well (improved discrimination across expertise groups; better intercase reliabilities; Hodges, 2003)

OSCEs impacted on the learning of students

1Van der Vleuten & Swanson, 1990

Reliabilities across methods

TestingTime inHours

1

2

4

8

MCQ1

0.62

0.76

0.93

0.93

Case-BasedShortEssay2

0.68

0.73

0.84

0.82

PMP1

0.36

0.53

0.69

0.82

OralExam3

0.50

0.69

0.82

0.90

LongCase4

0.60

0.75

0.86

0.90

OSCE5

0.47

0.64

0.78

0.881Norcini et al., 19852Stalenhoef-Halling et al., 19903Swanson, 1987

4Wass et al., 20015Petrusa, 2002

Checklist or rating scale reliability in OSCE1

Test length In hours

Examiners using Checklists

Examiners using Rating scales

1 0.44 0.45

2 0.61 0.62

3 0.71 0.71

4 0.76 0.76

5 0.80 0.80

1Van Luijk & van der Vleuten, 1990


Knows

Shows how

Knows how

Does

Shows how

Specific Lessons learned! OSCE-ology (patient training, checklist writing, standard setting, etc.; Petrusa 2002)

OSCEs are not inherently valid nor reliable, that depends on the fidelity of the simulation and the sampling of stations (Van der Vleuten & Swanson, 1990).


Knows

Shows how

Knows how

Does

Shows how

General Lessons learned! Objectivity is not the same as reliability (Van der Vleuten, Norman, De Graaff, 1991)

Subjective expert judgment has incremental value (Van der Vleuten & Schuwirth, in prep)

Sampling across content and jugdes/examiners is eminently important

Assessment drives learning.

Assessing does

Knows

Shows how

Knows how

Does

Shows how

Does

90-ies: Performance assessment in vivo by judging work samples (Mini-CEX, CBD, MSF, DOPS, Portfolio)

Key findings assessing does

Ongoing work; this is where we currently are

Reliable findings point to feasible sampling (8-10 judgments seems to be the magical number; Williams et al 2003)

Scores tend to be inflated (Govaerts et al 2007)

Qualitative/narrative information is (more) useful (Govaerts et al 2007)

Lots of work still needs to be done How (much) to sample across instruments? How to aggregate information?

Reliabilities across methods

TestingTime inHours

1

2

4

8

MCQ1

0.62

0.76

0.93

0.93

Case-BasedShortEssay2

0.68

0.73

0.84

0.82

PMP1

0.36

0.53

0.69

0.82

OralExam3

0.50

0.69

0.82

0.90

LongCase4

0.60

0.75

0.86

0.90

OSCE5

0.47

0.64

0.78

0.88

PracticeVideo

Assess-ment7

0.62

0.76

0.93

0.931Norcini et al., 19852Stalenhoef-Halling et al., 19903Swanson, 1987

4Wass et al., 20015Petrusa, 20026Norcini et al., 1999

In-cognito

SPs8

0.61

0.76

0.92

0.93

MiniCEX6

0.73

0.84

0.92

0.967Ram et al., 19998Gorter, 2002

Assessing does

Knows

Shows how

Knows how

DoesDoes

Specific Lessons learned! Reliable sampling is possible

Qualitative information carries a lot of weight

Assessment impacts on work-based learning (more feedback, more reflection…)

Validity strongly depends on the users of these instruments and therefore on the quality of implementation.

Assessing does

Knows

Shows how

Knows how

DoesDoes

General Lessons learned! Work-based assessment cannot replace standardised assessment (yet), or, no single measure can do it all (Tooke report, UK)

Validity strongly depends on the implementation of the assessment (Govaerts et 2007)

But, there is a definite place for (more subjective) expert judgment (Van der Vleuten & Schuwirth, under ed review).

Competency/outcome categorizations

CanMeds roles

Medical expert Communicator Collaborator Manager Health advocate Scholar Professional

ACGME competencies

Medical knowledge

Patient care Practice-based

learning & improvement

Interpersonal and communication skills

Professionalism Systems-based

practice

Measuring the unmeasurable

Knows

Shows how

Knows how

Does

“Domain independent” skills

“Domain specific” skills


Importance of domain-independent skills If things go wrong in practice, these

skills are often involved (Papadakis et 2005; 2008)

Success in labour market is associated with these skills (Meng 2006)

Practice performance is related to school performance (Padakis et al 2004).


Knows

Shows how

Knows how

Does

“Domain independent” skills

“Domain specific” skills

Assessment (mostlyin vivo) heavily relying onexpert judgment and qualitative information


Self assessment Peer assessment Co-assessment (combined self, peer,

teacher assessment) Multisource feedback Log book/diary Learning process

simulations/evaluations Product-evaluations Portfolio assessment

Eva, K. W., & Regehr, G. (2005). Self-assessment in the health professions: a reformulation and research agenda. Acad Med, 80(10 Suppl), S46-54.

Falchikov, N., & Goldfinch, J. (2000). Student peer assessment in higher education: A meta-analysis comparing peer and teacher marks. Review of Educational Research, 70(3), 287-322.

Driessen, E., van Tartwijk, J., van der Vleuten, C., & Wass, V. (2007). Portfolios in medical education: why do they meet with mixed success? A systematic review. Med Educ, 41(12), 1224-1233.

General lessons learned

Competence is specific, not generic Assessment is as good as you are prepared to

put into it Objectivity is not the same as reliability Subjective expert judgment has incremental

value Sampling across content and

judges/examiners is eminently important Assessment drives learning No single measure can do it all Validity strongly depends on the

implementation of the assessment

Practical implications Competence is specific, not generic

One measure is no measure Increase sampling (across content,

examiners, patients…) within measures Combine information across measures and

across time Be aware of (sizable) false positive and

negative decisions Build safeguards in examination regulations.

Practical implications No single measure can do it all

Use a cocktail of methods across the competency pyramid

Arrange methods in a programme of assessment Any method may have utility (including the ‘old’

assessment methods depending on its utility within the programme)

Compromises on the quality of methods should be made in light of its function in the programme

Compare assessment design with curriculum design

Responsible people/committee(s) Use an overarching structure Involve your stakeholders Implement, monitor and change (assessment programmes

‘wear out’)

Practical implications Validity strongly depends on the

implementation of the assessment Pay special attention to implementation

(good educational ideas often fail due to implementation problems)

Involve your stakeholders in the design of the assessment

Many naive ideas exist around assessment; train and educate your staff and students.


Where is education going? Where are we with assessment? Where are we going with

assessment? Conclusions

Areas of development and research

Understanding expert judgment

Understanding human judgment

How does the mind work of expert judges?

How is it influenced? Link between clinical expertise and

judgment expertise? Clash between psychology literature

on expert judgment and psychometric research.


Understanding expert judgment Building non-psychometric rigour into

assessment

Qualitative methodology as an inspiration

Quantitative QualitativeCriterion approach approach

Truth value Internal validity CredibilityApplicability External validity TransferabilityConsistency Reliability DependabilityNeutrality Objectivity Confirmability

Strategies for establishing trustworthiness:

• Prolonged engagement• Triangulation• Peer examination• Member checking• Structural coherence• Time sampling• Stepwise replication• Dependability audit• Thick description• Confirmability audit

Procedural measures and safeguards:

• Assessor training & benchmarking• Appeal procedures• Triangulation across sources, saturation• Assessor panels• Intermediate feedback cycles• Decision justification• Moderation• Scoring rubrics• ……….

Driessen, E. W., Van der Vleuten, C. P. M., Schuwirth, L. W. T., Van Tartwijk, J., & Vermunt, J. D. (2005). The use of qualitative research criteria for portfolio assessment as an alternative to reliability evaluation: a case study. Medical Education, 39(2), 214-220.



assessment Construction and governance of

assessment programmes (Van der Vleuten 2005)

Assessment programmes

How to design assessment programmes?

Strategies for governance (implementation, quality assurance)?

How to aggregate information for decision making? When is enough enough?

A model for designing programmes1

1Dijkstra, J. et al, in preparation.




assessment programmes Understanding and using assessment

impacting learning

Assessment impacting learning

Lab studies convincingly show tests improve retention and performance (Larsen et al., 2008)

Relatively little empirical research supporting educational practice

Absence of theoretical insights.

Theoretical model under construction1

1Cilliers, F. in preparation.

Metacognitive regulation strategies

•choice•effort•persistence

OUTCOMES OF

LEARNING

• impact appraisal• likelihood• severity

• response appraisal• efficacy• costs• value

• perceived agency• interpersonal factors

• normative beliefs• motivation to comply

SOURCES OF IMPACT

Assessment• assessment strategy• assessment task• volume of assessable

material• sampling• cues• individual assessor

DE

TE

RM

INA

NT

S O

F

AC

TIO

NCONSEQUENCES OF IMPACT

Cognitive processing strategies

•choice•effort•persistence

Dr Hanan Al-Kadri




assessment programmes Understanding and using assessment

impacting learning Understanding and using qualitative

information.

Understanding and using qualitative information

Assessment is dominated by the quantitative discourse (Hodges 2006)

How to improve the use of qualitative information?

How to aggregate qualitative information?

How to combine qualitative and quantitative information?

How to use expert judgment here?

Finally Assessment in medical education has a rich

history of research and development with clear practical implications (we’ve covered some ground in 40 yrs!)

We are moving beyond the psychometric discourse into an educational design discourse

We are starting to measure the unmeasurable Expert human judgment is reinstated as an

indispensable source of information both at the method level as well as at the programmatic level

Lots of exciting developments lie still ahead of us!

“Did you ever feel you’re on the verge ofan incredible breakthrough?”

This presentation can be found at: www.fdg.unimaas.nl/educ/cees/singapore

http://www.fdg.unimaas.nl/educ/cees/singapore

Literature Cillier, F. (In preparation). Assessment impacts on learning, you say? Please explain how. The impact of summative

assessment on how medical students learn. Driessen, E., van Tartwijk, J., van der Vleuten, C., & Wass, V. (2007). Portfolios in medical education: why do they

meet with mixed success? A systematic review. Med Educ, 41(12), 1224-1233. Driessen, E. W., Van der Vleuten, C. P. M., Schuwirth, L. W. T., Van Tartwijk, J., & Vermunt, J. D. (2005). The use of

qualitative research criteria for portfolio assessment as an alternative to reliability evaluation: a case study. Medical Education,

39(2), 214-220. Dijkstra, J. , Schuwirth, L. & Van der Vleuten (In preparation) A model for designing assessment programmes. Eva, K. W., & Regehr, G. (2005). Self-assessment in the health professions: a reformulation and research agenda.

Acad Med, 80(10 Suppl), S46-54. Gorter, S., Rethans, J. J., Van der Heijde, D., Scherpbier, A., Houben, H., Van der Vleuten, C., et al. (2002).

Reproducibility of clinical performance assessment in practice using incognito standardized patients. Medical Education, 36(9), 827-832.

Govaerts, M. J., Van der Vleuten, C. P., Schuwirth, L. W., & Muijtjens, A. M. (2007). Broadening Perspectives on Clinical Performance Assessment: Rethinking the Nature of In-training Assessment. Adv Health Sci Educ Theory Pract, 12, 239-260.

Hodges, B. (2006). Medical education and the maintenance of incompetence. Med Teach, 28(8), 690-696. Jozefowicz, R. F., Koeppen, B. M., Case, S. M., Galbraith, R., Swanson, D. B., & Glew, R. H. (2002). The quality of in-

house medical school examinations. Academic Medicine, 77(2), 156-161. Meng, C. (2006). Discipline-specific or academic ? Acquisition, role and value of higher education competencies.,

PhD Dissertation, Universiteit Maastricht, Maastricht. Norcini, J. J., Swanson, D. B., Grosso, L. J., & Webster, G. D. (1985). Reliability, validity and efficiency of multiple

choice question and patient management problem item formats in assessment of clinical competence. Medical Education, 19(3), 238-247.

Papadakis, M. A., Hodgson, C. S., Teherani, A., & Kohatsu, N. D. (2004). Unprofessional behavior in medical school is associated with subsequent disciplinary action by a state medical board. Acad Med, 79(3), 244-249.

Papadakis, M. A., A. Teherani, et al. (2005). "Disciplinary action by medical boards and prior behavior in medical school." N Engl J Med 353(25): 2673-82.

Papadakis, M. A., G. K. Arnold, et al. (2008). "Performance during internal medicine residency training and subsequent disciplinary action by state licensing boards." Annals of Internal Medicine 148: 869-876.

Literature Petrusa, E. R. (2002). Clinical performance assessments. In G. R. Norman, C. P. M. Van der Vleuten & D. I. Newble

(Eds.), International Handbook for Research in Medical Education (pp. 673-709). Dordrecht: Kluwer Academic Publisher.

Ram, P., Grol, R., Rethans, J. J., Schouten, B., Van der Vleuten, C. P. M., & Kester, A. (1999). Assessment of general practitioners by video observation of communicative and medical performance in daily practice: issues of validity, reliability and feasibility. Medical Education, 33(6), 447-454.

Stalenhoef- Halling, B. F., Van der Vleuten, C. P. M., Jaspers, T. A. M., & Fiolet, J. B. F. M. (1990). A new approach to assessign clinical problem-solving skills by written examination: Conceptual basis and initial pilot test results. Paper presented at the Teaching and Assessing Clinical Competence, Groningen.

Swanson, D. B. (1987). A measurement framework for performance-based tests. In I. Hart & R. Harden (Eds.), Further developments in Assessing Clinical Competence (pp. 13 - 45). Montreal: Can-Heal publications.

van der Vleuten, C. P., Schuwirth, L. W., Muijtjens, A. M., Thoben, A. J., Cohen-Schotanus, J., & van Boven, C. P. (2004). Cross institutional collaboration in assessment: a case on progress testing. Med Teach, 26(8), 719-725.

Van der Vleuten, C. P. M., & D. Swanson, D. (1990). Assessment of Clinical Skills With Standardized Patients: State of the Art. Teaching and Learning in Medicine, 2(2), 58 - 76.

Van der Vleuten, C. P. M., & Newble, D. I. (1995). How can we test clinical reasoning? The Lancet, 345, 1032-1034.

Van der Vleuten, C. P. M., Norman, G. R., & De Graaff, E. (1991). Pitfalls in the pursuit of objectivity: Issues of reliability. Medical Education, 25, 110-118.

Van der Vleuten, C. P. M., & Schuwirth, L. W. T. (2005). Assessment of professional competence: from methods to programmes. Medical Education, 39, 309-317.

Van der Vleuten, C. P. M., & Schuwirth, L. W. T. (Under editorial review). On the value of (aggregate) human judgment. Med Educ.

Van Luijk, S. J., Van der Vleuten, C. P. M., & Schelven, R. M. (1990). The relation between content and psychometric characteristics in performance-based testing. In W. Bender, R. J. Hiemstra, A. J. J. A. Scherp bier & R. P. Zwierstra (Eds.), Teaching and Assessing Clinical Competence. (pp. 202-207). Groningen: Boekwerk Publications.

Wass, V., Jones, R., & Van der vleuten, C. (2001). Standardized or real patients to test clinical competence? The long case revisited. Medical Education, 35, 321-325.

Williams, R. G., Klamen, D. A., & McGaghie, W. C. (2003). Cognitive, social and environmental sources of bias in clinical performance ratings. Teaching and Learning in Medicine, 15(4), 270-292.

lessons learned in assessment history, research and practical implications cees van der vleuten...

Documents