university of calgary application of classical test theory...

UNIVERSITY OF CALGARY

Application of Classical Test Theory and Item Response Theory to Analyze

Multiple Choice Questions

Mona Nasir

A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE

DEGREE OF DOCTOR OF PHILOSOPHY

DEPARTMENT OF MEDICAL SCIENCE

CALGARY, ALBERTA

September, 2014

©Mona Nasir 2014

Abstract

Background

Multiple choice questions are used worldwide for summative assessment in undergraduate

medical education. Only a few studies have looked at their reliability using both classical test

theory and item response theory. The main aim of this research was to use examination data

from the summative multiple choice exams at the University of Calgary in order to assess the

reliability of scores using and comparing two methods of analysis, i.e., classical test theory and

item response theory, on items administered three times over a six year period. In addition, the

temporal stability of the same items was also analyzed using both classical test theory and item

response theory.

Methods

Three courses were chosen for the item analysis. Thirty items from each course over a period

of three years were scrutinized for reliability by conducting an item analysis using SPSS and

Xcalibre 4.2. Item difficulty and discrimination indices were calculated using both classical test

and the 2 parameter logistic model of item response theory. Correlation coefficients were

calculated for all three years to analyze the relationship between the two measurement methods

and also the inter-year correlation for the three years using both classical test and item response

theory. Cronbach’s Alpha was calculated to look at the reliability of the scores. Furthermore,

item characteristic curves were generated using Xcalibre 4.2. Repeated measures analysis of

variance was conducted for the item parameters of both classical test and item response theory

and test characteristic curves generated year-wise for the multiple choice items for a 2 parameter

logistic model which were then compared across the years to assess the stability of the multiple

choice items over time.

Results

Difficulty was found to be adequate for half the items when classical test theory was applied

and for two thirds of the items when item response theory was used. Discrimination was mostly

fair to adequate with classical test theory and excellent with item response theory. Standard error

of measure was noted to vary from small to large for the item parameters of different items, the

reliability index being 0.56- 0.65 for the test scores across the years. Correlation coefficients

were excellent between Year 1 and 3 and only fair for Year 2 when compared with the other two.

Correlation coefficients between classical test and item response theory were excellent. Items

were noted to be stable across the three years using repeated measures analysis of variance which

yielded small F ratios thus exhibiting stability of item difficulty and discrimination over Times 1,

2 and 3. Visual inspection of the test characteristic curves yielded the same findings.

Conclusion

Multiple choice questions used by the University of Calgary over a period of three years have

been shown to be fairly reliable and stable over time with different samples of students. Some

differences were noted in the item analysis carried out by the two different methods ( i.e.,

classical test and item response theory) but mostly the two measurement methods were

comparable. Some items need reviewing and revision to further improve the reliability of the

exam following which the multiple choice items may be used repeatedly without affecting their

psychometric properties.

Acknowledgements

I’d like to start with thanking the Almighty; He has always carried me in the palm of His

hand. I am extremely grateful to Dr. Jocelyn Lockyer for her continued guidance and support.

She has inculcated in me the habit of thinking “why”. Thank you, Dr. Lockyer, for your help and

direction with this research.

Dr. Tyrone Donnon and Dr. Tanya Beran, thank you for leading me through the precipitous

road of statistics and encouraging me to delve further into this intriguing field. I am also grateful

to Dr. Claudio Violato for his direction, Dr. Bruce Wright for allowing access to the

Undergraduate Medical Education data and Mr. Alain Chan for his assistance with the data.

My deepest gratitude to my rock, my husband, Saghir. If it weren’t for your continued

encouragement and support, especially in my darkest moments, this research might not have seen

the light of the day. I would also like to appreciate my remarkably resilient and adaptable

children, Alishba, Raza and Murtaza, for their extraordinary patience with my thesis writing.

Guys, thank you for being you!

Last but not the least, my sincerest gratitude to my siblings for their continued support,

especially my brother Shabih whose selflessness knows no bounds and my sister Farzana who

pushes me to strive for the best!

Dedication

This dissertation is dedicated to the most cherished memories of my beloved parents,

Nasir Hussain and Zakira, the gems who honed my skills, loved me unconditionally and

continue to guide me in spirit.

Table of Contents

Abstract……………………………………………………………………………………………ii

Acknowledgements………………………………………………………………….....................iv

Dedication…………………………………………………………………………………............v

Table of Contents…………………………………………………………………………………vi

List of Tables……………………………………………………………………………….........xii

List of Figures…………………………………………………………………………………...xvi

List of Symbols and Abbreviations………………………………………………………........xviii

Epigraph……………………………………………………………………………………........xx

CHAPTER 1: INTRODUCTION……………………………………………………..…..........1

1.1 Overview………………………………………………………………………. …......1

1.1.1 Types of Assessment…………………………………………………………....1

1.1.2 Importance of Formative and Summative Assessments…………………...…...2

1.1.3 Tools of Assessment…………………………………………………………....4

1.1.4 Multiple Choice Questions………………………………………………..........6

1.2 Problem Statement………………………………………………………………….....6

1.3 Significance of the Research………………………………………………………......7

1.4 Purpose of Research…………………………………………………………………...8

CHAPTER II – LITERATURE REVIEW……………………………………………………10

2.1 Multiple Choice Questions for Summative Assessments………………………........10

2.2 Classical Test Theory…………………………………………………………….......12

2.2.1 Assumptions of Classical Test Theory…………………………………….........14

2.2.2 Item Analysis with Classical Test Theory……………………………………....15

2.2.2.1 Reliability of Tests in the Context of CTT……………………….........16

2.2.2.2 Item Difficulty………………………………………………………....17

2.2.2.3 Item Discrimination…………………………………………………...18

2.2.4 Limitations of Classical Test Theory…………………………………………...18

2.3 Shift from Classical Test Theory to Item Response Theory………………………....21

2.4 Item Response Theory……………………………………………..…………….......22

2.4.1 Item Response Theory-Then and Now…………………………………….........23

2.4.2 Basic Concepts of IRT…………………………………………………..............23

2.4.3 Assumptions of IRT……………………………………………………..............24

2.4.4 Item Characteristic Curve, Item Difficulty

And Item Discrimination………………………………………………..............25

2.4.5 Test Characteristic Curve………………………………………………………..28

2.4.6 IRT Models……………………………………………………………………..30

2.4.7 Item Analysis with IRT……………………………………………………........30

2.4.8 Applications of IRT………………………………………………………….....32

2.4.6.1 Ability and Item Parameter Estimation………………………......33

2.4.6.2 Differential Item Functioning…………………………………....34

2.4.6.3 Computerized Adaptive Testing………………………………....36

2.5 Comparing CTT and IRT……………………………………………………….........36

2.6 Temporal Stability and Parameter Drift……………………………………………...39

2.7 Research Questions……………………………………………………………..........43

CHAPTER III – RESEARCH METHODS…………………………………………………...44

3.1 Study Design……………………………………………………………………….....44

3.2 Setting and Context…………………………………………………………………...44

3.3 Sample and Data Source……………………………………………………………...46

3.4 Data Analyses………………………………………………………………………...48

3.4.1 Research Question No. 1: Reliability of scores with CTT and IRT……………49

3.4.1.1 Research Question No.1 A: Item parameters with CTT………… 49

3.4.1.2 Research Question No.1 B: Item parameters with IRT…………..50

3.4.1.2.1 Two-Parameter Logistic Model of Item Response Theory.....50

3.4.1.2.2 Item Analysis………………………………………………..51

3.4.1.2.3 Item Difficulty……………………………………………....51

3.4.1.2.4 Item Discrimination………………………………………....51

3.4.1.3 Research Question No.1 C: Comparability of item parameters with

CTT and IRT………………………………………………………52

3.4.1.4 Research Question No.1 D: Reliability index of test scores……...52

3.4.1.5 .Research Question No.1 E: Item characteristic curves…………..53

3.4.2 Research Question No. 2: Temporal stability of items…………………………54

3.4.2.1 Research Question No 2A: Item stability with CTT……………..54

3.4.2.1.1 Repeated Measures ANOVA…..……………………………54

3.4.2.1.2 Effect Sizes….…………………………………………...55

3.4.2.2 Research Question No. 2B: Item stability with IRT……………..56

3.4.2.2.1 Test Characteristic Curve...………………………………….56

3.5 Summary of Analyses………….………………………………………………….....57

3.6 Ethics…………………………………………………………………………………59

CHAPTER IV-RESULTS……………………………………………………………………...60

4.1 Overview………………………………………………………………………..........60

4.2 Descriptive Analysis………………………………………………………………....60

4.3 Results of Research Question No. 1: Reliability of scores CTT and IRT…………...66

4.3.1 Results of Research Question No. 1A: Item parameters with CTT…………....67

4.3.2 Results of Research Question No. 1B: Item parameters with IRT……………70

4.3.3 Results of Research Question No.1 C: item analysis with CTT and IRT…….72

… 4.3.4 Results of Research Question No.1 D: Reliability index of items…………....74

4.3.5 Results of Research Question No.1 E: Item characteristic curves…………….81

4.4 Results of Research Question No.2: Temporal stability of items……...…………...84

4.4.1 Results of Research Question No. 2 A: Item stability using CTT……………..84

4.4.1.1 Repeated Measures ANOVA CTT.................................................84

4.4.1.2 Correlation Coefficients CTT...……………………………….......88

4.4.1.3 Scatter Plots CTT......……………………………………………...90

4.4.2 Results of Research Question No. 2 B: TCC for Item stability using IRT......96

4.4.1.1 Repeated Measures ANOVA IRT..................................................97

4.4.1.2 Correlation Coefficients IRT..……………………………….........99

4.4.1.3 Scatter Plots IRT………………………………………………..101

4.4.1.4 TCCs……………………………………………………………106

CHAPTER V-DISCUSSION…………………………………………………………………109

5.1 Research Question No.1: Reliability of scores using with CTT and IRT...................109

5.1.1 Research Question No. 1 A: Item parameters with CTT...…………………....110

5.1.2 Research Question No.1 B: Item parameters with IRT……………………….112

5.1.3 Research Question No.1 C: Item analysis with CTT and IRT………………...115

5.1.4 Research Question No.1 D: Reliability index of test scores…………………..116

5.1.5 Research Question No.1 E: Item characteristic curves………………………..118

5.2 Research Question No. 2: Temporal stability of items…………………………........119

5.2.1 Research Question No. 2 A: Item stability using CTT………………………..119

5.2.2 Research Question No. 2 B: Item stability using IRT………………………...120

5.3 Implications and Future Directions for Research……………………………………122

5.4 Limitations of the Study……………………………………………………..............123

5.5 Conclusion………………………………………………………………..….............123

5.6 Recommendations…………………………………………………………………...125

REFERENCES………………………………………………………………………………..126

APPENDIX A: COURSE 3…………………………………………………………………..137

APPENDIX B: COURSE 6…………………………………………………………………...167

APPENDIX C: ICCS OF COURSE 1……………………………………………………….194

List of Tables

Table 1: Features of Classical Test and Item Response Theory…………………………………38

Table 2: Item Distribution for Individual Year and Course……………………………………...47

Table 3: Methods Summary……………………………………………………………………...58

Table 4: Distribution of MCQs According to Type of Skill………………………………..........61

Table 5: Number of Examinees across Courses and Years……………………………………...61

Table 6: Content of 30 Items Course 1 Classified by Clinical Presentation and Skills……. …...62

Table 7: Content of 30 Items Course 3 Classified by Clinical Presentation and Skills……. …...63

Table 8: Content of 30 Items Course 6 Classified by Clinical Presentation and Skills………….64

Table 9: Descriptive Statistics of Item Parameters for Course 1………………………………...65

Table 10: Descriptive Statistics of Item Parameters for Course 3……………………………….66

Table 11: Descriptive Statistics of Item Parameters for Course 6……………………………….66

Table 12: Item Difficulty (p) and Point Biserial (p-bis) Correl of Course 1 Using CTT…..........68

Table 13: Difficulty (b) and Discrimination (a) Indices of Course 1 Using IRT…………..........71

Table 14: Correl Coefficients of Difficulty Index Between CTT and IRT for Course 1…...........73

Table 15: Correl Coeff of p-bis and Discrim Index B/W CTT and IRT for Course 1…………...74

Table 16: SE and Reliability Index (Alpha w/o) Course 1 Year 1……………………………....75

Table 17: SE and Reliability Index (Alpha w/o) Course 1 Year 2……………………………....77

Table 18a: SE and Reliability Index (Alpha w/o) Course 1 Year 3………………………….......79

Table 18b: Cronbach’s Alpha for Course 1, 2 and 3 Using CTT and IRT....................................81

Table 19: Repeated Measures ANOVA to Determine the Effect of Time on the Item Difficulty

Index for Course 1 Using CTT……………………………………………………………..........85

Table 20: Repeated Measures ANOVA to Determine the Effect of Time on the Item

Discrimination Index for Course 1 Using CTT………………………………………………….85

Discrimination Index for Course 3 Using CTT……………………………………………….....87

Discrimination Index for Course 6 Using CTT……………………………………………….....88

Table 25: Correlation Coefficient of Difficulty Index of Year 1, 2, 3 for CTT……………….....89

Table 26: Correlation Coefficient of Discrimination Index of Year 1, 2, 3 for CTT…………....90

Table 27: Repeated Measures ANOVA to Determine the Effect of Time on the b Parameter for

Course 1 Using IRT………………………………………………………………………...........97

Table 28: Repeated Measures ANOVA to Determine the Effect of Time on the a Parameter for

Course 1 Using IRT……………………………………………………………………………...97

Course 3 Using IRT……………………………………………………………………………...98

Course 3 Using IRT………………………………………………………………………...........98

Course 6 Using IRT………………………………………………………………………...........99

Table 33: Correlation Coefficient of Difficulty Index of Year 1, 2, 3 for IRT………………....100

Table 34: Correlation Coefficient of Discrimination Index of Year 1, 2, 3 for IRT…………...101

Table 35: App A1: Item Diff (p) and p-bis Correl of Course 3 Using CTT.....….......................137

Table 36: App A2: Diff (b) and Discrim (a) Indices of Course 3 Using IRT……………..........139

Table 37: App A3: Correl Coeff of Difficulty Index b/w CTT and IRT for Course 3…………141

Table 38: App A4: Correl Coeff of p-bis and Discrim b/w CTT and IRT for Course 3……….141

Table 39: App A5: SE and Reliability Index (Alpha w/o) Course 3 Year 1…………………...142

Table 42: App A9: Correl Coeff of Difficulty Index of CTT for Course 3 Year 1, 2, 3……….156

Table 43: App A10: Correl Coeff of Difficulty Index of IRT for Course 3 Year 1, 2, 3………156

Table 44: App A11: Correl Coeff of Discrim Index of CTT for Course 3 Year 1, 2, 3…..........157

Table 45: App A12: Correl Coeff of Discrim Index of IRT for Course 3 Year 1, 2, 3………...157

Table 46: App B1: Item Diff (p) and p-bis Correlation of Course 6 Using CTT………………167

Table 47: App B2: Difficulty (b) and Discrim (a) Indices of Course 6 Using IRT…………….169

Table 48: App B3: Correl Coeff of Difficulty Index b/w CTT and IRT for Course 6…………171

Table 49: App B4: Correl Coeff of p-bis and Discrim b/w CTT and IRT for Course 6…..........171

Table 50: App B5: SE and Reliability Index (Alpha w/o) Course 6 Year 1……………………172

Table 53: App B9: Correl Coeff of Difficulty Index of CTT for Course 6 Year 1, 2, 3….........184

Table 54: App B10: Correl Coeff of Diff Index of IRT for Course 6 Year 1, 2, 3……………..184

Table 55: App B11: Correl Coeff of Discrim Index of CTT for Course 6 Year 1, 2, 3……......185

Table 56: App B12: Correl Coeff of Discrim Index of IRT for Course 6 Year 1, 2, 3………...185

List of Figures

Figure 1: b Parameter on Item Characteristic Curve…………………………………………….26

Figure 2: a Parameter on Item Characteristic Curve……………………………………………..27

Figure 3: c Parameter on Item Characteristic Curve……………………………………………..28

Figure 4: Test Characteristic Curve………………………………………………………...........29

Figure 5: Causes and Pathophysiology of Hypertension………………………………………...45

Figure 6: ICCs for Course 1……………………………………………………………………...82

Figure 7: Scatter Plot of Item Difficulty for Course 1 with CTT Year 1 and 2……………….....91

Figure 8: Scatter Plot of Item Difficulty for Course 1 with CTT Year 2 and 3……………….....92

Figure 9: Scatter Plot of Item difficulty for Course 1 with CTT Year 3 and 1………………......92

Figure 10: Scatter Plot of p-bis for Course 1 with CTT Year 1 and 2……………………...........94

Figure 13: Item Difficulty for Course 1 with IRT Year 1 and 2………………………………..102

Figure 14: Item Difficulty for Course 1 with IRT Year 2 and 3…………………………..........104

Figure 15: Item Difficulty for Course 1 with IRT Year 3 and 1……………………………….103

Figure 16: Item Discrimination for Course 1 with IRT Year 1 and 2………………………......104

Figure 17: Item Discrimination for Course 1 with IRT Year 2 and 1……………………..........105

Figure 18: Item Discrimination for Course 1 with IRT Year 3 and 1………………………......105

Figure 19. Test Characteristic Curve for Course 1, Year 1………………………………….....106

Figure 20. Test Characteristic Curve for Course 1, Year 2………………………………….....107

Figure 21. Test Characteristic Curve for Course 1, Year 3…………………………….............107

Figure 22: App A8: Item Characteristic Curves for Course 3 for Year 1, 2, 3…………………148

Figure 23: App A13: Scatter Plots for Item Difficulty Using CTT for Course 3…………........158

Figure 24: App A14: Scatter Plots of Item Difficulty Using IRT for Course 3………………...160

Figure 25: App A15: Scatter Plots of Item Discrim (p-bis) Using CTT for Course 3…….........161

Figure 26: App A16: Scatter Plots of Item Discrim Using IRT for Course 3………………….162

Figure 27: App A17: Test Characteristic Curves for Course 3…………………………………163

Figure 28: App B8: Item Characteristic Curves for Course 6 for Year 1, 2, 3…………………178

Figure 29: App B13: Scatter Plots for Item Difficulty Using CTT for Course 6…………........186

Figure 30: App B14: Scatter Plots of Item Difficulty Using IRT for Course 6………………...187

Figure 31: App B15: Scatter Plots of Item Discrim (p-bis) Using CTT for Course 6….............189

Figure 32: App B16: Scatter Plots of Item Discrimination Using IRT for Course 6…………...190

Figure 33: App B17: Test Characteristic Curves for Course 6…………………………………192

List of Abbreviations

A Item Discrimination Index in Item Response Theory

B Item Difficulty Index in Item Response Theory

C Guessing Parameter in Item Response Theory

CAT Computerized Adaptive Testing

Conjoint Health Research Ethics Board

Classical Test Theory

CVS Cardiovascular

D Item Discrimination in Classical Test Theory

DIF Differential Item Functioning

GIT Gastroenterology

ICC Item Characteristic Curve

IRT Item Response Theory

MCQs Multiple Choice Questions

ANOVA Analysis of Variance

OSCE Objective Structured Clinical Exams

p Item Difficulty in Classical Test Theory

p-bis Point Biserial Correlation

1, 2, 3 PL Model One, Two, Three Parameter Logistic Model

R Correlation Coefficient

SBA Single Best Answer

SEM Standard Error of Measure

TCC Test Characteristic Curve

UGME Undergraduate Medical Education

Epigraph

Knowledge

“Its head is humility, its eye freedom from envy, its ear understanding, its tongue the

truth, its memory research, its heart good intention”.

Ali Ibne Abi Talib (596-661 AD)

CHAPTER 1- INTRODUCTION

1.1 Overview

For any medical training programme to achieve its learning outcomes, it should be designed so that

the graduates acquire the knowledge, behaviour and skills necessary to practice evidence-based

medicine.1, 2 Assessment is an important link in the curricular process and drives learning—by way of its

content, timing, format and subsequent feedback.2 It helps evaluate competencies and identify curricular

deficiencies.3 Furthermore, the effectiveness of instructional skills can be established by the type of

assessment used to assess students’ level of understanding. Recent times have seen the implementation of

numerous changes to assessment of medical undergraduates and graduate students.4-7 In addition to issues

of reliability and validity, elements like educational effect and catalytic effects of assessment have been

highlighted.8 Furthermore, the choice of tools of assessment has been under scrutiny and the utility of one

over the other has been the objective of recent research.7 Multiple choice questions (MCQs) are

commonly used in both undergraduate and graduate levels in medical education and issues of stability in

addition to those of their security are frequently raised, hence needing addressing. This research is carried

out in an attempt to explore the reliability and stability of MCQs over time.

1.1.1 Types of Assessment

Assessment can be either formative or summative in nature. Formative assessment is defined

as the process of providing individually tailored doses of feedback to students on their

performance in a concrete, effective way.9 It is carried out during the various phases of a

program. Formative assessment can be informal or formal.10 When informal, it can take place in

the course of events during learning and is not necessarily stipulated explicitly within the

curriculum. Formal types of formative assessment, on the contrary, are part of pre-designed

curricular objectives and are provided by the academic staff or the supervisor of the placement

activity within a collaborating organisation at pre-defined intervals.10

Summative assessment, unlike formative type, comprises of a process of assessment of

students after units, mid-terms and courses.11 It is geared more towards the final outcome.

Summative assessments are high-stake and require more efforts for the development of the exam

and its quality control. Whereas formative assessment is for learning, summative is more

directed towards assessment of learning.4

1.1.2 Importance of Formative and Summative Assessments

Current research in assessment has highlighted the vitality of formative assessment in

providing self-motivation and future direction in learning.12 Moreover, aptly conducted

formative assessments aid the learner in setting more advanced goals by providing continued

guidance.10 Formative assessment is important because it lets the instructors know how the

students are progressing and where they need more attention. This helps in making important

adjustments in instructions or arranging more opportunities for learning by practice. These

activities then lead to an improvement in a student’s success. Furthermore, students are able to

identify any gaps that exist between their desired goals and their present knowledge and

competencies. They can then carry out actions necessary to reach their goals.

Summative assessments are vital for reporting on achievements at certain intervals. As stated

earlier, they are high stakes since they are used for certification purposes, both for graduation

and for higher training.3 Their choice is also influenced by the stake holders’ demands which

include the public in addition to accreditation and licensing bodies.13 Summative assessments

utilize a number of tools for gathering information about what has been learned by the students.

They are valuable because they provide critical information about the overall learning of the

students as well as an indication of the quality of instruction. They can be carried out in a

number of ways which include end of unit tests or projects, course grades and portfolios. At the

student level, these tools reflect the level of their performance and overall expectations for a

particular course. At the program level, they provide information about the objectives of the

program being achieved by the students. It is useful to create summative assessments prior to

instruction as it helps in identifying the content and process of learning leading to desired

outcomes. Summative assessment can, thus, serve as a guide for giving directions for the

curriculum and instruction.

Recent trends have seen a shift towards competence-based assessments which require

frequent testing of students. Furthermore, the onus is now being placed on continuous formative

assessment rather than end of academic year summative assessments.4 Schuwirth and Ash have

also recommended combining the formative and summative functions to inform and guide

student learning.7 Since the item banks are used repeatedly, there is a concern that the

psychometric properties of items may be affected. This is an element that needs exploring as the

repeating of items potentially influences the stability of such items over time. Irrespective of

which scoring method is used, neither are resistant to such influences and hence require

exploration in the context of their usability for measuring the effectiveness of the MCQ items.

1.1.3 Tools of Assessment

A number of tools are available for the assessment of different aspects of clinical

competence. According to Miller’s Pyramid of Clinical Competence14, assessments should be

designed keeping in mind the domains of know, knows how, shows how and does.15

Structured oral exams are commonly used for assessing the knowledge and understanding of

concepts which form the bases for the knows and knows how tiers of Miller’s Pyramid. 16 A

clinical scenario is presented and the candidate is then asked to elaborate on principles of

differential diagnosis, investigations and management. He/she may also be asked to comment on

certain tests or findings. Long and modified essay questions are also used for knowledge testing.

They are written pieces which can be several paragraphs to pages long. They are used to broadly

measure the amount of knowledge retained by the candidates and their ability to use that

knowledge to reason through clinical problems.17 Multiple choice questions are used to assess

the knows and knows how domain of Miller’s pyramid. Single best, multiple best, true false and

extended matching are the different types of MCQs that are used for assessing the students’

knowledge, comprehension and application ability. MCQs have been criticized for being poorly

linked to the professional reality and testing only trivial knowledge.18 One of the objections is

that students are required to recognize the correct answer from a list of options or eliminate the

incorrect one. Thus, the ability of a student to be judged on his or her free writing capability as in

an essay cannot be assessed. The MCQs are now mostly constructed in the form of clinical

vignettes so that they are able to assess the deeper knowledge along with comprehension and

application of the student’s knowledge. The single best answer type of MCQ consists of a

statement followed by a set of answers. The examinee has to select the single most appropriate

answer for the main statement. This process comprises recall of the knowledge, comprehension

of the problem and application of the knowledge to that problem. MCQs hence, are able to test

factual recall along with an assessment of the approach of an examinee to a clinically oriented

scenario. It is possible to structure MCQs in such a way that they can test higher order skills and

levels of cognition such as analysis and synthesis. This is especially true for the single best

answer type of MCQs. Case and Swanson19 have shown that well-constructed MCQs can assess

taxonomically higher cognitive processes in addition to just assessing factual knowledge.

Assessment tools available for the shows how domain of Miller’s Pyramid of Clinical

Competence mainly include objective structured clinical examination (OSCE), simulation and

bedside examinations in the form of long and short cases.20-22 OSCE and simulation are more

widely accepted and popular due to better reliability and use of standardized patients.23Various

modalities are at the assessors’ disposal for evaluating the does domain of Miller’s Pyramid. Of

note amongst these are mini clinical evaluation exercises (Mini-CEX), direct observation of

procedural skills (DOPS), checklists and rating scales, 360 degrees multisource feedback (MSF),

portfolios and log books.5 The application of Mini-CEXs and DOPS is widespread as learners

are given feedback on workplace-based performance promptly which in turn helps formulate

remedial measures quickly and accurately if warranted. Feedback is also collected from

colleagues in the form of 360 MSF24, 25 while portfolios and log books allow for personal

reflection to develop and improve professional practice.26 Recent research has recommended

that a variety of assessment modalities should be employed to reach a reliable summative

decision using a feasible number of workplace-based assessments.27, 28

1.1.4 Multiple Choice Questions

Multiple choice questions are widely used for both formative and summative assessments in

undergraduate and graduate medical education.17, 29-33 They are particularly useful in summative

exams because of their ability to assess a large amount of knowledge in a relatively short time34

and contextualization with a clinical vignette and scenario.17 Computerized marking of large sets

of questions also tends to make them widely acceptable. Although MCQs with desirable

reliability are difficult to construct, once constructed, they may be repeated over time without

affecting their reliability. Wass et al have reported a reliability of >0.9 for a four-hour long test

which is above the desirable level.35 Norcini et al reported a slightly lower coefficient of 0.88 for

shorter tests that were 90- items long.36 Well constructed MCQs are useful for summative

assessments because taxonomically higher-order cognitive processes of interpretation, synthesis

and application are assessed adequately besides recall of isolated factors.

1.2 Problem Statement

The evaluation of the MD certifying exams at the University of Calgary is important not only

to the trainees but also to the faculty and the administration. This research is expected to help

elucidate the interplay of the three main components of the educational process--curriculum,

teaching and evaluation. An insight into the changes warranted in the use of MCQs can help

improve the efficacy of the program in turn. The issue that was addressed in this research was to

assess the reliability of scores using and comparing two methods of analysis, i.e., classical test

theory (CTT) and item response theory (IRT), on MCQ items administered three times over a six

year period. IRT is a body of theory that describes the application of mathematical models to

data from questionnaires and tests as a basis for measuring abilities, attitudes and other

variables.31 It may be used for the development of assessments and their statistical analyzes by

studying the stability of difficulty and discrimination indices of items over time. IRT has been

applied to item level statistics for MCQs31, 37 and further research will help explore the reliability

and stability especially in the setting of high stakes exams. Item response theory offers the

promise of solving many problems that are faced by psychometricians in medical education. The

major problems that have hindered the widespread use of IRT in the past have now been

overcome to a great extent. With the advent of more sophisticated computer software, it is now

emerging as a favoured method of measurement.38, 39 Establishing whether the use of MCQs is

the right choice for assessing a particular facet of knowledge would assist in providing activities

which facilitate linking theory with practice, exercising the skills of thinking in a practical

context and gaining personal insight including career preferences. It would, furthermore,

facilitate effective delivery of curriculum by rendering it relevant and applicable to the practice

of medicine.

1.3 Significance of the Research

This study assessed the reliability of scores using and comparing two methods of analysis,

i.e., CTT and IRT. CTT forms a vital part of the basis of measurement theory. The underlying

assumption in CTT is that the test score is made up of two components, true score and error

score. This assumption allows for the statistical analyses to be carried out in the form of test and

item analysis. IRT uses 1, 2 or 3 parameter models for item analysis. Two parameter logistic

models have been applied to MCQs in psychology40, 41 and medical education.31, 42, 43 The two

parameter model estimates student performance on a test with differences in item difficulty and

discrimination. Hence, this model includes more information about the items than CTT. IRT is

deemed to be a superior measurement theory in comparison with CTT.44 This is due to its

characteristic of analyzing item level statistics that is sample independent. In this research, the 2

PL was used based on the premise that the examinee sample along with the examiners over a

period of three years belongs to groups with similar characteristics. This study will highlight the

similarities and differences between CTT and IRT in the context of item analysis with directions

and suggestions for changes at both the individual and program levels. At the individual level, it

may help program directors not only to evaluate whether or not the students have met the

standards, but also how fast they are approaching the standards. At the program level, it may

provide data that will help to evaluate the effectiveness of each program. Most importantly, this

will help these schools to view their program as an integrated system so that the knowledge

training and skill training can be balanced, and the link between training at different levels can

be reinforced. It is hoped that this research may also provide more robust evidence of the

psychometrics of MCQs, thus identifying areas of improvement in both the formative and

summative exams.

1.4 Purpose of Research

The main purpose of this research was to use University of Calgary summative examination

data from MCQ exams that were held for three courses over a three-year period. This research

addressed the following questions:

Research Question No. 1

What was the reliability of scores using and comparing two methods of analysis, i.e., CTT and

IRT, on MCQ items administered three times over a six year period?

Research Question No. 2

Do the items exhibit temporal stability when repeated over Year 1, 2 and 3?

This research work is divided into Chapter 2 which comprises of literature review related to

the use of MCQs as an assessment tool and research in the context of CTT and 2PL IRT, chapter

3 which describes the methods used for the research including the data collection techniques and

analyzes and chapter 4 which provides the results. Chapter 5 concludes this research. It

summarizes the research findings, situates the findings within the broader literature, describes the

limitations of the study, identifies future research directions and states some recommendations

for future application.

CHAPTER II – LITERATURE REVIEW

This chapter encompasses the following five sections: 1) a discussion on MCQs as a tool for

summative assessment in medical education, 2) CTT and its assumptions, features and concepts,

3) IRT and its assumptions, features and concepts, 4) comparison of CTT and IRT, 5) temporal

stability of MCQs, and 6) research questions.

2.1 Multiple Choice Questions for Summative Assessments

Multiple choice questions were first utilized in the field of medicine in the 1950s 45 and since

then have been used increasingly. They are used for both formative and summative assessments

to test the acquisition of knowledge and understanding across the curriculum. Many types of

MCQs are described in literature. For the purpose of this research, the A type were taken into

consideration. The A type of MCQs are characterised by an opening statement followed by a

lead-in.46 There are usually about four to five options provided to choose the correct one from. In

addition to these types, other MCQ formats used in medicine are the true-false and the extended

matching types.1

MCQs have undergone continuous scrutiny since their inception. The major concern remains

the scope of what they can be used to assess. If well constructed, they can be used to assess the

first three to four levels of cognitive domain of Miller’s Pyramid47 and also discriminate between

more and less able students. Research shows that testing of knowledge is the most accurate

method of evaluating expertise. It is, hence, understandable that a lot of time and attention is put

1 http://www.nbme.org/publications/item-writing-manual-download.html

into constructing psychometrically sound MCQs that are deemed capable of doing that. In the

past, MCQ items have been blamed for testing only recall memory.48 Indeed, many consider

them to be poorly suited for testing students in high-stakes exams requiring problem-solving and

a self-directed approach.49 MCQs can competently evaluate knowledge, comprehension,

application and analysis levels by putting up questions that require the student to recognize

problems or discrepancies and infer their causes and devise solutions. Such MCQs are capable of

challenging the analytical skills of students.50 Multiple-choice tests are very well suited to

sampling many diverse test items. They can be administered to a large group of students since

they are easy to mark by computerized optical scanners. They can, in addition, be used for

testing a wide variety of course material.51 MCQs provide objective evaluation of performance as

they have the capacity to overcome the subjectivity that may exist in the assessment by essays

and oral examinations.52 They can motivate students positively and can assist students in

monitoring and affirming their own learning.53

There is little doubt that if the MCQ items are flawed, student scores are affected. This has

been reported by Downing who found that the test scores improved by 10-25% on removing

items that were noted to be technically flawed after an item analysis.54 MCQs assessing the lower

levels of cognition are found to be more flawed than the ones made for the assessment of higher

cognition. It may be because of the fact that those made to evaluate higher levels of cognition are

constructed in a longer period of time with more attention than the simpler ones. It is, therefore,

important to analyze the item indices, their reliability and stability in high stakes summative

exams.

This research used both CTT and IRT for item analyses to highlight the similarities and

differences using both. In addition, reliability of the test items and their stability across the years

was also taken into account. CTT and IRT are widely understood to be two extremely different

frameworks despite the fact that ample literature exists that examines the similarities and

differences in the estimation of item parameters using both the frameworks. A discussion on the

two methods of measurement, i.e., CTT and IRT, follows.

2.2 Classical Test Theory

CTT was founded by Charles Spearman55 in 1904 and it comprises three components. They

are the observed score, true score and random error. Mathematically, it is depicted as:

where X represents the observed score of a student on any test, T is the expected value of the

observed score received on several such tests of equal difficulty when run an infinite number of

times and E is the difference between X and T and is related to the standard error.

An important concept in CTT is that of standard error of measurement (SEM). It is the

standard deviation of errors of measurement that are associated with test groups from a particular

group of examinees.56 It can also be thought of as the determination of the amount of variation or

spread in the measurement errors for a test. From the equation stated above, i.e., X=T + E, it is

known that a person’s true score equals the average of his or her observed scores, hence

accounting for measurement error associated with a test. Because it is not possible to know the

measurement error, all standardized tests have an associated SEM. The SEM is expressed in

standard deviation units. It is directly related to the reliability of a test. Hence, the smaller the

SEM, the higher the reliability and more precise the scores obtained. The error in CTT is always

assumed to be random and non-systematic. It can be attributed to several factors external or

internal to the examinee. Examples of external errors include ones attributable to test items that

might have been created poorly or those associated with inadequate testing conditions. Internal

errors can result from conditions internal to the examinee so lack of concentration, fatigue and

stress may contribute to the random error in CTT.

Another concept associated with SEM is that of the confidence interval which measures the

probability that a population parameter will fall between two set values.56 It can take any number

of probabilities but the most common ones are 95% or 99%. It can thus be stated that confidence

interval is the probability of a value falling between an upper and a lower bound of a probability

distribution. When 95% confidence interval is used, it refers to the range of values within which

the scores are found 95% of the time at least.

Classical test theory deals with both item and test level statistics.55 At the item level, it deals

with item difficulty and discrimination. The item difficulty index is depicted by p and it indicates

the proportion of the students who have answered the item correctly. The item discrimination

index is indicated by D and it informs the extent to which the item differentiates between the

high-ability and the low-ability students. At the test level, CTT deals with the reliability of a test

that is parallel.57, 58 Two tests are said to be parallel if they measure the same latent ability for

which the examinees have the same true score and errors across the tests. Parallel tests require

the generation of a large set of items that represent a single content domain. It is recommended

that at minimum, the number of items in this set should be twice the planned size of a single test

form.59 In other words, it should be large enough to establish that the content domain is well

represented.

2.2.1 Assumptions of Classical Test Theory

Some fundamental assumptions have to be made for the estimation of the true score of an

examinee using CTT since both the true and the error scores are unknown. Classical test theory

assumes that observed score has a proportion of true score and random error due to errors of

measurement instrument.59 In addition, variability of the test score and examinee conditions also

contribute to these errors. If the same examinee takes the same exam an infinite number of times

(without the effects of any learning taking place), errors will approach zero, and the observed

score will be equal to the true score. The following four assumptions are implicit within CTT:

1. The observed score of a person is comprised of the true score and random error

2. The expected value of any observed score is the person’s true score

3. The covariance of error components from two tests is zero in the population (i.e., errors

from two tests are uncorrelated)

4. Errors in one test are uncorrelated with true scores in another (i.e., measurement errors

are not dependent on traits)

It is important to note that the onus in CTT is on the test score rather than the item score as it

relates the test score to true score rather than the item score to true score.

2.2.2 Item Analysis with Classical Test Theory

It is vital that there is a match between what is taught and what is assessed. There should be a

variety of items in any exam testing both the basic and advanced knowledge. If the items are too

difficult, they lead to examinee frustration due to low scores. If they are too easy, inflation in

scores leads to false sense of overconfidence and a decline in examinee motivation.46 Item

improvement is also important as it leads to the development of a pool bank that can be reused

over time. For this purpose, item analysis is carried out. Item analysis may be defined as a

method used to evaluate test items, typically for the purpose of test construction and revision.60 It

is a technique available for the improvement of items used in assessments.

The advantage of item analysis is that it helps identify biased or unfair items.42 Another

advantage of item analysis is that it can identify poorly worded and miskeyed questions. Results

of item analysis, once it has been carried out, are then used to refine the item of interest. Items

that are found to be more difficult identify a concept that needs revising. If a distracter is found

to be the most chosen answer, then the item must be re-examined for its correctness. Item

analysis also helps improve the quality of items by observing the reliability of test scores and

although some literature on measurement discusses reliability as somewhat distinct from item

analysis, item characteristics play a vital role in reliability estimation by both CTT and IRT.61

Item difficulty and discrimination are the two components of item analysis which are helpful in

establishing the reliability of test scores. These components are discussed later.

2.2.2.1 Reliability of Test Scores in the Context of CTT

Norcini et al 8 have described seven components or criteria of a good assessment tool. They

are (1) validity or coherence, (2) reproducibility or consistency, (3) equivalence, (4) feasibility,

(5) educational effect, (6) catalytic effect, and (7) acceptability. Reproducibility or consistency is

the extent to which students’ scores in context of time, sampling and factors related to test

administration are reproducible and consistent from one assessment to the next and from one

item to another.62 It is expressed numerically as a coefficient called the reliability coefficient.63

Any value around 0.8 and above is deemed good to excellent in the context of MCQs.64

Reliability estimates the amount of random measurement error in assessments and is

differentiated into several types.63, 65 Test-retest reliability measures the stability of score over

time. Equivalent-form reliability is the degree to which two similar tests administered at the same

time or shortly thereafter produce similar scores from a single group of test takers. Internal

consistency reliability is the extent to which items in a single test are consistent amongst

themselves and with the test as a whole. It can be split-half reliability (which is appropriate for

very long or difficult-to-administer tests), Kuder-Richardson reliability (or KR-20 which can

only be used on dichotomously-scored items like in the selected-response tests) and Cronbach’s

alpha.57 Rater reliability investigates the error attributable to individuals who score the test. It

can be inter-rater which is due to consistency of two or more independent scorers scoring the

same participant in the same context or intra-rater which is due to error associated with the

scoring of one rater for the same participant in the same context at two different points in time.

The concept of alpha was developed by Lee Cronbach in 1951.57 It is commonly used in the

fields of medical education and psychology and provides a measure of the internal consistency of

a test or scale.65-70 It is expressed as a number between 0 and 1 and is useful as it elaborates on

the extent to which all the items in a given test are utilized to measure a similar construct or

concept. If the items in a test are found to be highly correlated with each other, the alpha

coefficient increases. It must be kept in mind that correlation is not the only factor affecting the

reliability or the alpha of a test. Test length is another factor that influences the Cronbach’s

alpha. Thus, a low value of alpha may be attributable to poor inter-item correlation or the test

length. It is recommended that such items as ones with poor correlation should then be either

discarded or revised. A high value of alpha, on the other hand, may indicate redundant use of

items for a variable in which case again, revision of items is desirable.

2.2.2.2 Item Difficulty

Another concept in item analysis using CTT is of item difficulty. It refers to the

number of people who answer an item correctly.59 The item difficulty index is expressed by the

letter “p”. Hence, if an item on a test is answered correctly by 78% of the examinees, the

difficulty index for that item is p = .78. An item is categorized as ‘easy’ if a higher percentage of

people answer it correctly. For example, if another item is answered correctly by only 45% of the

class, this item is said to be more difficult than the previous one where 78% of the examinees got

it right. In other words, the higher the percentage of people who answer an item correctly, the

easier is the item.

There are several factors that have to be considered while establishing appropriate levels of

difficulty.60 The first factor that influences the item difficulty is the probability of answering an

item by chance or guessing. In a true-false type of item, there is always a fifty percent chance to

get the answer right as there are only two choices. This means that such an item will not be a

good one to include in a test as the difficulty level will only be p = .50. Examinees are able to

answer such items correctly by guessing only and hence, such an item does not reflect the actual

level of knowledge or ability of the student. In the same way, a MCQ that has five options may

be answered correctly by guessing at least 20% of the time. Thus, a difficulty index more than

.20 would be needed for that item to be able to differentiate between students who might be

guessing and those who have a higher degree of ability. A difficulty index between .25 and .75 is

desirable for the item to be able to identify students who have various levels of ability.71

2.2.2.3 Item Discrimination

Item discrimination is another important element of item analysis. It is expressed as “D”.60 It

determines whether those who did well on the test also did well on a particular item. It is, hence,

able to divide students into low scoring and high scoring groups. It is anticipated that those

students who do well on the test also score highly on a particular item. If an item is selected by a

larger proportion of lower scoring group in comparison to the higher scoring one, it is said to

have negative discrimination. Such an item should either be revised or discarded. Once the two

groups, i.e., low and high performing, have been formed, an item’s discrimination can be

determined.72 It can be calculated as :

D = pu – pl

where pu is the proportion of correct responses for the upper group and pl is the proportion of

correct responses for the lower group. After the students in the upper one-third and lower one-

third have been identified, the proportion, i.e., percentage passing is calculated for both the

groups on each item. Then, the p of lower performing group is subtracted from the p of the top

performing group to yield an item discrimination index. Item discrimination index ranges from -

1 to +1. Past research has given the following four guidelines for the interpretations for the item

discrimination:73

1. If D ≥ .40, no item revision necessary

2. If .30 ≤ D ≤ .39, little to no item revision is needed

3. If .20 ≤ D ≤ .29, item revision is necessary

4. If D ≤ .19, either the item should be completely revisited or eliminated

Item discrimination is also established by determining the correlation coefficient between the

examinees’ performance on an item and their performance on a test.59 This is reported as the

point-biserial correlation (p-bis) between item score and total test score. It is desirable to have a

positive correlation as that is an indication that students who are answering correctly have a

higher overall score and the ones scoring incorrectly have lower overall scores. The items should

be revised or discarded if the coefficient is negative. A value close to 1.0 discriminates more

strongly than one closer to 0.

2.2.3 Advantages of Classical Test Theory

Despite the development of newer measurement methods, CTT has continued to remain

popular with the majority of educators.59, 71 This is because the basic concepts of CTT are easy to

understand. The most commonly documented advantage of CTT is its relatively weak

assumptions. It is possible for a variety of data to be analyzed with the application of CTT due to

these assumptions. Because it is not mathematically strenuous, the concepts are easily grasped by

anyone with basic mathematical knowledge. For the purpose of assessing reliability, Cronbach’s

alpha is used universally. Most of the commonly available statistical packages have the option of

carrying out the analyses under CTT. This makes it more acceptable by psychometricians in the

fields of education and psychology. In addition, instruments designed for CTT- based

measurement easily fit into the underlying models, thus yielding desirable results. A significant

advantage of CTT is that individual items need not be optimal.74 Even if the items relate to an

underlying construct only to an extent, this concern can be overcome by constructing several

items assessing the construct under question. Studies have shown that reliability can be improved

to any desired level by increasing the number of items about a variable on a particular test.51, 75

2.2.4 Limitations of Classical Test Theory

There are certain limitations to CTT despite its common usage. Hambleton76 has pointed out

that the item analysis is very much dependent on the sample of the examinees being assessed as

both item difficulty and discrimination indices are influenced by it. As stated elsewhere, if the

sample comprises examinees with high ability, the difficulty index tends to be higher. 77 Other

researchers point out that the scores of examinee ability depend on item difficulty in CTT.78

Hence, if the items are easy, the observed test scores are higher. They are lower if the items are

difficult.

Another limitation of classical test theory that was addressed by Hambleton and

Swaminathan107 is that it assumes that the measurement error is the same for all examinees. The

type of test affects the test score and true score. Thus, the students’ scores become dependent on

the items being administered and even though the ability remains the same, one may have lower

scores on difficult tests and higher on easier. Due to their different levels of ability, scores in

tests depict different amounts of error.

There is another limitation of CTT. It is that for comparison of the performance of different

examinees, the same or parallel items have to be used.79 This limitation is further aggravated as

parallel forms are difficult to achieve in CTT. Parallel testing is also the basis for test reliability

and because of that, test reliability is also affected by the examinee sample. In one study, the

authors presented evidence that reliability is a useful indicator of the quality of a set of test

scores.80 They concluded that it is dependent on the characteristics of the group of examinees

who take the test.

Another issue with CTT is that it is test-oriented which means that it is difficult to predict the

response of examinees on a test item.60 The CTT model, therefore, does not allow the developers

of a test to foresee the level of accomplishment of an examinee on a particular item.

The most significant limitation of CTT amongst the ones discussed above is that of examinee

and item inter-dependence. Both are influenced by the changes in each other’s characteristics. As

a result, it becomes difficult to compare the examinees taking different tests and items whose

characteristics are generated from different groups of examinees.

2.3 Shift from Classical Test Theory to Item Response Theory

Due to the limitations discussed above, newer methods of measurement continued to be

developed. Since the limitations of CTT were related to group dependence, mismatch between

items and examinee ability, weak assumptions and problems with parallel testing, it was only

understandable that the newer model was aimed at overcoming these limitations.

IRT or latent trait theory, as initially labeled by Lord in his dissertation in the 1950s, seemed

to provide a solution to the shortcomings of CTT.81 Once an alternative model had been

developed, it was, very quickly, followed by various other models focusing on measurement

issues. The main focus of IRT is the item and thus, all statistical analyzes are carried out at item

level. This continues to be the main advantage of IRT over CTT. The same concept has been

highlighted by several studies in the fields of education 82-87 and psychology.86, 88-93 This supports

the evidence of the widespread utility of IRT in these fields, medical education being no

exception.31, 42, 94

2.4 Item Response Theory

Continuous changes in educational outcome measures demand the development of newer and

psychometrically sound instruments that produce valid scores including scores with high

reliability. In psychometrics, IRT (also known as latent trait theory or modern mental test

theory) is a body of theory that describes the application of mathematical models to data from

questionnaires and tests as a basis for measuring abilities, attitudes or other variables.95 It is used

for statistical analysis and development of assessments, especially for high stakes exams.

IRT is a statistical model that expresses the relationship between an individual’s response to

an item and the underlying latent variable, also called latent trait or construct. This latent variable

is expressed as theta (θ) and is a continuous unidimensional construct that explains the

covariance among item responses.96 People at higher levels of theta have a higher probability of

responding to an item correctly. The ultimate aim of item response theory is to test people.

Hence, its primary interest is focused on establishing the position of the individual along some

latent dimension. Because of the many educational applications, the latent trait is often called

ability.

2.4.1 Item Response Theory-Then and Now

When Frederic Lord published his doctoral thesis on latent trait theory, educators and

psychometricians were provided with an option to choose between CTT AND IRT.97 The fact

that IRT modeled the probability of a response pattern of an examinee as a function of the

person’s ability led to a quick propagation of interest. In 1957, Birnbaum44 published a series of

technical reports followed by George Rasch98, 99 who published his book presenting some more

models for IRT in 1960. Baker added to Birnbaum’s works by comparing logistic and normal

ogive functions in 1961.99 While Lord61 and Novick100 put forward dichotomous models,

polytomous models were proposed by Samejima towards the later end of 1960s.101 By the 1970s

and 80s, Applied Psychological Measurement and The Journal of Educational Measurement

were publishing original studies by Hambleton102 and Wright.103

With the advent of the new century, a surge was noted in the software designed for the

analyzes of item data sets. These software handled both the technical and the computational

aspects of the IRT framework and mainly included BILOG,104 MULTILOG,39 WINSTEPS. 105

Recent addition to this list includes Xcalibre 106 which has helped more widespread use of IRT

by statisticians rather than exclusively by behavioural scientists and psychometricians.

2.4.2 Basic Concepts of IRT

In contrast to CTT which is based on the theoretical model depicted by X=T+E, IRT employs

mathematical function. Hambleton and Swaminathan107 stated that the characteristics of IRT are

based on the notion that the relationship between the observed response and the trait in question

has to be specified. Furthermore, it is assumed that the examinee performance can be predicted

from one or more abilities. The ability parameter, also called a theta, constitutes one of the

parameters of IRT. Crocker and Algina have also noted that the relationship between the

observed score and ability parameter is the same as the observed score and true score.60 They

have, in addition, highlighted the fact that item parameters, i.e., item difficulty and

discrimination are not dependent on the characteristics of the examinee. Furthermore, the ability

estimates are also independent of the items. It can, thus, be said that the item statistics are

person-free and the ability parameters are item-free.

2.4.3 Assumptions of IRT

IRT models include a set of assumptions about the data to which the model is applied.108

The first assumption common to the IRT models most widely used is that only a single ability is

measured by the items that make up the test. This is the assumption of unidimensionality, i.e., the

covariance among the items can be explained by a single underlying dimension.94 This

assumption is sometimes not met when cognitive, personality and test-taking factors might affect

test performance. A few of these factors are level of motivation, test anxiety, ability to work

quickly and tendency to guess when in doubt about the answers. All these factors are said to

contribute to random error. The unidimensionality of a scale can be evaluated by performing an

item-level factor analysis, designed to evaluate the factor structure.109

A second assumption of IRT models is that the items display local independence.110 This

means that when the abilities influencing test performance are held constant, examinees’

responses to any pair of items are statistically independent. This is technically subsumed under

the unidimensional assumption and requires that, given their relationship to the underlying

construct being measured is unidimensional, there is no additional systematic covariance among

the items.111 In other words, local independence means that if the trait level is held constant,

there should be no association among the item responses. Violation of this assumption may result

in parameter estimates that are different from what they would be if the data were locally

independent.

The third assumption of IRT models is that the response of an examinee to an item can be

modeled mathematically as the item response function.99 Item response function is a

mathematical function that looks at the relationship of the theta with the probability of endorsing

an item. When expressed in the form of a graph, it is called as the item characteristic curve

(ICC). These curves are discussed in the coming sections.

2.4.4 Item Characteristic Curve, Item Difficulty and Item Discrimination

A basic concept in IRT is the ICC which is a mathematical expression that relates the

probability of success on an item to the ability measured by the test and the characteristics of the

item.109 It is essentially a non-linear regression on ability of probability of a correct response to a

given item. Ability is also called as theta in IRT. The two important properties of an ICC curve

are difficulty and discrimination of an item. Item difficulty, also called as the “b” parameter is a

location index whose position is depicted on theta or x-axis. The second property is that of

discrimination, also called as the “a” parameter. It informs on the ability of an item to

differentiate between examinees with abilities below and above the item location. The figures

below show the graphic representation of the ICC.

In an ICC, theta or ability lies on the x axis and the probability of endorsing an item on the y

axis. The item difficulty or parameter b is the point on theta scale θ where a person has a 50%

chance of responding positively to the scale item. Hence, it can be observed that b determines the

threshold of the graph. Indices between 0.25 and 0.75 are recommended as desirable levels of

difficulty in IRT. 112 The location of b is plotted by drawing a vertical line from the point of

inflection, i.e., the change in curvature, to the horizontal axis. In the figure below, the value for b

is 1 for the right most curve, 0 for the middle one and -1 for the left one. The closest equivalent

of b parameter in CTT is p.

Figure 1: b Parameter on Item Characteristic Curves

The difficulty parameter, expressed as “b”, is most central to the concept of ICC. If one

observes the ICCs in Figure 1, one notices a change in the shape of the curve from

downwards concavity to upwards concavity. This concavity is determined by the b parameter

that determines the position of the curve on the x axis or theta. As an item becomes more

difficult, the curve is shifted from left to right.

The discrimination or parameter “a” describes the strength of an item's discrimination

between people with trait levels (θ) below and above the difficulty. It determines the slope of

the curve. The figure below show the item slopes formed by discrimination index.

Figure 2: a Parameter on Item Characteristic Curves

The a parameter is determined by drawing a line tangential to the curve at the b parameter.

The steeper the curve, the more discriminating is the item. In Figure 2, respective values for the a

parameter are 2, 1 and 0.5. Item-total correlation (also called as point biserial correlation) is the

equivalent of item discrimination in CTT. With a decrease in the steepness of the a parameter,

the ICC continues to get flatter until there is no change in the probability across the ability

continuum. It is obvious that those items which have very low a values are not useful for

discrimination of different ability levels.

The third parameter in IRT is that of guessing, also called as the ‘c’ parameter. It is the lower

asymptote parameter that describes why people of low level of ability respond correctly to an

item. In Figure 3, it can be seen as the lowest point of the ICC as it shifts to negative infinity on

theta.

Figure 3: c Parameter on Item Characteristic Curve

2.4.5 Test Characteristic Curve

IRT and methods are also applicable at the test or scale level besides item level. The concept

of test characteristic curve (TCC) stems from this ability of IRT.113 TCCs are test level

analogues of ICCs that represent a non-linear regression of overall test score on ability. In other

words, a TCC is created by summing all the ICCs across the ability continuum. The TCC can be

a very useful tool for evaluating the range of measurement and the degree of discrimination at

different points of the ability continuum. In addition, the degree to which the TCC is linear

provides an indication of the extent to which the measure provides interval scale or linear

measurement.112

Figure 4: Test Characteristic Curve

It can be observed in Figure 4 that the ability estimate is plotted on the x-axis as for an ICC

and the true score on the y-axis. A TCC expresses the relationship between the true score and the

ability scale. It can be interpreted in nearly the same terms as an ICC. The slope of the curve is

influenced by how the value of true score is affected by the changes in ability.114 There are some

situations where the TCC can be a nearly straight line over most of the ability scale. Most tests,

however, are expressed by a nonlinear curve. TCCs do not have a particular formula that may

help in their calculations. Hence, the curve is best defined in verbal terms after visual

observation.

2.4.6 IRT Models

There are three types of models that are commonly used in IRT for dichotomous data.

Depending on the number of parameters being used, they are called as one, two and three-

parameter models.115 The three parameters being used for these models are the b, a and c

parameters which are the difficulty, discrimination and guessing parameters.

A one-parameter model is the simplest of the three models.60, 95, 107 This model assumes that

the probability that a student will correctly answer a question is a logistic function of the

difference between the student's ability (θ) and the difficulty of the question (b).116 Another

model that should be mentioned here is the Rasch model which, although takes the student’s

ability and the difficulty of the question into account, is slightly different to the 1 PL model. In

the Rasch model, each individual in the person sample has parameters defined for item

estimation. On the other hand, when the person sample has the parameters defined by a mean and

standard deviation for item estimation, it is called as the 1PL model of IRT.2 The two-parameter

model has the same function as presented for the one-parameter model. However, in the two

parameter model, the item discrimination parameter will vary across items, as does the item

difficulty parameter.76 The three-parameter model includes a guessing parameter especially

useful for multiple-choice and true-false testing.

2.4.7 Item Analysis with IRT

Item analysis is the process by which the quality of an item in a test and the test as a whole is

assessed on the basis of examinee’s response to that item.72 It is useful because not only does it

2 http://www.rasch.org/rmt/rmt193h.htm

help improve items for future use but it also helps eliminate the ones that have poor

characteristics. This process also helps instructors develop content-appropriate tests.113

IRT analyzes a scale at the item level by calculating item difficulty, discrimination and the

test information function.117 In addition, it calculates the SE for the a and b parameter of each

individual item. It is able to estimate the relationship of an item to the construct being measured.

The former is signified by theta on the ICC and the latter by the slope of the curve.118 This

property of IRT helps decide which items to keep in a test and which ones to remove. Depending

on the purpose of the analysis, the items may be placed close to the cut-off value on theta or be

spread uniformly along the continuum from - ∞ to + ∞.

If the purpose of the instrument is to identify participants either for remedial measures or for

placing them into various groups, the location parameters should ideally be close to the cut-off.

If the aim is to measure the trait at all levels, they should be placed equivocally. IRT is, thus,

able to create tests that are shorter and more reliable and are aimed at the concerned population

to test the desired content.

It is not possible to fully utilize the potential of IRT models without making sure that the

right model has been chosen for item analysis. IRT investigates how test items function as trait

measures. This is carried out by determining item fit statistics. Item fit is vital because it

identifies the test model that is most effective in retaining the integrity of the collected data. It

locates non essential dimensions affecting the response to an item along with faulty construction

of items, thus recognizing item issues like miskeying of items or ambiguously worded items.

Another feature of item fit analysis is that it indicates errors that might have occurred in the

calibration phase of developing the test.

Most of the methods used for item fit statistics rely on the chi-square statistic. Examinees

are first rank-ordered according to their estimated theta. They are, then, grouped into categories

which may be fixed or subjectively determined. The proportion of examinees who answer an

item correctly is then calculated which is compared to the predicted proportion based on the item

response function. Xcalibre, the IRT software used in my research, also uses the chi-square fit

statistic as an index of the overall fit of an item with the empirical data to evaluate its statistical

significance.

Research aiming at item analysis has yielded valid and reliable information.42, 72, 119 Chang et

al applied the Rasch model to the data from Taiwanese board certification exam in anesthesia

and found a mean examinee ability that was higher than the mean item difficulty in this written

test.42 The participants were able to answer 78% of the items correctly. Swanson et al

investigated the impact of item format and number of options on the psychometric characteristics

in addition to the response times for multiple-choice questions appearing on Step 2 of the United

States Medical Licensing Examination.120 They concluded that use of the extended-matching

format and smaller numbers of options per item resulted in more efficient use of testing time and

greater score precision per unit of testing time. Other studies conducted by May and Jackson,119

Yan et al121 and Bhakta et al 31 have also explored various aspects of item analysis.

2.4.8 Applications of IRT

IRT has numerous applications in educational measurement and social sciences. It is used for

a number of processes because of its unique features, characteristics and components

summarized above. It is the testing model of choice for many high-stake exams including GRE,

SAT, TOEFL and PISA.122 It is also used for medical licensing and accreditation exams. These

include the Medical Council of Canada Evaluating Examination and the MCQ component of the

Medical council of Canada Qualifying Examination Part I.123 It is used for assessing reliability110,

124 and providing validity evidence for various types of exams (item and test information

function),125 test equating,102 test assembly and banking,126 scoring and reporting and for

estimating task difficulty and stringency levels of raters.98 One of its main purposes in the

context of assessment is to evaluate how well a tool of assessment works.127 It allows for the

analysis of more complex methods of assessment than what the CTT offers. Perhaps the most

novel application of IRT is in computer based testing where it has been used extensively.

All of the above-mentioned applications of IRT can be broadly put into one or the other of

the following categories: 1) Item analysis 2) Ability and parameter estimation of items

3) Differential item functioning 4) Computerized adaptive testing. For the purpose of this

research, item analysis was taken into consideration which has been discussed earlier but to get a

broader perspective of the widespread use of IRT, it is vital to briefly touch upon a few of the

other applications of IRT as well. These are discussed below.

2.4.8.1 Ability and Item Parameter Estimation

The probability of a correct response in the item response models depends on the

examinee’s ability and the parameters that characterize these items. Because the actual values of

the item parameters are not known, one of the tasks performed when a test is analyzed under IRT

is to estimate these parameters. The obtained item parameter estimates then provide information

as to the technical properties of the test items. This procedure is called maximum likelihood

estimation.39, 128 In IRT, item parameter estimation is computationally intensive and must be

carried out by computer programs specifically designed for such a task. Early software programs

focused on maximum-likelihood estimation as a mechanism for estimating the item

parameters.129 These programs eventually had to adopt numerous ad hoc constraining

mechanisms to avoid some of the problems associated with “pure” maximum-likelihood

estimation of IRT item parameters. Previous item parameter estimation techniques required

relatively long tests and large samples (i.e., several thousand examinees) in order to obtain

accurate IRT item parameter estimates. With the implementation of the maximum likelihood

technique, reasonable estimates of IRT item parameters can be derived from short tests (e.g., 25

items) and small samples of examinees (e.g., less than 1000). IRT adopts explicit models for the

probability of each possible response to a test and hence its alternative name, probabilistic test

theory, may be the more apt one. Any attempt at testing is preceded by a calibration study, i.e.,

the items are given to a sufficient number of test persons whose responses are then used to

estimate the item parameters.130

2.4.8.2 Differential Item Functioning

Test equity is a concept that characterizes uniformity in testing of subgroups of a population

with different levels of the same construct under study in participants with various levels of

ability. The way to ensure it is by removing content that is biased towards students, i.e., favoring

one group more than the other with same construct being measured. The items that create such

bias are said to have differential item function (DIF). Such items warrant that they be removed

from the test or scale to make it more reflective of a person’s true abilities. DIF is a statistical

property that states that examinees with similar abilities have differential probabilities of success

on an item. Such items are responsible for affecting the validity of a test and are a serious threat

to such tests that measure the trait level of participants from different subgroups of the

population under study. IRT is a very useful method for identifying such items.

IRT calculates DIF by studying the difference between the ICCs of two examinees with

potentially similar abilities. If the matched-ability examinees plot on the same curve, it is an

indication of that item not exhibiting the notion of differential function (the smaller the distance

between two ICCs, the less the DIF). This type of analysis is always preceded by test equating. It

is important to note here that group differences might necessarily be due to DIF but due to actual

difference in their means. If the IRT model fits accurately, the same ICCs will be generated.61

One must always question why an item has differential function and if a justifiable reason is not

found, that item might have to be left out due to its content. Such a situation is balanced by

constructing more items in the test that favour the focal group.

2.4.8.3 Computerized Adaptive Testing

IRT is the backbone of computerized adaptive testing (CAT) and its various functions

are utilized in all the three steps of tests administered by this “state of the art” mode. In fact,

CAT cannot function without the property of invariance, a characteristic of IRT. When an item

bank exists with the provision of access to item level statistics generated with the application of

IRT, CAT can be initiated. The actual process is iterative and begins with the analysis of all

those items that have not been used by the candidate so far and based on that, a decision is made

about the next one to be administered which will suit the ability level assessed currently. The

chosen item is then answered by the examinee and a new ability estimate is generated that is

based on the responses of the ones administered so far. These steps continue to be repeated until

such time as a criterion for stopping, which has been identified beforehand, is met. This criterion

may be the time spent on the test, the number of items administered, ability estimate, content

tested upon, or the standard error. Students find this method of being tested favourable as it helps

to cut down the testing time by half while maintaining a high level of precision.131

2.4 Comparing CTT AND IRT

Certain aspects of classical test theory make it less desirable for educational measurement

than IRT. One of these is that the item characteristics are group-dependent, i.e., if examinees

under study are different from the ones with which the item indices had been obtained, the test

becomes of limited value. Again, examinee performance is also test-dependent. Furthermore, this

test is expressed at the test level rather than the item level. In addition, it also does not provide a

measure of precision for each ability score.111, 130 IRT, on the other hand, is group-independent,

test-independent and is expressed at the item level.

In contrast to CTT, IRT models are lauded for their ability to generate invariant

estimators.132 That is, theoretically IRT ability estimates, θ, are “item-free” (i.e., would not

change if different items were used) and the item difficulty statistics are “person-free” (i.e.,

would not change if different persons were used). For single ability, dichotomously scored test

items, IRT employs three different models. Because the assumptions of IRT are complex, it is

not always suitable to use it for all situations.133 Several medical school exams utilize CTT rather

than IRT for analyzes. On the other hand, IRT is extensively used in several high-stake exams

like GRE, SAT and PISA due to its computer adaptiveness and ability to handle large data

sets.133 With the development of newer software, it is becoming more commonplace than ever

before to use IRT for medical education-related research.38, 134

Despite the more advanced nature of IRT, CTT has served the psychometricians well for

very long. This is because it has a number of well-documented advantages over other testing

theories.59 Its concepts are fairly basic and methods quite flexible. It has a robust model that is

amenable to changes with changes in the data without skewing it. Furthermore, its underlying

models fit several instruments accurately. It does have some theoretical weaknesses as well that

make it less favourable for certain situations. In CTT, item level statistics of difficulty and

discrimination is examinee-dependant.76 Usually, the scales tend to be long with an inability to

differentiate between a common theme that might run across items for the construct under study.

The items, furthermore, are not probed vigilantly. Despite these shortcomings, CTT continues to

be in demand for many types of studies.

In many situations, a combination of CTT and IRT together works better than either of them

on their own. CTT, in such circumstances, can be used to carry out basic statistical analyzes and

IRT can be applied to measure examinee abilities and item level statistics.

IRT measurement is an advanced statistical model that is able to address many item-level

concerns not resolved by other testing theories. Although CTT has been used more often than

IRT in medical education, the numerous applications of IRT and now the advent of more

advanced software are making it more acceptable. Test designing and equating, item selection

and scaling and adaptive testing are carried out more conveniently by IRT than by other

available models. IRT offers the promise of solving many problems that are faced by

psychometricians in medical education. Despite the fact that CTT is more robust in terms of

assumptions and data size, IRT provides more useful information in terms of examinee abilities

and item difficulty. Many international testing bodies dealing with larger data prefer applying

IRT models for various high-stake exams which is a credit in itself in favour of IRT. A summary

of the differences between some important features of CTT and IRT is given in Table 1.

Table 1: Features of Classical Test and Item Response Theory

Features Classical Test Theory Item Response Theory

Focus It is on determining the error of

measurement.

It is on determining the unobserved

theoretical latent trait.

Goal In CTT, the quality of the observed

test score is evaluated by estimating

the reliability coefficient and the

standard error.

In IRT, the score of the latent trait is

estimated.

Standard Error Only a single type of error can be

determined in CTT.

The standard errors of individual

parameter estimates can be

determined in IRT.

Sample Size CTT works well with both small

and large data sets.

IRT requires a larger data for optimal

application depending on the model

that fits the data.

Assumptions CTT has a robust model with

flexible assumptions.

The various models have strict

assumptions of unidimensionality

and local independence in IRT.

Reliability of the Scores CTT calculates the reliability

coefficient of the total test score.

In IRT, reliability is reflected by the

test information function.

Item Calibration CTT does not require item

calibration due to its flexibility with

the data.

It is a prerequisite in IRT for the

items to be calibrated before the

actual administration of a test.

2.5 Temporal Stability of MCQs

MCQs are frequently used in high stakes exams to assess the students. Their

psychometric properties may become less stable over time and across administrations due to a

number of factors. This raises concerns since decisions of certification, promotion and

graduation depend on these written assessments. It is, thus, desirable for the items to exhibit

temporal stability for their repeated use in exams. Changes that may occur in the item parameters

over time and administrations refer to the phenomenon of parameter drift which is discussed

below.

2.5.1. Parameter Drift

As stated above, parameter drift refers to the phenomenon of changes that occur in the

parameter estimates of an item due to repeated administrations. If the values of the parameters

alter more than would be expected due to measurement error, it cannot be assumed that these

values will remain unchanged over time. As a consequence of this, such items may have to be

removed from the item bank due to threats to stability.

Although the phenomenon of parameter drift is typically associated with IRT, changes may

also be observed in the context of CTT where p values and point biserial correlations may drift

over time and repeated administrations. Parameter drift is observed both in the context of item

difficulty and discrimination estimates. Some researchers have documented that item difficulty

has a stronger parameter drift than item discrimination.135

The phenomenon of parameter drift is attributable to several reasons in addition to

measurement error. Changes occurring in the construct are one reason for parameter drift. A

construct may change due to alterations in the testing universe, the objective of the assessments

or the target students. This is particularly observed in the context of a curriculum that is still

being developed and is undergoing frequent changes. Since some items testing a particular

construct might not be required for assessing it any longer, the usefulness of such items may

wane, leading to further drift. Another factor influencing the item parameters is the content of the

curriculum. If the curricular content is not dynamic, students become well-versed in picking up a

trend in the exam questions and item stability is affected. Bock et al135 observed parameter drift

in a study on a College Board Physics exam which they attributed to curricular differences. The

items on Basic Mechanics became easier over the years as the content was heavily covered in the

curriculum. Changes in the characteristics of the items also causes changes to occur in the item

parameters. Hence, items that test certain general skills like arithmetic and comprehension have

been noted to drift less compared to ones that are content-specific. They are also affected by the

timing of instructions delivered to different cohorts. If one group of students have been

instructed closer to the exams in contrast to another, changes in the item characteristic may be

noted to influence the scores. One interesting influence on parameter drift is that of recency of

instructions.136 Content that has been emphasized in the near past may lead to improved general

knowledge about such topics, making some items appear less challenging. Bergstrom and

colleagues have reported on parameter drift resulting from differences in pre and post tests due to

changes in practice and motivational effect.137 A national computerized adaptive test yielded a

drift in 32-49% of the items between a pre-test and operational use over a five-year period.

Threats to security also bring about a change in the parameter estimates. Techniques like training

in test wiseness tend to cause parameter drift since students learn to pick up the correct answer

despite lack of content knowledge. This is further aggravated by answer sharing by the

examinees who have already taken the test. Overexposure of the items, either due to repeating

over multiple administrations or due to computer adaptive testing, both lead to a decrease in the

test security since students start anticipating that certain items will be included in the test. Gender

differences, language preference and ethnicity can also cause a drift in the parameters of items.

Furthermore, significant changes in parameter estimates may also result from large changes in

the population. In addition, relatively easy items may become more difficult as the knowledge

being tested by these items becomes less common. On the other hand, difficult items may

become easier as the previously specialized knowledge becomes more commonly known.

Parameter drift leads to a number of consequences that may affect the outcome of an

assessment or a program. Due to its impact on the performance on an item, it affects the scores of

an exam. Students may find the items differentially easy or difficult due to a drift in the

estimates. Where comparisons have to be made in the performance of a student over time, it can

be complicated due to parameter drift as the baseline estimates may be altered with time.

Parameter drift can also pose challenges when decisions need to be made around the cut score of

an exam. This is especially important in the context of a high stake exam where decisions about

certification, graduation, etc. may be affected by this phenomenon. In the context of equating,

parameter drift leads to the addition of further equating error in pre-equated test forms if the

parameters are not re-estimated before the administration. Despite the unwarranted consequences

of parameter drift, researchers have reported that the overall effect of this phenomenon on the

test forms remains small. Due to its robustness, theta recovery remains intact although drift may

be seen in both b and a parameters.138, 139

Various methods are employed to study the phenomenon of parameter drift. One method is to

use chi square where parameter estimates are compared across different time points to look at its

effect.140, 141 It is also detected using z test where a comparison is made between two subgroups

to detect drift.140 In the context of IRT, different models are compared with each other for fit.135

Model fit is determined using likelihood ratios chi square test for detecting differences in the

models and hence parameter drift. Alternatively, the fit of one type of model is compared at

different points in time across administrations to observe whether there is a difference in the

parameter estimates leading to a drift. In addition, DIF may be used to detect drift as well.

Parameter estimates that are obtained from such testing are compared across administrations to

study the differences in them. Babcock et al142 and Wollack et al143 have utilized test

characteristic curves to visually compare them as they provide useful information regarding

changes in parameter estimates over time. Amongst all the methods discussed above, the chi

square testing has been documented to be most effective.140

To conclude, it can be stated that temporal stability can be assessed by analyzing parameter

drift. Parameter drift is not uncommon and can especially be observed in exams utilized for

assessing a large number of students and with a large number of items. It is important to analyze

the MCQs for estimate drift as the items may become differentially easy or hard over repeated

administrations. The choice of method for analyzing the drift should depend on the effectiveness

and ease of application of a method and the stakeholders’ understanding of it.

2.6 Research Questions

The questions that were addressed by this research were mainly to observe the reliability of

scores on an MCQ exam while using two different methods and the stability of these MCQ items

over time. It was hoped that this research would help compare the similarities and differences in

the two methods, i.e., CTT and IRT and also explore some of the factors affecting the stability of

items on being used repeatedly. My research questions were as follows:

1. What was the reliability of scores using and comparing two methods of analysis, i.e.,

CTT and IRT, on MCQ items administered three times over a six year period?

1A. What are the item parameters when conducting item analysis with CTT?

1B. What are the item parameters when conducting item analysis with IRT?

1C. Are the item parameters comparable when analysing with both CTT and IRT?

1D. What is the reliability index of the test scores?

1E. What are the item characteristic curves like for the individual items for each year?

2. Do the items exhibit temporal stability when repeated over Year 1, 2 and 3?

2A. Do the items show stability across years using CTT?

2B. Do the items show stability across years using IRT?

CHAPTER III – RESEARCH METHODS

3.1 Study Design

An exploratory retrospective cohort design was utilized to answer the research questions

in this study. The main aim of this particular design was to assess the reliability of the MCQs

over three selected years using CTT and IRT. In addition, item stability over three years was also

studied. Section two of this chapter presents the setting and context, section three elaborates on

sample and data source, and section four describes the analyses. Ethical concerns are discussed

in section five.

3.2 Setting and Context

This research was carried out at the University of Calgary. The data were obtained from the

Office of the Undergraduate Medical Education at the university. University of Calgary is one of

Canada’s seven premier research universities and is a member of the Network of Centers of

Excellence, a Canada-wide program of research and innovation. In addition, it has launched its

own initiative of “Eyes High” in 2011. Eyes High is the University’s new strategic direction

aiming at becoming one of Canada’s top five research universities, grounded in innovative

learning and teaching and fully integrated with the local community.

The undergraduate medical program at the University of Calgary, which was established in

1967, is an innovative program that encourages the acquisition of skills required for solving

clinical problems through the use of the “Clinical Presentation Curriculum”. This curriculum

was initially introduced in the early nineties.144 The foundations of this curriculum are the

principles of early contact with patients and integration of basic and clinical sciences. These

principles nurture the growth of knowledge and skills vital for the practice of medicine and the

efficient use of knowledge for the analysis and solution of clinical presentations.

The “clinical presentation curriculum” organizes the instructional strategies around 120

clinical presentations. Clinical history, physical examination and investigations warranted are

covered extensively in this way. For instance, the schema of an approach to a patient with

hypertension is shown in Figure 5 below.144 This new curriculum was further strengthened in

2006 after student and faculty feedback over ten years where the more traditional systems with

overlapping clinical presentations were merged together into one longer case.145 For example,

“chest pain” and “dyspnea” were linked together into the “cardio-respiratory system”. This

improvement helped to integrate the clinical presentations horizontally.

Hypertension

True or Mislabeled

Primary Secondary

Volume-Dependent Vasoconstrictive

Parenchymal

Disease

Mineralocorticoid

Excess

Angiotensin II

Excess

Catecholamine

Excess

Figure 5: Causes and Pathophysiology of Hypertension

The MD is a three-year program at the University of Calgary and the summative certifying

exams comprise both MCQs and OSCE.3 The items in this research were chosen from three

randomly selected courses, i.e., 1, 3 and 6.4 Course 1 covers the prescribed curriculum of

Hematology and Gastroenterology (GIT) and is offered in the first year of medical school,

Course 3 covers the Cardiovascular (CVS) and Respiratory content and is offered in year one

like Course 1. Course 6 comprises Reproductive Medicine and Human Development and is

offered in year two of the MD program. The Undergraduate Medical Education (UGME) Office

has a well-developed MCQ bank that was accessed in this research with the permission of the

Associate Dean, Undergraduate Medical Education.

Security and copyrighting of MCQ items is an issue that arises whenever MCQ question

banks are accessed.146 These banks are expensive to construct and maintain both due to financial

constraints and logistical problems with the faculty as they require constant replenishing of high-

quality items after authoring, pretesting and analysis. Furthermore, the confidentiality of MCQs

is compromised with repeated use of the same items. For this reason, intellectual property and

digital copyrighting are put in place and implemented by academic institutes and individual

departments. For the same reason, it was not possible to disclose the details of the items analyzed

in this research.

3.3 Sample and Data Source

A total of 90 MCQs used in the assessment of three courses, Course 1 (Hematology and

GIT), Course 3 (CVS and Respiratory System) and Course 6 (Reproduction and Human

3 http://www.ucalgary.ca/mdprogram/admissions/introduction/years-- 4 http://www.ucalgary.ca/mdprogram/admissions/teachingmethods

Development) over three years each were analyzed in this research. Table 2 shows the

distribution of item selection for each year and course.

Table 2: Item Distribution for Individual Year and Course

Year Course 1 Course 3 Course 6

2007 30 Items Selected

2008 30 Items Selected 30 Items Selected

2009 Same 30 Items Same 30 Items

2010 Same 30 Items Same 30 Items

2011 Same 30 Items

2012 Same 30 Items

Thirty multiple choice items were chosen for each of Courses 1, 3 and 6. The MCQs selected

were the ones that had been reused in either alternate or successive years. These MCQs were the

single best answer (SBA) type, also known as the one-best answer type. They are the most

commonly used type of MCQs in medicine and other life sciences.32 A clinical scenario usually

acts as an introductory stem in such types of questions which is followed by a lead-in question

and usually five options to choose from. Four of these options are distracters and one the correct

answer. It is important to keep the options homogeneous so that one option does not stand out

more than the other.51 The following is an example of an SBA type of MCQ:

A nine-month old girl is admitted to the hospital for growth faltering. The prenatal history is

unremarkable and the child thrived well for the initial four months. On examination, the child is

found to have a wide open fontanel, is listless and has a nappy rash. She is also below the 5th

percentile for length and weight. No other abnormalities are detected. After 1 week of routine

hospital care, the infant has gained 1 kg and has become more playful and alert. Which of the

following is the most likely explanation for the faltering growth?

(A) Hypothyroidism

(B) Infantile psoriasis

(C) Milk allergy

(D) Parental neglect

(E) Pyloric stenosis

The MCQs selected for this research covered various components that included the

assessment of knowledge about the skills in basic sciences, investigations, treatment and

management. The details of their distributions over the four mentioned skills will be elaborated

upon in the results section.

3.4 Data Analyses

Data analyses for both CTT and IRT were carried out using Xcalibre version 4.2. These

included the descriptive analysis of the research to give an overall picture of the results. Below

are the details of the analyses that were carried out in addition to a summary of the concepts

underlying them. Since the research question had two parts, one related to the reliability of items

and test, and the other to the temporal stability of the items, the analysis has accordingly been

divided into two questions.

3.4.1 Research Question No. 1

As discussed in the literature review, reliability of test scores is an end result of item analysis.

Since the objective of this research was to use two methods of analysis, i.e., CTT and IRT, and to

compare the results of both, the methods were subdivided to yield the answers to the following

questions:

3.4.1.1 Research Question No.1 A

What are the item parameters when conducting item analysis with CTT?

Item difficulty and discrimination are both important constituents of item analysis. For this

purpose, CTT was used to look at the difficulty and discrimination indices of items under study

over a period of three years. In CTT, the difficulty index is denoted by “p” and refers to the

examinees who have answered the item correctly. The higher the p value, the easier the item. It is

synonymous with the item difficulty in IRT denoted by “b”. As discussed in the section on

literature review, item discrimination in CTT refers to the item-total correlation and is called a

point biserial correlation which can be any value between -1 to +1 although the closer the

correlation coefficient is to 1, the more discriminating is the item. The IRT analogue of point

biserial correlation is discrimination index which is denoted by “a”.

3.4.1.2 Research Question No.1 B

What are the item parameters when conducting item analysis with IRT?

IRT was applied to assess the difficulty and discrimination indices of the same item used for

carrying out the item analysis with CTT. This was done so that differences and similarities could

be highlighted between the two methods.

3.4.1.2.1 Two-Parameter Logistic Model of Item Response Theory

This research has used the 2 PL model of IRT 115 which comprises the following two

parameters:

1. The item difficulty, or threshold, parameter b--- it is the point on the latent scale θ where a

person has a 50% chance of responding positively to the scale item.

2. The slope, or discrimination, parameter a--- it describes the strength of an item's

discrimination between people with trait levels (θ) below and above the threshold b.

The 2 PL model was used since one of the aims of this research was to compare the a and b

parameters with the difficulty and discrimination parameters of CTT over three years. Also, 3 PL

models require a larger sample size for such analyses.61 The sample size usually recommended

for 3-PL analysis is between 1000-2000 examinees, the larger number being more desirable.147

While carrying out the item analysis with Xcalibre 4.2, the model constant was set at 1.7.

Theta was estimated using a maximum likelihood estimate and examinee ability estimates were

rescaled to have a mean theta of 0. Item analysis is briefly recapitulated in the following sections.

Since there are some differences in the interpretation of item analysis using the two different

measurement methods, these will also be discussed.

3.4.1.2.2 Item Analysis

Item analysis is the process by which it can be confirmed if the items on a test are

functioning in the desired manner.112 Given that there are limited numbers of items on an

examination, every item has to be written in such a way that it is able to assess higher cognitive

functions along with an evaluation of the understanding and application process of the examinee

sufficiently.17 Item analysis, thus, helps establish the difficulty and discrimination levels of each

3.4.1.2.3 Item Difficulty

Item difficulty expresses the proportion or percentage of students who answered the item

correctly. It can range from 0.0 (none of the students answered the item correctly) to 1.0 (all of

the students answered the item correctly). The average difficulty index for a five-option multiple

choice test should be between 0.25 and 0.75. 112 If an item is found to have a difficulty of less

than 0.25, it may be that one of the wrong options has been recorded into the answer scanner as

the correct one (miskeyed item) or that that the item was not written clearly. It is also possible

that the item may have more than one correct answer or that at least one distracter is very close

to the correct option.

3.4.1.2.4 Item Discrimination

This refers to the ability of an item to distinguish between the more knowledgeable and the

less knowledgeable students.112 An index of 0.40 and higher is said to be consistent with

excellent discrimination, 0.30 to 0.39 good, 0.10 to 0.29 fair and 0.01-0.10 poor. If the

discrimination index is calculated to be in negative values, the item may be ambiguous or as in

the case of item difficulty may have been miskeyed inadvertently by the programmer.

3.4.1.3 Research Question No.1 C

Are the item parameters comparable when conducting item analysis with both CTT and IRT?

Correlation coefficients were calculated between the item parameters generated by both CTT

and IRT for all three courses for the three years. This was done to observe whether there was a

relationship between the two methods of measurement. It was assumed that if the correlation was

good to excellent, it would mean that the two methods, irrespective of the differences in them,

would be comparable to each other. A perfect correlation coefficient is that of 1.0. Correlation

coefficients can be negative or positive; negative meaning that there is little or no correlation

between the variables under study and positive meaning that the variables are correlated and

hence are comparable. The formula that is used for calculating the correlation coefficient

standardizes the variables. Hence, changes in scale or changes in units of measurement do not

affect its value. P values were also reported along with the correlation coefficients.

3.4.1.4 Research Question No.1 D

What is the reliability index of the test scores?

The SE of parameter estimates was calculated for each item for the three years. As stated

elsewhere, standard error is sensitive to the size of the sample and a larger standard error is noted

for smaller samples than for the larger ones. The size of item parameters, i.e., difficulty and

discrimination indices also influences the standard error as more extreme parameters like a

difficulty index of, for example 1.5, will lead to a larger standard error.

In addition, Cronbach’s alpha was calculated for the test scores of the three years for the

three courses individually to look at the reliability coefficient of the scores. Both CTT and IRT

were used for this purpose so that a comparison could be made between the results of the two.

For IRT, the SEMs of the examinees’ theta were averaged to produce a mean SEM for the

examinees for a given year. This mean SEM was then converted to a reliability coefficient by

applying the following formula:

Reliability = 1-(SEM/SD theta) ^2

where SEM is the mean standard error of measure and SD scores is the standard deviation of the

examinee thetas. This calculation gave us a single value that was used to represent the reliability

of the IRT scores on the examination under scrutiny.

3.4.1.5 Research Question No.1 E

What are the item characteristic curves like for the individual items for each year?

A basic concept of IRT is the ICC which is a mathematical expression that relates the

probability of success on an item to the ability measured by the test and the characteristics of the

item.109 Item characteristic curves were generated for each individual item over three years for

three courses. The details of Course 1 are given in the results section. For Course 3nad 6, the

graphs can be found in the Appendix A8 and B8 respectively.

An ICC is essentially a non-linear regression of the probability of a correct response to a

given item on the examinee’s ability. Item difficulty and discrimination influence the shape of

the curve. Difficulty is a location index and describes the function of the item along the ability

scale. The steepness is attributable to the discrimination of the item. The higher the

discrimination index, the steeper the curve.

3.4.2 Research Question No. 2

This research question has been answered by using both CTT and IRT. The sections that

follow have been hence divided accordingly into subsections.

3.4.2.1 Research Question No 2. A

Do the items show stability across years using CTT?

3.4.2.1.1 Repeated Measures ANOVA

IBM’s SPSS (version 22) was utilized to run a repeated measures ANOVA for the three

years for each of the three courses. Repeated measures ANOVA detects the variances between

means for related groups. Furthermore, it helps determine if the dependent variables are altered

by the independent variable (year in the case of this research). It was appropriate for this research

as one of my objectives was to study the change in the item parameters over three points in time.

ANOVA assumes that the variances are equal across the groups or samples under research.

Levene’s Test for Equality of Variances148 is applied to test this assumption of homogeneity of

variances. It can, thus, be used to verify whether or not the variances of the groups are

statistically different. Generally, 0.05 is used as the probability level to establish the statistical

significance; so, if the Levene’s Test shows a significance value of < 0.05, it can be concluded

that the variances are significantly different. Similarly, if it shows a value greater than > 0 .05, it

means that the variances are not significantly different.

If the Levene’s Test is non-significant, then another statistic is determined for ANOVA

which is called the F ratio. This is the ratio of the variance between groups to the variance within

groups i.e. the ratio of the explained variance to the unexplained variance. The F ratio is used to

test whether or not two variances are equal. If the p values are not significant and the F ratio

small, it meant that the dependent variables, i.e., difficulty and discrimination indices, are stable

over the years and unaffected by time.

In addition to repeated measure ANOVA, correlation coefficients were calculated and scatter

plots constructed for the two methods and the three years to assess the stability across the years.

Correlation coefficients inform how strongly two or more variables are related to each other.149

The correlation is said to be positive if one variable increases with the other and negative if one

increases while the other decreases. Both the variables are said to have a relationship even if it is

negative. A correlation of + 1 is said to be a perfect correlation. It is said to be moderate if 0.5

and above and excellent if 0.8 and above.

This research, as stated earlier, also reported the overall mean and standard deviation for the

difficulty and discrimination indices of the 270 items (30 items per year X 9 years). In addition,

descriptive statistics were generated for each course individually.

3.4.2.1.2 Effect Sizes

Partial Eta² was calculated to report the effect sizes for item difficulty and discrimination for

the three courses. Effect size is a useful index to depict the practical significance of study

results.150 It is preferred to statistical significance because it is not dependent on sample size and

is a scale-free index. It can, hence, be interpreted irrespective of the scales of variables. The

index varies from about 0.3 to ∞. It is small if the value is between 0.30-0.49, moderate between

0.50-0.79 and large between 0.80 to ∞. The larger the effect size, the larger the difference

between the distributions of scores.

3.4.2.2 Research Question No. 2 B

Do the items show stability across years using IRT?

Repeated measures ANOVA was used to study the temporal stability of item parameters

obtained with the IRT method. The objective was to observe the change in parameters with time

and to compare the findings with those observed with the application of CTT. In addition, TCCs

were also generated to visually compare the trend of the curves for stability over time. TCCs for

Course 1 are displayed in the results section whereas the ones for Course 3and 6 are displayed in

Appendix A17 and B17 respectively.

3.4.2.2.1 Test Characteristic Curve

IRT and methods are also applicable at the test or scale level as discussed earlier. The

concept of a TCC stems from this ability of IRT.113 It represents a non-linear regression of

overall test score on ability. The TCC can be a very useful tool for evaluating the range of

measurement and the degree of discrimination at different points of the ability continuum. This

research used TCCs that were generated to assess the temporal stability of multiple choice items

by comparing them. The 2PL model of IRT was used to calculate the theta level of the

examinees in each cohort for each year and course separately along with the proportion correct

units and the number-correct units. These were then plotted on graph using Xcalibre 4.2.

Individual graphs for each year per one course were then visually compared to establish whether

they looked similar in trend.

3.5 Summary of Analyses

1. Descriptive analyzes were carried out for the difficulty and discrimination indices

calculated by using SPSS and the IRT software called Xcalibre (Version 4.2).

2. The reliability of the test was assessed by carrying out item analysis and calculating SE

of estimates and Cronbach’s Alpha using both CTT and IRT.

3. Correlation coefficients were calculated to look at the comparability of CTT and IRT

4. ICCs were constructed to study the item parameters.

5. Repeated measures ANOVA was conducted using the difficulty index and the

discrimination index individually as dependent variables and year as the independent

variable to look at the stability of the MCQs across three years. Effect sizes (Partial Eta²)

were also calculated.

6. Year-wise correlation coefficients were calculated to look at the temporal stability of the

items.

7. Temporal stability of the selected MCQs across three years on item response calibrated

difficulty and discrimination indices for a 2 PL model of IRT was analyzed by generating

TCCs across the years.

Table 3: Methods Summary

Research Question Variables Statistical Analysis

Do the items exhibit reliability

across years using CTT and

1. Item difficulty index of 30 X 90

2. Item discrimination index of

30 X 90 items

3. Comparability of CTT and IRT

Item analysis

SE of Estimates

Cronbach’s Alpha

Correlation coefficients

Do the items show stability

across years using CTT and

1. Independent: Years of exam (2007-

2. Dependent: Item difficulty and

discrimination indices of 30 X 90

3. Theta and item scores

a) Repeated measures

ANOVA as it proves

stability of items if F

ratios are small and p

value not significant;

effect size for

significance index.

b) Correlation

Coefficients to observe

inter-year relationship of

item; if moderate to

excellent, it would

indicate stability

c) IRT to generate TCCs

3.6 Ethics

This study received ethics approval from the Conjoint Health Research Ethics Board

(CHREB) at the University of Calgary. The permission to utilize the items for analysis was

granted by the Office of Undergraduate Medical Education, Faculty of Medicine, University of

Calgary. The participants’ demographic data could not be accessed for my research and they

were completely anonymous. This was due to a lack of permission from the CHREB in the

context of student demography. Except for the item number to identify the selected MCQs, no

other information could be accessed due to the issue of the security of MCQ bank that rises with

the publishing of MCQs. The data were only accessible to the primary researchers and were

password-protected.

CHAPTER IV-RESULTS

4.1 Overview

This chapter describes the results obtained from the statistical analyzes elaborated in

the previous chapter. The main aim of this research was to use University of Calgary summative

examination data from MCQ exams in order to assess the reliability of scores using and

comparing two methods of analysis, i.e., CTT and IRT, on MCQ items administered three times

over a six year period. In addition, the temporal stability of the same items was also analyzed

using both CTT and IRT. Due to a lack of permission from the CHREB, demographic data was

not available. For the purpose of the overview, the descriptive analyses for all three courses over

the three years are presented. For a full elaboration of results of the research questions presented

in chapter III, only Course 1 is discussed at length. Detailed results of Course 3 and 6 can be

viewed in the appendices at the end.

4.2 Descriptive Analysis

Descriptive analyses are shown below for various aspects of this research. These show the

skills, number of examinees and content of the three courses. In addition, descriptive statistics of

item parameters are shown as well.

Table 4 shows the distribution of MCQ items according to the skills which were divided

into Basic Sciences, Diagnosis, Investigation and Treatment. Predominant items in Course 1

(Hematology and GIT) were from the skill of Basic Sciences (N=11), closely followed by

Diagnosis (N=10). For Course 3 (CVS and Respiratory), they were mainly from the Diagnosis

(N=17), followed by Treatment (N=8) and an equal and small number belonged to Basic

Sciences and Investigations. Basic Sciences items (N=9) were slightly predominant in Course 6

(Reproductive Medicine and Pediatrics) with near-equal number across the skills of Diagnosis,

Investigation and Treatment.

Table 4: Distribution of MCQs According to Type of Skill (N=90)

Course Skill Total

Basic Sciences Diagnosis Investigation Treatment

1 11 10 2 7 30

3 2 17 3 8 30

6 9 7 8 6 30

Total 22 34 13 21 90

Table 5 shows the examinees’ numbers across three courses over three years with the

largest number of examinee data analyzed for Course 1, i.e., 527. The number of examinees

varied between 151 and 179 across the courses and was more consistent for Course 1 as

compared to the rest.

Table 5: Number of Examinees Across Courses and Years

Course 2007 2008 2009 2010 2011 2012 Total

1 174 179 174 527

3 151 179 175 505

6 154 176 164 496

Tables 6, 7 and 8 show cross-tabulation of the content of 90 questions used in the

examination for Courses 1, 3 and 6 classified by clinical presentation and skills. For Course 1

(Table 6), 16 clinical presentations were selected. The most common clinical presentations were

Fever/Sore Throat (N=4) and Failing Liver (N=5). For the skills, the items were evenly divided

between Basic Sciences and Diagnosis (N=11), followed by items on Treatment (N=6).

Table 6: Content of 30 Items Course 1 Classified by Clinical Presentation and Skills

Clinical Presentation Skill Total

Basic Sciences Diag Invest Treat

Abnormalities of White Cells 1 2 0 0 3

Acute Abdominal Pain 0 0 0 1 1

Bleeding and Bruising 0 3 0 0 3

Blood in Stool 0 0 0 2 2

Diarrhoea 1 0 0 0 1

Epidemiology 1 0 0 0 1

Fever/Sore Throat 2 0 1 1 4

Genetics 1 0 0 0 1

Immunology 1 0 0 0 1

Jaundice 0 1 0 0 1

Failing Liver 1 1 1 2 5

Lymphadenopathy 0 2 0 0 2

Pharmacology 1 0 0 0 1

Splenomegaly 0 1 0 0 1

Thrombosis 1 0 0 0 1

Undefined 1 1 0 0 2

Total 11 11 2 6 30

Table 7 displays the contents of Course 3 based on clinical presentation and skills. There

were twelve clinical presentations selected for this course of which Chronic Dyspnea was the

most common ones (N=6). The largest number of items belonged to the category of Diagnostic

skills (N=17) followed by Treatment (N=8).

Clinical Presentation Skill Total

Basic Sciences Diagnosis Investigation Treatment

Anemia/Pallor 0 1 0 0 1

Chest Discomfort 0 0 0 1 1

Chronic Dyspnea 0 3 0 3 6

Congestive Heart Failure 0 1 0 1 2

Cough in Children 0 3 0 0 3

Cough/Fever 0 3 1 1 5

Dyspnea/CHF 0 1 0 0 1

Hypercapnea 1 0 0 0 1

Hypoxemia 1 0 0 2 3

Noisy Breathing in Child 0 2 0 0 2

Lung Nodule/Mass 0 1 1 0 2

Pleural Effusion 0 2 1 0 3

Total 2 17 3 8 30

For Course 6, items were chosen from twenty different clinical presentations, as shown in

Table 8, of which most belonged to the category of Increased Risk/Genetic Disease. Basic

Sciences skills was the dominant one followed by Investigations.

Clinical Presentation Skills Total

Sciences

Diagnosis Invest Treatment

Childhood/Abnormal Urine Analysis 0 0 1 0 1

Childhood/Adolescent Exam 0 1 0 0 1

Childhood/Developmental Delay 0 2 0 0 2

Childhood/Rash 0 1 0 0 1

Childhood/Respiratory Diseases 0 1 0 0 1

Childhood/Serious Childhood Infection 0 1 0 0 1

Increased Risk/Genetic Disease 6 0 0 0 6

Menopause/Amenorrhea 1 0 0 0 1

Neonatal Jaundice 1 0 0 0 1

Neonatal/SIDS 0 0 0 1 1

Pelvic Mass 0 1 0 0 1

Pregnancy Loss 0 0 0 2 2

Pregnancy/Antepartum Care 0 0 2 0 2

Pregnancy/Intrapartum Care 0 0 2 0 2

Pregnancy/Obstetric Complication 0 0 2 0 2

Pregnancy/Obstetric Emergency 0 0 1 0 1

Prolapse 1 0 0 0 1

Vaginal Discharge/ Urinary Symptoms 0 0 0 1 1

Well Patient/Immunization 0 0 0 1 1

Well Patient/Normal Childhood 0 0 0 1 1

Total 9 7 8 6 30

Item Parameters

Tables 9 10, 11 display the descriptive statistics of item parameters of Courses 1, 3and 6

respectively. For Course 1, the indices varied between low to fair value of 0.26 to an average

value of 0.92 for item difficulty (Table 9), the recommended ranges in literature being 0.25-

0.85.112 Values of discrimination index for this course were between 0.09 to 0.75 which varied

between less than the desirable ones to the recommended ones.112

Table 9: Descriptive Statistics of Item Parameters for Course 1

Parameter N Min Max Mean SD

Difficulty 90 0.26 0.92 0.683 0.148

Discrimination 90 0.09 0.71 0.292 0.132

The trend was a little different for both the indices for Course 3 as compared to Course 1.

This is seen in Table 10. The difficulty indices varied between 0.31 to 0.93 for item difficulty

and 0.31 to 0.58 for item discrimination. This showed that difficulty index ranged from easy to

adequate levels.112 The discrimination index was noted to vary over a smaller range with a

relatively closer mean but still a moderate standard deviation.112

Difficulty 90 0.31 0.93 0.72 0.133

Discrimination 90 0.13 0.58 0.24 0.126

For Course 6 shown in Table 11, the difficulty index varied between 0.24 to 0.99 and 0.07 to

0.69 for item discrimination. These findings were consistent with the ones observed in Course 1

and although they showed similar trends of desirable item difficulty,112 and discrimination

index,112 the range of discrimination index was wider with a larger standard deviation from the

Difficulty 90 0.24 0.99 0.755 0.142

Discrimination 90 0.07 0.69 0.178 0.106

4.3 Results of Research Question No. 1

This research question was answered using both CTT and IRT. Item analyses were conducted

to look at the item difficulty and discrimination indices for the three courses over three years.

Details of Course 1 follow. Results of Courses 2 can be viewed in Appendix A1 and A2. For

Course 6, they can be viewed in Appendix B1 and B2.

4.3.1 Results of Research Question No. 1 A

What are the item parameters when conducting item analysis with CTT?

One must remember that in CTT, difficulty refers to the number of students who are able to

answer an item correctly. The bigger the number, the easier the item. Table 12 presents the

results of item analysis for Course 1 across three years using CTT. For Year 1, 23 items had

recommended p between 0.25-0.75 and 7 items had a p of more than 0.75. The items with a p

more than 0.75 were no. 1, 6, 17, 18, 19, 21 and 26. Hence, they were easy. For Year 2, the items

that fitted into the category of easy items were 8, 9, 11 and 14 as their p was more than 0.75. The

analysis of items from Year 3 showed similar results as Year 1 with 23 items having adequate

levels of difficulty. Seven items were noted to be easy, i.e., 1, 6, 17, 18, 19, 21 and 26.

Interestingly, items in Years 1 and 3 showed more stability in terms of difficulty. Although

majority of the items in Year 2 were also stable over time, items 8, 9, 11 and 14 yielded different

results, i.e., they were found to be easy by the students in that year.

Regarding the discrimination index, also called point biserial correlation in CTT, many items

had a value greater than 0.2. For Year 1, four items, i.e., 7, 8, 22 and 26 had a p-bis correlation of

>0.3. There were 9 items that had a p-bis >0.2. They were items no. 3, 4, 5, 12, 14, 17, 24, 27

and 28. For Year 2, one item, i.e., no.8 had a p-bis of >0.4. Nine items had a p-bis >0.2. They

were items no. 5, 7, 9, 14, 17, 18, 20, 22 and 23. Trends similar to Year 1 were observed in Year

3 where the same four items as Year 1 had a p-bis >0.3. They were items no, 7, 8, 22 and 26.

Furthermore, nine items had a p-bis >0.2. They were the same as in Year 1, i.e., items no. 3, 4, 5,

12, 14, 17, 24, 27 and 28.

In summary, by looking at Table 12, one notices that item difficulty was adequate for all

three years although students in Year 2 found the items slightly easier. In addition, items 15 and

19 had higher difficulty for Year 1 and 3 and lower for Year 2 which was different from the

trend of the rest of the items. In contrast, item 1 had higher difficulty for all three years (0.84-

0.86), potential explanation being that this item was probably testing core knowledge that all the

students were expected to know. Regarding item discrimination, only one item, i.e., no. 8 had

ideal discrimination of 0.40 for Year 2. Also, Year 1 and 3 were more consistent with each other

in respect of item difficulty and discrimination than Year 2.

Table 12: Item Difficulty (p) and Point Biserial (p-bis) Correlation of Course 1 Using CTT

Year 1 Year 2 Year 3

ID p p-bis p p-bis p p-bis

1 0.862 0.093 0.844 0.058 0.861 0.090

2 0.448 0.161 0.408 0.014 0.423 0.165

3 0.695 0.202 0.777 0.117 0.678 0.213

4 0.672 0.244 0.665 0.066 0.671 0.234

5 0.569 0.281 0.704 0.248 0.567 0.279

6 0.816 0.039 0.911 0.112 0.809 0.042

7 0.477 0.305 0.620 0.296 0.472 0.325

8 0.575 0.306 0.765 0.401 0.568 0.303

9 0.592 0.150 0.832 0.258 0.594 0.152

10 0.489 0.081 0.525 0.010 0.488 0.081

11 0.690 0.107 0.816 0.084 0.687 0.114

12 0.746 0.206 0.726 0.072 0.745 0.216

13 0.632 0.160 0.609 0.112 0.630 0.159

14 0.678 0.211 0.866 0.216 0.667 0.208

15 0.638 0.193 0.402 0.092 0.639 0.190

16 0.747 0.153 0.793 0.038 0.734 0.155

17 0.810 0.263 0.844 0.269 0.808 0.270

18 0.839 0.157 0.793 0.202 0.826 0.155

19 0.920 0.085 0.760 0.114 0.925 0.083

20 0.753 0.148 0.799 0.214 0.754 0.147

21 0.822 0.123 0.777 0.105 0.823 0.109

22 0.718 0.348 0.676 0.240 0.716 0.343

23 0.678 0.172 0.626 0.278 0.668 0.170

24 0.742 0.155 0.753 0.028 0.724 0.155

25 0.840 0.262 0.834 0.269 0.808 0.260

26 0.816 0.309 0.816 0.152 0.811 0.301

27 0.744 0.283 0.765 0.180 0.746 0.276

28 0.500 0.227 0.363 0.162 0.497 0.231

29 0.713 0.114 0.682 0.150 0.719 0.112

30 0.687 0.135 0.696 0.148 0.684 0.133

Mean 0.699 0.189 0.714 0.156 0.701 0.187

SD 0.121 0.079 0.137 0.096 0.122 0.079

4.3.2 Results of Research Question No. 1B

What are the item parameters when conducting item analysis with IRT?

Table 13 shows the results of item analysis carried out by applying IRT. More than fifty

percent of the items in all three years had an item difficulty of <0.25. In Year 1, six items had a

recommended range of difficulty between 0.25-0.75. They were items no. 5, 8, 9, 13, 15 and 24.

Three items had an item difficulty of > 0.75. They were items no. 7, 10 and 28. In the context of

Year 2, other than one item, i.e., no. 24, a different set of items (compared to Year 1) yielded a

desirable difficulty level of between 0.25-0.75. They were items no. 7, 13, 22 and 23. Three

items in Year 2 had an item difficulty of >0.75. They were items no. 10, 15 and 28. Year 3

showed very similar trends and hence temporal stability with Year 1. Items no. 5, 8, 9, 13, 15 and

24 had desirable levels of item difficulty between 0.25-0.75 and items no. 7, 10 and 28 had an

item difficulty of >0.75. In all three years, quite a few items had an item difficulty with negative

values.

A report on the discrimination indices for Course 1 for three years follows. All three years

for Course 1 showed stable temporal trends as most of the items had the a parameter higher than

the recommended one of 0.4. This means that they are discriminating well. Item no. 2 had the a

parameter of <0.3 in all three years, i.e., less than the desirable one. In addition, items no. 3, 4

and 10 had indices of <0.4 for Year 2 and item no. 3 for Year 3.

In summary, the majority of the items were easy for all three years when IRT was applied.

Three items stood out as very difficult for students in all three years. It is hard to explain why

they were found to be more difficult for the students in Year 2 who otherwise have shown better

performance in general. Items 1 and 2 had the lowest discrimination amongst all. When

comparing CTT with IRT, one notices that >50% of the items were of adequate type (difficulty

level between 0.25-0.75) when CTT was applied for item analysis. On the contrary, item analysis

with IRT showed that more than 50% of the items were of the easy type. Discrimination was

better when IRT was applied and was in fact noted to be quite high as several values were above

the ideal cut-off value of 0.

Table 13: Difficulty (b) and Discrimination (a) Indices of Course 1 Using IRT

Year 1

Year 2 Year 3

ID a b a b a b

1 0.198 -4.025 0.245 -3.008 0.188 -4.020

2 0.228 1.193 0.215 1.958 0.227 1.189

3 0.423 -0.301 0.365 -0.966 0.329 -0.305

4 0.601 0.129 0.380 0.005 0.590 0.127

5 0.686 0.638 0.605 0.153 0.680 0.649

6 0.549 -0.785 0.638 -1.255 0.544 -0.791

7 0.720 0.999 0.675 0.610 0.712 0.985

8 0.754 0.635 0.883 0.105 0.752 0.629

9 0.752 0.632 0.833 0.101 0.742 0.619

10 0.754 0.637 0.823 0.115 0.732 0.609

11 0.545 -0.015 0.531 -0.674 0.541 -0.012

12 0.680 -0.233 0.486 -0.159 0.688 -0.219

13 0.592 0.320 0.461 0.510 0.572 0.318

14 0.682 0.182 0.645 -0.803 0.662 0.178

15 0.624 0.318 0.444 1.684 0.623 0.312

16 0.654 -0.175 0.498 -0.592 0.559 -0.173

17 0.800 -0.335 0.742 -0.465 0.794 -0.332

18 0.686 -0.660 0.640 -0.299 0.645 -0.650

19 0.750 -1.193 0.522 -0.300 0.735 -1.190

20 0.617 -0.254 0.625 -0.357 0.601 -0.248

21 0.650 -0.609 0.539 -0.376 0.650 -0.519

22 0.852 0.129 0.599 0.294 0.845 0.129

23 0.612 0.120 0.614 0.551 0.616 0.125

24 0.658 0.726 0.528 0.484 0.659 0.690

25 0.591 1.792 0.524 1.217 0.611 1.790

26 0.837 -0.327 0.578 -0.561 0.834 -0.317

27 0.768 -0.136 0.581 -0.223 0.766 -0.126

28 0.646 0.913 0.522 1.823 0.606 0.901

29 0.591 -0.076 0.508 0.154 0.611 -0.073

30 0.627 -0.435 0.625 -0.357 0.622 -0.432

Mean 0.623 0.000 0.540 0.000 0.623 0.000

SD 0.146 1.000 0.140 1.000 0.146 1.000

4.3.3 Results of Research Question No.1 C

Are the item parameters comparable when conducting item analysis with both CTT and IRT?

Table 14 shows the correlation coefficients calculated for the three years for Course 1 for

item difficulty using both CTT and IRT. It can be noted that the correlation coefficients for all

three years are good to excellent. The negative sign here is arbitrary since it must be kept in mind

while looking at these figures that in IRT, the item difficulty index (b) moves from the smaller to

the bigger number and item difficulty itself moves from the easier to the more difficult.151, 152 On

the other hand, in CTT, item difficulty index (p) moves from the smaller to the bigger number

but item difficulty moves from the more difficult to the easier. Hence, a negative correlation

holds and the sign becomes arbitrary.

Table 14: Correlation Coefficients of Difficulty Index Between CTT and IRT for Course 1

Year 1

Year 2

Year 3

-0.807 -0.887 -0.804

Table 15 shows the correlation coefficients of point biserial and discrimination index

between CTT and IRT for Course 1. It can be observed that the most remarkable correlation was

for Year 1 (r=0.927, p < 0.00). Year 2 has also yielded relatively stronger correlations (r=0.847,

p <0.00). Year 3 on the other hand, has yielded only moderate correlation coefficient (r=0.637, p

< 0.00). This trend is similar to the correlation coefficients calculated for item difficulty with

CTT and IRT though not as strong (with the exception of Year 1).

Table 15: Correlation Coefficients of Point Biserial and Discrimination Index Between

CTT and IRT for Course 1

Year 1

pbis-a

Year 2

pbis-a

Year 3

pbis-a

0.927 0.847 0.637

4.3.4 Results of Research Question No.1 D

What is the reliability index of the items?

Reliability coefficients and SE of parameter estimates were calculated for each item for the

three years under research for all three courses. Results of Course 1 are elaborated upon below.

Results of Course 3 can be viewed in Appendix A5 - A7. For Course 6, the results are displayed

in Appendix B5 - B7.

Table 16 presents the SE of difficulty and discrimination parameters along with the alpha

coefficient of test score of Year 1. In this table, aSE refers to the standard error of estimate for a

parameter, bSE is the standard error of estimate of b parameter, and alpha without is the

reliability index of the test score obtained on the removal of that particular item. It is calculated

by Xcalibre taking CTT statistics into account.

To clarify the concept of SE of estimates of a and b parameters, two items are discussed

here. If one looks at the first item in Table 17, the a parameter is 0.198 and the aSE 0.126. The

SE of 0.126 is multiplied by 2 since a confidence interval is being calculated at 95%. This means

that for a 95% confidence interval, the true mean for the a parameter of this item may fall

between +0.45 and -0.05. Here, the lower limit of the score band is 0.198 – 0.252 = -0.05 and the

upper limit of the score band is 0.198 + 0.252 = 0.45. In other words, if this item is repeatedly

used to assess a student without further learning taking place, 95% of the time, the true score will

lie between -0.05 and + 0.45.

For item 5, the a parameter is 0.686 and the aSE 0.223. This means that at the 95%

confidence interval, the true mean for the a parameter of this item will fall between 0.24 and

1.132. In other words, if this item is repeatedly used to assess a student without further learning

taking place, 95% of the time, the true score will lie between 0.24 and 1.132. If one looks at item

5 for the difficulty index, i.e., b, the mean for this item falls between 0.358 and 0.918 since the

bSE is 0.140.

As can be noted in Table 16, removal or revision of item no.1 improves the alpha coefficient

of the test score to 0.64. Removal or revision of certain items increases it to 0.63. They are items

no. 6, 10, 11, 19 and 21. Most of the items in Table 14 appear to have large SE for both a and b

which may be attributable to the small sample size.

Table 16: SE and Reliability Index (Alpha w/o) Course 1 Year 1

Item ID a aSE b bSE Alpha w/o

Item 1 0.198 0.126 -4.025 0.615 0.64

Item 2 0.228 0.501 1.193 0.396 0.62

Item 3 0.423 0.186 -0.301 0.235 0.62

Item 4 0.601 0.189 0.129 0.166 0.61

Item 5 0.610 0.187 0.126 0.166 0.60

Item 6 0.549 0.136 -0.785 0.214 0.63

Item 7 0.720 0.222 0.999 0.134 0.61

Item 8 0.754 0.210 0.635 0.129 0.61

Item 9 0.570 0.237 0.493 0.167 0.62

Item 10 0.500 0.286 0.974 0.186 0.63

Item 11 0.545 0.185 -0.015 0.184 0.63

Item 12 0.680 0.151 -0.233 0.162 0.62

Item 13 0.592 0.212 0.320 0.164 0.62

Item 14 0.682 0.182 0.182 0.149 0.62

Item 15 0.624 0.205 0.318 0.157 0.62

Item 16 0.654 0.157 -0.175 0.164 0.62

Item 17 0.800 0.139 -0.335 0.151 0.61

Item 18 0.686 0.132 -0.660 0.183 0.62

Item 19 0.750 0.124 -1.193 0.221 0.63

Item 20 0.617 0.155 -0.254 0.175 0.62

Item 21 0.650 0.135 -0.609 0.185 0.63

Item 22 0.852 0.161 0.129 0.126 0.60

Item 23 0.612 0.186 0.120 0.164 0.62

Item 24 0.658 0.234 0.726 0.145 0.62

Item 25 0.591 0.174 1.792 0.172 0.62

Item 26 0.837 0.138 -0.327 0.147 0.61

Item 27 0.768 0.150 -0.136 0.146 0.61

Item 28 0.646 0.240 0.913 0.147 0.62

Item 29 0.591 0.172 -0.076 0.174 0.61

Item 30 0.627 0.144 -0.435 0.180 0.62

As stated earlier, several of the items in Year 2 had a difficulty index <0.25. Discrimination

indices were mostly good. If one looks at Table 17, one notices that the overall reliability of this

test is less than that of Year 1. In fact, it is only in the lower range of what is deemed good

reliability for an MCQ exam. Revision or removal of item no. 2 improves the reliability

coefficient to 0.58. Items no, 1, 4, 12, 15 and 16 also affect the reliability of the test as their

revision or removal from the test improves the reliability to 0.57. If one looks at the item 15 in

Table 17, it has an aSE half the size of mean. This means that at the 95% confidence interval, the

true mean for the discrimination index of this item falls between -0.024 and +0.024. In other

words, if this item is repeatedly used to assess a student without further learning taking place,

95% of the time, the true score will lie between the above values. Here, the lower limit of the

score band is 0.444 - 0.468 = 0.02 and the upper limit of the score band is 0.444 + 0.468 = 0.91.

Hence, The true score falls within the 95% confidence interval of 0.24 and 0.912. This is a very

large SEM and understandably, the reliability index may be improved to 0.57 by removing the

above-mentioned item.

Table 17: SE and Reliability Index (Alpha w/o) Course 1 Year 2

Item 1 0.245 0.124 -3.008 0.498 0.57

Item 2 0.215 0.368 1.958 0.418 0.58

Item 3 0.365 0.145 -0.966 0.294 0.56

Item 4 0.380 0.203 0.005 0.252 0.57

Item 5 0.605 0.166 0.153 0.169 0.54

Item 6 0.638 0.122 -1.255 0.247 0.56

Item 7 0.628 0.112 -1.265 0.244 0.55

Item 8 0.883 0.145 0.105 0.130 0.53

Item 9 0.832 0.156 0.714 0.120 0.54

Item 10 0.393 0.302 0.963 0.232 0.58

Item 11 0.531 0.134 -0.674 0.220 0.56

Item 12 0.486 0.163 -0.159 0.211 0.57

Item 13 0.461 0.228 0.510 0.204 0.56

Item 14 0.645 0.126 -0.803 0.208 0.55

Item 15 0.444 0.234 1.684 0.210 0.57

Item 16 0.498 0.140 -0.592 0.224 0.57

Item 17 0.742 0.131 -0.465 0.174 0.55

Item 18 0.640 0.139 -0.299 0.179 0.55

Item 19 0.522 0.150 -0.300 0.205 0.56

Item 20 0.625 0.138 -0.357 0.184 0.55

Item 21 0.539 0.144 -0.376 0.204 0.56

Item 22 0.599 0.177 0.294 0.167 0.55

Item 23 0.614 0.195 0.551 0.158 0.54

Item 24 0.528 0.207 0.484 0.181 0.56

Item 25 0.524 0.243 1.217 0.177 0.55

Item 26 0.578 0.134 -0.561 0.204 0.56

Item 27 0.581 0.147 -0.223 0.187 0.55

Item 28 0.522 0.196 1.823 0.185 0.56

Item 29 0.508 0.182 0.154 0.194 0.56

Item 30 0.625 0.183 -0.357 0.192 0.54

Table 18a displays the alpha without, aSE and bSE of Course 1 Year 3. As stated earlier,

several of the items in Year 3 had a difficulty index <0.25. Discrimination indices were mostly

good. If one looks at the table below, one notices that the overall reliability of this test is good

and about the same as that of Year 1.

Revision or removal of item no. 2 improves the reliability coefficient to 0.64. Removal of

Item no. 22 improves it to 0.63. Reliability was respectively improved to 0.60 and 0.62 by

removing the above-mentioned items. Again, both aSE and bSE were large with moderate

reliability index, potentially due to small sample size.

Table 18a: SE and Reliability Index (Alpha w/o) Course 1 Year 3

Item 1 0.188 0.146 -4.020 0.312 0.61

Item 2 0.227 0.449 1.189 0.426 0.64

Item 3 0.329 0.168 -0.305 0.325 0.62

Item 4 0.590 0.147 0.127 0.303 0.62

Item 5 0.680 0.263 0.649 0.247 0.62

Item 6 0.544 0.208 -0.791 0.227 0.62

Item 7 0.712 0.123 0.985 0.286 0.62

Item 8 0.752 0.121 0.629 0.331 0.62

Item 9 0.601 0.205 0.491 0.177 0.60

Item 10 0.505 0.166 0.972 0.260 0.62

Item 11 0.541 0.151 -0.012 0.247 0.62

Item 12 0.688 0.217 -0.219 0.189 0.61

Item 13 0.572 0.144 0.318 0.245 0.62

Item 14 0.662 0.127 0.178 0.253 0.61

Item 15 0.668 0.125 0.176 0.252 0.60

Item 16 0.559 0.283 -0.173 0.229 0.62

Item 17 0.794 0.140 -0.332 0.240 0.62

Item 18 0.645 0.180 -0.650 0.218 0.61

Item 19 0.735 0.173 -1.190 0.234 0.62

Item 20 0.601 0.127 -0.248 0.244 0.61

Item 21 0.650 0.285 -0.519 0.232 0.62

Item 22 0.845 0.154 0.129 0.272 0.63

Item 23 0.616 0.132 0.125 0.225 0.61

Item 24 0.659 0.174 0.690 0.241 0.62

Item 25 0.611 0.197 1.790 0.244 0.62

Item 26 0.834 0.145 -0.317 0.209 0.61

Item 27 0.766 0.151 -0.126 0.217 0.61

Item 28 0.606 0.126 0.901 0.264 0.62

Item 29 0.611 0.139 -0.073 0.244 0.62

Item 30 0.622 0.139 -0.432 0.234 0.61

Table 18b shows the reliability coefficients for the test scores for the three courses for all

the three years calculated by using both CTT and IRT. It can be noted that for CTT, the

reliability coefficients fall between the ranges of 0.57-0.64 for Course 1, 0.51-0.62 for Course 2

and 0.53–0.62 for Course 3. This indicates that the coefficients were mostly adequate. For IRT,

they were marginally better. For Course 1, they were 0.59–0.69; for Course 3, they were 0.56–

0.69 and for Course 3, they were between 0.53–0.65. This showed that IRT was not remarkably

superior to CTT for assessing the reliability of test scores with two different methods.

Table 18b: Cronbach’s Alpha for Course 1, 2 and 3 Using CTT and IRT

Course 1 Course 2 Course 3

CTT IRT CTT IRT CTT IRT

Year 1 0.63 0.69 0.62 0.69 0.61 0.64

Year 2 0.57 0.59 0.51 0.56 0.53 0.53

Year 3 0.64 0.67 0.60 0.64 0.62 0.65

4.3.5 Results of Research Question No.1 E

What are the item characteristic curves like for the individual items for each year?

The two technical properties that are used to describe an ICC are the item difficulty and the

item discrimination. Item difficulty describes where an item functions along the x axis which is

the ability scale. It is, thus, a location index. Hence, it is observed that an easy item functions

among the low-ability students and a hard one among the high-ability ones. Item discrimination

is the second technical property of the ICC and it describes how much an item can differentiate

between students with ability below and above the item location. This property influences the

steepness of the curve in its middle. The steeper the curve, the better the discrimination. On the

other hand, the flatter the curve, the less the item is able to discriminate since the probability of

correct response at low levels of ability is nearly the same as it is at high ability levels.

For this research, ICCs were generated for all the items for the three years for all three

courses using Xcalibre. For Course 1, some ICCs are elaborated upon below. The remainder of

the ICCs for all the items can be viewed in Appendix C. For the ICCs, ability or theta is plotted

on the x axis and the probability of endorsing an item on the y axis.

Figure 6 below shows the ICCs for five items selected from Course 1. Item no. 2 can be seen

to have low difficulty index. It is, in fact, a very easy item with only fair discrimination. On

visual inspection, Year 1 and 3 look similar but Year 2 appears to be different. This trend is

noticeable in all the ICCs for Year 2, i.e., visually, if overlapped, it does not follow the same

pattern as that of Year 1 and 3. All three curves appear to be quite flat which is attributable to the

low discrimination indices.

The ICCs for Item no. 3 show slightly steeper curves compared to the previous ones. This

indicates that this item is more discriminating at different ability levels although it is still noted

to have very low difficulty index. The next three items, i.e., item no. 8, 9 and 24 show further

steepness of the ICCs. Hence, these items are better than the previous two in differentiating

between students of lower and higher ability. One can note that these three items have adequate

difficulty indices and they are influencing the curves to move to the right.

Figure 6: ICCs for Course 1

Item No. 2

Item No. 3

Item No. 8

Item No. 9

Item No. 24

4.4 Results of Research Question No. 2

4.4.1 Results of Research Question No. 2 A

Do the items show stability across years using CTT?

4.4.1.1 Repeated Measures ANOVA

Repeated measures ANOVA were conducted for the three courses to evaluate stability across

time by taking years as independent variable and the item parameters as dependent variables

individually. In addition correlation coefficients were calculated for inter-year correlations and

scatter plots constructed. The results of individual repeated measures ANOVA for Courses 1, 3

and 6 are shown in Tables 19-24 and the tables and scatter plots for correlations are depicted

following them.

The results of repeated measures ANOVA for Course 1 (Table 19) indicate that there were

no significant differences at α < 0.05 amongst the mean measures of item difficulty across the

three years of measurement as the p values for the three years was more than 0.05. The main

effect was not significant.150, 153 Levene’s Test for Equality of Variances for item difficulty for

C1 was non-significant, i.e., F(1, N = 90) = 0.04, p = 0.96. This indicated that there is

homogeneity of variances between the items across three years and they have similar

characteristics. Repeated measures ANOVA did not yield significant differences between the

means in the context of item difficulty over three years. The F ratio calculated was F(1, 90) =

0.40, p = 0.67.

Item discrimination also yielded consistent results as Levene’s Test for Equality of

Variances was non-significant for item discrimination (Table 20). It was F(1, N = 90) = 0.26, p =

0.76. The result of the differences between the means of items was non-significant as F(1, 90) =

1.23, p = 0.23. Hence, it can be said that the items are stable over time.

Difficulty Index for Course 1

Item Parameter Course Repeated Measures ANOVA

F p Effect Size

Item Difficulty C1 0.040 0.96ns 0.009

Note: ns Not significant (significant at α < 0.05)

Discrimination Index for Course 1

F p Effect Size

Item Discrimination C1 0/264 0.76ns 0.028

The results of repeated measures ANOVA for the difficulty parameter for Course 3 (Table

21) were similar to Course 1. They indicate that the differences were non-significant at α < 0.05

amongst the mean measures of item difficulty across the three years of measurement as the p

values for Course 3 across the three years was more than 0.05. Levene’s Test for Equality of

Variances for item difficulty was non-significant for Course 3, i.e., F(1, N = 90) = 1.65, p = 0.19.

Like Course 1, the F ratio was non-significant for the between-groups mean, hence showing

stability over time. The F ratio was F(1, 90) = 1.73, p = 0.18, hence showing stability for item

difficulty parameter for all the three years across time.

Levene’s Test for Equality of Variances for item discrimination was non-significant as

well, i.e., F(1, N = 90) = 0.30, p = 0.73. The main effect was also not significant (Table 22).

Furthermore, the mean differences between groups were also non-significant as F(1, 90) = 1.65,

p = 0.20. These results pointed towards stability of item discrimination parameter over time.

F p Effect Size

Item Discrimination C3 0.307 0.73ns 0.077

The results of repeated measures ANOVA for the difficulty parameter for Course 6 (Table

23) showed similar trends as Courses 1 and 3 and indicate that the differences were non-

significant at α < 0.05 amongst the mean measures of item difficulty across the three years of

measurement as the p values for Course 6 across the three years were also greater than 0.05.

Levene’s Test for Equality of Variances for item difficulty parameter for Course 6 was non-

siginificant as F(1, N = 90) = 0.13, p = 0.87. Between-groups mean also yielded non-significant

results, thus showing stability of items over time. The F ratio was F(1, 90) = 0.06, p = 0.93.

Like the other two courses, Levene’s Test for Equality of Variances for item discrimination

for Course 6 for all three years was also non-significant (Table 24). It was F(1, N = 90) = 1.81, p

= 0.16. The F ratio depicted stability of items across the years since there were non-significant at

F(1, 90) = 0.43, p = 0.16.

F p Effect Size

Item Discrimination C6 1.814 0.16ns 0.091

4.4.1.2 Correlation Coefficient

Correlation coefficients (r) were calculated to look at the temporal stability of items using

CTT. It was assumed that if the r was high, the items were stable. Tables 25 and 26 show the

correlation coefficients across the years, i.e., Year 1, 2 and 3, for Course 1 using CTT for both

difficulty and discrimination parameters. These tables are then followed by scatter plots to depict

correlations of one year with another, first for CTT and then for IRT, for both difficulty and

discrimination parameters. Correlation coefficients for Course 3 can be viewed in Appendix A9

and A11. For Course 6, they can be viewed in Appendix B9 and B11.

Table 25 shows that all three years yielded positive correlation with each other. The highest

correlation was noted between Year 1 and Year 3 (r = 0.99, p < 0.00) whereas Year 1 and 2 had a

slightly lower correlation coefficient (r = 0.71, p < 0.00). Year 2 and 3 showed similar trend as

Year 1 and 2 (r = 0.71, p < 0.00). High correlation coefficient points towards homogeneity of the

cohort of students and stability of items across the years.

Table 25: Correlation Coefficient of Difficulty Index of Year 1, 2, 3 for Course 1 for CTT

Year 1 1 0.714 0.998

Year 2 0.714 1 0.711

Year 3 0.998 0.711 1

Table 26 expresses the correlation coefficient for the discrimination index for the three

years calculated by CTT. All of them were positive and the highest correlation was seen between

Year 1 and 3 (r = 0.96, p < 0.00) whereas those between Year 1 and 2 (r = 0.56, p < 0.00) and 2

and 3 (r = 0.56, p < 0.00) were much lower. This indicated that the Year 2 cohort was not as

homogeneous as the other two years and items not as stable as the other two years for

discrimination index.

Table 26: Correlation Coefficient of Discrimination Index of Year 1, 2, 3 for Course 1 for

Year 1 1 0.564 0.996

Year 2 0.564 1 0.565

Year 3 0.996 0.565 1

In summary, it can be stated that Year 2 was not as strongly correlated with Year 1 and 3 as

the latter two, i.e., Year 1 and 3 with each other. CTT and IRT yielded similar sort of correlation

index; hence one method did not stand out over the other in terms of stability over time.

4.4.1.3 Scatter Plots for CTT for Item Parameters

Below are the scatter plots of the items for Course 1 plotted between two respective years, i.e.,

Year 1 and 2, Year 2 and 3, Year 3 and 1. They show the correlation of items in the context of

their difficulty and discrimination using CTT. Figures 7-9 show the comparisons between the

three years for Course 1 using CTT. The scatter plots for Course 3 are displayed in Appendix

A13 and A15. For Course 6, the scatter plots are displayed in Appendix B13 and B15.

The first scatter plot is between Year 1 and Year 2. The plots indicate that there is a positive

correlation between the difficulty index of Year 1 and 2 (r = 0.71, p < 0.00). The degree of

correlation is good which means that several items correlated with each other strongly. Items

no.15, 25 and 28 were noted to be deviating from the line of best fit; these may be considered as

influential items. For the second figure, again a good and positive correlation is seen (r = 0.71, p

< 0.00) but some items are noted to deviate from the line of best fit. On closer inspection, these

are the same ones as reported for Year 1 and 2 earlier, i.e., 15, 25 and 28. Ultimate result shows a

linear and positive correlation. The last plot depicts very strongly positive, linear correlations

between the item difficulty of Year 3 and 1 as nearly all the values of item difficulty fall on the

line of best fit. This shows that inter-year correlations were quite strong for the difficulty index

using CTT and that this method yielded stable results on being used across the three years.

Scatter Plots for Item Difficulty Using CTT for Course 1

Figure 7: Item Difficulty with CTT Year 1 and 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Year 1

Year 2

Figure 8: Item Difficulty with CTT Year 2 and 3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Year 2

Year 3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Year 1

Year 3

Figure 9: Item difficulty with CTT Year 3 and 1

Scatter Plots of Item Discrimination (p-bis) Using CTT for Course 1

The scatter plots for the three comparisons between the three years for Course 1 using

CTT are seen in Figures 10-12. The first scatter plot is between Year 1 and Year 2. The plot

indicates that there is a positive correlation between the difficulty index of Year 1 and 2. The

degree of correlation is only moderate (r = 0.56, p < 0.00) which means that not all items

correlate with each other strongly. Items no. 9, 10, 23 and 25 were noted to be deviating

remarkably from the line of best fit; these may be considered as influential items. In addition, the

scatter was wider which also pointed towards only moderate correlations. For the second figure

for Year 2 and 3 (r = 0.56, p < 0.00), again a positive relationship was seen but some items are

noted to deviate from the line of best fit. On closer inspection, these are the same ones as

reported for Year 1 and 2 earlier, i.e., 9, 10, 23 and 25. Ultimate result shows only a moderate,

linear but positive correlation. The last plot depicts very strongly positive, linear correlations

between the item discrimination of Year 3 and 1 as nearly all the values of item difficulty fall on

the line of best fit.

Figure 10: P-bis with CTT of Year 1 and 2

0. Year 2

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Year 1

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Year 2

Year 3

To summarize, the higher value of correlation coefficient for Year 1and 3 is an indication of

their homogeneity as indicated by a positive, linear and strong cluster in the scatter plots above.

On the contrary, the scatter plot of Year 1 and 2 and Year 2 and Year 3 indicate that there are at

least three points that seem to be deviating from the line of best fit in the case of CTT and 2

points in the case of IRT, again indicative of heterogeneity of the group. This could be one of the

reasons of the fluctuations in the values of correlation coefficient reported earlier. As one will

notice in the next section, very similar trends are seen in CTT and IRT for most correlations and

scatter plots.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Year 1

Year 3

4.4.2 Results of Research Question No. 2 B

Do the items show stability across years using IRT?

4.4.2.1 Repeated Measures ANOVA

Repeated measures ANOVA were conducted for IRT for the three courses to evaluate

stability across time by taking years as independent variable and the item parameters as

dependent variables individually. In addition, correlation coefficients were calculated and scatter

plots constructed. Furthermore, TCCs for the three courses were also generated for visual

comparison. The results of individual repeated measures ANOVA for Courses 1, 3 and 6 are

shown in Tables 27-32 and the TCCs follow them.

The results of repeated measures ANOVA for Course 1 (Table 27) indicate that there were

no significant differences at α < 0.05 amongst the mean measures of b parameter across the three

years of measurement as the p values for the three years was more than 0.05. Levene’s Test for

Equality of Variances for b parameter for Course 1 was non-significant as F(1, N = 90) = 0.11, p

= 0.89, the interpretation being that there is homogeneity between the items across three years

and they have similar characteristics. The F ratio also yielded non-significant results as F(1, 90)

= 0.00, p = 0.99, indicating that the differences in between-groups mean were non-significant

and the items, hence, stable over time.

Levene’s Test for Equality of Variances for the a parameter was non-significant as it was

F(1, N = 90) = 2.63, p = 0.08 (Table 28). The F ratio indicated that the item discrimination had

stable characteristics over times 1, 2 and 3 as the result for between-groups differences was non-

significant. It was F(1, 90), 3.01, p = 0.22.

Table 27: Repeated Measures ANOVA to Determine the Effect of Time on the b Parameter

for Course 1

F p Effect Size

B Parameter C1 0.113 0.89ns 0.018

Table 28: Repeated Measures ANOVA to Determine the Effect of Time on the a Parameter

for Course 1

F p Effect Size

a Parameter C1 2.63 0.08ns 0.083

The results of repeated measures ANOVA for Course 3 (Table 29) indicate that similar

results as Course 1 were obtained with this course as well. There were no significant differences

at α < 0.05 amongst the mean measures of b parameter across the three years of measurement as

the p value for the three years was more than 0.05. Levene’s Test for Equality of Variances for b

parameter for Course 3 was non-significant and showed that F(1, N = 90) = 0.29, p = 0.74,

pointing towards homogeneity between the items across three years and the fact that they have

similar characteristics. In addition, the F ratio revealed non-significant differences in the

between-groups mean, hence showing item stability over time. It was F(1, 90) = 0.01, p = 1.00.

The a parameter also yielded non-significant result for Levene’s Test for Equality of

Variances for all three years (Table 30). It was F(1, N = 90) = 0.85, p = 0.42. These results

indicated that both the item parameters were stable over times 1, 2 and 3. The F ratio yielded

non-significant result for the between-groups mean. It was F(1, 90) = 13.90, p = 0.96.

for Course 3

F p Effect Size

b Parameter C3 0.295 0.74ns 0.101

for Course 3

F p Effect Size

a Parameter C3 0.859 0.42ns 0.096

Note: ns Not significant (significant at α < 0.05

The results of repeated measures ANOVA for Course 6 (Table 31) indicate that similar

results as Course 1 and 3 and were obtained with this course as well. There were no significant

differences at α < 0.05 amongst the mean measures of b parameter across the three years of

measurement as the p values for the three years was more than 0.05. Levene’s Test for Equality

of Variances for b parameter for Course 6 was non-significant and showed that F(1, N = 90) =

0.07, p = 0.92. Between-groups mean did not show significant result as the F ratio was non-

significant, thus showing item stability over time. It was F(1, 90) = 0.00, p = 1.00.

The a parameter also yielded non-significant result for Levene’s Test for Equality of

Variances (Table 32). It was F(1, N = 90) = 0.17, p = 0.84. The F ratio also indicated that both

the item parameters were stable over times 1, 2 and 3 and the difference in between-groups mean

was non-significant. It was F(1, 90) = 5.09, p = 0.88.

for Course 6

F p Effect Size

b Parameter C6 0.077 0.92ns 0.011

for Course 6

F p Effect Size

a Parameter C6 0.173 0.84ns 0.031

Note: ns Not significant (significant at α < 0.05

4.4.2.2 Correlation Coefficient

Correlation coefficients (r) were calculated to look at the temporal stability of items using

IRT as was done in the context of CTT. Course 1 is elaborated upon here while the correlation

coefficients for Course 3 and 6 are presented in the appendix. It was assumed that if the r was

high, the items were stable. Tables 33 and 34 show the correlation coefficients across the years,

i.e., Year 1, 2 and 3, for Course 1 using IRT for both difficulty and discrimination parameters.

These tables are then followed by scatter plots in the next section to depict correlations of one

year with another for both difficulty and discrimination parameters when calculated with IRT.

Correlation coefficients for Course 3 can be viewed in Appendix A10 and A12. For Course 6,

they can be viewed in Appendix B10 and B12.

Table 33 shows inter-year correlation coefficients for difficulty index calculated by IRT.

Positive correlation is noted amongst all three years but the most remarkable correlation was

noted between Year 1 and 3 (r = 0.99, p < 0.00). Correlations between Year 1 and 2 (r = 0.82, p

< 0.00) and between Year 2 and 3 (r = 0.82, p < 0.00) were also quite high. This was similar in

trend to the ones noted when similar analyzes were conducted with CTT though in contrast to

CTT, the ones with IRT were more strongly correlated with each other.

Table 33: Correlation Coefficient of Difficulty Index of Year 1, 2, 3 for Course 1 for IRT

Year 1 1 0.825 0.999

Year 2 0.825 1 0.822

Year 3 0.999 0.825 1

Table 34 depicts the correlation coefficient for the discrimination index for the three years

calculated by IRT. All of them were positive and the highest correlation was seen between Year

1 and 3 (r = 0.98, p < 0.00) whereas those between Year 1 and 2 (r = 0.73, p < 0.00) and 2 and 3

(r = 0.74, p < 0.00) were much lower. This indicated that the Year 2 cohort was not as

homogeneous as the other two years and items not as stable as the other two years when

calculating discrimination index.

Table 34: Correlation Coefficient of Discrimination Index of Year 1, 2, 3 for Course 1 for

Year 1 1 0.732 0.983

Year 2 0.732 1 0.744

Year 3 0.983 0.744 1

4.4.2.3 Scatter Plots of Item Difficulty Using IRT for Course 1

Figures 13-15 show positive correlations between the item difficulty index for all three years,

i.e., 1, 2 and 3 for Course 1 when IRT was applied. The scatter plots for Course 3 are displayed

in Appendix A14 and A16. For Course 6, the scatter plots are displayed in Appendix B14 and

In the context of Course 1,Year 1 and 2 show very good correlation with each other (r =

0.82, p < 0.00). Here as well, items 15, 25 and 28 were noted to deviate from the line of best fit.

Hence, it can be noted that it is the same items as the ones noted in the CTT that deviate for the

line of fit. A similar type of plot is observed above for Year 2 and 3 (r = 0.82, p < 0.00) as all the

items showed positive correlation with each other. Again, the same three items as the ones

reported with previous plots are observed here, i.e., 15, 25 and 28. A near-perfect correlation is

seen in the case of item difficulty calculated by IRT for Year 3 and 1 (r = 0.99, p < 0.00). Almost

all the items are noted to fall very close to the line of best fit. All these trends are similar to ones

reported for item difficulty calculated using CTT. Correlation coefficients are noted to be slightly

better for IRT analyses.

Figure 13: Item Difficulty with IRT Year 1 and 2

-5 -4 -3 -2 -1 0 1 2 3

Year 1

4.4.2.4 Scatter Plots for Item Discrimination using IRT for Course 1

The scatter plot for the three comparisons between the three years for Course 1 using

IRT are in Figures 16-18.The first scatter plot is between Year 1 and Year 2. The plot indicates

that there is a positive correlation between the discrimination index of Year 1 and 2. The degree

of correlation is quite good (r = 0.73, p < 0.00) which means that only 2 items did not correlate

Figure 14: Item Difficulty with IRT Year 2 and 3

-4 -3 -2 -1 0 1 2 3

Year 2

Year 3

Figure 15: Item Difficulty with IRT for Year 3 and 1

-5 -4 -3 -2 -1 0 1 2 3

Year 1

Year 3

well and hence were not stable over the years. These were noted to be items no. 1 and 2; they

were deviating from the line of best fit and hence may be called as influential items for this set of

data. It is interesting to note that with IRT, an entirely different set of items were noted to deviate

from the line of fit. For the second figure, again a positive trend is seen and very few items are

noted to deviate from the line of best fit. For Year 2 and 3 here, the correlation coefficient was

noted to be better than for CTT (r = 0.74, p < 0.00). On closer inspection, the deviated items are

the same ones as reported for Year 1 and 2 earlier, i.e., items no. 1 and 2. Ultimate result shows a

moderate, linear and positive correlation. The last plot depicts very strongly positive, linear

correlations between the item discrimination index of Year 3 and 1 (r = 0.98, p < 0.00) as nearly

all the values on the line of best fit.

Figure 16: Item Discrimination with IRT of Year 1 and 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Year 1

Year 2 2

To summarize, the higher value of correlation coefficient for Year 1and 3 is an indication of

their homogeneity as indicated by a positive, linear and strong cluster in the scatter plots above.

On the contrary, the scatter plot of Year 1 and 2 and Year 2 and Year 3 are not as strongly

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Year 2

Year 1 3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Year 3

Year 1

correlated though the correlations are still significant. Very similar trends are seen in CTT and

IRT for most correlations and scatter plots. Only the discrimination index calculated by IRT has

yielded different influential points (items no. 1 and 2). Years 3 and 1 have shown the most stable

temporal pattern.

4.4.3 Test Characteristic Curves

Test characteristic curves (TCCs) were generated to elucidate the stability of all three courses

across the three chosen years respectively. The TCC predicts the proportion or number of items

that an examinee would answer correctly as a function of theta. The X-axis depicts the levels of

theta. The left Y-axis is in proportion correct units while the right Y-axis is in number-correct

units. These graphs are presented in the figures below.

Test Characteristic Curves for Course 1

Figure 19. Test Characteristic Curve for Course 1, Year 1

As can be observed, all three curves for the three successive years appear to be quite similar

in shape which signifies the stability of the items across the years .i.e., times 1, 2 and 3. TCCs for

Courses 3 and 6 can be seen in A17 and B17 respectively.

It can be summarized that both CTT and IRT show temporal stability for the items across the

years. Year 1 and 3 show more stability with each other than Year 2 when using CTT. This

pattern is seen in the context of both item difficulty and discrimination. This may be attributable

to a non-homogeneous sample in Year 2 with students potentially having better abilities than

those in Year 1 and 3. Some items stand out as potentially influential ones leading to their

deviation from the line of best fit. These items need either revision or their removal from the test

to improve the stability over time.

CHAPTER V- DISCUSSION

In this research, item analysis, reliability and stability of 90 MCQs were assessed three times

over six years. These MCQs covered the skills of Basic Sciences, Investigations, Diagnosis and

Management. The data were analyzed using and comparing CTT and IRT. This research showed

that the items had adequate item difficulty and discrimination using both methods along with fair

reliability. Furthermore, the items were stable for some years when repeated. In the context of

Course 1, they were more stable for Year 1 and 3 than for Year 2. What is unique about this

research is that two measurement methods have been used to look at the psychometrics of

MCQs, one observing the raw score, the other using the true scores. In addition, stability of the

MCQs on re-using them in recurring years is also an element that has not been extensively

investigated in the field of medical education. Course 1 has been discussed at length in this

research as the number of examinees was the highest amongst the three courses and also the most

consistent.

5.1 Discussion Related to Research Question No. 1

What was the reliability of scores using and comparing two methods of analysis, i.e., item

response theory and classical test theory, on MCQ items administered three times over a six

year period reliability of the items?

This research question aimed to look at the item parameters using CTT and IRT. The

reliability index was also calculated for the items along with SEM and ICCs generated and

plotted using IRT.

5.1.1 Research Question No. 1 A: What are the item parameters when conducting item

analysis with CTT?

This research showed that most of the MCQs for Course 1 were of ‘adequate’ type, the

rest being easy. For Year 2, a slightly different trend was noted as half the items were easy.

Differences in the performance of Year 2 as compared to Year 1 and 3 may be attributable to

differences in their learning curves. Students tend to learn more as their experience increases.

One explanation of the difference in performance of students in Year 2 might be that some

students entering in the MD program at the University of Calgary have already finished a

master’s. The other reason could be the difference in the style of teaching. Both the Faculty of

Medicine and the Teaching and Learning Center at the University of Calgary offer teaching

certificates. The former is a requirement for the “master teachers” who provide a large

percentage of the teaching in the medical school. This may have made a difference to the

examinee’s performance in Year 2. Research in the fields of education and social sciences have

shown that teaching strategies and methods of information transfer do make a difference to

students’ results.154, 155

Discrimination index is important as it helps to distinguish between students of different

abilities. It also highlights the weaknesses of MCQs under study by giving a value to the degree

of difference between students of high and low ability. Confusing or ambiguous wordings along

with incorrect answer key may lead to poor discriminatory values. In the case of item

discrimination for Course 1 using CTT, nearly half of the items in this research had a fair

discrimination index Year 1 and 3. For Year 2, only about one third had a fair discrimination

index of >0.2. Similar findings were observed in the analyses of Course 3 and 6 as well.

It has been observed that good students become overcautious in attempting to answer

parts of an item they are not completely sure of as they fear losing hard-earned marks on the

other item parts. On the other hand, relatively weaker students would take risks since they

already know little about the topic. They expect the least score they can get is a zero and hence,

they take a chance at attempting to answer the option. With the SBA type of MCQs only one

option is the correct one; hence the element of chance guessing is reduced to an extent. It is quite

striking to note that in the case of Course 1, results were the most different for Year 2 where a

few items were found to be more difficult by the Year 2 cohort. One explanation of poor

discrimination, in addition to there being miskeying and ambiguity of the item is that the clarity

of concepts may have been less among the students in the other two years. Year 2 students, on

the other hand, likely selected the right answer because of their intrinsic ability to explore , but it

was marked wrong. Some more potential causes of varied indices may be the wording of

question and areas of controversy in the topic being questioned. Bhakta et al 31 have noted that

the reason for frequently selecting the incorrect response as the correct one is attributable to the

distracter being very close to the correct option in terms of the accuracy of information it

provides. According to their findings, if a distracter is constructed so that it is very close to the

correct option, it is chosen frequently as the correct option by the students. The difference lies in

the ability of the examinees as the ones with lower ability usually choose the distracter as the

correct option and those with higher ability actually choose the correct option.

Hingorjo et al156 utilized 50 MCQs from a physiology exam for undergraduate students. The

mean difficulty index reported by them is again similar to our research, i.e., 0.78. Furthermore,

they reported a mean discrimination index of 0.35 with 62% of the items having an excellent

discrimination index of 0.4-0.45. This is also comparable to this research where similar

discrimination indices were reported. Contrary to this research, another study reported lower

discrimination indices of mostly between 0.2 to 0.25 on a set of seventy MCQs, which were

randomly selected from para-clinical subjects.157 They attributed these low indices to the

ambiguity in the content of the MCQ items.

It is desirable that MCQs at the medical school level are constructed to assess higher order

thinking and analysis in addition to application and synthesis. In the case of Course 1 and 3

which are offered in the first year of medical school at the University of Calgary, these items

may be slightly less in number but as the student matures and moves on, it is acceptable, in fact

warranted, that more difficult type of MCQs be encountered in an exam. In the context of this

research, more MCQs were of easy / adequate type and less of difficult type and discrimination

only fair. For a summative exam, it is desirable that a certain proportion of the items are the type

that are more difficult and discriminating. In our research, because of similarities in the groups of

students, this was not the case. It must be kept in mind that some items will have low

discrimination indices because they may represent content that is expected to be known and

understood by the student.158

5.1.2 Research Question No.1 B: What are the item parameters when conducting item analysis

with IRT?

The majority of the items were of the easy type for all three years when IRT was applied.

Strikingly in Course 1, three items stood out as very difficult for students in all three years. It is

difficult to explain why they were found to be more difficult for the students in Year 2 who

otherwise have shown better performance in general. One explanation may be that these items

were the ones whose underlying concepts were not taught effectively and although the

misconception was understood by the students in Year 1 and 3 as taught, the students in Year 2,

with their superior ability of reasoning, were able to identify the concept as unclear or wrong.

Items 1 and 2 had the lowest discrimination amongst all. Discrimination was better when IRT

was applied and was in fact noted to be quite high as several values were above the ideal cut-off

value of 0.4.

An item with a difficulty level where fifty percent of the students are able to answer correctly

may be appropriate depending on what the aim of the exam is and what the sample

characteristics might be, i.e., smaller size, narrow content as was the case in this research. The

difficulty and discrimination indices are often reciprocally related.159 However, this may not

always be true. Questions having higher p (easier questions), discriminate poorly; conversely,

questions with lower p (harder questions) are considered to be good discriminators. A potential

reason for such high discrimination indices as noted in this research could be the narrow

examination content that the students were assessed on. On the other hand, if the efficiency of

distracters is good, the discrimination index becomes narrow.

For Course 1, if one looks at the percentages of the items, more than half of the items were of

the easy type. A close inspection revealed that the easy ones were easier for students in Year 2.

This may be an indication of their better ability or better quality of both teaching and learning, as

stated earlier. Some items were easier for Years 1 and 3 when it has been observed elsewhere

that students in Year 2 were better performers. A closer look at the discrimination index shows

that such items were also more discriminating for Year 1 and 3 than for Year 2. It is

recommended that such items be either revised or removed from the exam. On the other hand,

items with a low difficulty index for Year 1 and 3 which had a higher difficulty index for Year 2

were likely appropriately taught and tested. Since it is thought that students in Year 2 had better

abilities, it might be that the concepts underlying these items may have been misunderstood by

students in Year 1 and 3. Another reason for the Year 2 students finding easy items as difficult

could be that although everybody might have made a guess, Year 2 students failed to guess the

right answer. This is where the 3 PL model can help which looks at the guessing behaviour of

students. Difficult items like nos. 10, 15 and 28 are the ones that may play a role in

differentiating between students with high and higher abilities where honours need to be

determined in addition to decisions about pass and fail.

From the results so far, one gets an impression that although CTT and IRT are mostly

comparable, there are subtle differences noted in context of both parameters. The fact remains

that in CTT, the item statistics are sample-dependant and in IRT, sample-independent. It appears

that IRT has demonstrated a more specific analysis of the items than CTT which is what was

anticipated as IRT works at item level and CTT at test level. These parameters are sometimes

affected by unidentified changes in the characteristics of a sample drawn from a population and

thus the item statistics are completely changed, thus providing evidence for its sample

dependence in CTT.160

Fan152 conducted research with the objective of looking at the comparability of CTT and IRT

with a very large data size. In this research, 108 MCQs were analyzed that were used to assess

40,000 students. Although Fan152 used all three parameters to assess the comparability of IRT

with CTT, it was found that the results of the analysis were most comparable in the context of

both item difficulty and discrimination when 1 and 2 PL models were used. Similar results were

also reported by more recent study, again using a very large data set. Guler and colleagues151

looked at comparing the two measurement methods, i.e., CTT and IRT. Although their data are

smaller compared to the one reported on by Fan, the results were consistent. CTT and IRT,

especially the 2 PL, were found to be comparable with each other.

5.1.3 Research Question No.1 C: Are the item parameters comparable when conducting item

analysis with both CTT and IRT?

Studies have shown moderate to excellent comparability between item parameters when

applying CTT and IRT.151, 152 This research showed similar results for most of the years for all

three courses. Fan152 conducted research comparing CTT with the three dichotomous models of

IRT. The examinee data size in their research were much larger at 1,000 for each sample set.

One, two and three parameter logistic models were applied to a criterion-referenced test. As in

this research, correlation coefficients were calculated for CTT and IRT for the item difficulty and

discrimination. The correlation coefficients for item difficulty reported by Fan are around 0.9;

the ones reported by ourselves are about the same or around 0.8 for most of the years for all three

courses. In Fan’s study, the best correlation coefficients were noted for the 1 PL model. Similar

trends were noted for item difficulty for both 2 and 3 PL models. The researcher attributed the

differences in the correlations to the sampling of the items. In contrast, item discrimination,

although correlated with each other, did not do as well as item difficulty. Like the study by Fan,

a ceiling effect is seen in this research as well. Although theirs is attributable to the nature of the

exam which was minimum-competency, ours is likely due to the homogeneity of students. In the

context of item discrimination, Fan found that both CCT and IRT were comparable though not as

strongly. Our research showed very good comparability for item discrimination as well with both

CTT and IRT. Our sample of students was quite consistent with each other in ability levels and

the discrimination was likely uniform due to that reason. One reason for the findings above may

be that although the number of examinees was relatively adequate in our research, the number of

items was small. Fan’s research was replicated by Courville161 to study similarities between CTT

and IRT which further strengthened the notion that CTT and IRT are quite comparable.

In another study carried out by Guler et al151 about 1200 students were assessed with 25

items for a high school entrance exam. Both CTT and IRT were applied to the data to look at

person and item fit statistics. These data were about the same size as the one in my study,

although smaller than ones reported by Fan152 and Courville.161 Our results are quite similar to

the ones reported by Guler as high correlation coefficients were noted between both CTT and

IRT for the given data. The best correlations were seen for the 1 PL model for item difficulty and

for 2 PL model for item discrimination. Interestingly, the poorest correlations were seen with the

3 PL model which was attributed to the guessing behaviour of the students.

5.1.4 Research Question No.1 D:

What is the reliability index of the test scores?

Reliability coefficient and SE of estimates were calculated for each item for the three years

examined for all three courses. The SE for both difficulty and discrimination parameters were

mostly found to be large in this research. As a consequence, the reliability was noted to be only

fair to moderate. The large sizes of SE are most likely attributable to the small sample size. It is

also known that the error tends to be larger for students who are high scorers as the case in this

research since the noise from stronger students is likely to be larger than from the weaker

ones.149 There are several ways of ensuring that the SE is kept within the acceptable range and

reliability improved.. The items should be written without any confusing or misleading

statements. In addition, the instructions about the question should be clearly written.

Furthermore, the marking should be objective. Reliability of the scores tends to decrease if the

items on a test are too easy or too difficult. It is also affected by the characteristics of a group

since the more heterogeneous the group, the higher the reliability.162 In this particular research,

students belonged to a local medical school where entry is gained after going through a rigorous

admission process. As a result, those who ultimately enter the school have very similar

characteristics and ability levels. One reason why the reliability coefficient of the scores may be

low in this research could be the homogeneity of the sample. In addition, the items that were

analyzed were mostly found to be easy. Both these factors may have led to low reliability

reported in this research. Research has highlighted that reliability coefficients are helpful in

informing the researchers about the sampling errors that can adversely affect the reliability.163 If

the reliability of the scores is low, it may also indicate that either the test is short or the content

being examined is narrow. In one study, a smaller data set of about 25 MCQs for a low stake

exam was analyzed.164 The research was conducted in the field of pulmonology where the MCQs

were randomly selected from a larger pool of 70 items. Cronbach’s alpha was reported to be

0.69, quite similar to the ones reported for most of the years in our research. The research

concluded that the relatively low reliability index was attributable to the narrow content of

assessment.

When the reliability coefficients of test scores were compared by the two methods, i.e.,

CTT and IRT, the results indicated that neither CTT nor IRT was particularly better than the

other. In fact, the results with both the methods were quite consistent with each other. Although

the reliability coefficients for the test scores for all three years for the three courses were slightly

better when applying IRT, at a local medical school, they were not significant enough to

recommend the use of only IRT for measurement purposes.

5.1.5 Research Question No.1 E: What are the item characteristic curves like for the

individual items for each year?

ICCs were generated for individual items for the three years for Course 1, 3 and 6. As

discussed earlier, the ICC expresses the relationship between the ability of an examinee and the

probability of his or her endorsing an item. With SBA type of MCQs, the curve tends to be s-

shaped since with the increase in the level of ability, the probability of endorsing an item also

increases. The curve is noted to be steeper with large changes in the probability of endorsing an

item and little changes in the level of ability. This regression is non-linear. In an ICC, the slope is

formed by the item discrimination index. The threshold at which the examinees endorse the item

and the slope of the curve establish the effectiveness of the item as an indicator of the ability. In

our research, more than 50% of the items were of the easy type. Hence, several curves are noted

to be moved to the left. Furthermore, one of the objectives of our research was to look at the

temporal stability of the items. One can notice that the curves look similar at a glance for Year 1

and 3 but not so for Year 2. It has also been speculated that the likely reason for the difference is

the performance of students in Year 2 and hence the difference in the curve is attributable to the

ability of the students in this year which seems to be superior to the other two years.

In summary, analyses like ours assist the assessors in revising the items. One option is to

merge the answers of an item together if the domains overlap for items with similar curves. This

will lead to the creation of a single option. It may also be advisable to remove unused options

and replace them with more effective ones. Such measures lead to improvement of

discrimination between less and more able students.

5.2 Discussion Related to Research Question No. 2

5.2.1 Research Question No. 2 A : Do the items show stability across years using CTT?

The stability of items was assessed by using repeated measures ANOVA and calculating the

correlation coefficients of the items under scrutiny which yielded stable results for all three years

for the three course.

The aim of studying the stability of items over time was to present evidence that they are

repeatable across the years without compromising their psychometric properties. For this purpose

F ratios were calculated for the three courses. In the context of F ratio, if the p value is non-

significant, it shows that between-groups differences are not remarkable and item parameters,

hence, stable. Our research did yield small F ratios for the three courses for both item difficulty

and discrimination parameters, thus signifying the stability of items. Baig and Violato conducted

similar sort of analyses using MANOVA to compare station stability in the background of

OSCEs for international medical graduates in Alberta. 165 They also documented adequate

stability of the OSCE station over three points in time. The construction and maintenance of an

item bank is difficult, both in terms of monetary factors and in terms of faculty time and

expertise. Once an item has been constructed, it also requires timely updating due to the changes

in the curricular content, usually as a result of new knowledge that has been acquired about the

topic that a student is being assessed on. Keeping in mind the logistics of developing and

maintaining such an item bank, the items that show less stability across the years may sometimes

need to be revised due to a threat to their psychometric properties. Alternatively, they might

require removal from the exam altogether. This decision is also influenced by the objective of the

examination. If such exams are low stake, formative type, the items may only need to be revised.

On the other hand, for summative, end-of-year high stakes exams where decisions about

graduation and certification are involved, such items may need removal.

In our research, correlation coefficients were also calculated, most showing very good

correlation with each other, thus providing further evidence for the temporal stability of the

items. Correlation coefficients express the linear relationship between two variables, the years

being those variables in this research. As indicated in the Results section, some items stood out

as having only fair correlation coefficients. These items cause concerns with both their difficulty

and discrimination indices, when assessed with either CTT or IRT. Such items, if noted to be

affecting the reliability of the scores, should be removed.

5.2.2 Research Question No. 2 B: Do the items show stability across years using IRT?

Repeated measures ANOVA was also carried out for the three courses using IRT to look at

the stability of items over three administrations and to compare the findings of CTT with IRT.

The effect size of the F ratio was small for all the three courses signifying stability over time.

The results yielded by repeated measures ANOVA for IRT showed the same trend as CTT. It can

be, thus, stated that neither the CTT nor the IRT is necessarily superior over the other and the

choice between the two is influenced by factors discussed earlier like the objective of the

research, the data size and the model fit.

Test characteristic curves were also generated to further elucidate the stability of all three

courses across the three chosen years respectively. It was assumed that the temporal stability

would be reflected by the uniformity of the curves observed visually. Baig and Violato have

used similar methods for analyzing the temporal stability of OSE stations for high stakes

licensing exams for international medical graduates.165 The research under discussion revealed

that the scores using IRT were consistent over three years for the three courses when graphs were

plotted between the ability levels and the scores of the items. Very similar results were obtained

for all three courses for the three years. TCCS provide a means for converting ability scores to

true scores. In this way, a number is given to the examinee which relates to the number of items

in the test. It can be noted in the curves generated for the data in this research that the shape is

mostly of that a smooth S. This is dependent on the number of items and the item parameters.

The ability of the examinee is noted to correspond to the mid true score of the examinee and is

plotted on theta. The mid true score is actually the difficulty level of the item and contributes to

the interpretation of the curve for descriptive purposes.

5.3 Implications and Future Directions for Research

High stakes exams require the construction of items that are psychometrically sound in the

context of their reliability. Furthermore, they have to be stable for repeatability since item

banking has many logistic issues associated with the construction and security of items. Item

parameters also influence the selection of items for the exams. If the item parameters are not

taken into consideration before the selection for an exam, there is a chance that good items that

should be in an exam are mistakenly removed and weak ones included. This research has shown

that the choice of one method of scoring over the other depends on the objective of the research

and the size of data. At the level of a local medical school, both the methods yielded very

comparable results. Stability of the items across time is also an issue that must be addressed

while administering them repeatedly since changes in construct, curricular content, test wiseness

and other threats to the security of such items require that the factors that lead to parameter drift

be more thoroughly explored.

Although CTT has been the mainstay of measurement methods in the past, the more recent

decades have seen increased use of IRT. It is now being increasingly utilized in the educational

field for the calibration and evaluation of items in various tests and questionnaires for the scoring

of attitudes, abilities and other traits. Recent advances have seen more frequent application of

IRT in the context of item scaling, equating and CAT. Item calibration and test equating with

IRT are both important for the movement of IRT in a forward direction. As IRT models continue

to evolve, it is hoped that they will soon become less analytically and computationally intensive.

As these models become more able to adapt to the design, size and complexity of assessments,

they are expected to play a more pivotal role in assessments.

5.4 Limitations of the Study

This research looked at the reliability of MCQs using both CTT and IRT. One of the

limitations of this study was the choice of model. Since a 2 PL was applied to this research, the

guessing behaviour of the students could not be studied.

Another limitation of this study was the limited choice of items. The SBA type of MCQs

were included in this research. For consistency, it was also decided to include in the study those

items that had five options to choose the correct answer from. They also had to have been

repeated in at least three consecutive or overlapping years. In addition to the factors narrowing

down the data, the content examined was narrow as well. The items were chosen from the four

skills of Basic Sciences, Investigations, Treatment and Management. Several of the items were

of ‘easy’ type. It is clear that such items do affect the stability over time (as evidenced by the

deviation of some items from the line of best fit in the scatter plots). In future, it might be useful

to look at a wider variety of MCQs as recommended for a high stakes exam.152

5.5 Conclusion

Effective measurement of knowledge is vital for the growth of a program. Methods that

are used to assess students’ knowledge have to be evaluated for the qualities of a good

assessment tool as recommended by Norcini et al.8 It is, therefore, important to evaluate the

MCQs to observe their effectiveness in measuring the knowledge of students in preclinical years.

This research was carried out by using and comparing two methods of item analysis for

establishing the reliability of scores on MCQs of MD certifying exams at the University of

Calgary. Results showed that the analyses of the selected items were comparable between CTT

and IRT to some extent. Several items were noted to be of the ‘easy’ type. Furthermore, the item

discrimination was noted to be ‘good’. The reliability of these MCQs was found to be fair only.

The fair indices of reliability may be attributable to the homogeneity of the student sample and

the relatively small size of the data as also indicated by mostly large standard errors of estimates.

In addition, the correlation coefficients calculated for the three years for three courses were only

moderate to good in some instances which means that those items which correlated to a lesser

extent with each other did not exhibit remarkable temporal stability.

On a continuum from less to more complex, the development of IRT models has taken

place with the intent to address the restrictions posed by CTT. IRT models require larger data for

better fit and interpretation. It is clear that the choice between CTT and IRT depends on the aim

of research since IRT is better suited to data when being analyzed at item level. The most

effective application of IRT is with a large data since that improves the reliability of the scores

Despite its advantage of item level statistics, the results so far do not prove the superiority of IRT

over CTT. These results are similar to results reported by Fan,152 Macdonald et al 78 and

Courville.161 This research has shown that both CTT and IRT often yield similar results. There is

a growing body of literature that points strongly towards the fact put forward by Fan152 who

states “when scores developed by IRT can be correlated with those obtained by the more usual

approach to simply sum items scores, typically it is found that the two sets of scores correlate

higher; thus there is hardly any difference between the two approaches or any marked departure

from linearity of the measurement obtained from the two approaches.”

5.6 Recommendations

This research looked at the psychometrics of MCQs at a local medical school. It did not show

significant superiority of one method of measurement over the other and in such situations, both

CTT and IRT have their respective utility. Although CTT is easier to use due to its robustness, a

combination of both measurement methods may be applied at a local medical school to analyze

the psychometric properties of MCQs in high stakes summative exams. Hence, CTT may be used

to look at the reliability of the test scores and IRT may be applied to analyze the item parameters.

It is hoped that a combination of the two methods would be more practical than using only IRT

considering the fact that it is less robust than CTT when a smaller data is being analyzed.

Another aspect of this research was to analyze the temporal stability of MCQs across time.

This research showed stability of items in the context of their parameters although these findings

were not entirely consistent across all the years; some variability was noted in both difficulty and

discrimination parameters. It is, thus, recommended that parameter drift should be analyzed so

that measures can be taken to curtail the observed drift. Since parameter drift has certain

undesirable consequences, schools should make sure that methods are available for assessing and

detecting this drift. One method might be recalibration of an item bank on a regular basis;

another would be to increase the item bank.This is helpful when reusing the same items across a

number of administrations by ensuring that repeating the MCQs in subsequent administrations

does not affect their psychometric properties.

REFERENCES

1. Bernstein J. Evidence-Based Medicine. Journal of the American Academy of Orthopaedic

Surgeons. 2004;12(2):80-88.

2. Cooke M, Irby DM, Sullivan W, Ludmerer KM. American Medical Education 100 Years

after the Flexner Report. New England Journal of Medicine. 2006;355(13):1339-1344.

3. Boulet JR. Summative Assessment in Medicine: The Promise of Simulation for High

stakes Evaluation. Academic Emergency Medicine. 2008;15(11):1017-1024.

4. Dannefer EF. Beyond assessment of learning toward assessment for learning: Educating

tomorrow's physicians. Medical Teacher. 2013;35(7):560-563.

5. Driessen E, Scheele F. What is wrong with assessment in postgraduate training? Lessons

from clinical practice and educational research. Medical Teacher.2013;35(7):569-574.

6. Hodges B. Assessment in the post-psychometric era: Learning to love the subjective and

collective. Medical Teacher.2013;35(7):564-568.

7. Schuwirth L, Ash J. Assessing tomorrow's learners: In competency-based education only

a radically different holistic method of assessment will work. Six things we could forget.

Medical Teacher.2013;35(7):555-559.

8. Norcini J, Anderson B, Bollela V, Burch V, Costa MJo, Duvivier R, et al. Criteria for

good assessment: consensus statement and recommendations from the Ottawa 2010

Conference. Medical Teacher.2010;33(3):206-214.

9. Rudolph JW, Simon R, Raemer DB, Eppich WJ. Debriefing as formative assessment:

closing performance gaps in medical education. Academic Emergency Medicine.

2008;15(11):1010-1016.

10. Yorke M. Formative assessment in higher education: Moves towards theory and the

enhancement of pedagogic practice. Higher Education. 2003;45(4):477-501.

11. Wiliam D, Black P. Meanings and consequences: a basis for distinguishing formative and

summative functions of assessment? British Educational Research Journal.

1996;22(5):537-548.

12. Roberts TE. Assessment est mort, vive assessment 1. Medical Teacher.2013;35(7):535-

13. Harlen W, James M. Assessment and learning: differences and relationships between

formative and summative assessment. Assessment in Education. 1997;4(3):365-379.

14. Miller GE. The assessment of clinical skills/competence/performance. Academic

Medicine. 1990;65(9):S63-67.

15. Van Der Vleuten CP, Schuwirth LW. Assessing professional competence: from methods

to programmes. Medical education. 2005;39(3):309-317.

16. Davis MH, Karunathilake I. The place of the oral examination in today's assessment

systems. Medical Teacher. 2005;27(4):294-297.

17. Palmer EJ, Devitt PG. Assessment of higher order cognitive skills in undergraduate

education: modified essay or multiple choice questions? Research paper. BMC Medical

Education. 2007;7(1):49-54.

18. Van Der Vleuten CP. The assessment of professional competence: developments,

research and practical implications. Advances in Health Sciences Education.

1996;1(1):41-67.

19. Case S, Swanson D. Extended matching items: a practical alternative to free response

questions. Teaching and Learning in Medicine. 1993;5:107-115.

20. Roberts C, Newble D, Jolly B, Reed M, Hampton K. Assuring the quality of high-stakes

undergraduate assessments of clinical competence. Medical Teacher. 2006;28(6):535-

21. Newble D. Techniques for measuring clinical competence: objective structured clinical

examinations. Medical education. 2004;38(2):199-203.

22. Ramani S. Twelve tips to improve bedside teaching. Medical Teacher. 2003;25(2):112-

23. Stillman P, Swanson D, Regan MB, Philbin MM, Nelson V, Ebert T, et al. Assessment of

Clinical Skills of Residents Utilizing Standardized PatientsA Follow-up Study and

Recommendations for Application. Annals of Internal Medicine. 1991;114(5):393-401.

24. Lockyer J. Multisource feedback in the assessment of physician competencies. Journal of

Continuing Education in the Health Professions. 2003;23(1):4-12.

25. Whitehouse A, Hassell A, Bullock A, Wood L, Wall D. 360 degree assessment

(multisource feedback) of UK trainee doctors: Field testing of team assessment of

behaviours (TAB). Medical Teacher. 2007;29(2-3):171-176.

26. Sandars J. The use of reflection in medical education: AMEE Guide No. 44. Medical

Teacher. 2009;31(8):685-695.

27. Moonen-van Loon J, Overeem K, Donkers H, van der Vleuten C, Driessen E. Composite

reliability of a workplace-based assessment toolbox for postgraduate medical education.

Advances in Health Sciences Education.18(5):1087-1102.

28. Norcini J, Burch V. Workplace-based assessment as an educational tool: AMEE Guide

No. 31. Medical Teacher. 2007;29(9-10):855-871.

29. Alagumalai S, Keeves J. Distractors - Can they be biased too? Journal of Outcome

Measurement. 1999;3:89-102.

30. Beullens J, Struyf E, Van Damme B. Do extended matching multiple-choice questions

measure clinical reasoning? Medical education. 2005;39(4):410-417.

31. Bhakta B, Tennant A, Horton M, Lawton G, Andrich D. Using item response theory to

explore the psychometric properties of extended matching questions examination in

undergraduate medical education. BMC Medical Education. 2005;5(1):5-9.

32. Campbell DE. How to write good multiple choice questions. Journal of paediatrics and

child health.2013;47(6):322-325.

33. Schuwirth LW, Van Der Vleuten CP. Different written assessment methods: what can be

said about their strengths and weaknesses? Medical education. 2004;38(9):974-979.

34. Fowell SL, Bligh JG. Recent developments in assessing medical students. Postgraduate

medical journal. 1998;74(867):18-24.

35. Wass V, Van der Vleuten C, Shatzer J, Jones R. Assessment of clinical competence. The

Lancet. 2001;357(9260):945-949.

36. Norcini J, Swanson D, Grosso L, Webster G. Reliability, validity and efficiency of

multiple choice question and patient management problem item formats in assessment of

clinical competence. Medical education. 1985;19(3):238-247.

37. Lukhele R, Thissen D, Wainer H. On the Relative Value of Multiple-Choice, Constructed

Response, and Examinee-Selected Items on Two Achievement Tests. Journal of

Educational Measurement. 1994;31(3):234-250.

38. Mislevy RJ, Stocking ML. A Consumer's Guide to LOGIST and BILOG. Applied

Psychological Measurement. 1989;13(1):57-75.

39. Bock R, Aitkin M. Marginal maximum likelihood estimation of item parameters:

Application of an EM algorithm. Psychometrika. 1981;46(4):443-459.

40. Patz RJ, Junker BW. Applications and Extensions of MCMC in IRT: Multiple Item

Types, Missing Data, and Rated Responses. Journal of Educational and Behavioral

Statistics. 1999;24(4):342-366.

41. Drasgow F, Levine MV, Tsien S, Williams B, Mead AD. Fitting Polytomous Item

Response Theory Models to Multiple-Choice Tests. Applied Psychological Measurement.

1995;19(2):143-166.

42. Chang KY, Tsou MY, Chan KH, Chang SH, Tai J, Chen HH. Item analysis for the

written test of Taiwanese board certification examination in anaesthesiology using the

Rasch model. British journal of anaesthesia.2010;104(6):717-722.

43. Huang Y-F, Tsou M-Y, Chen E-T, Chan K-H, Chang K-Y. Item response analysis on an

examination in anesthesiology for medical students in Taiwan: A comparison of one- and

two-parameter logistic models. Journal of the Chinese Medical

Association.2010;76(6):344-349.

44. Birnbaum A. Some latent trait models and their use in inferring an examinee’s ability.

Statistical theories of mental test scores. 1968:397–479.

45. Norcini JJ, McKinley DW. Assessment methods in medical education. Teaching and

Teacher Education. 2007;23(3):239-250.

46. Harden RMG, Brown R, Biran L, Ross WD, Wakeford R. Multiple choice questions: to

guess or not to guess. Medical education. 2009;10(1):27-32.

47. Tarrant M, Knierim A, Hayes SK, Ware J. The frequency of item writing flaws in

multiple-choice questions used in high stakes nursing assessments. Nurse education in

practice. 2006;6(6):354-363.

48. Schuwirth LWT, Vleuten CPM, Donkers H. A closer look at cueing effects in multiple-

choice questions. Medical Education. 1996;30(1):44-49.

49. Brady A. Assessment of learning with multiple-choice questions. Nurse Education in

Practice. 2005;5(4):238-242.

50. Tarrant M, Knierim A, Hayes SK, Ware J. The frequency of item writing flaws in

multiple-choice questions used in high stakes nursing assessments. Nurse Education

Today. 2006;26(8):662-671.

51. McCoubrie P. Improving the fairness of multiple-choice questions: a literature review.

Medical Teacher. 2004;26(8):709-712.

52. Fox J. The multiple choice tutorial: its use in the reinforcement of fundamentals in

medical education. Med Educ. 1983;17:90-94.

53. Laura TF. Using feedback to reduce students' judgment bias on test questions. Journal of

Nursing Education. 2001;40(1):10-22.

54. Downing SM. The effects of violating standard item writing principles on tests and

students: the consequences of using flawed test items on achievement examinations in

medical education. Advances in Health Sciences Education. 2005;10(2):133-143.

55. Spearman C. The proof and measurement of association between two things. The

American Journal of Psychology. 1904;15(1):72-101.

56. Harvill LM. Standard Error of Measurement. Educational measurement: Issues and

practice. 1991;10(2):33-41.

57. Cronbach L. Coefficient alpha and the internal structure of tests. Psychometrika.

1951;16(3):297-334.

58. Traub RE, Rowley GL. Understanding reliability. Educational measurement: Issues and

practice. 1991;10(1):37-45.

59. DeVellis RF. Classical test theory. Medical care. 2006;44(11):S50.

60. Crocker L, Algina J. Introduction to classical and modem test theory. New York: Holt,

Rinehart, and Winston. 1986.

61. Lord FM. Applications of item response theory to practical testing problems: Lawrence

Erlbaum Associates New Jersey; 1980.

62. Dent J, Harden RM. A Practical Guide for Medical Teachers E-Book: Churchill

Livingstone; 2009.

63. Gay LR, Mills GE, Airasian PW. Educational research: Competencies for analysis and

applications. 2006.

64. Cox M, Irby DM, Epstein RM. Assessment in medical education. New England Journal

of Medicine. 2007;356(4):387-396.

65. Downing SM. Reliability: on the reproducibility of assessment data. Medical education.

2004;38(9):1006-1012.

66. Cortina JM. What is coefficient alpha? An examination of theory and applications.

Journal of applied psychology. 1993;78(1):98-104.

67. Sijtsma K. On the use, the misuse, and the very limited usefulness of Cronbachâ€™s

alpha. Psychometrika. 2009;74(1):107-120.

68. Tavakol M, Dennick R. Making sense of Cronbach's alpha. International journal of

medical education.2011;2:53-55.

69. Gliem JA, Gliem RR. Calculating, interpreting, and reporting Cronbachâ€™s alpha

reliability coefficient for Likert-type scales. In; 2003: Midwest Research-to-Practice

Conference in Adult, Continuing, and Community Education; 2003.

70. Phinney JS. The multigroup ethnic identity measure a new scale for use with diverse

groups. Journal of adolescent research. 1992;7(2):156-176.

71. De Champlain AF. A primer on classical test theory and item response theory for

assessments in medical education. Medical education.2010;44(1):109-117.

72. Sim S, Rasiah RI. Relationship between item difficulty and discrimination indices in

true/false-type multiple choice questions of a para-clinical multidisciplinary paper.

Annals-Academy of Medicine Singapore. 2006;35(2):67-72.

73. Ebel RL. Measuring educational achievement: Prentice-hall Englewood Cliffs, NJ; 1965.

74. DeVellis RF. Classical test theory. Medical care. 2006;44(11):S50-S59.

75. Wells CS, Wollack JA. An instructors guide to understanding test reliability. Testing &

Evaluation Services University of Wisconsin. 2003.

76. Hambleton RK. Emergence of Item Response Modeling in Instrument Development and

Data Analysis. Medical care. 2000;38(9):II60-II65.

77. Kolen MJ. Comparison of traditional and item response theory methods for equating

tests. Journal of Educational Measurement. 1981;18(1):1-11.

78. Macdonald P, Paunonen SV. A Monte Carlo comparison of item and person statistics

based on item response theory versus classical test theory. Educational and psychological

measurement. 2002;62(6):921-943.

79. Bechger TM, Maris G, Verstralen HH, Baguin AA. Using classical test theory in

combination with item response theory. Applied Psychological Measurement.

2003;27(5):319-334.

80. Traub RE. Classical test theory in historical perspective. Educational Measurement:

issues and practice. 2005;16(4):8-14.

81. Lord FM, Wingersky MS. Comparison of IRT True-Score and Equipercentile Observed-

Score "Equatings". Applied Psychological Measurement. 1984;8(4):453-461.

82. Oliveri ME, Olson BF, Ercikan K, Zumbo BD. Methodologies for Investigating Item-

and Test-Level Measurement Equivalence in International Large-Scale Assessments.

International Journal of Testing.12(3):203-223.

83. McEldoon K, Cho S-J, Rittle-Johnson B, Society for Research on Educational E.

Measuring Intervention Effectiveness: The Benefits of an Item Response Theory

Approach: Society for Research on Educational Effectiveness.

84. Magno C. Demonstrating the Difference between Classical Test Theory and Item

Response Theory Using Derived Test Data: Online Submission; 2009.

85. Wainer H, Kiely GL. Item clusters and computerized adaptive testing: A case for testlets.

Journal of Educational Measurement. 1987;24(3):185-201.

86. Cooke DJ, Michie C. An item response theory analysis of the Hare Psychopathy

Checklist--Revised. Psychological assessment. 1997;9(1):3-10.

87. Hays RD, Morales LS, Reise SP. Item response theory and health outcomes measurement

in the 21st century. Medical care. 2000;38(9 Suppl):II28-II42.

88. Fraley RC, Waller NG, Brennan KA. An item response theory analysis of self-report

measures of adult attachment. Journal of personality and social psychology.

2000;78(2):350-365.

89. Hulin CL, Drasgow F, Komocar J. Applications of item response theory to analysis of

attitude scale translations. Journal of Applied Psychology.1982;67(6):818-825.

90. Saha TD, Chou SP, Grant BF. Toward an alcohol use disorder continuum using item

response theory: results from the National Epidemiologic Survey on Alcohol and Related

Conditions. Psychological medicine. 2006;36(7):931-942.

91. Bolt DM, Hare RD, Vitale JE, Newman JP. A Multigroup Item Response Theory

Analysis of the Psychopathy Checklist-Revised. Psychological assessment.

2004;16(2):155-168.

92. Justice LM, Bowles RP, Skibbe LE. Measuring preschool attainment of print-concept

knowledge: a study of typical and at-risk 3-to 5-year-old children using item response

theory. Language, Speech & Hearing Services in Schools. 2006;37(3):460-476.

93. Scherbaum CA, Cohen-Charash Y, Kern MJ. Measuring General Self-Efficacy A

Comparison of Three Measures Using Item Response Theory. Educational and

psychological measurement. 2006;66(6):1047-1063.

94. Downing SM. Item response theory: applications of modern test theory in medical

education. Medical Education. 2003;37(8):739-745.

95. Hambleton RK. Fundamentals of item response theory: Sage Publications, Incorporated;

96. Steinberg L, Thissen D. Uses of Item Response Theory and the Testlet Concept in the

Measurement of Psychopathology. [Article]. Psychological Methods March.

1996;1(1):81-97.

97. Hambleton RK, Swaminathan H, Rogers HJ. Fundamentals of item response theory:

Sage; 1991.

98. Reise SP, Ainsworth AT, Haviland MG. Item Response Theory: Fundamentals,

Applications, and Promise in Psychological Research. Current Directions in

Psychological Science. 2005;14(2):95-101.

99. van der Linden WJ, Hambleton RK. Handbook of modern item response theory:

Springer; 1997.

100. Hambleton RK, Van der Linden WJ. Advances in item response theory and applications:

An introduction. 1982.

101. Samejima F. Estimation of latent ability using a response pattern of graded scores.

Psychometrika Monograph Supplement. 1969;34(4, Pt. 2):100.

102. Hambleton RK, Cook LL. Latent Trait Models and Their Use in the Analysis of

Educational Test Data. Journal of Educational Measurement. 1977;14(2):75-96.

103. Wright B. Rasch measurement models. Advances in measurement in educational

research and assessment. 1999:85-97.

104. Mislevy RJ. Foundations of a new test theory. Test theory for a new generation of tests.

1993:19-39.

105. Linacre JM. A user's guide to WINSTEPS MINISTEP Rasch-model computer programs.

Chicago: Winsteps com. 2005.

106. Guyer R, Thompson N. User's manual for Xcalibre 4.1. In: St. Paul MN: Assessment

Systems Corporation.

107. Hambleton RK, Swaminathan H. Item response theory: Principles and applications:

Boston; 1985.

108. Edelen MO, Reeve BB. Applying item response theory (IRT) modeling to questionnaire

development, evaluation, and refinement. Quality of Life Research. 2007;16:5-18.

109. Van Alphen A, Halfens R, Hasman A, Imbos T. Likert or Rasch? Nothing is more

applicable than a good theory. Journal of Advanced Nursing. 1994;20:196 - 201.

110. Wainer H, Thissen D. How is reliability related to the quality of test scores? What is the

effect of local dependence on reliability? Educational measurement: Issues and practice.

1996;15(1):22-29.

111. Hambleton R, Rogers H, Swaminathan H. Fundamentals of item response theory: Sage

Publ.; 1995.

112. De Ayala RJ. Theory and practice of item response theory: Guilford Publications; 2009.

113. Hambleton R, Slater S. Item response theory models and testing practices: Current

international status and future directions. European Journal of Psychological Assessment.

1997;13:20-28.

114. Hambleton RK. Item response theory: a broad psychometric framework for measurement

advances 1, 2. Psicothema. 1994;6(3):535-556.

115. Harris D. Comparison of 1 , 2 , and 3 Parameter IRT Models. Educational measurement:

Issues and practice. 1989;8(1):35-41.

116. Lawson S. One parameter latent trait measurement: Do the results justify the effort.

Advances in educational research: Substantive findings, methodological developments.

1991;1:159-168.

117. Tavakol M, Dennick R. Psychometric evaluation of a knowledge based examination

using Rasch analysis: An illustrative guide: AMEE Guide No. 72. Medical Teacher.

(0):1-11.

118. Van Batenburg T, Laros J. Graphical analysis of test items. Educational Research and

Evaluation. 2002;8:319 - 333.

119. May K, Jackson TS. IRT Item Parameters and the Reliability and Validity of Pretest,

Posttest, and Gain Scores. International Journal of Testing. 2005;5(1):11-18.

120. Swanson DB, Holtzman KZ, Allbee K, Clauser BE. Psychometric Characteristics and

Response Times for Content-Parallel Extended-Matching and One-Best-Answer Items in

Relation to Number of Options. Academic Medicine. 2006;81(10):S52-S55.

121. Yang S-C, Tsou M-Y, Chen E-T, Chan K-H, Chang K-Y. Statistical item analysis of the

examination in anesthesiology for medical students using the Rasch model. Journal of the

Chinese Medical Association.74(3):125-129.

122. Gonzalves F, Gamerman D, Soares T. Simultaneous multifactor DIF analysis and

detection in Item Response Theory. Computational Statistics & Data Analysis.59:144-

123. Wang N. Use of the Rasch IRT Model in Standard Setting: An Item Mapping Method.

Journal of Educational Measurement. 2003;40(3):23-253.

124. De Champlain AF, Melnick D, Scoles P, Subhiyah R, Holtzman K, Swanson D, et al.

Assessing medical students' clinical sciences knowledge in France: a collaboration

between the NBME and a consortium of French medical schools. Academic Medicine.

2003;78(5):509-517.

125. Linn RL. Has Item Response Theory Increased the Validity of Achievement Test Scores?

Applied Measurement in Education. 1990;3(2):115-141.

126. Kreiter C, Ferguson K, Gruppen L. Evaluating the usefulness of computerized adaptive

testing for medical in-course assessment. Academic Medicine. 1999;74:1125 - 1128.

127. Thissen D, Orlando M. Item response theory for items scored in two categories. Test

scoring. 2001:73–140.

128. Andersen E, Madsen M. Estimating the parameters of the latent population distribution.

Psychometrika. 1977;42(3):357-374.

129. Williams VSL, Pommerich M, Thissen D. A comparison of developmental scales based

on Thurstone methods and item response theory. Journal of Educational Measurement.

1998;35(2):93-107.

130. Hambleton R. Principles and selected applications of item response theory. Educational

measurement. 1989;3:147-200.

131. Weiss DJ, Kingsbury G. Application of computerized adaptive testing to educational

problems. Journal of Educational Measurement. 1984;21(4):361-375.

132. Melvin R N. The axioms and principal results of classical test theory. Journal of

Mathematical Psychology. 1966;3(1):1-18.

133. Lawson DM. Applying the Item Response Theory to classroom examinations. Journal of

manipulative and physiological therapeutics. 2006;29(5):393-397.

134. Linacre J, Wright B. A user’s guide to Winsteps Rasch-model computer program. 2001.

In: MESA Press Chicago, IL.

135. Bock RD, Murakl E, Pfeiffenberger W. Item pool maintenance in the presence of item

parameter drift. Journal of Educational Measurement. 1988;25(4):275-285.

136. Cook LL, Eignor DR, Taft HL. A comparative study of the effects of recency of

instruction on the stability of IRT and conventional item parameter estimates. Journal of

Educational Measurement. 1988;25(1):31-45.

137. Bergstrom B, Stahl J, Netzky B. Factors that influence item parameter drift. In: annual

meeting of the American Educational Research Association, Seattle, WA; 2001; 2001.

138. Wells CS, Subkoviak MJ, Serlin RC. The effect of item parameter drift on examinee

ability estimates. Applied Psychological Measurement. 2002;26(1):77-87.

139. Babcock B, Albano AD. Rasch scale stability in the presence of item parameter and trait

drift. Applied Psychological Measurement..2012;36(7): 565-580

140. Donoghue JR, Isham SP. A comparison of procedures to detect item parameter drift.

Applied Psychological Measurement. 1998;22(1):33-51.

141. Kim W, Nering M. Evaluation of equating items using DFIT. In: Annual meeting of the

national council on measurement in education Chicago, IL; 2007; 2007.

142. Babcock B, Albano A, Raymond M. Nominal Weights Mean Equating A Method for

Very Small Samples. Educational and psychological measurement.72(4):608-628.

143. Wollack JA, Cohen AS, Wells CS. A Method for Maintaining Scale Stability in the

Presence of Test Speededness. Journal of Educational Measurement. 2003;40(4):307-

144. Mandin H, Harasym P, Eagle C, Watanabe M. Developing a" clinical presentation"

curriculum at the University of Calgary. Academic Medicine. 1995;70(3):186-193.

145. Woloschuk W, Harasym P, Mandin H, Jones A. Use of schema based problem solving:

an evaluation of the implementation and utilization of schemes in a clinical presentation

curriculum. Medical education. 2000;34(6):437-442.

146. Breithaupt K, Ariel AA, Hare DR. Assembling an inventory of multistage adaptive

testing systems. In: Elements of adaptive testing: Springer. p. 247-266.

147. Gao F, Chen L. Bayesian or non-Bayesian: A comparison study of item parameter

estimation in the three-parameter logistic model. Applied Measurement in Education.

2005;18(4):351-380.

148. Gay LR, Airasian PW. Educational research: Competencies for analysis and application.

149. Weir JP. Quantifying test-retest reliability using the intraclass correlation coefficient and

the SEM. The Journal of Strength & Conditioning Research. 2005;19(1):231-240.

150. Hojat M, Xu G. A visitor's guide to effect sizesâ€“statistical significance versus practical

(clinical) importance of research findings. Advances in Health Sciences Education.

2004;9(3):241-249.

151. Galer N, Uyan GlK, Teker GlT. Comparison of classical test theory and item response

theory in terms of item parameters. European Journal of Research on

Education.2013;2(1):1-6.

152. Fan X. Item response theory and classical test theory: An empirical comparison of their

item/person statistics. Educational and psychological measurement. 1998;58(3):357-381.

153. Cohen J. Statistical Power Analysis. Current Directions in Psychological Science.

1992;1(3):98-101.

154. Garet MS, Porter AC, Desimone L, Birman BF, Yoon KS. What makes professional

development effective? Results from a national sample of teachers. American

Educational Research Journal. 2001;38(4):915-945.

155. Hill HC, Rowan B, Ball DL. Effects of teachers mathematical knowledge for teaching on

student achievement. American educational research journal. 2005;42(2):371-406.

156. Hingorjo MR, Jaleel F. Analysis of one-best MCQs: the difficulty index, discrimination

index and distractor efficiency analysis. Journal of Pakistan Medical Association. 2012;

157. Baxi S, Parmar R, Parmar D, Tripathi C. Item Analysis of MCQ from Presently Available

MCQ Books. The Practising Doctor.

158. McGahee TW, Ball J. How to read and really use an item analysis. Nurse educator.

2009;34(4):166-171.

159. Carroll RG. Evaluation of vignette-type examination items for testing medical

physiology. The American journal of physiology. 1993;264(6 Pt 3):S11-15.

160. Hambleton RK, Slater SC. Item response theory models and testing practices: current

international status and future directions. European Journal of Psychological Assessment.

1997;13(1):21-28.

161. Courville TG. An empirical comparison of item response theory and classical test theory

item/person statistics: Texas A&M University; 2004.

162. Frisbie DA. Reliability of Scores From Teacher-Made Tests. Educational measurement:

Issues and practice. 1988;7(1):25-35.

163. Charter RA. Sample size requirements for precise estimates of reliability,

generalizability, and validity coefficients. Journal of Clinical and Experimental

Neuropsychology. 1999;21(4):559-566.

164. Quadrelli S, Davoudi M, Galandez F, Colt HG. Reliability of a 25-item low-stakes

multiple-choice assessment of bronchoscopic knowledge. CHEST Journal.

2009;135(2):315-321.

165. Baig LA, Violato C. Temporal stability of objective structured clinical exams: a

longitudinal study employing item response theory. BMC Medical

Education.2012;12(1):121.

APPENDIX A: Course 3