university of calgary application of classical test theory...
Post on 09-Mar-2018
219 Views
Preview:
TRANSCRIPT
UNIVERSITY OF CALGARY
Application of Classical Test Theory and Item Response Theory to Analyze
Multiple Choice Questions
by
Mona Nasir
A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES
IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE
DEGREE OF DOCTOR OF PHILOSOPHY
DEPARTMENT OF MEDICAL SCIENCE
CALGARY, ALBERTA
September, 2014
©Mona Nasir 2014
ii
Abstract
Background
Multiple choice questions are used worldwide for summative assessment in undergraduate
medical education. Only a few studies have looked at their reliability using both classical test
theory and item response theory. The main aim of this research was to use examination data
from the summative multiple choice exams at the University of Calgary in order to assess the
reliability of scores using and comparing two methods of analysis, i.e., classical test theory and
item response theory, on items administered three times over a six year period. In addition, the
temporal stability of the same items was also analyzed using both classical test theory and item
response theory.
Methods
Three courses were chosen for the item analysis. Thirty items from each course over a period
of three years were scrutinized for reliability by conducting an item analysis using SPSS and
Xcalibre 4.2. Item difficulty and discrimination indices were calculated using both classical test
and the 2 parameter logistic model of item response theory. Correlation coefficients were
calculated for all three years to analyze the relationship between the two measurement methods
and also the inter-year correlation for the three years using both classical test and item response
theory. Cronbach’s Alpha was calculated to look at the reliability of the scores. Furthermore,
item characteristic curves were generated using Xcalibre 4.2. Repeated measures analysis of
variance was conducted for the item parameters of both classical test and item response theory
and test characteristic curves generated year-wise for the multiple choice items for a 2 parameter
iii
logistic model which were then compared across the years to assess the stability of the multiple
choice items over time.
Results
Difficulty was found to be adequate for half the items when classical test theory was applied
and for two thirds of the items when item response theory was used. Discrimination was mostly
fair to adequate with classical test theory and excellent with item response theory. Standard error
of measure was noted to vary from small to large for the item parameters of different items, the
reliability index being 0.56- 0.65 for the test scores across the years. Correlation coefficients
were excellent between Year 1 and 3 and only fair for Year 2 when compared with the other two.
Correlation coefficients between classical test and item response theory were excellent. Items
were noted to be stable across the three years using repeated measures analysis of variance which
yielded small F ratios thus exhibiting stability of item difficulty and discrimination over Times 1,
2 and 3. Visual inspection of the test characteristic curves yielded the same findings.
Conclusion
Multiple choice questions used by the University of Calgary over a period of three years have
been shown to be fairly reliable and stable over time with different samples of students. Some
differences were noted in the item analysis carried out by the two different methods ( i.e.,
classical test and item response theory) but mostly the two measurement methods were
comparable. Some items need reviewing and revision to further improve the reliability of the
exam following which the multiple choice items may be used repeatedly without affecting their
psychometric properties.
iv
Acknowledgements
I’d like to start with thanking the Almighty; He has always carried me in the palm of His
hand. I am extremely grateful to Dr. Jocelyn Lockyer for her continued guidance and support.
She has inculcated in me the habit of thinking “why”. Thank you, Dr. Lockyer, for your help and
direction with this research.
Dr. Tyrone Donnon and Dr. Tanya Beran, thank you for leading me through the precipitous
road of statistics and encouraging me to delve further into this intriguing field. I am also grateful
to Dr. Claudio Violato for his direction, Dr. Bruce Wright for allowing access to the
Undergraduate Medical Education data and Mr. Alain Chan for his assistance with the data.
My deepest gratitude to my rock, my husband, Saghir. If it weren’t for your continued
encouragement and support, especially in my darkest moments, this research might not have seen
the light of the day. I would also like to appreciate my remarkably resilient and adaptable
children, Alishba, Raza and Murtaza, for their extraordinary patience with my thesis writing.
Guys, thank you for being you!
Last but not the least, my sincerest gratitude to my siblings for their continued support,
especially my brother Shabih whose selflessness knows no bounds and my sister Farzana who
pushes me to strive for the best!
v
Dedication
This dissertation is dedicated to the most cherished memories of my beloved parents,
Nasir Hussain and Zakira, the gems who honed my skills, loved me unconditionally and
continue to guide me in spirit.
vi
Table of Contents
Abstract……………………………………………………………………………………………ii
Acknowledgements………………………………………………………………….....................iv
Dedication…………………………………………………………………………………............v
Table of Contents…………………………………………………………………………………vi
List of Tables……………………………………………………………………………….........xii
List of Figures…………………………………………………………………………………...xvi
List of Symbols and Abbreviations………………………………………………………........xviii
Epigraph……………………………………………………………………………………........xx
CHAPTER 1: INTRODUCTION……………………………………………………..…..........1
1.1 Overview………………………………………………………………………. …......1
1.1.1 Types of Assessment…………………………………………………………....1
1.1.2 Importance of Formative and Summative Assessments…………………...…...2
1.1.3 Tools of Assessment…………………………………………………………....4
1.1.4 Multiple Choice Questions………………………………………………..........6
1.2 Problem Statement………………………………………………………………….....6
1.3 Significance of the Research………………………………………………………......7
1.4 Purpose of Research…………………………………………………………………...8
vii
CHAPTER II – LITERATURE REVIEW……………………………………………………10
2.1 Multiple Choice Questions for Summative Assessments………………………........10
2.2 Classical Test Theory…………………………………………………………….......12
2.2.1 Assumptions of Classical Test Theory…………………………………….........14
2.2.2 Item Analysis with Classical Test Theory……………………………………....15
2.2.2.1 Reliability of Tests in the Context of CTT……………………….........16
2.2.2.2 Item Difficulty………………………………………………………....17
2.2.2.3 Item Discrimination…………………………………………………...18
2.2.4 Limitations of Classical Test Theory…………………………………………...18
2.3 Shift from Classical Test Theory to Item Response Theory………………………....21
2.4 Item Response Theory……………………………………………..…………….......22
2.4.1 Item Response Theory-Then and Now…………………………………….........23
2.4.2 Basic Concepts of IRT…………………………………………………..............23
2.4.3 Assumptions of IRT……………………………………………………..............24
2.4.4 Item Characteristic Curve, Item Difficulty
And Item Discrimination………………………………………………..............25
2.4.5 Test Characteristic Curve………………………………………………………..28
2.4.6 IRT Models……………………………………………………………………..30
2.4.7 Item Analysis with IRT……………………………………………………........30
2.4.8 Applications of IRT………………………………………………………….....32
2.4.6.1 Ability and Item Parameter Estimation………………………......33
viii
2.4.6.2 Differential Item Functioning…………………………………....34
2.4.6.3 Computerized Adaptive Testing………………………………....36
2.5 Comparing CTT and IRT……………………………………………………….........36
2.6 Temporal Stability and Parameter Drift……………………………………………...39
2.7 Research Questions……………………………………………………………..........43
CHAPTER III – RESEARCH METHODS…………………………………………………...44
3.1 Study Design……………………………………………………………………….....44
3.2 Setting and Context…………………………………………………………………...44
3.3 Sample and Data Source……………………………………………………………...46
3.4 Data Analyses………………………………………………………………………...48
3.4.1 Research Question No. 1: Reliability of scores with CTT and IRT……………49
3.4.1.1 Research Question No.1 A: Item parameters with CTT………… 49
3.4.1.2 Research Question No.1 B: Item parameters with IRT…………..50
3.4.1.2.1 Two-Parameter Logistic Model of Item Response Theory.....50
3.4.1.2.2 Item Analysis………………………………………………..51
3.4.1.2.3 Item Difficulty……………………………………………....51
3.4.1.2.4 Item Discrimination………………………………………....51
3.4.1.3 Research Question No.1 C: Comparability of item parameters with
CTT and IRT………………………………………………………52
3.4.1.4 Research Question No.1 D: Reliability index of test scores……...52
3.4.1.5 .Research Question No.1 E: Item characteristic curves…………..53
3.4.2 Research Question No. 2: Temporal stability of items…………………………54
ix
3.4.2.1 Research Question No 2A: Item stability with CTT……………..54
3.4.2.1.1 Repeated Measures ANOVA…..……………………………54
3.4.2.1.2 Effect Sizes….…………………………………………...55
3.4.2.2 Research Question No. 2B: Item stability with IRT……………..56
3.4.2.2.1 Test Characteristic Curve...………………………………….56
3.5 Summary of Analyses………….………………………………………………….....57
3.6 Ethics…………………………………………………………………………………59
CHAPTER IV-RESULTS……………………………………………………………………...60
4.1 Overview………………………………………………………………………..........60
4.2 Descriptive Analysis………………………………………………………………....60
4.3 Results of Research Question No. 1: Reliability of scores CTT and IRT…………...66
4.3.1 Results of Research Question No. 1A: Item parameters with CTT…………....67
4.3.2 Results of Research Question No. 1B: Item parameters with IRT……………70
4.3.3 Results of Research Question No.1 C: item analysis with CTT and IRT…….72
… 4.3.4 Results of Research Question No.1 D: Reliability index of items…………....74
4.3.5 Results of Research Question No.1 E: Item characteristic curves…………….81
4.4 Results of Research Question No.2: Temporal stability of items……...…………...84
4.4.1 Results of Research Question No. 2 A: Item stability using CTT……………..84
4.4.1.1 Repeated Measures ANOVA CTT.................................................84
4.4.1.2 Correlation Coefficients CTT...……………………………….......88
4.4.1.3 Scatter Plots CTT......……………………………………………...90
x
4.4.2 Results of Research Question No. 2 B: TCC for Item stability using IRT......96
4.4.1.1 Repeated Measures ANOVA IRT..................................................97
4.4.1.2 Correlation Coefficients IRT..……………………………….........99
4.4.1.3 Scatter Plots IRT………………………………………………..101
4.4.1.4 TCCs……………………………………………………………106
CHAPTER V-DISCUSSION…………………………………………………………………109
5.1 Research Question No.1: Reliability of scores using with CTT and IRT...................109
5.1.1 Research Question No. 1 A: Item parameters with CTT...…………………....110
5.1.2 Research Question No.1 B: Item parameters with IRT……………………….112
5.1.3 Research Question No.1 C: Item analysis with CTT and IRT………………...115
5.1.4 Research Question No.1 D: Reliability index of test scores…………………..116
5.1.5 Research Question No.1 E: Item characteristic curves………………………..118
5.2 Research Question No. 2: Temporal stability of items…………………………........119
5.2.1 Research Question No. 2 A: Item stability using CTT………………………..119
5.2.2 Research Question No. 2 B: Item stability using IRT………………………...120
5.3 Implications and Future Directions for Research……………………………………122
5.4 Limitations of the Study……………………………………………………..............123
5.5 Conclusion………………………………………………………………..….............123
5.6 Recommendations…………………………………………………………………...125
REFERENCES………………………………………………………………………………..126
APPENDIX A: COURSE 3…………………………………………………………………..137
xi
APPENDIX B: COURSE 6…………………………………………………………………...167
APPENDIX C: ICCS OF COURSE 1……………………………………………………….194
xii
List of Tables
Table 1: Features of Classical Test and Item Response Theory…………………………………38
Table 2: Item Distribution for Individual Year and Course……………………………………...47
Table 3: Methods Summary……………………………………………………………………...58
Table 4: Distribution of MCQs According to Type of Skill………………………………..........61
Table 5: Number of Examinees across Courses and Years……………………………………...61
Table 6: Content of 30 Items Course 1 Classified by Clinical Presentation and Skills……. …...62
Table 7: Content of 30 Items Course 3 Classified by Clinical Presentation and Skills……. …...63
Table 8: Content of 30 Items Course 6 Classified by Clinical Presentation and Skills………….64
Table 9: Descriptive Statistics of Item Parameters for Course 1………………………………...65
Table 10: Descriptive Statistics of Item Parameters for Course 3……………………………….66
Table 11: Descriptive Statistics of Item Parameters for Course 6……………………………….66
Table 12: Item Difficulty (p) and Point Biserial (p-bis) Correl of Course 1 Using CTT…..........68
Table 13: Difficulty (b) and Discrimination (a) Indices of Course 1 Using IRT…………..........71
Table 14: Correl Coefficients of Difficulty Index Between CTT and IRT for Course 1…...........73
Table 15: Correl Coeff of p-bis and Discrim Index B/W CTT and IRT for Course 1…………...74
Table 16: SE and Reliability Index (Alpha w/o) Course 1 Year 1……………………………....75
Table 17: SE and Reliability Index (Alpha w/o) Course 1 Year 2……………………………....77
Table 18a: SE and Reliability Index (Alpha w/o) Course 1 Year 3………………………….......79
Table 18b: Cronbach’s Alpha for Course 1, 2 and 3 Using CTT and IRT....................................81
Table 19: Repeated Measures ANOVA to Determine the Effect of Time on the Item Difficulty
Index for Course 1 Using CTT……………………………………………………………..........85
xiii
Table 20: Repeated Measures ANOVA to Determine the Effect of Time on the Item
Discrimination Index for Course 1 Using CTT………………………………………………….85
Table 21: Repeated Measures ANOVA to Determine the Effect of Time on the Item Difficulty
Index for Course 3 Using CTT……………………………………………………………..........86
Table 22: Repeated Measures ANOVA to Determine the Effect of Time on the Item
Discrimination Index for Course 3 Using CTT……………………………………………….....87
Table 23: Repeated Measures ANOVA to Determine the Effect of Time on the Item Difficulty
Index for Course 6 Using CTT……………………………………………………………..........88
Table 24: Repeated Measures ANOVA to Determine the Effect of Time on the Item
Discrimination Index for Course 6 Using CTT……………………………………………….....88
Table 25: Correlation Coefficient of Difficulty Index of Year 1, 2, 3 for CTT……………….....89
Table 26: Correlation Coefficient of Discrimination Index of Year 1, 2, 3 for CTT…………....90
Table 27: Repeated Measures ANOVA to Determine the Effect of Time on the b Parameter for
Course 1 Using IRT………………………………………………………………………...........97
Table 28: Repeated Measures ANOVA to Determine the Effect of Time on the a Parameter for
Course 1 Using IRT……………………………………………………………………………...97
Table 29: Repeated Measures ANOVA to Determine the Effect of Time on the b Parameter for
Course 3 Using IRT……………………………………………………………………………...98
Table 30: Repeated Measures ANOVA to Determine the Effect of Time on the a Parameter for
Course 3 Using IRT………………………………………………………………………...........98
Table 31: Repeated Measures ANOVA to Determine the Effect of Time on the b Parameter for
Course 6 Using IRT………………………………………………………………………...........99
xiv
Table 32: Repeated Measures ANOVA to Determine the Effect of Time on the a Parameter for
Course 6 Using IRT………………………………………………………………………...........99
Table 33: Correlation Coefficient of Difficulty Index of Year 1, 2, 3 for IRT………………....100
Table 34: Correlation Coefficient of Discrimination Index of Year 1, 2, 3 for IRT…………...101
Table 35: App A1: Item Diff (p) and p-bis Correl of Course 3 Using CTT.....….......................137
Table 36: App A2: Diff (b) and Discrim (a) Indices of Course 3 Using IRT……………..........139
Table 37: App A3: Correl Coeff of Difficulty Index b/w CTT and IRT for Course 3…………141
Table 38: App A4: Correl Coeff of p-bis and Discrim b/w CTT and IRT for Course 3……….141
Table 39: App A5: SE and Reliability Index (Alpha w/o) Course 3 Year 1…………………...142
Table 40: App A6: SE and Reliability Index (Alpha w/o) Course 3 Year 2…………………...144
Table 41: App A7: SE and Reliability Index (Alpha w/o) Course 3 Year 3…………………...146
Table 42: App A9: Correl Coeff of Difficulty Index of CTT for Course 3 Year 1, 2, 3……….156
Table 43: App A10: Correl Coeff of Difficulty Index of IRT for Course 3 Year 1, 2, 3………156
Table 44: App A11: Correl Coeff of Discrim Index of CTT for Course 3 Year 1, 2, 3…..........157
Table 45: App A12: Correl Coeff of Discrim Index of IRT for Course 3 Year 1, 2, 3………...157
Table 46: App B1: Item Diff (p) and p-bis Correlation of Course 6 Using CTT………………167
Table 47: App B2: Difficulty (b) and Discrim (a) Indices of Course 6 Using IRT…………….169
Table 48: App B3: Correl Coeff of Difficulty Index b/w CTT and IRT for Course 6…………171
Table 49: App B4: Correl Coeff of p-bis and Discrim b/w CTT and IRT for Course 6…..........171
Table 50: App B5: SE and Reliability Index (Alpha w/o) Course 6 Year 1……………………172
Table 51: App B6: SE and Reliability Index (Alpha w/o) Course 6 Year 2……………………174
Table 52: App B7: SE and Reliability Index (Alpha w/o) Course 6 Year 3……………………176
Table 53: App B9: Correl Coeff of Difficulty Index of CTT for Course 6 Year 1, 2, 3….........184
xv
Table 54: App B10: Correl Coeff of Diff Index of IRT for Course 6 Year 1, 2, 3……………..184
Table 55: App B11: Correl Coeff of Discrim Index of CTT for Course 6 Year 1, 2, 3……......185
Table 56: App B12: Correl Coeff of Discrim Index of IRT for Course 6 Year 1, 2, 3………...185
:
xvi
List of Figures
Figure 1: b Parameter on Item Characteristic Curve…………………………………………….26
Figure 2: a Parameter on Item Characteristic Curve……………………………………………..27
Figure 3: c Parameter on Item Characteristic Curve……………………………………………..28
Figure 4: Test Characteristic Curve………………………………………………………...........29
Figure 5: Causes and Pathophysiology of Hypertension………………………………………...45
Figure 6: ICCs for Course 1……………………………………………………………………...82
Figure 7: Scatter Plot of Item Difficulty for Course 1 with CTT Year 1 and 2……………….....91
Figure 8: Scatter Plot of Item Difficulty for Course 1 with CTT Year 2 and 3……………….....92
Figure 9: Scatter Plot of Item difficulty for Course 1 with CTT Year 3 and 1………………......92
Figure 10: Scatter Plot of p-bis for Course 1 with CTT Year 1 and 2……………………...........94
Figure 11: Scatter Plot of p-bis for Course 1 with CTT Year 2 and 3……………………...........94
Figure 12: Scatter Plot of p-bis for Course 1 with CTT Year 3 and 1……………………...........95
Figure 13: Item Difficulty for Course 1 with IRT Year 1 and 2………………………………..102
Figure 14: Item Difficulty for Course 1 with IRT Year 2 and 3…………………………..........104
Figure 15: Item Difficulty for Course 1 with IRT Year 3 and 1……………………………….103
Figure 16: Item Discrimination for Course 1 with IRT Year 1 and 2………………………......104
Figure 17: Item Discrimination for Course 1 with IRT Year 2 and 1……………………..........105
Figure 18: Item Discrimination for Course 1 with IRT Year 3 and 1………………………......105
Figure 19. Test Characteristic Curve for Course 1, Year 1………………………………….....106
Figure 20. Test Characteristic Curve for Course 1, Year 2………………………………….....107
Figure 21. Test Characteristic Curve for Course 1, Year 3…………………………….............107
Figure 22: App A8: Item Characteristic Curves for Course 3 for Year 1, 2, 3…………………148
xvii
Figure 23: App A13: Scatter Plots for Item Difficulty Using CTT for Course 3…………........158
Figure 24: App A14: Scatter Plots of Item Difficulty Using IRT for Course 3………………...160
Figure 25: App A15: Scatter Plots of Item Discrim (p-bis) Using CTT for Course 3…….........161
Figure 26: App A16: Scatter Plots of Item Discrim Using IRT for Course 3………………….162
Figure 27: App A17: Test Characteristic Curves for Course 3…………………………………163
Figure 28: App B8: Item Characteristic Curves for Course 6 for Year 1, 2, 3…………………178
Figure 29: App B13: Scatter Plots for Item Difficulty Using CTT for Course 6…………........186
Figure 30: App B14: Scatter Plots of Item Difficulty Using IRT for Course 6………………...187
Figure 31: App B15: Scatter Plots of Item Discrim (p-bis) Using CTT for Course 6….............189
Figure 32: App B16: Scatter Plots of Item Discrimination Using IRT for Course 6…………...190
Figure 33: App B17: Test Characteristic Curves for Course 6…………………………………192
xviii
List of Abbreviations
A Item Discrimination Index in Item Response Theory
B Item Difficulty Index in Item Response Theory
C Guessing Parameter in Item Response Theory
CAT Computerized Adaptive Testing
CHREB
CTT
Conjoint Health Research Ethics Board
Classical Test Theory
CVS Cardiovascular
D Item Discrimination in Classical Test Theory
DIF Differential Item Functioning
GIT Gastroenterology
ICC Item Characteristic Curve
IRT Item Response Theory
MCQs Multiple Choice Questions
ANOVA Analysis of Variance
OSCE Objective Structured Clinical Exams
p Item Difficulty in Classical Test Theory
p-bis Point Biserial Correlation
xix
1, 2, 3 PL Model One, Two, Three Parameter Logistic Model
R Correlation Coefficient
SBA Single Best Answer
SEM Standard Error of Measure
TCC Test Characteristic Curve
UGME Undergraduate Medical Education
xx
Epigraph
Knowledge
“Its head is humility, its eye freedom from envy, its ear understanding, its tongue the
truth, its memory research, its heart good intention”.
Ali Ibne Abi Talib (596-661 AD)
1
CHAPTER 1- INTRODUCTION
1.1 Overview
For any medical training programme to achieve its learning outcomes, it should be designed so that
the graduates acquire the knowledge, behaviour and skills necessary to practice evidence-based
medicine.1, 2 Assessment is an important link in the curricular process and drives learning—by way of its
content, timing, format and subsequent feedback.2 It helps evaluate competencies and identify curricular
deficiencies.3 Furthermore, the effectiveness of instructional skills can be established by the type of
assessment used to assess students’ level of understanding. Recent times have seen the implementation of
numerous changes to assessment of medical undergraduates and graduate students.4-7 In addition to issues
of reliability and validity, elements like educational effect and catalytic effects of assessment have been
highlighted.8 Furthermore, the choice of tools of assessment has been under scrutiny and the utility of one
over the other has been the objective of recent research.7 Multiple choice questions (MCQs) are
commonly used in both undergraduate and graduate levels in medical education and issues of stability in
addition to those of their security are frequently raised, hence needing addressing. This research is carried
out in an attempt to explore the reliability and stability of MCQs over time.
1.1.1 Types of Assessment
Assessment can be either formative or summative in nature. Formative assessment is defined
as the process of providing individually tailored doses of feedback to students on their
performance in a concrete, effective way.9 It is carried out during the various phases of a
program. Formative assessment can be informal or formal.10 When informal, it can take place in
2
the course of events during learning and is not necessarily stipulated explicitly within the
curriculum. Formal types of formative assessment, on the contrary, are part of pre-designed
curricular objectives and are provided by the academic staff or the supervisor of the placement
activity within a collaborating organisation at pre-defined intervals.10
Summative assessment, unlike formative type, comprises of a process of assessment of
students after units, mid-terms and courses.11 It is geared more towards the final outcome.
Summative assessments are high-stake and require more efforts for the development of the exam
and its quality control. Whereas formative assessment is for learning, summative is more
directed towards assessment of learning.4
1.1.2 Importance of Formative and Summative Assessments
Current research in assessment has highlighted the vitality of formative assessment in
providing self-motivation and future direction in learning.12 Moreover, aptly conducted
formative assessments aid the learner in setting more advanced goals by providing continued
guidance.10 Formative assessment is important because it lets the instructors know how the
students are progressing and where they need more attention. This helps in making important
adjustments in instructions or arranging more opportunities for learning by practice. These
activities then lead to an improvement in a student’s success. Furthermore, students are able to
identify any gaps that exist between their desired goals and their present knowledge and
competencies. They can then carry out actions necessary to reach their goals.
Summative assessments are vital for reporting on achievements at certain intervals. As stated
earlier, they are high stakes since they are used for certification purposes, both for graduation
3
and for higher training.3 Their choice is also influenced by the stake holders’ demands which
include the public in addition to accreditation and licensing bodies.13 Summative assessments
utilize a number of tools for gathering information about what has been learned by the students.
They are valuable because they provide critical information about the overall learning of the
students as well as an indication of the quality of instruction. They can be carried out in a
number of ways which include end of unit tests or projects, course grades and portfolios. At the
student level, these tools reflect the level of their performance and overall expectations for a
particular course. At the program level, they provide information about the objectives of the
program being achieved by the students. It is useful to create summative assessments prior to
instruction as it helps in identifying the content and process of learning leading to desired
outcomes. Summative assessment can, thus, serve as a guide for giving directions for the
curriculum and instruction.
Recent trends have seen a shift towards competence-based assessments which require
frequent testing of students. Furthermore, the onus is now being placed on continuous formative
assessment rather than end of academic year summative assessments.4 Schuwirth and Ash have
also recommended combining the formative and summative functions to inform and guide
student learning.7 Since the item banks are used repeatedly, there is a concern that the
psychometric properties of items may be affected. This is an element that needs exploring as the
repeating of items potentially influences the stability of such items over time. Irrespective of
which scoring method is used, neither are resistant to such influences and hence require
exploration in the context of their usability for measuring the effectiveness of the MCQ items.
4
1.1.3 Tools of Assessment
A number of tools are available for the assessment of different aspects of clinical
competence. According to Miller’s Pyramid of Clinical Competence14, assessments should be
designed keeping in mind the domains of know, knows how, shows how and does.15
Structured oral exams are commonly used for assessing the knowledge and understanding of
concepts which form the bases for the knows and knows how tiers of Miller’s Pyramid. 16 A
clinical scenario is presented and the candidate is then asked to elaborate on principles of
differential diagnosis, investigations and management. He/she may also be asked to comment on
certain tests or findings. Long and modified essay questions are also used for knowledge testing.
They are written pieces which can be several paragraphs to pages long. They are used to broadly
measure the amount of knowledge retained by the candidates and their ability to use that
knowledge to reason through clinical problems.17 Multiple choice questions are used to assess
the knows and knows how domain of Miller’s pyramid. Single best, multiple best, true false and
extended matching are the different types of MCQs that are used for assessing the students’
knowledge, comprehension and application ability. MCQs have been criticized for being poorly
linked to the professional reality and testing only trivial knowledge.18 One of the objections is
that students are required to recognize the correct answer from a list of options or eliminate the
incorrect one. Thus, the ability of a student to be judged on his or her free writing capability as in
an essay cannot be assessed. The MCQs are now mostly constructed in the form of clinical
vignettes so that they are able to assess the deeper knowledge along with comprehension and
application of the student’s knowledge. The single best answer type of MCQ consists of a
statement followed by a set of answers. The examinee has to select the single most appropriate
5
answer for the main statement. This process comprises recall of the knowledge, comprehension
of the problem and application of the knowledge to that problem. MCQs hence, are able to test
factual recall along with an assessment of the approach of an examinee to a clinically oriented
scenario. It is possible to structure MCQs in such a way that they can test higher order skills and
levels of cognition such as analysis and synthesis. This is especially true for the single best
answer type of MCQs. Case and Swanson19 have shown that well-constructed MCQs can assess
taxonomically higher cognitive processes in addition to just assessing factual knowledge.
Assessment tools available for the shows how domain of Miller’s Pyramid of Clinical
Competence mainly include objective structured clinical examination (OSCE), simulation and
bedside examinations in the form of long and short cases.20-22 OSCE and simulation are more
widely accepted and popular due to better reliability and use of standardized patients.23Various
modalities are at the assessors’ disposal for evaluating the does domain of Miller’s Pyramid. Of
note amongst these are mini clinical evaluation exercises (Mini-CEX), direct observation of
procedural skills (DOPS), checklists and rating scales, 360 degrees multisource feedback (MSF),
portfolios and log books.5 The application of Mini-CEXs and DOPS is widespread as learners
are given feedback on workplace-based performance promptly which in turn helps formulate
remedial measures quickly and accurately if warranted. Feedback is also collected from
colleagues in the form of 360 MSF24, 25 while portfolios and log books allow for personal
reflection to develop and improve professional practice.26 Recent research has recommended
that a variety of assessment modalities should be employed to reach a reliable summative
decision using a feasible number of workplace-based assessments.27, 28
6
1.1.4 Multiple Choice Questions
Multiple choice questions are widely used for both formative and summative assessments in
undergraduate and graduate medical education.17, 29-33 They are particularly useful in summative
exams because of their ability to assess a large amount of knowledge in a relatively short time34
and contextualization with a clinical vignette and scenario.17 Computerized marking of large sets
of questions also tends to make them widely acceptable. Although MCQs with desirable
reliability are difficult to construct, once constructed, they may be repeated over time without
affecting their reliability. Wass et al have reported a reliability of >0.9 for a four-hour long test
which is above the desirable level.35 Norcini et al reported a slightly lower coefficient of 0.88 for
shorter tests that were 90- items long.36 Well constructed MCQs are useful for summative
assessments because taxonomically higher-order cognitive processes of interpretation, synthesis
and application are assessed adequately besides recall of isolated factors.
1.2 Problem Statement
The evaluation of the MD certifying exams at the University of Calgary is important not only
to the trainees but also to the faculty and the administration. This research is expected to help
elucidate the interplay of the three main components of the educational process--curriculum,
teaching and evaluation. An insight into the changes warranted in the use of MCQs can help
improve the efficacy of the program in turn. The issue that was addressed in this research was to
assess the reliability of scores using and comparing two methods of analysis, i.e., classical test
theory (CTT) and item response theory (IRT), on MCQ items administered three times over a six
year period. IRT is a body of theory that describes the application of mathematical models to
7
data from questionnaires and tests as a basis for measuring abilities, attitudes and other
variables.31 It may be used for the development of assessments and their statistical analyzes by
studying the stability of difficulty and discrimination indices of items over time. IRT has been
applied to item level statistics for MCQs31, 37 and further research will help explore the reliability
and stability especially in the setting of high stakes exams. Item response theory offers the
promise of solving many problems that are faced by psychometricians in medical education. The
major problems that have hindered the widespread use of IRT in the past have now been
overcome to a great extent. With the advent of more sophisticated computer software, it is now
emerging as a favoured method of measurement.38, 39 Establishing whether the use of MCQs is
the right choice for assessing a particular facet of knowledge would assist in providing activities
which facilitate linking theory with practice, exercising the skills of thinking in a practical
context and gaining personal insight including career preferences. It would, furthermore,
facilitate effective delivery of curriculum by rendering it relevant and applicable to the practice
of medicine.
1.3 Significance of the Research
This study assessed the reliability of scores using and comparing two methods of analysis,
i.e., CTT and IRT. CTT forms a vital part of the basis of measurement theory. The underlying
assumption in CTT is that the test score is made up of two components, true score and error
score. This assumption allows for the statistical analyses to be carried out in the form of test and
item analysis. IRT uses 1, 2 or 3 parameter models for item analysis. Two parameter logistic
models have been applied to MCQs in psychology40, 41 and medical education.31, 42, 43 The two
8
parameter model estimates student performance on a test with differences in item difficulty and
discrimination. Hence, this model includes more information about the items than CTT. IRT is
deemed to be a superior measurement theory in comparison with CTT.44 This is due to its
characteristic of analyzing item level statistics that is sample independent. In this research, the 2
PL was used based on the premise that the examinee sample along with the examiners over a
period of three years belongs to groups with similar characteristics. This study will highlight the
similarities and differences between CTT and IRT in the context of item analysis with directions
and suggestions for changes at both the individual and program levels. At the individual level, it
may help program directors not only to evaluate whether or not the students have met the
standards, but also how fast they are approaching the standards. At the program level, it may
provide data that will help to evaluate the effectiveness of each program. Most importantly, this
will help these schools to view their program as an integrated system so that the knowledge
training and skill training can be balanced, and the link between training at different levels can
be reinforced. It is hoped that this research may also provide more robust evidence of the
psychometrics of MCQs, thus identifying areas of improvement in both the formative and
summative exams.
1.4 Purpose of Research
The main purpose of this research was to use University of Calgary summative examination
data from MCQ exams that were held for three courses over a three-year period. This research
addressed the following questions:
9
Research Question No. 1
What was the reliability of scores using and comparing two methods of analysis, i.e., CTT and
IRT, on MCQ items administered three times over a six year period?
Research Question No. 2
Do the items exhibit temporal stability when repeated over Year 1, 2 and 3?
This research work is divided into Chapter 2 which comprises of literature review related to
the use of MCQs as an assessment tool and research in the context of CTT and 2PL IRT, chapter
3 which describes the methods used for the research including the data collection techniques and
analyzes and chapter 4 which provides the results. Chapter 5 concludes this research. It
summarizes the research findings, situates the findings within the broader literature, describes the
limitations of the study, identifies future research directions and states some recommendations
for future application.
10
CHAPTER II – LITERATURE REVIEW
This chapter encompasses the following five sections: 1) a discussion on MCQs as a tool for
summative assessment in medical education, 2) CTT and its assumptions, features and concepts,
3) IRT and its assumptions, features and concepts, 4) comparison of CTT and IRT, 5) temporal
stability of MCQs, and 6) research questions.
2.1 Multiple Choice Questions for Summative Assessments
Multiple choice questions were first utilized in the field of medicine in the 1950s 45 and since
then have been used increasingly. They are used for both formative and summative assessments
to test the acquisition of knowledge and understanding across the curriculum. Many types of
MCQs are described in literature. For the purpose of this research, the A type were taken into
consideration. The A type of MCQs are characterised by an opening statement followed by a
lead-in.46 There are usually about four to five options provided to choose the correct one from. In
addition to these types, other MCQ formats used in medicine are the true-false and the extended
matching types.1
MCQs have undergone continuous scrutiny since their inception. The major concern remains
the scope of what they can be used to assess. If well constructed, they can be used to assess the
first three to four levels of cognitive domain of Miller’s Pyramid47 and also discriminate between
more and less able students. Research shows that testing of knowledge is the most accurate
method of evaluating expertise. It is, hence, understandable that a lot of time and attention is put
1 http://www.nbme.org/publications/item-writing-manual-download.html
11
into constructing psychometrically sound MCQs that are deemed capable of doing that. In the
past, MCQ items have been blamed for testing only recall memory.48 Indeed, many consider
them to be poorly suited for testing students in high-stakes exams requiring problem-solving and
a self-directed approach.49 MCQs can competently evaluate knowledge, comprehension,
application and analysis levels by putting up questions that require the student to recognize
problems or discrepancies and infer their causes and devise solutions. Such MCQs are capable of
challenging the analytical skills of students.50 Multiple-choice tests are very well suited to
sampling many diverse test items. They can be administered to a large group of students since
they are easy to mark by computerized optical scanners. They can, in addition, be used for
testing a wide variety of course material.51 MCQs provide objective evaluation of performance as
they have the capacity to overcome the subjectivity that may exist in the assessment by essays
and oral examinations.52 They can motivate students positively and can assist students in
monitoring and affirming their own learning.53
There is little doubt that if the MCQ items are flawed, student scores are affected. This has
been reported by Downing who found that the test scores improved by 10-25% on removing
items that were noted to be technically flawed after an item analysis.54 MCQs assessing the lower
levels of cognition are found to be more flawed than the ones made for the assessment of higher
cognition. It may be because of the fact that those made to evaluate higher levels of cognition are
constructed in a longer period of time with more attention than the simpler ones. It is, therefore,
important to analyze the item indices, their reliability and stability in high stakes summative
exams.
12
This research used both CTT and IRT for item analyses to highlight the similarities and
differences using both. In addition, reliability of the test items and their stability across the years
was also taken into account. CTT and IRT are widely understood to be two extremely different
frameworks despite the fact that ample literature exists that examines the similarities and
differences in the estimation of item parameters using both the frameworks. A discussion on the
two methods of measurement, i.e., CTT and IRT, follows.
2.2 Classical Test Theory
CTT was founded by Charles Spearman55 in 1904 and it comprises three components. They
are the observed score, true score and random error. Mathematically, it is depicted as:
X=T+E
where X represents the observed score of a student on any test, T is the expected value of the
observed score received on several such tests of equal difficulty when run an infinite number of
times and E is the difference between X and T and is related to the standard error.
An important concept in CTT is that of standard error of measurement (SEM). It is the
standard deviation of errors of measurement that are associated with test groups from a particular
group of examinees.56 It can also be thought of as the determination of the amount of variation or
spread in the measurement errors for a test. From the equation stated above, i.e., X=T + E, it is
known that a person’s true score equals the average of his or her observed scores, hence
accounting for measurement error associated with a test. Because it is not possible to know the
measurement error, all standardized tests have an associated SEM. The SEM is expressed in
standard deviation units. It is directly related to the reliability of a test. Hence, the smaller the
13
SEM, the higher the reliability and more precise the scores obtained. The error in CTT is always
assumed to be random and non-systematic. It can be attributed to several factors external or
internal to the examinee. Examples of external errors include ones attributable to test items that
might have been created poorly or those associated with inadequate testing conditions. Internal
errors can result from conditions internal to the examinee so lack of concentration, fatigue and
stress may contribute to the random error in CTT.
Another concept associated with SEM is that of the confidence interval which measures the
probability that a population parameter will fall between two set values.56 It can take any number
of probabilities but the most common ones are 95% or 99%. It can thus be stated that confidence
interval is the probability of a value falling between an upper and a lower bound of a probability
distribution. When 95% confidence interval is used, it refers to the range of values within which
the scores are found 95% of the time at least.
Classical test theory deals with both item and test level statistics.55 At the item level, it deals
with item difficulty and discrimination. The item difficulty index is depicted by p and it indicates
the proportion of the students who have answered the item correctly. The item discrimination
index is indicated by D and it informs the extent to which the item differentiates between the
high-ability and the low-ability students. At the test level, CTT deals with the reliability of a test
that is parallel.57, 58 Two tests are said to be parallel if they measure the same latent ability for
which the examinees have the same true score and errors across the tests. Parallel tests require
the generation of a large set of items that represent a single content domain. It is recommended
that at minimum, the number of items in this set should be twice the planned size of a single test
14
form.59 In other words, it should be large enough to establish that the content domain is well
represented.
2.2.1 Assumptions of Classical Test Theory
Some fundamental assumptions have to be made for the estimation of the true score of an
examinee using CTT since both the true and the error scores are unknown. Classical test theory
assumes that observed score has a proportion of true score and random error due to errors of
measurement instrument.59 In addition, variability of the test score and examinee conditions also
contribute to these errors. If the same examinee takes the same exam an infinite number of times
(without the effects of any learning taking place), errors will approach zero, and the observed
score will be equal to the true score. The following four assumptions are implicit within CTT:
1. The observed score of a person is comprised of the true score and random error
2. The expected value of any observed score is the person’s true score
3. The covariance of error components from two tests is zero in the population (i.e., errors
from two tests are uncorrelated)
4. Errors in one test are uncorrelated with true scores in another (i.e., measurement errors
are not dependent on traits)
It is important to note that the onus in CTT is on the test score rather than the item score as it
relates the test score to true score rather than the item score to true score.
15
2.2.2 Item Analysis with Classical Test Theory
It is vital that there is a match between what is taught and what is assessed. There should be a
variety of items in any exam testing both the basic and advanced knowledge. If the items are too
difficult, they lead to examinee frustration due to low scores. If they are too easy, inflation in
scores leads to false sense of overconfidence and a decline in examinee motivation.46 Item
improvement is also important as it leads to the development of a pool bank that can be reused
over time. For this purpose, item analysis is carried out. Item analysis may be defined as a
method used to evaluate test items, typically for the purpose of test construction and revision.60 It
is a technique available for the improvement of items used in assessments.
The advantage of item analysis is that it helps identify biased or unfair items.42 Another
advantage of item analysis is that it can identify poorly worded and miskeyed questions. Results
of item analysis, once it has been carried out, are then used to refine the item of interest. Items
that are found to be more difficult identify a concept that needs revising. If a distracter is found
to be the most chosen answer, then the item must be re-examined for its correctness. Item
analysis also helps improve the quality of items by observing the reliability of test scores and
although some literature on measurement discusses reliability as somewhat distinct from item
analysis, item characteristics play a vital role in reliability estimation by both CTT and IRT.61
Item difficulty and discrimination are the two components of item analysis which are helpful in
establishing the reliability of test scores. These components are discussed later.
16
2.2.2.1 Reliability of Test Scores in the Context of CTT
Norcini et al 8 have described seven components or criteria of a good assessment tool. They
are (1) validity or coherence, (2) reproducibility or consistency, (3) equivalence, (4) feasibility,
(5) educational effect, (6) catalytic effect, and (7) acceptability. Reproducibility or consistency is
the extent to which students’ scores in context of time, sampling and factors related to test
administration are reproducible and consistent from one assessment to the next and from one
item to another.62 It is expressed numerically as a coefficient called the reliability coefficient.63
Any value around 0.8 and above is deemed good to excellent in the context of MCQs.64
Reliability estimates the amount of random measurement error in assessments and is
differentiated into several types.63, 65 Test-retest reliability measures the stability of score over
time. Equivalent-form reliability is the degree to which two similar tests administered at the same
time or shortly thereafter produce similar scores from a single group of test takers. Internal
consistency reliability is the extent to which items in a single test are consistent amongst
themselves and with the test as a whole. It can be split-half reliability (which is appropriate for
very long or difficult-to-administer tests), Kuder-Richardson reliability (or KR-20 which can
only be used on dichotomously-scored items like in the selected-response tests) and Cronbach’s
alpha.57 Rater reliability investigates the error attributable to individuals who score the test. It
can be inter-rater which is due to consistency of two or more independent scorers scoring the
same participant in the same context or intra-rater which is due to error associated with the
scoring of one rater for the same participant in the same context at two different points in time.
The concept of alpha was developed by Lee Cronbach in 1951.57 It is commonly used in the
fields of medical education and psychology and provides a measure of the internal consistency of
17
a test or scale.65-70 It is expressed as a number between 0 and 1 and is useful as it elaborates on
the extent to which all the items in a given test are utilized to measure a similar construct or
concept. If the items in a test are found to be highly correlated with each other, the alpha
coefficient increases. It must be kept in mind that correlation is not the only factor affecting the
reliability or the alpha of a test. Test length is another factor that influences the Cronbach’s
alpha. Thus, a low value of alpha may be attributable to poor inter-item correlation or the test
length. It is recommended that such items as ones with poor correlation should then be either
discarded or revised. A high value of alpha, on the other hand, may indicate redundant use of
items for a variable in which case again, revision of items is desirable.
2.2.2.2 Item Difficulty
Another concept in item analysis using CTT is of item difficulty. It refers to the
number of people who answer an item correctly.59 The item difficulty index is expressed by the
letter “p”. Hence, if an item on a test is answered correctly by 78% of the examinees, the
difficulty index for that item is p = .78. An item is categorized as ‘easy’ if a higher percentage of
people answer it correctly. For example, if another item is answered correctly by only 45% of the
class, this item is said to be more difficult than the previous one where 78% of the examinees got
it right. In other words, the higher the percentage of people who answer an item correctly, the
easier is the item.
There are several factors that have to be considered while establishing appropriate levels of
difficulty.60 The first factor that influences the item difficulty is the probability of answering an
item by chance or guessing. In a true-false type of item, there is always a fifty percent chance to
18
get the answer right as there are only two choices. This means that such an item will not be a
good one to include in a test as the difficulty level will only be p = .50. Examinees are able to
answer such items correctly by guessing only and hence, such an item does not reflect the actual
level of knowledge or ability of the student. In the same way, a MCQ that has five options may
be answered correctly by guessing at least 20% of the time. Thus, a difficulty index more than
.20 would be needed for that item to be able to differentiate between students who might be
guessing and those who have a higher degree of ability. A difficulty index between .25 and .75 is
desirable for the item to be able to identify students who have various levels of ability.71
2.2.2.3 Item Discrimination
Item discrimination is another important element of item analysis. It is expressed as “D”.60 It
determines whether those who did well on the test also did well on a particular item. It is, hence,
able to divide students into low scoring and high scoring groups. It is anticipated that those
students who do well on the test also score highly on a particular item. If an item is selected by a
larger proportion of lower scoring group in comparison to the higher scoring one, it is said to
have negative discrimination. Such an item should either be revised or discarded. Once the two
groups, i.e., low and high performing, have been formed, an item’s discrimination can be
determined.72 It can be calculated as :
D = pu – pl
where pu is the proportion of correct responses for the upper group and pl is the proportion of
correct responses for the lower group. After the students in the upper one-third and lower one-
third have been identified, the proportion, i.e., percentage passing is calculated for both the
groups on each item. Then, the p of lower performing group is subtracted from the p of the top
19
performing group to yield an item discrimination index. Item discrimination index ranges from -
1 to +1. Past research has given the following four guidelines for the interpretations for the item
discrimination:73
1. If D ≥ .40, no item revision necessary
2. If .30 ≤ D ≤ .39, little to no item revision is needed
3. If .20 ≤ D ≤ .29, item revision is necessary
4. If D ≤ .19, either the item should be completely revisited or eliminated
Item discrimination is also established by determining the correlation coefficient between the
examinees’ performance on an item and their performance on a test.59 This is reported as the
point-biserial correlation (p-bis) between item score and total test score. It is desirable to have a
positive correlation as that is an indication that students who are answering correctly have a
higher overall score and the ones scoring incorrectly have lower overall scores. The items should
be revised or discarded if the coefficient is negative. A value close to 1.0 discriminates more
strongly than one closer to 0.
2.2.3 Advantages of Classical Test Theory
Despite the development of newer measurement methods, CTT has continued to remain
popular with the majority of educators.59, 71 This is because the basic concepts of CTT are easy to
understand. The most commonly documented advantage of CTT is its relatively weak
assumptions. It is possible for a variety of data to be analyzed with the application of CTT due to
these assumptions. Because it is not mathematically strenuous, the concepts are easily grasped by
anyone with basic mathematical knowledge. For the purpose of assessing reliability, Cronbach’s
alpha is used universally. Most of the commonly available statistical packages have the option of
20
carrying out the analyses under CTT. This makes it more acceptable by psychometricians in the
fields of education and psychology. In addition, instruments designed for CTT- based
measurement easily fit into the underlying models, thus yielding desirable results. A significant
advantage of CTT is that individual items need not be optimal.74 Even if the items relate to an
underlying construct only to an extent, this concern can be overcome by constructing several
items assessing the construct under question. Studies have shown that reliability can be improved
to any desired level by increasing the number of items about a variable on a particular test.51, 75
2.2.4 Limitations of Classical Test Theory
There are certain limitations to CTT despite its common usage. Hambleton76 has pointed out
that the item analysis is very much dependent on the sample of the examinees being assessed as
both item difficulty and discrimination indices are influenced by it. As stated elsewhere, if the
sample comprises examinees with high ability, the difficulty index tends to be higher. 77 Other
researchers point out that the scores of examinee ability depend on item difficulty in CTT.78
Hence, if the items are easy, the observed test scores are higher. They are lower if the items are
difficult.
Another limitation of classical test theory that was addressed by Hambleton and
Swaminathan107 is that it assumes that the measurement error is the same for all examinees. The
type of test affects the test score and true score. Thus, the students’ scores become dependent on
the items being administered and even though the ability remains the same, one may have lower
scores on difficult tests and higher on easier. Due to their different levels of ability, scores in
tests depict different amounts of error.
21
There is another limitation of CTT. It is that for comparison of the performance of different
examinees, the same or parallel items have to be used.79 This limitation is further aggravated as
parallel forms are difficult to achieve in CTT. Parallel testing is also the basis for test reliability
and because of that, test reliability is also affected by the examinee sample. In one study, the
authors presented evidence that reliability is a useful indicator of the quality of a set of test
scores.80 They concluded that it is dependent on the characteristics of the group of examinees
who take the test.
Another issue with CTT is that it is test-oriented which means that it is difficult to predict the
response of examinees on a test item.60 The CTT model, therefore, does not allow the developers
of a test to foresee the level of accomplishment of an examinee on a particular item.
The most significant limitation of CTT amongst the ones discussed above is that of examinee
and item inter-dependence. Both are influenced by the changes in each other’s characteristics. As
a result, it becomes difficult to compare the examinees taking different tests and items whose
characteristics are generated from different groups of examinees.
2.3 Shift from Classical Test Theory to Item Response Theory
Due to the limitations discussed above, newer methods of measurement continued to be
developed. Since the limitations of CTT were related to group dependence, mismatch between
items and examinee ability, weak assumptions and problems with parallel testing, it was only
understandable that the newer model was aimed at overcoming these limitations.
IRT or latent trait theory, as initially labeled by Lord in his dissertation in the 1950s, seemed
to provide a solution to the shortcomings of CTT.81 Once an alternative model had been
22
developed, it was, very quickly, followed by various other models focusing on measurement
issues. The main focus of IRT is the item and thus, all statistical analyzes are carried out at item
level. This continues to be the main advantage of IRT over CTT. The same concept has been
highlighted by several studies in the fields of education 82-87 and psychology.86, 88-93 This supports
the evidence of the widespread utility of IRT in these fields, medical education being no
exception.31, 42, 94
2.4 Item Response Theory
Continuous changes in educational outcome measures demand the development of newer and
psychometrically sound instruments that produce valid scores including scores with high
reliability. In psychometrics, IRT (also known as latent trait theory or modern mental test
theory) is a body of theory that describes the application of mathematical models to data from
questionnaires and tests as a basis for measuring abilities, attitudes or other variables.95 It is used
for statistical analysis and development of assessments, especially for high stakes exams.
IRT is a statistical model that expresses the relationship between an individual’s response to
an item and the underlying latent variable, also called latent trait or construct. This latent variable
is expressed as theta (θ) and is a continuous unidimensional construct that explains the
covariance among item responses.96 People at higher levels of theta have a higher probability of
responding to an item correctly. The ultimate aim of item response theory is to test people.
Hence, its primary interest is focused on establishing the position of the individual along some
latent dimension. Because of the many educational applications, the latent trait is often called
ability.
23
2.4.1 Item Response Theory-Then and Now
When Frederic Lord published his doctoral thesis on latent trait theory, educators and
psychometricians were provided with an option to choose between CTT AND IRT.97 The fact
that IRT modeled the probability of a response pattern of an examinee as a function of the
person’s ability led to a quick propagation of interest. In 1957, Birnbaum44 published a series of
technical reports followed by George Rasch98, 99 who published his book presenting some more
models for IRT in 1960. Baker added to Birnbaum’s works by comparing logistic and normal
ogive functions in 1961.99 While Lord61 and Novick100 put forward dichotomous models,
polytomous models were proposed by Samejima towards the later end of 1960s.101 By the 1970s
and 80s, Applied Psychological Measurement and The Journal of Educational Measurement
were publishing original studies by Hambleton102 and Wright.103
With the advent of the new century, a surge was noted in the software designed for the
analyzes of item data sets. These software handled both the technical and the computational
aspects of the IRT framework and mainly included BILOG,104 MULTILOG,39 WINSTEPS. 105
Recent addition to this list includes Xcalibre 106 which has helped more widespread use of IRT
by statisticians rather than exclusively by behavioural scientists and psychometricians.
2.4.2 Basic Concepts of IRT
In contrast to CTT which is based on the theoretical model depicted by X=T+E, IRT employs
mathematical function. Hambleton and Swaminathan107 stated that the characteristics of IRT are
based on the notion that the relationship between the observed response and the trait in question
has to be specified. Furthermore, it is assumed that the examinee performance can be predicted
24
from one or more abilities. The ability parameter, also called a theta, constitutes one of the
parameters of IRT. Crocker and Algina have also noted that the relationship between the
observed score and ability parameter is the same as the observed score and true score.60 They
have, in addition, highlighted the fact that item parameters, i.e., item difficulty and
discrimination are not dependent on the characteristics of the examinee. Furthermore, the ability
estimates are also independent of the items. It can, thus, be said that the item statistics are
person-free and the ability parameters are item-free.
2.4.3 Assumptions of IRT
IRT models include a set of assumptions about the data to which the model is applied.108
The first assumption common to the IRT models most widely used is that only a single ability is
measured by the items that make up the test. This is the assumption of unidimensionality, i.e., the
covariance among the items can be explained by a single underlying dimension.94 This
assumption is sometimes not met when cognitive, personality and test-taking factors might affect
test performance. A few of these factors are level of motivation, test anxiety, ability to work
quickly and tendency to guess when in doubt about the answers. All these factors are said to
contribute to random error. The unidimensionality of a scale can be evaluated by performing an
item-level factor analysis, designed to evaluate the factor structure.109
A second assumption of IRT models is that the items display local independence.110 This
means that when the abilities influencing test performance are held constant, examinees’
responses to any pair of items are statistically independent. This is technically subsumed under
the unidimensional assumption and requires that, given their relationship to the underlying
25
construct being measured is unidimensional, there is no additional systematic covariance among
the items.111 In other words, local independence means that if the trait level is held constant,
there should be no association among the item responses. Violation of this assumption may result
in parameter estimates that are different from what they would be if the data were locally
independent.
The third assumption of IRT models is that the response of an examinee to an item can be
modeled mathematically as the item response function.99 Item response function is a
mathematical function that looks at the relationship of the theta with the probability of endorsing
an item. When expressed in the form of a graph, it is called as the item characteristic curve
(ICC). These curves are discussed in the coming sections.
2.4.4 Item Characteristic Curve, Item Difficulty and Item Discrimination
A basic concept in IRT is the ICC which is a mathematical expression that relates the
probability of success on an item to the ability measured by the test and the characteristics of the
item.109 It is essentially a non-linear regression on ability of probability of a correct response to a
given item. Ability is also called as theta in IRT. The two important properties of an ICC curve
are difficulty and discrimination of an item. Item difficulty, also called as the “b” parameter is a
location index whose position is depicted on theta or x-axis. The second property is that of
discrimination, also called as the “a” parameter. It informs on the ability of an item to
differentiate between examinees with abilities below and above the item location. The figures
below show the graphic representation of the ICC.
26
In an ICC, theta or ability lies on the x axis and the probability of endorsing an item on the y
axis. The item difficulty or parameter b is the point on theta scale θ where a person has a 50%
chance of responding positively to the scale item. Hence, it can be observed that b determines the
threshold of the graph. Indices between 0.25 and 0.75 are recommended as desirable levels of
difficulty in IRT. 112 The location of b is plotted by drawing a vertical line from the point of
inflection, i.e., the change in curvature, to the horizontal axis. In the figure below, the value for b
is 1 for the right most curve, 0 for the middle one and -1 for the left one. The closest equivalent
of b parameter in CTT is p.
Figure 1: b Parameter on Item Characteristic Curves
The difficulty parameter, expressed as “b”, is most central to the concept of ICC. If one
observes the ICCs in Figure 1, one notices a change in the shape of the curve from
downwards concavity to upwards concavity. This concavity is determined by the b parameter
27
that determines the position of the curve on the x axis or theta. As an item becomes more
difficult, the curve is shifted from left to right.
The discrimination or parameter “a” describes the strength of an item's discrimination
between people with trait levels (θ) below and above the difficulty. It determines the slope of
the curve. The figure below show the item slopes formed by discrimination index.
Figure 2: a Parameter on Item Characteristic Curves
The a parameter is determined by drawing a line tangential to the curve at the b parameter.
The steeper the curve, the more discriminating is the item. In Figure 2, respective values for the a
parameter are 2, 1 and 0.5. Item-total correlation (also called as point biserial correlation) is the
equivalent of item discrimination in CTT. With a decrease in the steepness of the a parameter,
the ICC continues to get flatter until there is no change in the probability across the ability
28
continuum. It is obvious that those items which have very low a values are not useful for
discrimination of different ability levels.
The third parameter in IRT is that of guessing, also called as the ‘c’ parameter. It is the lower
asymptote parameter that describes why people of low level of ability respond correctly to an
item. In Figure 3, it can be seen as the lowest point of the ICC as it shifts to negative infinity on
theta.
Figure 3: c Parameter on Item Characteristic Curve
2.4.5 Test Characteristic Curve
IRT and methods are also applicable at the test or scale level besides item level. The concept
of test characteristic curve (TCC) stems from this ability of IRT.113 TCCs are test level
analogues of ICCs that represent a non-linear regression of overall test score on ability. In other
words, a TCC is created by summing all the ICCs across the ability continuum. The TCC can be
29
a very useful tool for evaluating the range of measurement and the degree of discrimination at
different points of the ability continuum. In addition, the degree to which the TCC is linear
provides an indication of the extent to which the measure provides interval scale or linear
measurement.112
Figure 4: Test Characteristic Curve
It can be observed in Figure 4 that the ability estimate is plotted on the x-axis as for an ICC
and the true score on the y-axis. A TCC expresses the relationship between the true score and the
ability scale. It can be interpreted in nearly the same terms as an ICC. The slope of the curve is
influenced by how the value of true score is affected by the changes in ability.114 There are some
situations where the TCC can be a nearly straight line over most of the ability scale. Most tests,
however, are expressed by a nonlinear curve. TCCs do not have a particular formula that may
help in their calculations. Hence, the curve is best defined in verbal terms after visual
observation.
30
2.4.6 IRT Models
There are three types of models that are commonly used in IRT for dichotomous data.
Depending on the number of parameters being used, they are called as one, two and three-
parameter models.115 The three parameters being used for these models are the b, a and c
parameters which are the difficulty, discrimination and guessing parameters.
A one-parameter model is the simplest of the three models.60, 95, 107 This model assumes that
the probability that a student will correctly answer a question is a logistic function of the
difference between the student's ability (θ) and the difficulty of the question (b).116 Another
model that should be mentioned here is the Rasch model which, although takes the student’s
ability and the difficulty of the question into account, is slightly different to the 1 PL model. In
the Rasch model, each individual in the person sample has parameters defined for item
estimation. On the other hand, when the person sample has the parameters defined by a mean and
standard deviation for item estimation, it is called as the 1PL model of IRT.2 The two-parameter
model has the same function as presented for the one-parameter model. However, in the two
parameter model, the item discrimination parameter will vary across items, as does the item
difficulty parameter.76 The three-parameter model includes a guessing parameter especially
useful for multiple-choice and true-false testing.
2.4.7 Item Analysis with IRT
Item analysis is the process by which the quality of an item in a test and the test as a whole is
assessed on the basis of examinee’s response to that item.72 It is useful because not only does it
2 http://www.rasch.org/rmt/rmt193h.htm
31
help improve items for future use but it also helps eliminate the ones that have poor
characteristics. This process also helps instructors develop content-appropriate tests.113
IRT analyzes a scale at the item level by calculating item difficulty, discrimination and the
test information function.117 In addition, it calculates the SE for the a and b parameter of each
individual item. It is able to estimate the relationship of an item to the construct being measured.
The former is signified by theta on the ICC and the latter by the slope of the curve.118 This
property of IRT helps decide which items to keep in a test and which ones to remove. Depending
on the purpose of the analysis, the items may be placed close to the cut-off value on theta or be
spread uniformly along the continuum from - ∞ to + ∞.
If the purpose of the instrument is to identify participants either for remedial measures or for
placing them into various groups, the location parameters should ideally be close to the cut-off.
If the aim is to measure the trait at all levels, they should be placed equivocally. IRT is, thus,
able to create tests that are shorter and more reliable and are aimed at the concerned population
to test the desired content.
It is not possible to fully utilize the potential of IRT models without making sure that the
right model has been chosen for item analysis. IRT investigates how test items function as trait
measures. This is carried out by determining item fit statistics. Item fit is vital because it
identifies the test model that is most effective in retaining the integrity of the collected data. It
locates non essential dimensions affecting the response to an item along with faulty construction
of items, thus recognizing item issues like miskeying of items or ambiguously worded items.
Another feature of item fit analysis is that it indicates errors that might have occurred in the
calibration phase of developing the test.
32
Most of the methods used for item fit statistics rely on the chi-square statistic. Examinees
are first rank-ordered according to their estimated theta. They are, then, grouped into categories
which may be fixed or subjectively determined. The proportion of examinees who answer an
item correctly is then calculated which is compared to the predicted proportion based on the item
response function. Xcalibre, the IRT software used in my research, also uses the chi-square fit
statistic as an index of the overall fit of an item with the empirical data to evaluate its statistical
significance.
Research aiming at item analysis has yielded valid and reliable information.42, 72, 119 Chang et
al applied the Rasch model to the data from Taiwanese board certification exam in anesthesia
and found a mean examinee ability that was higher than the mean item difficulty in this written
test.42 The participants were able to answer 78% of the items correctly. Swanson et al
investigated the impact of item format and number of options on the psychometric characteristics
in addition to the response times for multiple-choice questions appearing on Step 2 of the United
States Medical Licensing Examination.120 They concluded that use of the extended-matching
format and smaller numbers of options per item resulted in more efficient use of testing time and
greater score precision per unit of testing time. Other studies conducted by May and Jackson,119
Yan et al121 and Bhakta et al 31 have also explored various aspects of item analysis.
2.4.8 Applications of IRT
IRT has numerous applications in educational measurement and social sciences. It is used for
a number of processes because of its unique features, characteristics and components
summarized above. It is the testing model of choice for many high-stake exams including GRE,
33
SAT, TOEFL and PISA.122 It is also used for medical licensing and accreditation exams. These
include the Medical Council of Canada Evaluating Examination and the MCQ component of the
Medical council of Canada Qualifying Examination Part I.123 It is used for assessing reliability110,
124 and providing validity evidence for various types of exams (item and test information
function),125 test equating,102 test assembly and banking,126 scoring and reporting and for
estimating task difficulty and stringency levels of raters.98 One of its main purposes in the
context of assessment is to evaluate how well a tool of assessment works.127 It allows for the
analysis of more complex methods of assessment than what the CTT offers. Perhaps the most
novel application of IRT is in computer based testing where it has been used extensively.
All of the above-mentioned applications of IRT can be broadly put into one or the other of
the following categories: 1) Item analysis 2) Ability and parameter estimation of items
3) Differential item functioning 4) Computerized adaptive testing. For the purpose of this
research, item analysis was taken into consideration which has been discussed earlier but to get a
broader perspective of the widespread use of IRT, it is vital to briefly touch upon a few of the
other applications of IRT as well. These are discussed below.
2.4.8.1 Ability and Item Parameter Estimation
The probability of a correct response in the item response models depends on the
examinee’s ability and the parameters that characterize these items. Because the actual values of
the item parameters are not known, one of the tasks performed when a test is analyzed under IRT
is to estimate these parameters. The obtained item parameter estimates then provide information
as to the technical properties of the test items. This procedure is called maximum likelihood
34
estimation.39, 128 In IRT, item parameter estimation is computationally intensive and must be
carried out by computer programs specifically designed for such a task. Early software programs
focused on maximum-likelihood estimation as a mechanism for estimating the item
parameters.129 These programs eventually had to adopt numerous ad hoc constraining
mechanisms to avoid some of the problems associated with “pure” maximum-likelihood
estimation of IRT item parameters. Previous item parameter estimation techniques required
relatively long tests and large samples (i.e., several thousand examinees) in order to obtain
accurate IRT item parameter estimates. With the implementation of the maximum likelihood
technique, reasonable estimates of IRT item parameters can be derived from short tests (e.g., 25
items) and small samples of examinees (e.g., less than 1000). IRT adopts explicit models for the
probability of each possible response to a test and hence its alternative name, probabilistic test
theory, may be the more apt one. Any attempt at testing is preceded by a calibration study, i.e.,
the items are given to a sufficient number of test persons whose responses are then used to
estimate the item parameters.130
2.4.8.2 Differential Item Functioning
Test equity is a concept that characterizes uniformity in testing of subgroups of a population
with different levels of the same construct under study in participants with various levels of
ability. The way to ensure it is by removing content that is biased towards students, i.e., favoring
one group more than the other with same construct being measured. The items that create such
bias are said to have differential item function (DIF). Such items warrant that they be removed
from the test or scale to make it more reflective of a person’s true abilities. DIF is a statistical
35
property that states that examinees with similar abilities have differential probabilities of success
on an item. Such items are responsible for affecting the validity of a test and are a serious threat
to such tests that measure the trait level of participants from different subgroups of the
population under study. IRT is a very useful method for identifying such items.
IRT calculates DIF by studying the difference between the ICCs of two examinees with
potentially similar abilities. If the matched-ability examinees plot on the same curve, it is an
indication of that item not exhibiting the notion of differential function (the smaller the distance
between two ICCs, the less the DIF). This type of analysis is always preceded by test equating. It
is important to note here that group differences might necessarily be due to DIF but due to actual
difference in their means. If the IRT model fits accurately, the same ICCs will be generated.61
One must always question why an item has differential function and if a justifiable reason is not
found, that item might have to be left out due to its content. Such a situation is balanced by
constructing more items in the test that favour the focal group.
2.4.8.3 Computerized Adaptive Testing
IRT is the backbone of computerized adaptive testing (CAT) and its various functions
are utilized in all the three steps of tests administered by this “state of the art” mode. In fact,
CAT cannot function without the property of invariance, a characteristic of IRT. When an item
bank exists with the provision of access to item level statistics generated with the application of
IRT, CAT can be initiated. The actual process is iterative and begins with the analysis of all
those items that have not been used by the candidate so far and based on that, a decision is made
about the next one to be administered which will suit the ability level assessed currently. The
36
chosen item is then answered by the examinee and a new ability estimate is generated that is
based on the responses of the ones administered so far. These steps continue to be repeated until
such time as a criterion for stopping, which has been identified beforehand, is met. This criterion
may be the time spent on the test, the number of items administered, ability estimate, content
tested upon, or the standard error. Students find this method of being tested favourable as it helps
to cut down the testing time by half while maintaining a high level of precision.131
2.4 Comparing CTT AND IRT
Certain aspects of classical test theory make it less desirable for educational measurement
than IRT. One of these is that the item characteristics are group-dependent, i.e., if examinees
under study are different from the ones with which the item indices had been obtained, the test
becomes of limited value. Again, examinee performance is also test-dependent. Furthermore, this
test is expressed at the test level rather than the item level. In addition, it also does not provide a
measure of precision for each ability score.111, 130 IRT, on the other hand, is group-independent,
test-independent and is expressed at the item level.
In contrast to CTT, IRT models are lauded for their ability to generate invariant
estimators.132 That is, theoretically IRT ability estimates, θ, are “item-free” (i.e., would not
change if different items were used) and the item difficulty statistics are “person-free” (i.e.,
would not change if different persons were used). For single ability, dichotomously scored test
items, IRT employs three different models. Because the assumptions of IRT are complex, it is
not always suitable to use it for all situations.133 Several medical school exams utilize CTT rather
than IRT for analyzes. On the other hand, IRT is extensively used in several high-stake exams
37
like GRE, SAT and PISA due to its computer adaptiveness and ability to handle large data
sets.133 With the development of newer software, it is becoming more commonplace than ever
before to use IRT for medical education-related research.38, 134
Despite the more advanced nature of IRT, CTT has served the psychometricians well for
very long. This is because it has a number of well-documented advantages over other testing
theories.59 Its concepts are fairly basic and methods quite flexible. It has a robust model that is
amenable to changes with changes in the data without skewing it. Furthermore, its underlying
models fit several instruments accurately. It does have some theoretical weaknesses as well that
make it less favourable for certain situations. In CTT, item level statistics of difficulty and
discrimination is examinee-dependant.76 Usually, the scales tend to be long with an inability to
differentiate between a common theme that might run across items for the construct under study.
The items, furthermore, are not probed vigilantly. Despite these shortcomings, CTT continues to
be in demand for many types of studies.
In many situations, a combination of CTT and IRT together works better than either of them
on their own. CTT, in such circumstances, can be used to carry out basic statistical analyzes and
IRT can be applied to measure examinee abilities and item level statistics.
IRT measurement is an advanced statistical model that is able to address many item-level
concerns not resolved by other testing theories. Although CTT has been used more often than
IRT in medical education, the numerous applications of IRT and now the advent of more
advanced software are making it more acceptable. Test designing and equating, item selection
and scaling and adaptive testing are carried out more conveniently by IRT than by other
available models. IRT offers the promise of solving many problems that are faced by
38
psychometricians in medical education. Despite the fact that CTT is more robust in terms of
assumptions and data size, IRT provides more useful information in terms of examinee abilities
and item difficulty. Many international testing bodies dealing with larger data prefer applying
IRT models for various high-stake exams which is a credit in itself in favour of IRT. A summary
of the differences between some important features of CTT and IRT is given in Table 1.
Table 1: Features of Classical Test and Item Response Theory
Features Classical Test Theory Item Response Theory
Focus It is on determining the error of
measurement.
It is on determining the unobserved
theoretical latent trait.
Goal In CTT, the quality of the observed
test score is evaluated by estimating
the reliability coefficient and the
standard error.
In IRT, the score of the latent trait is
estimated.
Standard Error Only a single type of error can be
determined in CTT.
The standard errors of individual
parameter estimates can be
determined in IRT.
Sample Size CTT works well with both small
and large data sets.
IRT requires a larger data for optimal
application depending on the model
that fits the data.
Assumptions CTT has a robust model with
flexible assumptions.
The various models have strict
assumptions of unidimensionality
and local independence in IRT.
Reliability of the Scores CTT calculates the reliability
coefficient of the total test score.
In IRT, reliability is reflected by the
test information function.
Item Calibration CTT does not require item
calibration due to its flexibility with
the data.
It is a prerequisite in IRT for the
items to be calibrated before the
actual administration of a test.
39
2.5 Temporal Stability of MCQs
MCQs are frequently used in high stakes exams to assess the students. Their
psychometric properties may become less stable over time and across administrations due to a
number of factors. This raises concerns since decisions of certification, promotion and
graduation depend on these written assessments. It is, thus, desirable for the items to exhibit
temporal stability for their repeated use in exams. Changes that may occur in the item parameters
over time and administrations refer to the phenomenon of parameter drift which is discussed
below.
2.5.1. Parameter Drift
As stated above, parameter drift refers to the phenomenon of changes that occur in the
parameter estimates of an item due to repeated administrations. If the values of the parameters
alter more than would be expected due to measurement error, it cannot be assumed that these
values will remain unchanged over time. As a consequence of this, such items may have to be
removed from the item bank due to threats to stability.
Although the phenomenon of parameter drift is typically associated with IRT, changes may
also be observed in the context of CTT where p values and point biserial correlations may drift
over time and repeated administrations. Parameter drift is observed both in the context of item
difficulty and discrimination estimates. Some researchers have documented that item difficulty
has a stronger parameter drift than item discrimination.135
The phenomenon of parameter drift is attributable to several reasons in addition to
measurement error. Changes occurring in the construct are one reason for parameter drift. A
40
construct may change due to alterations in the testing universe, the objective of the assessments
or the target students. This is particularly observed in the context of a curriculum that is still
being developed and is undergoing frequent changes. Since some items testing a particular
construct might not be required for assessing it any longer, the usefulness of such items may
wane, leading to further drift. Another factor influencing the item parameters is the content of the
curriculum. If the curricular content is not dynamic, students become well-versed in picking up a
trend in the exam questions and item stability is affected. Bock et al135 observed parameter drift
in a study on a College Board Physics exam which they attributed to curricular differences. The
items on Basic Mechanics became easier over the years as the content was heavily covered in the
curriculum. Changes in the characteristics of the items also causes changes to occur in the item
parameters. Hence, items that test certain general skills like arithmetic and comprehension have
been noted to drift less compared to ones that are content-specific. They are also affected by the
timing of instructions delivered to different cohorts. If one group of students have been
instructed closer to the exams in contrast to another, changes in the item characteristic may be
noted to influence the scores. One interesting influence on parameter drift is that of recency of
instructions.136 Content that has been emphasized in the near past may lead to improved general
knowledge about such topics, making some items appear less challenging. Bergstrom and
colleagues have reported on parameter drift resulting from differences in pre and post tests due to
changes in practice and motivational effect.137 A national computerized adaptive test yielded a
drift in 32-49% of the items between a pre-test and operational use over a five-year period.
Threats to security also bring about a change in the parameter estimates. Techniques like training
in test wiseness tend to cause parameter drift since students learn to pick up the correct answer
41
despite lack of content knowledge. This is further aggravated by answer sharing by the
examinees who have already taken the test. Overexposure of the items, either due to repeating
over multiple administrations or due to computer adaptive testing, both lead to a decrease in the
test security since students start anticipating that certain items will be included in the test. Gender
differences, language preference and ethnicity can also cause a drift in the parameters of items.
Furthermore, significant changes in parameter estimates may also result from large changes in
the population. In addition, relatively easy items may become more difficult as the knowledge
being tested by these items becomes less common. On the other hand, difficult items may
become easier as the previously specialized knowledge becomes more commonly known.
Parameter drift leads to a number of consequences that may affect the outcome of an
assessment or a program. Due to its impact on the performance on an item, it affects the scores of
an exam. Students may find the items differentially easy or difficult due to a drift in the
estimates. Where comparisons have to be made in the performance of a student over time, it can
be complicated due to parameter drift as the baseline estimates may be altered with time.
Parameter drift can also pose challenges when decisions need to be made around the cut score of
an exam. This is especially important in the context of a high stake exam where decisions about
certification, graduation, etc. may be affected by this phenomenon. In the context of equating,
parameter drift leads to the addition of further equating error in pre-equated test forms if the
parameters are not re-estimated before the administration. Despite the unwarranted consequences
of parameter drift, researchers have reported that the overall effect of this phenomenon on the
test forms remains small. Due to its robustness, theta recovery remains intact although drift may
be seen in both b and a parameters.138, 139
42
Various methods are employed to study the phenomenon of parameter drift. One method is to
use chi square where parameter estimates are compared across different time points to look at its
effect.140, 141 It is also detected using z test where a comparison is made between two subgroups
to detect drift.140 In the context of IRT, different models are compared with each other for fit.135
Model fit is determined using likelihood ratios chi square test for detecting differences in the
models and hence parameter drift. Alternatively, the fit of one type of model is compared at
different points in time across administrations to observe whether there is a difference in the
parameter estimates leading to a drift. In addition, DIF may be used to detect drift as well.
Parameter estimates that are obtained from such testing are compared across administrations to
study the differences in them. Babcock et al142 and Wollack et al143 have utilized test
characteristic curves to visually compare them as they provide useful information regarding
changes in parameter estimates over time. Amongst all the methods discussed above, the chi
square testing has been documented to be most effective.140
To conclude, it can be stated that temporal stability can be assessed by analyzing parameter
drift. Parameter drift is not uncommon and can especially be observed in exams utilized for
assessing a large number of students and with a large number of items. It is important to analyze
the MCQs for estimate drift as the items may become differentially easy or hard over repeated
administrations. The choice of method for analyzing the drift should depend on the effectiveness
and ease of application of a method and the stakeholders’ understanding of it.
43
2.6 Research Questions
The questions that were addressed by this research were mainly to observe the reliability of
scores on an MCQ exam while using two different methods and the stability of these MCQ items
over time. It was hoped that this research would help compare the similarities and differences in
the two methods, i.e., CTT and IRT and also explore some of the factors affecting the stability of
items on being used repeatedly. My research questions were as follows:
1. What was the reliability of scores using and comparing two methods of analysis, i.e.,
CTT and IRT, on MCQ items administered three times over a six year period?
1A. What are the item parameters when conducting item analysis with CTT?
1B. What are the item parameters when conducting item analysis with IRT?
1C. Are the item parameters comparable when analysing with both CTT and IRT?
1D. What is the reliability index of the test scores?
1E. What are the item characteristic curves like for the individual items for each year?
2. Do the items exhibit temporal stability when repeated over Year 1, 2 and 3?
2A. Do the items show stability across years using CTT?
2B. Do the items show stability across years using IRT?
44
CHAPTER III – RESEARCH METHODS
3.1 Study Design
An exploratory retrospective cohort design was utilized to answer the research questions
in this study. The main aim of this particular design was to assess the reliability of the MCQs
over three selected years using CTT and IRT. In addition, item stability over three years was also
studied. Section two of this chapter presents the setting and context, section three elaborates on
sample and data source, and section four describes the analyses. Ethical concerns are discussed
in section five.
3.2 Setting and Context
This research was carried out at the University of Calgary. The data were obtained from the
Office of the Undergraduate Medical Education at the university. University of Calgary is one of
Canada’s seven premier research universities and is a member of the Network of Centers of
Excellence, a Canada-wide program of research and innovation. In addition, it has launched its
own initiative of “Eyes High” in 2011. Eyes High is the University’s new strategic direction
aiming at becoming one of Canada’s top five research universities, grounded in innovative
learning and teaching and fully integrated with the local community.
The undergraduate medical program at the University of Calgary, which was established in
1967, is an innovative program that encourages the acquisition of skills required for solving
clinical problems through the use of the “Clinical Presentation Curriculum”. This curriculum
was initially introduced in the early nineties.144 The foundations of this curriculum are the
45
principles of early contact with patients and integration of basic and clinical sciences. These
principles nurture the growth of knowledge and skills vital for the practice of medicine and the
efficient use of knowledge for the analysis and solution of clinical presentations.
The “clinical presentation curriculum” organizes the instructional strategies around 120
clinical presentations. Clinical history, physical examination and investigations warranted are
covered extensively in this way. For instance, the schema of an approach to a patient with
hypertension is shown in Figure 5 below.144 This new curriculum was further strengthened in
2006 after student and faculty feedback over ten years where the more traditional systems with
overlapping clinical presentations were merged together into one longer case.145 For example,
“chest pain” and “dyspnea” were linked together into the “cardio-respiratory system”. This
improvement helped to integrate the clinical presentations horizontally.
Hypertension
True or Mislabeled
Primary Secondary
Volume-Dependent Vasoconstrictive
Renal
Parenchymal
Disease
Mineralocorticoid
Excess
Angiotensin II
Excess
Catecholamine
Excess
Figure 5: Causes and Pathophysiology of Hypertension
46
The MD is a three-year program at the University of Calgary and the summative certifying
exams comprise both MCQs and OSCE.3 The items in this research were chosen from three
randomly selected courses, i.e., 1, 3 and 6.4 Course 1 covers the prescribed curriculum of
Hematology and Gastroenterology (GIT) and is offered in the first year of medical school,
Course 3 covers the Cardiovascular (CVS) and Respiratory content and is offered in year one
like Course 1. Course 6 comprises Reproductive Medicine and Human Development and is
offered in year two of the MD program. The Undergraduate Medical Education (UGME) Office
has a well-developed MCQ bank that was accessed in this research with the permission of the
Associate Dean, Undergraduate Medical Education.
Security and copyrighting of MCQ items is an issue that arises whenever MCQ question
banks are accessed.146 These banks are expensive to construct and maintain both due to financial
constraints and logistical problems with the faculty as they require constant replenishing of high-
quality items after authoring, pretesting and analysis. Furthermore, the confidentiality of MCQs
is compromised with repeated use of the same items. For this reason, intellectual property and
digital copyrighting are put in place and implemented by academic institutes and individual
departments. For the same reason, it was not possible to disclose the details of the items analyzed
in this research.
3.3 Sample and Data Source
A total of 90 MCQs used in the assessment of three courses, Course 1 (Hematology and
GIT), Course 3 (CVS and Respiratory System) and Course 6 (Reproduction and Human
3 http://www.ucalgary.ca/mdprogram/admissions/introduction/years-- 4 http://www.ucalgary.ca/mdprogram/admissions/teachingmethods
47
Development) over three years each were analyzed in this research. Table 2 shows the
distribution of item selection for each year and course.
Table 2: Item Distribution for Individual Year and Course
Year Course 1 Course 3 Course 6
2007 30 Items Selected
2008 30 Items Selected 30 Items Selected
2009 Same 30 Items Same 30 Items
2010 Same 30 Items Same 30 Items
2011 Same 30 Items
2012 Same 30 Items
Thirty multiple choice items were chosen for each of Courses 1, 3 and 6. The MCQs selected
were the ones that had been reused in either alternate or successive years. These MCQs were the
single best answer (SBA) type, also known as the one-best answer type. They are the most
commonly used type of MCQs in medicine and other life sciences.32 A clinical scenario usually
acts as an introductory stem in such types of questions which is followed by a lead-in question
and usually five options to choose from. Four of these options are distracters and one the correct
answer. It is important to keep the options homogeneous so that one option does not stand out
more than the other.51 The following is an example of an SBA type of MCQ:
A nine-month old girl is admitted to the hospital for growth faltering. The prenatal history is
unremarkable and the child thrived well for the initial four months. On examination, the child is
found to have a wide open fontanel, is listless and has a nappy rash. She is also below the 5th
48
percentile for length and weight. No other abnormalities are detected. After 1 week of routine
hospital care, the infant has gained 1 kg and has become more playful and alert. Which of the
following is the most likely explanation for the faltering growth?
(A) Hypothyroidism
(B) Infantile psoriasis
(C) Milk allergy
(D) Parental neglect
(E) Pyloric stenosis
The MCQs selected for this research covered various components that included the
assessment of knowledge about the skills in basic sciences, investigations, treatment and
management. The details of their distributions over the four mentioned skills will be elaborated
upon in the results section.
3.4 Data Analyses
Data analyses for both CTT and IRT were carried out using Xcalibre version 4.2. These
included the descriptive analysis of the research to give an overall picture of the results. Below
are the details of the analyses that were carried out in addition to a summary of the concepts
underlying them. Since the research question had two parts, one related to the reliability of items
and test, and the other to the temporal stability of the items, the analysis has accordingly been
divided into two questions.
49
3.4.1 Research Question No. 1
What was the reliability of scores using and comparing two methods of analysis, i.e., CTT and
IRT, on MCQ items administered three times over a six year period?
As discussed in the literature review, reliability of test scores is an end result of item analysis.
Since the objective of this research was to use two methods of analysis, i.e., CTT and IRT, and to
compare the results of both, the methods were subdivided to yield the answers to the following
questions:
3.4.1.1 Research Question No.1 A
What are the item parameters when conducting item analysis with CTT?
Item difficulty and discrimination are both important constituents of item analysis. For this
purpose, CTT was used to look at the difficulty and discrimination indices of items under study
over a period of three years. In CTT, the difficulty index is denoted by “p” and refers to the
examinees who have answered the item correctly. The higher the p value, the easier the item. It is
synonymous with the item difficulty in IRT denoted by “b”. As discussed in the section on
literature review, item discrimination in CTT refers to the item-total correlation and is called a
point biserial correlation which can be any value between -1 to +1 although the closer the
correlation coefficient is to 1, the more discriminating is the item. The IRT analogue of point
biserial correlation is discrimination index which is denoted by “a”.
50
3.4.1.2 Research Question No.1 B
What are the item parameters when conducting item analysis with IRT?
IRT was applied to assess the difficulty and discrimination indices of the same item used for
carrying out the item analysis with CTT. This was done so that differences and similarities could
be highlighted between the two methods.
3.4.1.2.1 Two-Parameter Logistic Model of Item Response Theory
This research has used the 2 PL model of IRT 115 which comprises the following two
parameters:
1. The item difficulty, or threshold, parameter b--- it is the point on the latent scale θ where a
person has a 50% chance of responding positively to the scale item.
2. The slope, or discrimination, parameter a--- it describes the strength of an item's
discrimination between people with trait levels (θ) below and above the threshold b.
The 2 PL model was used since one of the aims of this research was to compare the a and b
parameters with the difficulty and discrimination parameters of CTT over three years. Also, 3 PL
models require a larger sample size for such analyses.61 The sample size usually recommended
for 3-PL analysis is between 1000-2000 examinees, the larger number being more desirable.147
While carrying out the item analysis with Xcalibre 4.2, the model constant was set at 1.7.
Theta was estimated using a maximum likelihood estimate and examinee ability estimates were
rescaled to have a mean theta of 0. Item analysis is briefly recapitulated in the following sections.
Since there are some differences in the interpretation of item analysis using the two different
measurement methods, these will also be discussed.
51
3.4.1.2.2 Item Analysis
Item analysis is the process by which it can be confirmed if the items on a test are
functioning in the desired manner.112 Given that there are limited numbers of items on an
examination, every item has to be written in such a way that it is able to assess higher cognitive
functions along with an evaluation of the understanding and application process of the examinee
sufficiently.17 Item analysis, thus, helps establish the difficulty and discrimination levels of each
item.
3.4.1.2.3 Item Difficulty
Item difficulty expresses the proportion or percentage of students who answered the item
correctly. It can range from 0.0 (none of the students answered the item correctly) to 1.0 (all of
the students answered the item correctly). The average difficulty index for a five-option multiple
choice test should be between 0.25 and 0.75. 112 If an item is found to have a difficulty of less
than 0.25, it may be that one of the wrong options has been recorded into the answer scanner as
the correct one (miskeyed item) or that that the item was not written clearly. It is also possible
that the item may have more than one correct answer or that at least one distracter is very close
to the correct option.
3.4.1.2.4 Item Discrimination
This refers to the ability of an item to distinguish between the more knowledgeable and the
less knowledgeable students.112 An index of 0.40 and higher is said to be consistent with
excellent discrimination, 0.30 to 0.39 good, 0.10 to 0.29 fair and 0.01-0.10 poor. If the
52
discrimination index is calculated to be in negative values, the item may be ambiguous or as in
the case of item difficulty may have been miskeyed inadvertently by the programmer.
3.4.1.3 Research Question No.1 C
Are the item parameters comparable when conducting item analysis with both CTT and IRT?
Correlation coefficients were calculated between the item parameters generated by both CTT
and IRT for all three courses for the three years. This was done to observe whether there was a
relationship between the two methods of measurement. It was assumed that if the correlation was
good to excellent, it would mean that the two methods, irrespective of the differences in them,
would be comparable to each other. A perfect correlation coefficient is that of 1.0. Correlation
coefficients can be negative or positive; negative meaning that there is little or no correlation
between the variables under study and positive meaning that the variables are correlated and
hence are comparable. The formula that is used for calculating the correlation coefficient
standardizes the variables. Hence, changes in scale or changes in units of measurement do not
affect its value. P values were also reported along with the correlation coefficients.
3.4.1.4 Research Question No.1 D
What is the reliability index of the test scores?
The SE of parameter estimates was calculated for each item for the three years. As stated
elsewhere, standard error is sensitive to the size of the sample and a larger standard error is noted
for smaller samples than for the larger ones. The size of item parameters, i.e., difficulty and
53
discrimination indices also influences the standard error as more extreme parameters like a
difficulty index of, for example 1.5, will lead to a larger standard error.
In addition, Cronbach’s alpha was calculated for the test scores of the three years for the
three courses individually to look at the reliability coefficient of the scores. Both CTT and IRT
were used for this purpose so that a comparison could be made between the results of the two.
For IRT, the SEMs of the examinees’ theta were averaged to produce a mean SEM for the
examinees for a given year. This mean SEM was then converted to a reliability coefficient by
applying the following formula:
Reliability = 1-(SEM/SD theta) ^2
where SEM is the mean standard error of measure and SD scores is the standard deviation of the
examinee thetas. This calculation gave us a single value that was used to represent the reliability
of the IRT scores on the examination under scrutiny.
3.4.1.5 Research Question No.1 E
What are the item characteristic curves like for the individual items for each year?
A basic concept of IRT is the ICC which is a mathematical expression that relates the
probability of success on an item to the ability measured by the test and the characteristics of the
item.109 Item characteristic curves were generated for each individual item over three years for
three courses. The details of Course 1 are given in the results section. For Course 3nad 6, the
graphs can be found in the Appendix A8 and B8 respectively.
An ICC is essentially a non-linear regression of the probability of a correct response to a
given item on the examinee’s ability. Item difficulty and discrimination influence the shape of
54
the curve. Difficulty is a location index and describes the function of the item along the ability
scale. The steepness is attributable to the discrimination of the item. The higher the
discrimination index, the steeper the curve.
3.4.2 Research Question No. 2
Do the items exhibit temporal stability when repeated over Year 1, 2 and 3?
This research question has been answered by using both CTT and IRT. The sections that
follow have been hence divided accordingly into subsections.
3.4.2.1 Research Question No 2. A
Do the items show stability across years using CTT?
3.4.2.1.1 Repeated Measures ANOVA
IBM’s SPSS (version 22) was utilized to run a repeated measures ANOVA for the three
years for each of the three courses. Repeated measures ANOVA detects the variances between
means for related groups. Furthermore, it helps determine if the dependent variables are altered
by the independent variable (year in the case of this research). It was appropriate for this research
as one of my objectives was to study the change in the item parameters over three points in time.
ANOVA assumes that the variances are equal across the groups or samples under research.
Levene’s Test for Equality of Variances148 is applied to test this assumption of homogeneity of
variances. It can, thus, be used to verify whether or not the variances of the groups are
statistically different. Generally, 0.05 is used as the probability level to establish the statistical
significance; so, if the Levene’s Test shows a significance value of < 0.05, it can be concluded
55
that the variances are significantly different. Similarly, if it shows a value greater than > 0 .05, it
means that the variances are not significantly different.
If the Levene’s Test is non-significant, then another statistic is determined for ANOVA
which is called the F ratio. This is the ratio of the variance between groups to the variance within
groups i.e. the ratio of the explained variance to the unexplained variance. The F ratio is used to
test whether or not two variances are equal. If the p values are not significant and the F ratio
small, it meant that the dependent variables, i.e., difficulty and discrimination indices, are stable
over the years and unaffected by time.
In addition to repeated measure ANOVA, correlation coefficients were calculated and scatter
plots constructed for the two methods and the three years to assess the stability across the years.
Correlation coefficients inform how strongly two or more variables are related to each other.149
The correlation is said to be positive if one variable increases with the other and negative if one
increases while the other decreases. Both the variables are said to have a relationship even if it is
negative. A correlation of + 1 is said to be a perfect correlation. It is said to be moderate if 0.5
and above and excellent if 0.8 and above.
This research, as stated earlier, also reported the overall mean and standard deviation for the
difficulty and discrimination indices of the 270 items (30 items per year X 9 years). In addition,
descriptive statistics were generated for each course individually.
3.4.2.1.2 Effect Sizes
Partial Eta² was calculated to report the effect sizes for item difficulty and discrimination for
the three courses. Effect size is a useful index to depict the practical significance of study
56
results.150 It is preferred to statistical significance because it is not dependent on sample size and
is a scale-free index. It can, hence, be interpreted irrespective of the scales of variables. The
index varies from about 0.3 to ∞. It is small if the value is between 0.30-0.49, moderate between
0.50-0.79 and large between 0.80 to ∞. The larger the effect size, the larger the difference
between the distributions of scores.
3.4.2.2 Research Question No. 2 B
Do the items show stability across years using IRT?
Repeated measures ANOVA was used to study the temporal stability of item parameters
obtained with the IRT method. The objective was to observe the change in parameters with time
and to compare the findings with those observed with the application of CTT. In addition, TCCs
were also generated to visually compare the trend of the curves for stability over time. TCCs for
Course 1 are displayed in the results section whereas the ones for Course 3and 6 are displayed in
Appendix A17 and B17 respectively.
3.4.2.2.1 Test Characteristic Curve
IRT and methods are also applicable at the test or scale level as discussed earlier. The
concept of a TCC stems from this ability of IRT.113 It represents a non-linear regression of
overall test score on ability. The TCC can be a very useful tool for evaluating the range of
measurement and the degree of discrimination at different points of the ability continuum. This
research used TCCs that were generated to assess the temporal stability of multiple choice items
by comparing them. The 2PL model of IRT was used to calculate the theta level of the
57
examinees in each cohort for each year and course separately along with the proportion correct
units and the number-correct units. These were then plotted on graph using Xcalibre 4.2.
Individual graphs for each year per one course were then visually compared to establish whether
they looked similar in trend.
3.5 Summary of Analyses
1. Descriptive analyzes were carried out for the difficulty and discrimination indices
calculated by using SPSS and the IRT software called Xcalibre (Version 4.2).
2. The reliability of the test was assessed by carrying out item analysis and calculating SE
of estimates and Cronbach’s Alpha using both CTT and IRT.
3. Correlation coefficients were calculated to look at the comparability of CTT and IRT
4. ICCs were constructed to study the item parameters.
5. Repeated measures ANOVA was conducted using the difficulty index and the
discrimination index individually as dependent variables and year as the independent
variable to look at the stability of the MCQs across three years. Effect sizes (Partial Eta²)
were also calculated.
6. Year-wise correlation coefficients were calculated to look at the temporal stability of the
items.
7. Temporal stability of the selected MCQs across three years on item response calibrated
difficulty and discrimination indices for a 2 PL model of IRT was analyzed by generating
TCCs across the years.
58
Table 3: Methods Summary
Research Question Variables Statistical Analysis
Do the items exhibit reliability
across years using CTT and
IRT?
1. Item difficulty index of 30 X 90
items
2. Item discrimination index of
30 X 90 items
3. Comparability of CTT and IRT
Item analysis
SE of Estimates
Cronbach’s Alpha
Correlation coefficients
ICCs
Do the items show stability
across years using CTT and
IRT?
1. Independent: Years of exam (2007-
2011)
2. Dependent: Item difficulty and
discrimination indices of 30 X 90
items
3. Theta and item scores
a) Repeated measures
ANOVA as it proves
stability of items if F
ratios are small and p
value not significant;
effect size for
significance index.
b) Correlation
Coefficients to observe
inter-year relationship of
item; if moderate to
excellent, it would
indicate stability
c) IRT to generate TCCs
59
3.6 Ethics
This study received ethics approval from the Conjoint Health Research Ethics Board
(CHREB) at the University of Calgary. The permission to utilize the items for analysis was
granted by the Office of Undergraduate Medical Education, Faculty of Medicine, University of
Calgary. The participants’ demographic data could not be accessed for my research and they
were completely anonymous. This was due to a lack of permission from the CHREB in the
context of student demography. Except for the item number to identify the selected MCQs, no
other information could be accessed due to the issue of the security of MCQ bank that rises with
the publishing of MCQs. The data were only accessible to the primary researchers and were
password-protected.
60
CHAPTER IV-RESULTS
4.1 Overview
This chapter describes the results obtained from the statistical analyzes elaborated in
the previous chapter. The main aim of this research was to use University of Calgary summative
examination data from MCQ exams in order to assess the reliability of scores using and
comparing two methods of analysis, i.e., CTT and IRT, on MCQ items administered three times
over a six year period. In addition, the temporal stability of the same items was also analyzed
using both CTT and IRT. Due to a lack of permission from the CHREB, demographic data was
not available. For the purpose of the overview, the descriptive analyses for all three courses over
the three years are presented. For a full elaboration of results of the research questions presented
in chapter III, only Course 1 is discussed at length. Detailed results of Course 3 and 6 can be
viewed in the appendices at the end.
4.2 Descriptive Analysis
Descriptive analyses are shown below for various aspects of this research. These show the
skills, number of examinees and content of the three courses. In addition, descriptive statistics of
item parameters are shown as well.
Table 4 shows the distribution of MCQ items according to the skills which were divided
into Basic Sciences, Diagnosis, Investigation and Treatment. Predominant items in Course 1
(Hematology and GIT) were from the skill of Basic Sciences (N=11), closely followed by
Diagnosis (N=10). For Course 3 (CVS and Respiratory), they were mainly from the Diagnosis
61
(N=17), followed by Treatment (N=8) and an equal and small number belonged to Basic
Sciences and Investigations. Basic Sciences items (N=9) were slightly predominant in Course 6
(Reproductive Medicine and Pediatrics) with near-equal number across the skills of Diagnosis,
Investigation and Treatment.
Table 4: Distribution of MCQs According to Type of Skill (N=90)
Course Skill Total
Basic Sciences Diagnosis Investigation Treatment
1 11 10 2 7 30
3 2 17 3 8 30
6 9 7 8 6 30
Total 22 34 13 21 90
Table 5 shows the examinees’ numbers across three courses over three years with the
largest number of examinee data analyzed for Course 1, i.e., 527. The number of examinees
varied between 151 and 179 across the courses and was more consistent for Course 1 as
compared to the rest.
Table 5: Number of Examinees Across Courses and Years
Course 2007 2008 2009 2010 2011 2012 Total
1 174 179 174 527
3 151 179 175 505
6 154 176 164 496
62
Tables 6, 7 and 8 show cross-tabulation of the content of 90 questions used in the
examination for Courses 1, 3 and 6 classified by clinical presentation and skills. For Course 1
(Table 6), 16 clinical presentations were selected. The most common clinical presentations were
Fever/Sore Throat (N=4) and Failing Liver (N=5). For the skills, the items were evenly divided
between Basic Sciences and Diagnosis (N=11), followed by items on Treatment (N=6).
Table 6: Content of 30 Items Course 1 Classified by Clinical Presentation and Skills
Clinical Presentation Skill Total
Basic Sciences Diag Invest Treat
Abnormalities of White Cells 1 2 0 0 3
Acute Abdominal Pain 0 0 0 1 1
Bleeding and Bruising 0 3 0 0 3
Blood in Stool 0 0 0 2 2
Diarrhoea 1 0 0 0 1
Epidemiology 1 0 0 0 1
Fever/Sore Throat 2 0 1 1 4
Genetics 1 0 0 0 1
Immunology 1 0 0 0 1
Jaundice 0 1 0 0 1
Failing Liver 1 1 1 2 5
Lymphadenopathy 0 2 0 0 2
Pharmacology 1 0 0 0 1
Splenomegaly 0 1 0 0 1
Thrombosis 1 0 0 0 1
Undefined 1 1 0 0 2
Total 11 11 2 6 30
63
Table 7 displays the contents of Course 3 based on clinical presentation and skills. There
were twelve clinical presentations selected for this course of which Chronic Dyspnea was the
most common ones (N=6). The largest number of items belonged to the category of Diagnostic
skills (N=17) followed by Treatment (N=8).
Table 7: Content of 30 Items Course 3 Classified by Clinical Presentation and Skills
Clinical Presentation Skill Total
Basic Sciences Diagnosis Investigation Treatment
Anemia/Pallor 0 1 0 0 1
Chest Discomfort 0 0 0 1 1
Chronic Dyspnea 0 3 0 3 6
Congestive Heart Failure 0 1 0 1 2
Cough in Children 0 3 0 0 3
Cough/Fever 0 3 1 1 5
Dyspnea/CHF 0 1 0 0 1
Hypercapnea 1 0 0 0 1
Hypoxemia 1 0 0 2 3
Noisy Breathing in Child 0 2 0 0 2
Lung Nodule/Mass 0 1 1 0 2
Pleural Effusion 0 2 1 0 3
Total 2 17 3 8 30
64
For Course 6, items were chosen from twenty different clinical presentations, as shown in
Table 8, of which most belonged to the category of Increased Risk/Genetic Disease. Basic
Sciences skills was the dominant one followed by Investigations.
Table 8: Content of 30 Items Course 6 Classified by Clinical Presentation and Skills
Clinical Presentation Skills Total
Basic
Sciences
Diagnosis Invest Treatment
Childhood/Abnormal Urine Analysis 0 0 1 0 1
Childhood/Adolescent Exam 0 1 0 0 1
Childhood/Developmental Delay 0 2 0 0 2
Childhood/Rash 0 1 0 0 1
Childhood/Respiratory Diseases 0 1 0 0 1
Childhood/Serious Childhood Infection 0 1 0 0 1
Increased Risk/Genetic Disease 6 0 0 0 6
Menopause/Amenorrhea 1 0 0 0 1
Neonatal Jaundice 1 0 0 0 1
Neonatal/SIDS 0 0 0 1 1
Pelvic Mass 0 1 0 0 1
Pregnancy Loss 0 0 0 2 2
Pregnancy/Antepartum Care 0 0 2 0 2
Pregnancy/Intrapartum Care 0 0 2 0 2
Pregnancy/Obstetric Complication 0 0 2 0 2
Pregnancy/Obstetric Emergency 0 0 1 0 1
Prolapse 1 0 0 0 1
65
Vaginal Discharge/ Urinary Symptoms 0 0 0 1 1
Well Patient/Immunization 0 0 0 1 1
Well Patient/Normal Childhood 0 0 0 1 1
Total 9 7 8 6 30
Item Parameters
Tables 9 10, 11 display the descriptive statistics of item parameters of Courses 1, 3and 6
respectively. For Course 1, the indices varied between low to fair value of 0.26 to an average
value of 0.92 for item difficulty (Table 9), the recommended ranges in literature being 0.25-
0.85.112 Values of discrimination index for this course were between 0.09 to 0.75 which varied
between less than the desirable ones to the recommended ones.112
Table 9: Descriptive Statistics of Item Parameters for Course 1
Parameter N Min Max Mean SD
Difficulty 90 0.26 0.92 0.683 0.148
Discrimination 90 0.09 0.71 0.292 0.132
The trend was a little different for both the indices for Course 3 as compared to Course 1.
This is seen in Table 10. The difficulty indices varied between 0.31 to 0.93 for item difficulty
and 0.31 to 0.58 for item discrimination. This showed that difficulty index ranged from easy to
adequate levels.112 The discrimination index was noted to vary over a smaller range with a
relatively closer mean but still a moderate standard deviation.112
66
Table 10: Descriptive Statistics of Item Parameters for Course 3
Parameter N Min Max Mean SD
Difficulty 90 0.31 0.93 0.72 0.133
Discrimination 90 0.13 0.58 0.24 0.126
For Course 6 shown in Table 11, the difficulty index varied between 0.24 to 0.99 and 0.07 to
0.69 for item discrimination. These findings were consistent with the ones observed in Course 1
and although they showed similar trends of desirable item difficulty,112 and discrimination
index,112 the range of discrimination index was wider with a larger standard deviation from the
mean.
Table 11: Descriptive Statistics of Item Parameters for Course 6
Parameter N Min Max Mean SD
Difficulty 90 0.24 0.99 0.755 0.142
Discrimination 90 0.07 0.69 0.178 0.106
4.3 Results of Research Question No. 1
What was the reliability of scores using and comparing two methods of analysis, i.e., CTT and
IRT, on MCQ items administered three times over a six year period?
This research question was answered using both CTT and IRT. Item analyses were conducted
to look at the item difficulty and discrimination indices for the three courses over three years.
67
Details of Course 1 follow. Results of Courses 2 can be viewed in Appendix A1 and A2. For
Course 6, they can be viewed in Appendix B1 and B2.
4.3.1 Results of Research Question No. 1 A
What are the item parameters when conducting item analysis with CTT?
One must remember that in CTT, difficulty refers to the number of students who are able to
answer an item correctly. The bigger the number, the easier the item. Table 12 presents the
results of item analysis for Course 1 across three years using CTT. For Year 1, 23 items had
recommended p between 0.25-0.75 and 7 items had a p of more than 0.75. The items with a p
more than 0.75 were no. 1, 6, 17, 18, 19, 21 and 26. Hence, they were easy. For Year 2, the items
that fitted into the category of easy items were 8, 9, 11 and 14 as their p was more than 0.75. The
analysis of items from Year 3 showed similar results as Year 1 with 23 items having adequate
levels of difficulty. Seven items were noted to be easy, i.e., 1, 6, 17, 18, 19, 21 and 26.
Interestingly, items in Years 1 and 3 showed more stability in terms of difficulty. Although
majority of the items in Year 2 were also stable over time, items 8, 9, 11 and 14 yielded different
results, i.e., they were found to be easy by the students in that year.
Regarding the discrimination index, also called point biserial correlation in CTT, many items
had a value greater than 0.2. For Year 1, four items, i.e., 7, 8, 22 and 26 had a p-bis correlation of
>0.3. There were 9 items that had a p-bis >0.2. They were items no. 3, 4, 5, 12, 14, 17, 24, 27
and 28. For Year 2, one item, i.e., no.8 had a p-bis of >0.4. Nine items had a p-bis >0.2. They
were items no. 5, 7, 9, 14, 17, 18, 20, 22 and 23. Trends similar to Year 1 were observed in Year
3 where the same four items as Year 1 had a p-bis >0.3. They were items no, 7, 8, 22 and 26.
68
Furthermore, nine items had a p-bis >0.2. They were the same as in Year 1, i.e., items no. 3, 4, 5,
12, 14, 17, 24, 27 and 28.
In summary, by looking at Table 12, one notices that item difficulty was adequate for all
three years although students in Year 2 found the items slightly easier. In addition, items 15 and
19 had higher difficulty for Year 1 and 3 and lower for Year 2 which was different from the
trend of the rest of the items. In contrast, item 1 had higher difficulty for all three years (0.84-
0.86), potential explanation being that this item was probably testing core knowledge that all the
students were expected to know. Regarding item discrimination, only one item, i.e., no. 8 had
ideal discrimination of 0.40 for Year 2. Also, Year 1 and 3 were more consistent with each other
in respect of item difficulty and discrimination than Year 2.
Table 12: Item Difficulty (p) and Point Biserial (p-bis) Correlation of Course 1 Using CTT
Year 1 Year 2 Year 3
ID p p-bis p p-bis p p-bis
1 0.862 0.093 0.844 0.058 0.861 0.090
2 0.448 0.161 0.408 0.014 0.423 0.165
3 0.695 0.202 0.777 0.117 0.678 0.213
4 0.672 0.244 0.665 0.066 0.671 0.234
5 0.569 0.281 0.704 0.248 0.567 0.279
6 0.816 0.039 0.911 0.112 0.809 0.042
7 0.477 0.305 0.620 0.296 0.472 0.325
8 0.575 0.306 0.765 0.401 0.568 0.303
9 0.592 0.150 0.832 0.258 0.594 0.152
10 0.489 0.081 0.525 0.010 0.488 0.081
69
11 0.690 0.107 0.816 0.084 0.687 0.114
12 0.746 0.206 0.726 0.072 0.745 0.216
13 0.632 0.160 0.609 0.112 0.630 0.159
14 0.678 0.211 0.866 0.216 0.667 0.208
15 0.638 0.193 0.402 0.092 0.639 0.190
16 0.747 0.153 0.793 0.038 0.734 0.155
17 0.810 0.263 0.844 0.269 0.808 0.270
18 0.839 0.157 0.793 0.202 0.826 0.155
19 0.920 0.085 0.760 0.114 0.925 0.083
20 0.753 0.148 0.799 0.214 0.754 0.147
21 0.822 0.123 0.777 0.105 0.823 0.109
22 0.718 0.348 0.676 0.240 0.716 0.343
23 0.678 0.172 0.626 0.278 0.668 0.170
24 0.742 0.155 0.753 0.028 0.724 0.155
25 0.840 0.262 0.834 0.269 0.808 0.260
26 0.816 0.309 0.816 0.152 0.811 0.301
27 0.744 0.283 0.765 0.180 0.746 0.276
28 0.500 0.227 0.363 0.162 0.497 0.231
29 0.713 0.114 0.682 0.150 0.719 0.112
30 0.687 0.135 0.696 0.148 0.684 0.133
Mean 0.699 0.189 0.714 0.156 0.701 0.187
SD 0.121 0.079 0.137 0.096 0.122 0.079
70
4.3.2 Results of Research Question No. 1B
What are the item parameters when conducting item analysis with IRT?
Table 13 shows the results of item analysis carried out by applying IRT. More than fifty
percent of the items in all three years had an item difficulty of <0.25. In Year 1, six items had a
recommended range of difficulty between 0.25-0.75. They were items no. 5, 8, 9, 13, 15 and 24.
Three items had an item difficulty of > 0.75. They were items no. 7, 10 and 28. In the context of
Year 2, other than one item, i.e., no. 24, a different set of items (compared to Year 1) yielded a
desirable difficulty level of between 0.25-0.75. They were items no. 7, 13, 22 and 23. Three
items in Year 2 had an item difficulty of >0.75. They were items no. 10, 15 and 28. Year 3
showed very similar trends and hence temporal stability with Year 1. Items no. 5, 8, 9, 13, 15 and
24 had desirable levels of item difficulty between 0.25-0.75 and items no. 7, 10 and 28 had an
item difficulty of >0.75. In all three years, quite a few items had an item difficulty with negative
values.
A report on the discrimination indices for Course 1 for three years follows. All three years
for Course 1 showed stable temporal trends as most of the items had the a parameter higher than
the recommended one of 0.4. This means that they are discriminating well. Item no. 2 had the a
parameter of <0.3 in all three years, i.e., less than the desirable one. In addition, items no. 3, 4
and 10 had indices of <0.4 for Year 2 and item no. 3 for Year 3.
In summary, the majority of the items were easy for all three years when IRT was applied.
Three items stood out as very difficult for students in all three years. It is hard to explain why
they were found to be more difficult for the students in Year 2 who otherwise have shown better
performance in general. Items 1 and 2 had the lowest discrimination amongst all. When
71
comparing CTT with IRT, one notices that >50% of the items were of adequate type (difficulty
level between 0.25-0.75) when CTT was applied for item analysis. On the contrary, item analysis
with IRT showed that more than 50% of the items were of the easy type. Discrimination was
better when IRT was applied and was in fact noted to be quite high as several values were above
the ideal cut-off value of 0.
Table 13: Difficulty (b) and Discrimination (a) Indices of Course 1 Using IRT
Year 1
Year 2 Year 3
ID a b a b a b
1 0.198 -4.025 0.245 -3.008 0.188 -4.020
2 0.228 1.193 0.215 1.958 0.227 1.189
3 0.423 -0.301 0.365 -0.966 0.329 -0.305
4 0.601 0.129 0.380 0.005 0.590 0.127
5 0.686 0.638 0.605 0.153 0.680 0.649
6 0.549 -0.785 0.638 -1.255 0.544 -0.791
7 0.720 0.999 0.675 0.610 0.712 0.985
8 0.754 0.635 0.883 0.105 0.752 0.629
9 0.752 0.632 0.833 0.101 0.742 0.619
10 0.754 0.637 0.823 0.115 0.732 0.609
11 0.545 -0.015 0.531 -0.674 0.541 -0.012
12 0.680 -0.233 0.486 -0.159 0.688 -0.219
13 0.592 0.320 0.461 0.510 0.572 0.318
14 0.682 0.182 0.645 -0.803 0.662 0.178
15 0.624 0.318 0.444 1.684 0.623 0.312
72
16 0.654 -0.175 0.498 -0.592 0.559 -0.173
17 0.800 -0.335 0.742 -0.465 0.794 -0.332
18 0.686 -0.660 0.640 -0.299 0.645 -0.650
19 0.750 -1.193 0.522 -0.300 0.735 -1.190
20 0.617 -0.254 0.625 -0.357 0.601 -0.248
21 0.650 -0.609 0.539 -0.376 0.650 -0.519
22 0.852 0.129 0.599 0.294 0.845 0.129
23 0.612 0.120 0.614 0.551 0.616 0.125
24 0.658 0.726 0.528 0.484 0.659 0.690
25 0.591 1.792 0.524 1.217 0.611 1.790
26 0.837 -0.327 0.578 -0.561 0.834 -0.317
27 0.768 -0.136 0.581 -0.223 0.766 -0.126
28 0.646 0.913 0.522 1.823 0.606 0.901
29 0.591 -0.076 0.508 0.154 0.611 -0.073
30 0.627 -0.435 0.625 -0.357 0.622 -0.432
Mean 0.623 0.000 0.540 0.000 0.623 0.000
SD 0.146 1.000 0.140 1.000 0.146 1.000
4.3.3 Results of Research Question No.1 C
Are the item parameters comparable when conducting item analysis with both CTT and IRT?
Table 14 shows the correlation coefficients calculated for the three years for Course 1 for
item difficulty using both CTT and IRT. It can be noted that the correlation coefficients for all
three years are good to excellent. The negative sign here is arbitrary since it must be kept in mind
while looking at these figures that in IRT, the item difficulty index (b) moves from the smaller to
the bigger number and item difficulty itself moves from the easier to the more difficult.151, 152 On
73
the other hand, in CTT, item difficulty index (p) moves from the smaller to the bigger number
but item difficulty moves from the more difficult to the easier. Hence, a negative correlation
holds and the sign becomes arbitrary.
Table 14: Correlation Coefficients of Difficulty Index Between CTT and IRT for Course 1
Year 1
p-b
Year 2
p-b
Year 3
p-b
-0.807 -0.887 -0.804
Table 15 shows the correlation coefficients of point biserial and discrimination index
between CTT and IRT for Course 1. It can be observed that the most remarkable correlation was
for Year 1 (r=0.927, p < 0.00). Year 2 has also yielded relatively stronger correlations (r=0.847,
p <0.00). Year 3 on the other hand, has yielded only moderate correlation coefficient (r=0.637, p
< 0.00). This trend is similar to the correlation coefficients calculated for item difficulty with
CTT and IRT though not as strong (with the exception of Year 1).
74
Table 15: Correlation Coefficients of Point Biserial and Discrimination Index Between
CTT and IRT for Course 1
Year 1
pbis-a
Year 2
pbis-a
Year 3
pbis-a
0.927 0.847 0.637
4.3.4 Results of Research Question No.1 D
What is the reliability index of the items?
Reliability coefficients and SE of parameter estimates were calculated for each item for the
three years under research for all three courses. Results of Course 1 are elaborated upon below.
Results of Course 3 can be viewed in Appendix A5 - A7. For Course 6, the results are displayed
in Appendix B5 - B7.
Table 16 presents the SE of difficulty and discrimination parameters along with the alpha
coefficient of test score of Year 1. In this table, aSE refers to the standard error of estimate for a
parameter, bSE is the standard error of estimate of b parameter, and alpha without is the
reliability index of the test score obtained on the removal of that particular item. It is calculated
by Xcalibre taking CTT statistics into account.
To clarify the concept of SE of estimates of a and b parameters, two items are discussed
here. If one looks at the first item in Table 17, the a parameter is 0.198 and the aSE 0.126. The
SE of 0.126 is multiplied by 2 since a confidence interval is being calculated at 95%. This means
75
that for a 95% confidence interval, the true mean for the a parameter of this item may fall
between +0.45 and -0.05. Here, the lower limit of the score band is 0.198 – 0.252 = -0.05 and the
upper limit of the score band is 0.198 + 0.252 = 0.45. In other words, if this item is repeatedly
used to assess a student without further learning taking place, 95% of the time, the true score will
lie between -0.05 and + 0.45.
For item 5, the a parameter is 0.686 and the aSE 0.223. This means that at the 95%
confidence interval, the true mean for the a parameter of this item will fall between 0.24 and
1.132. In other words, if this item is repeatedly used to assess a student without further learning
taking place, 95% of the time, the true score will lie between 0.24 and 1.132. If one looks at item
5 for the difficulty index, i.e., b, the mean for this item falls between 0.358 and 0.918 since the
bSE is 0.140.
As can be noted in Table 16, removal or revision of item no.1 improves the alpha coefficient
of the test score to 0.64. Removal or revision of certain items increases it to 0.63. They are items
no. 6, 10, 11, 19 and 21. Most of the items in Table 14 appear to have large SE for both a and b
which may be attributable to the small sample size.
Table 16: SE and Reliability Index (Alpha w/o) Course 1 Year 1
Item ID a aSE b bSE Alpha w/o
Item 1 0.198 0.126 -4.025 0.615 0.64
Item 2 0.228 0.501 1.193 0.396 0.62
Item 3 0.423 0.186 -0.301 0.235 0.62
Item 4 0.601 0.189 0.129 0.166 0.61
Item 5 0.610 0.187 0.126 0.166 0.60
76
Item 6 0.549 0.136 -0.785 0.214 0.63
Item 7 0.720 0.222 0.999 0.134 0.61
Item 8 0.754 0.210 0.635 0.129 0.61
Item 9 0.570 0.237 0.493 0.167 0.62
Item 10 0.500 0.286 0.974 0.186 0.63
Item 11 0.545 0.185 -0.015 0.184 0.63
Item 12 0.680 0.151 -0.233 0.162 0.62
Item 13 0.592 0.212 0.320 0.164 0.62
Item 14 0.682 0.182 0.182 0.149 0.62
Item 15 0.624 0.205 0.318 0.157 0.62
Item 16 0.654 0.157 -0.175 0.164 0.62
Item 17 0.800 0.139 -0.335 0.151 0.61
Item 18 0.686 0.132 -0.660 0.183 0.62
Item 19 0.750 0.124 -1.193 0.221 0.63
Item 20 0.617 0.155 -0.254 0.175 0.62
Item 21 0.650 0.135 -0.609 0.185 0.63
Item 22 0.852 0.161 0.129 0.126 0.60
Item 23 0.612 0.186 0.120 0.164 0.62
Item 24 0.658 0.234 0.726 0.145 0.62
Item 25 0.591 0.174 1.792 0.172 0.62
Item 26 0.837 0.138 -0.327 0.147 0.61
Item 27 0.768 0.150 -0.136 0.146 0.61
Item 28 0.646 0.240 0.913 0.147 0.62
Item 29 0.591 0.172 -0.076 0.174 0.61
Item 30 0.627 0.144 -0.435 0.180 0.62
77
As stated earlier, several of the items in Year 2 had a difficulty index <0.25. Discrimination
indices were mostly good. If one looks at Table 17, one notices that the overall reliability of this
test is less than that of Year 1. In fact, it is only in the lower range of what is deemed good
reliability for an MCQ exam. Revision or removal of item no. 2 improves the reliability
coefficient to 0.58. Items no, 1, 4, 12, 15 and 16 also affect the reliability of the test as their
revision or removal from the test improves the reliability to 0.57. If one looks at the item 15 in
Table 17, it has an aSE half the size of mean. This means that at the 95% confidence interval, the
true mean for the discrimination index of this item falls between -0.024 and +0.024. In other
words, if this item is repeatedly used to assess a student without further learning taking place,
95% of the time, the true score will lie between the above values. Here, the lower limit of the
score band is 0.444 - 0.468 = 0.02 and the upper limit of the score band is 0.444 + 0.468 = 0.91.
Hence, The true score falls within the 95% confidence interval of 0.24 and 0.912. This is a very
large SEM and understandably, the reliability index may be improved to 0.57 by removing the
above-mentioned item.
Table 17: SE and Reliability Index (Alpha w/o) Course 1 Year 2
Item ID a aSE b bSE Alpha w/o
Item 1 0.245 0.124 -3.008 0.498 0.57
Item 2 0.215 0.368 1.958 0.418 0.58
Item 3 0.365 0.145 -0.966 0.294 0.56
Item 4 0.380 0.203 0.005 0.252 0.57
Item 5 0.605 0.166 0.153 0.169 0.54
78
Item 6 0.638 0.122 -1.255 0.247 0.56
Item 7 0.628 0.112 -1.265 0.244 0.55
Item 8 0.883 0.145 0.105 0.130 0.53
Item 9 0.832 0.156 0.714 0.120 0.54
Item 10 0.393 0.302 0.963 0.232 0.58
Item 11 0.531 0.134 -0.674 0.220 0.56
Item 12 0.486 0.163 -0.159 0.211 0.57
Item 13 0.461 0.228 0.510 0.204 0.56
Item 14 0.645 0.126 -0.803 0.208 0.55
Item 15 0.444 0.234 1.684 0.210 0.57
Item 16 0.498 0.140 -0.592 0.224 0.57
Item 17 0.742 0.131 -0.465 0.174 0.55
Item 18 0.640 0.139 -0.299 0.179 0.55
Item 19 0.522 0.150 -0.300 0.205 0.56
Item 20 0.625 0.138 -0.357 0.184 0.55
Item 21 0.539 0.144 -0.376 0.204 0.56
Item 22 0.599 0.177 0.294 0.167 0.55
Item 23 0.614 0.195 0.551 0.158 0.54
Item 24 0.528 0.207 0.484 0.181 0.56
Item 25 0.524 0.243 1.217 0.177 0.55
Item 26 0.578 0.134 -0.561 0.204 0.56
Item 27 0.581 0.147 -0.223 0.187 0.55
Item 28 0.522 0.196 1.823 0.185 0.56
Item 29 0.508 0.182 0.154 0.194 0.56
Item 30 0.625 0.183 -0.357 0.192 0.54
79
Table 18a displays the alpha without, aSE and bSE of Course 1 Year 3. As stated earlier,
several of the items in Year 3 had a difficulty index <0.25. Discrimination indices were mostly
good. If one looks at the table below, one notices that the overall reliability of this test is good
and about the same as that of Year 1.
Revision or removal of item no. 2 improves the reliability coefficient to 0.64. Removal of
Item no. 22 improves it to 0.63. Reliability was respectively improved to 0.60 and 0.62 by
removing the above-mentioned items. Again, both aSE and bSE were large with moderate
reliability index, potentially due to small sample size.
Table 18a: SE and Reliability Index (Alpha w/o) Course 1 Year 3
Item ID a aSE b bSE Alpha w/o
Item 1 0.188 0.146 -4.020 0.312 0.61
Item 2 0.227 0.449 1.189 0.426 0.64
Item 3 0.329 0.168 -0.305 0.325 0.62
Item 4 0.590 0.147 0.127 0.303 0.62
Item 5 0.680 0.263 0.649 0.247 0.62
Item 6 0.544 0.208 -0.791 0.227 0.62
Item 7 0.712 0.123 0.985 0.286 0.62
Item 8 0.752 0.121 0.629 0.331 0.62
Item 9 0.601 0.205 0.491 0.177 0.60
Item 10 0.505 0.166 0.972 0.260 0.62
Item 11 0.541 0.151 -0.012 0.247 0.62
Item 12 0.688 0.217 -0.219 0.189 0.61
Item 13 0.572 0.144 0.318 0.245 0.62
80
Item 14 0.662 0.127 0.178 0.253 0.61
Item 15 0.668 0.125 0.176 0.252 0.60
Item 16 0.559 0.283 -0.173 0.229 0.62
Item 17 0.794 0.140 -0.332 0.240 0.62
Item 18 0.645 0.180 -0.650 0.218 0.61
Item 19 0.735 0.173 -1.190 0.234 0.62
Item 20 0.601 0.127 -0.248 0.244 0.61
Item 21 0.650 0.285 -0.519 0.232 0.62
Item 22 0.845 0.154 0.129 0.272 0.63
Item 23 0.616 0.132 0.125 0.225 0.61
Item 24 0.659 0.174 0.690 0.241 0.62
Item 25 0.611 0.197 1.790 0.244 0.62
Item 26 0.834 0.145 -0.317 0.209 0.61
Item 27 0.766 0.151 -0.126 0.217 0.61
Item 28 0.606 0.126 0.901 0.264 0.62
Item 29 0.611 0.139 -0.073 0.244 0.62
Item 30 0.622 0.139 -0.432 0.234 0.61
Table 18b shows the reliability coefficients for the test scores for the three courses for all
the three years calculated by using both CTT and IRT. It can be noted that for CTT, the
reliability coefficients fall between the ranges of 0.57-0.64 for Course 1, 0.51-0.62 for Course 2
and 0.53–0.62 for Course 3. This indicates that the coefficients were mostly adequate. For IRT,
they were marginally better. For Course 1, they were 0.59–0.69; for Course 3, they were 0.56–
0.69 and for Course 3, they were between 0.53–0.65. This showed that IRT was not remarkably
superior to CTT for assessing the reliability of test scores with two different methods.
81
Table 18b: Cronbach’s Alpha for Course 1, 2 and 3 Using CTT and IRT
Course 1 Course 2 Course 3
CTT IRT CTT IRT CTT IRT
Year 1 0.63 0.69 0.62 0.69 0.61 0.64
Year 2 0.57 0.59 0.51 0.56 0.53 0.53
Year 3 0.64 0.67 0.60 0.64 0.62 0.65
4.3.5 Results of Research Question No.1 E
What are the item characteristic curves like for the individual items for each year?
The two technical properties that are used to describe an ICC are the item difficulty and the
item discrimination. Item difficulty describes where an item functions along the x axis which is
the ability scale. It is, thus, a location index. Hence, it is observed that an easy item functions
among the low-ability students and a hard one among the high-ability ones. Item discrimination
is the second technical property of the ICC and it describes how much an item can differentiate
between students with ability below and above the item location. This property influences the
steepness of the curve in its middle. The steeper the curve, the better the discrimination. On the
other hand, the flatter the curve, the less the item is able to discriminate since the probability of
correct response at low levels of ability is nearly the same as it is at high ability levels.
For this research, ICCs were generated for all the items for the three years for all three
courses using Xcalibre. For Course 1, some ICCs are elaborated upon below. The remainder of
82
the ICCs for all the items can be viewed in Appendix C. For the ICCs, ability or theta is plotted
on the x axis and the probability of endorsing an item on the y axis.
Figure 6 below shows the ICCs for five items selected from Course 1. Item no. 2 can be seen
to have low difficulty index. It is, in fact, a very easy item with only fair discrimination. On
visual inspection, Year 1 and 3 look similar but Year 2 appears to be different. This trend is
noticeable in all the ICCs for Year 2, i.e., visually, if overlapped, it does not follow the same
pattern as that of Year 1 and 3. All three curves appear to be quite flat which is attributable to the
low discrimination indices.
The ICCs for Item no. 3 show slightly steeper curves compared to the previous ones. This
indicates that this item is more discriminating at different ability levels although it is still noted
to have very low difficulty index. The next three items, i.e., item no. 8, 9 and 24 show further
steepness of the ICCs. Hence, these items are better than the previous two in differentiating
between students of lower and higher ability. One can note that these three items have adequate
difficulty indices and they are influencing the curves to move to the right.
Figure 6: ICCs for Course 1
Item No. 2
Year 1 Year 2 Year 3
83
Item No. 3
Year 1 Year 2 Year 3
\
Item No. 8
Year 1 Year 2 Year 3
Item No. 9
Year 1 Year 2 Year 3
84
Item No. 24
Year 1 Year 2 Year 3
4.4 Results of Research Question No. 2
Do the items exhibit temporal stability when repeated over Year 1, 2 and 3?
4.4.1 Results of Research Question No. 2 A
Do the items show stability across years using CTT?
4.4.1.1 Repeated Measures ANOVA
Repeated measures ANOVA were conducted for the three courses to evaluate stability across
time by taking years as independent variable and the item parameters as dependent variables
individually. In addition correlation coefficients were calculated for inter-year correlations and
scatter plots constructed. The results of individual repeated measures ANOVA for Courses 1, 3
and 6 are shown in Tables 19-24 and the tables and scatter plots for correlations are depicted
following them.
The results of repeated measures ANOVA for Course 1 (Table 19) indicate that there were
no significant differences at α < 0.05 amongst the mean measures of item difficulty across the
three years of measurement as the p values for the three years was more than 0.05. The main
effect was not significant.150, 153 Levene’s Test for Equality of Variances for item difficulty for
C1 was non-significant, i.e., F(1, N = 90) = 0.04, p = 0.96. This indicated that there is
85
homogeneity of variances between the items across three years and they have similar
characteristics. Repeated measures ANOVA did not yield significant differences between the
means in the context of item difficulty over three years. The F ratio calculated was F(1, 90) =
0.40, p = 0.67.
Item discrimination also yielded consistent results as Levene’s Test for Equality of
Variances was non-significant for item discrimination (Table 20). It was F(1, N = 90) = 0.26, p =
0.76. The result of the differences between the means of items was non-significant as F(1, 90) =
1.23, p = 0.23. Hence, it can be said that the items are stable over time.
Table 19: Repeated Measures ANOVA to Determine the Effect of Time on the Item
Difficulty Index for Course 1
Item Parameter Course Repeated Measures ANOVA
F p Effect Size
Item Difficulty C1 0.040 0.96ns 0.009
Note: ns Not significant (significant at α < 0.05)
Table 20: Repeated Measures ANOVA to Determine the Effect of Time on the Item
Discrimination Index for Course 1
Item Parameter Course Repeated Measures ANOVA
F p Effect Size
Item Discrimination C1 0/264 0.76ns 0.028
Note: ns Not significant (significant at α < 0.05)
86
The results of repeated measures ANOVA for the difficulty parameter for Course 3 (Table
21) were similar to Course 1. They indicate that the differences were non-significant at α < 0.05
amongst the mean measures of item difficulty across the three years of measurement as the p
values for Course 3 across the three years was more than 0.05. Levene’s Test for Equality of
Variances for item difficulty was non-significant for Course 3, i.e., F(1, N = 90) = 1.65, p = 0.19.
Like Course 1, the F ratio was non-significant for the between-groups mean, hence showing
stability over time. The F ratio was F(1, 90) = 1.73, p = 0.18, hence showing stability for item
difficulty parameter for all the three years across time.
Levene’s Test for Equality of Variances for item discrimination was non-significant as
well, i.e., F(1, N = 90) = 0.30, p = 0.73. The main effect was also not significant (Table 22).
Furthermore, the mean differences between groups were also non-significant as F(1, 90) = 1.65,
p = 0.20. These results pointed towards stability of item discrimination parameter over time.
Table 21: Repeated Measures ANOVA to Determine the Effect of Time on the Item
Difficulty Index for Course 3
Item Parameter Course Repeated Measures ANOVA
F p Effect Size
Item Difficulty C3 1.655 0.19ns 0.038
Note: ns Not significant (significant at α < 0.05)
87
Table 22: Repeated Measures ANOVA to Determine the Effect of Time on the Item
Discrimination Index for Course 3
Item Parameter Course Repeated Measures ANOVA
F p Effect Size
Item Discrimination C3 0.307 0.73ns 0.077
Note: ns Not significant (significant at α < 0.05)
The results of repeated measures ANOVA for the difficulty parameter for Course 6 (Table
23) showed similar trends as Courses 1 and 3 and indicate that the differences were non-
significant at α < 0.05 amongst the mean measures of item difficulty across the three years of
measurement as the p values for Course 6 across the three years were also greater than 0.05.
Levene’s Test for Equality of Variances for item difficulty parameter for Course 6 was non-
siginificant as F(1, N = 90) = 0.13, p = 0.87. Between-groups mean also yielded non-significant
results, thus showing stability of items over time. The F ratio was F(1, 90) = 0.06, p = 0.93.
Like the other two courses, Levene’s Test for Equality of Variances for item discrimination
for Course 6 for all three years was also non-significant (Table 24). It was F(1, N = 90) = 1.81, p
= 0.16. The F ratio depicted stability of items across the years since there were non-significant at
F(1, 90) = 0.43, p = 0.16.
88
Table 23: Repeated Measures ANOVA to Determine the Effect of Time on the Item
Difficulty Index for Course 6
Item Parameter Course Repeated Measures ANOVA
F p Effect Size
Item Difficulty C6 0.133 0.87ns 0.001
Note: ns Not significant (significant at α < 0.05)
Table 24: Repeated Measures ANOVA to Determine the Effect of Time on the Item
Discrimination Index for Course 6
Item Parameter Course Repeated Measures ANOVA
F p Effect Size
Item Discrimination C6 1.814 0.16ns 0.091
Note: ns Not significant (significant at α < 0.05)
4.4.1.2 Correlation Coefficient
Correlation coefficients (r) were calculated to look at the temporal stability of items using
CTT. It was assumed that if the r was high, the items were stable. Tables 25 and 26 show the
correlation coefficients across the years, i.e., Year 1, 2 and 3, for Course 1 using CTT for both
difficulty and discrimination parameters. These tables are then followed by scatter plots to depict
correlations of one year with another, first for CTT and then for IRT, for both difficulty and
discrimination parameters. Correlation coefficients for Course 3 can be viewed in Appendix A9
and A11. For Course 6, they can be viewed in Appendix B9 and B11.
89
Table 25 shows that all three years yielded positive correlation with each other. The highest
correlation was noted between Year 1 and Year 3 (r = 0.99, p < 0.00) whereas Year 1 and 2 had a
slightly lower correlation coefficient (r = 0.71, p < 0.00). Year 2 and 3 showed similar trend as
Year 1 and 2 (r = 0.71, p < 0.00). High correlation coefficient points towards homogeneity of the
cohort of students and stability of items across the years.
Table 25: Correlation Coefficient of Difficulty Index of Year 1, 2, 3 for Course 1 for CTT
Year 1 Year 2 Year 3
Year 1 1 0.714 0.998
Year 2 0.714 1 0.711
Year 3 0.998 0.711 1
Table 26 expresses the correlation coefficient for the discrimination index for the three
years calculated by CTT. All of them were positive and the highest correlation was seen between
Year 1 and 3 (r = 0.96, p < 0.00) whereas those between Year 1 and 2 (r = 0.56, p < 0.00) and 2
and 3 (r = 0.56, p < 0.00) were much lower. This indicated that the Year 2 cohort was not as
homogeneous as the other two years and items not as stable as the other two years for
discrimination index.
90
Table 26: Correlation Coefficient of Discrimination Index of Year 1, 2, 3 for Course 1 for
CTT
Year 1 Year 2 Year 3
Year 1 1 0.564 0.996
Year 2 0.564 1 0.565
Year 3 0.996 0.565 1
In summary, it can be stated that Year 2 was not as strongly correlated with Year 1 and 3 as
the latter two, i.e., Year 1 and 3 with each other. CTT and IRT yielded similar sort of correlation
index; hence one method did not stand out over the other in terms of stability over time.
4.4.1.3 Scatter Plots for CTT for Item Parameters
Below are the scatter plots of the items for Course 1 plotted between two respective years, i.e.,
Year 1 and 2, Year 2 and 3, Year 3 and 1. They show the correlation of items in the context of
their difficulty and discrimination using CTT. Figures 7-9 show the comparisons between the
three years for Course 1 using CTT. The scatter plots for Course 3 are displayed in Appendix
A13 and A15. For Course 6, the scatter plots are displayed in Appendix B13 and B15.
The first scatter plot is between Year 1 and Year 2. The plots indicate that there is a positive
correlation between the difficulty index of Year 1 and 2 (r = 0.71, p < 0.00). The degree of
correlation is good which means that several items correlated with each other strongly. Items
no.15, 25 and 28 were noted to be deviating from the line of best fit; these may be considered as
influential items. For the second figure, again a good and positive correlation is seen (r = 0.71, p
91
< 0.00) but some items are noted to deviate from the line of best fit. On closer inspection, these
are the same ones as reported for Year 1 and 2 earlier, i.e., 15, 25 and 28. Ultimate result shows a
linear and positive correlation. The last plot depicts very strongly positive, linear correlations
between the item difficulty of Year 3 and 1 as nearly all the values of item difficulty fall on the
line of best fit. This shows that inter-year correlations were quite strong for the difficulty index
using CTT and that this method yielded stable results on being used across the three years.
Scatter Plots for Item Difficulty Using CTT for Course 1
Figure 7: Item Difficulty with CTT Year 1 and 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Year 1
Year 2
92
Figure 8: Item Difficulty with CTT Year 2 and 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Year 2
Year 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Year 1
Year 3
2
Figure 9: Item difficulty with CTT Year 3 and 1
93
Scatter Plots of Item Discrimination (p-bis) Using CTT for Course 1
The scatter plots for the three comparisons between the three years for Course 1 using
CTT are seen in Figures 10-12. The first scatter plot is between Year 1 and Year 2. The plot
indicates that there is a positive correlation between the difficulty index of Year 1 and 2. The
degree of correlation is only moderate (r = 0.56, p < 0.00) which means that not all items
correlate with each other strongly. Items no. 9, 10, 23 and 25 were noted to be deviating
remarkably from the line of best fit; these may be considered as influential items. In addition, the
scatter was wider which also pointed towards only moderate correlations. For the second figure
for Year 2 and 3 (r = 0.56, p < 0.00), again a positive relationship was seen but some items are
noted to deviate from the line of best fit. On closer inspection, these are the same ones as
reported for Year 1 and 2 earlier, i.e., 9, 10, 23 and 25. Ultimate result shows only a moderate,
linear but positive correlation. The last plot depicts very strongly positive, linear correlations
between the item discrimination of Year 3 and 1 as nearly all the values of item difficulty fall on
the line of best fit.
94
Figure 10: P-bis with CTT of Year 1 and 2
0
0.05
0.1
0.15
0.2
0
.
2
5
0. Year 2
0.35
0.4
0.45
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Year 1
Figure 11: P-bis with CTT of Year 2 and 3
0
0.05
0.1
0.15
0.2
0.2
0.3
0.35
0.4
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Year 2
Year 3
95
To summarize, the higher value of correlation coefficient for Year 1and 3 is an indication of
their homogeneity as indicated by a positive, linear and strong cluster in the scatter plots above.
On the contrary, the scatter plot of Year 1 and 2 and Year 2 and Year 3 indicate that there are at
least three points that seem to be deviating from the line of best fit in the case of CTT and 2
points in the case of IRT, again indicative of heterogeneity of the group. This could be one of the
reasons of the fluctuations in the values of correlation coefficient reported earlier. As one will
notice in the next section, very similar trends are seen in CTT and IRT for most correlations and
scatter plots.
Figure 12: P-bis with CTT of Year 3 and 1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Year 1
Year 3
96
4.4.2 Results of Research Question No. 2 B
Do the items show stability across years using IRT?
4.4.2.1 Repeated Measures ANOVA
Repeated measures ANOVA were conducted for IRT for the three courses to evaluate
stability across time by taking years as independent variable and the item parameters as
dependent variables individually. In addition, correlation coefficients were calculated and scatter
plots constructed. Furthermore, TCCs for the three courses were also generated for visual
comparison. The results of individual repeated measures ANOVA for Courses 1, 3 and 6 are
shown in Tables 27-32 and the TCCs follow them.
The results of repeated measures ANOVA for Course 1 (Table 27) indicate that there were
no significant differences at α < 0.05 amongst the mean measures of b parameter across the three
years of measurement as the p values for the three years was more than 0.05. Levene’s Test for
Equality of Variances for b parameter for Course 1 was non-significant as F(1, N = 90) = 0.11, p
= 0.89, the interpretation being that there is homogeneity between the items across three years
and they have similar characteristics. The F ratio also yielded non-significant results as F(1, 90)
= 0.00, p = 0.99, indicating that the differences in between-groups mean were non-significant
and the items, hence, stable over time.
Levene’s Test for Equality of Variances for the a parameter was non-significant as it was
F(1, N = 90) = 2.63, p = 0.08 (Table 28). The F ratio indicated that the item discrimination had
stable characteristics over times 1, 2 and 3 as the result for between-groups differences was non-
significant. It was F(1, 90), 3.01, p = 0.22.
97
Table 27: Repeated Measures ANOVA to Determine the Effect of Time on the b Parameter
for Course 1
Item Parameter Course Repeated Measures ANOVA
F p Effect Size
B Parameter C1 0.113 0.89ns 0.018
Note: ns Not significant (significant at α < 0.05)
Table 28: Repeated Measures ANOVA to Determine the Effect of Time on the a Parameter
for Course 1
Item Parameter Course Repeated Measures ANOVA
F p Effect Size
a Parameter C1 2.63 0.08ns 0.083
Note: ns Not significant (significant at α < 0.05)
The results of repeated measures ANOVA for Course 3 (Table 29) indicate that similar
results as Course 1 were obtained with this course as well. There were no significant differences
at α < 0.05 amongst the mean measures of b parameter across the three years of measurement as
the p value for the three years was more than 0.05. Levene’s Test for Equality of Variances for b
parameter for Course 3 was non-significant and showed that F(1, N = 90) = 0.29, p = 0.74,
pointing towards homogeneity between the items across three years and the fact that they have
similar characteristics. In addition, the F ratio revealed non-significant differences in the
between-groups mean, hence showing item stability over time. It was F(1, 90) = 0.01, p = 1.00.
The a parameter also yielded non-significant result for Levene’s Test for Equality of
Variances for all three years (Table 30). It was F(1, N = 90) = 0.85, p = 0.42. These results
98
indicated that both the item parameters were stable over times 1, 2 and 3. The F ratio yielded
non-significant result for the between-groups mean. It was F(1, 90) = 13.90, p = 0.96.
Table 29: Repeated Measures ANOVA to Determine the Effect of Time on the b Parameter
for Course 3
Item Parameter Course Repeated Measures ANOVA
F p Effect Size
b Parameter C3 0.295 0.74ns 0.101
Note: ns Not significant (significant at α < 0.05)
Table 30: Repeated Measures ANOVA to Determine the Effect of Time on the a Parameter
for Course 3
Item Parameter Course Repeated Measures ANOVA
F p Effect Size
a Parameter C3 0.859 0.42ns 0.096
Note: ns Not significant (significant at α < 0.05
The results of repeated measures ANOVA for Course 6 (Table 31) indicate that similar
results as Course 1 and 3 and were obtained with this course as well. There were no significant
differences at α < 0.05 amongst the mean measures of b parameter across the three years of
measurement as the p values for the three years was more than 0.05. Levene’s Test for Equality
of Variances for b parameter for Course 6 was non-significant and showed that F(1, N = 90) =
0.07, p = 0.92. Between-groups mean did not show significant result as the F ratio was non-
significant, thus showing item stability over time. It was F(1, 90) = 0.00, p = 1.00.
99
The a parameter also yielded non-significant result for Levene’s Test for Equality of
Variances (Table 32). It was F(1, N = 90) = 0.17, p = 0.84. The F ratio also indicated that both
the item parameters were stable over times 1, 2 and 3 and the difference in between-groups mean
was non-significant. It was F(1, 90) = 5.09, p = 0.88.
Table 31: Repeated Measures ANOVA to Determine the Effect of Time on the b Parameter
for Course 6
Item Parameter Course Repeated Measures ANOVA
F p Effect Size
b Parameter C6 0.077 0.92ns 0.011
Note: ns Not significant (significant at α < 0.05)
Table 32: Repeated Measures ANOVA to Determine the Effect of Time on the a Parameter
for Course 6
Item Parameter Course Repeated Measures ANOVA
F p Effect Size
a Parameter C6 0.173 0.84ns 0.031
Note: ns Not significant (significant at α < 0.05
4.4.2.2 Correlation Coefficient
Correlation coefficients (r) were calculated to look at the temporal stability of items using
IRT as was done in the context of CTT. Course 1 is elaborated upon here while the correlation
coefficients for Course 3 and 6 are presented in the appendix. It was assumed that if the r was
high, the items were stable. Tables 33 and 34 show the correlation coefficients across the years,
100
i.e., Year 1, 2 and 3, for Course 1 using IRT for both difficulty and discrimination parameters.
These tables are then followed by scatter plots in the next section to depict correlations of one
year with another for both difficulty and discrimination parameters when calculated with IRT.
Correlation coefficients for Course 3 can be viewed in Appendix A10 and A12. For Course 6,
they can be viewed in Appendix B10 and B12.
Table 33 shows inter-year correlation coefficients for difficulty index calculated by IRT.
Positive correlation is noted amongst all three years but the most remarkable correlation was
noted between Year 1 and 3 (r = 0.99, p < 0.00). Correlations between Year 1 and 2 (r = 0.82, p
< 0.00) and between Year 2 and 3 (r = 0.82, p < 0.00) were also quite high. This was similar in
trend to the ones noted when similar analyzes were conducted with CTT though in contrast to
CTT, the ones with IRT were more strongly correlated with each other.
Table 33: Correlation Coefficient of Difficulty Index of Year 1, 2, 3 for Course 1 for IRT
Year 1 Year 2 Year 3
Year 1 1 0.825 0.999
Year 2 0.825 1 0.822
Year 3 0.999 0.825 1
Table 34 depicts the correlation coefficient for the discrimination index for the three years
calculated by IRT. All of them were positive and the highest correlation was seen between Year
1 and 3 (r = 0.98, p < 0.00) whereas those between Year 1 and 2 (r = 0.73, p < 0.00) and 2 and 3
(r = 0.74, p < 0.00) were much lower. This indicated that the Year 2 cohort was not as
101
homogeneous as the other two years and items not as stable as the other two years when
calculating discrimination index.
Table 34: Correlation Coefficient of Discrimination Index of Year 1, 2, 3 for Course 1 for
IRT
Year 1 Year 2 Year 3
Year 1 1 0.732 0.983
Year 2 0.732 1 0.744
Year 3 0.983 0.744 1
4.4.2.3 Scatter Plots of Item Difficulty Using IRT for Course 1
Figures 13-15 show positive correlations between the item difficulty index for all three years,
i.e., 1, 2 and 3 for Course 1 when IRT was applied. The scatter plots for Course 3 are displayed
in Appendix A14 and A16. For Course 6, the scatter plots are displayed in Appendix B14 and
B16.
In the context of Course 1,Year 1 and 2 show very good correlation with each other (r =
0.82, p < 0.00). Here as well, items 15, 25 and 28 were noted to deviate from the line of best fit.
Hence, it can be noted that it is the same items as the ones noted in the CTT that deviate for the
line of fit. A similar type of plot is observed above for Year 2 and 3 (r = 0.82, p < 0.00) as all the
items showed positive correlation with each other. Again, the same three items as the ones
reported with previous plots are observed here, i.e., 15, 25 and 28. A near-perfect correlation is
102
seen in the case of item difficulty calculated by IRT for Year 3 and 1 (r = 0.99, p < 0.00). Almost
all the items are noted to fall very close to the line of best fit. All these trends are similar to ones
reported for item difficulty calculated using CTT. Correlation coefficients are noted to be slightly
better for IRT analyses.
Figure 13: Item Difficulty with IRT Year 1 and 2
-4
-3
-2
-1
0
1
2
3
-5 -4 -3 -2 -1 0 1 2 3
Year 1
103
4.4.2.4 Scatter Plots for Item Discrimination using IRT for Course 1
The scatter plot for the three comparisons between the three years for Course 1 using
IRT are in Figures 16-18.The first scatter plot is between Year 1 and Year 2. The plot indicates
that there is a positive correlation between the discrimination index of Year 1 and 2. The degree
of correlation is quite good (r = 0.73, p < 0.00) which means that only 2 items did not correlate
Figure 14: Item Difficulty with IRT Year 2 and 3
-5
-4
-3
-2
-1
0
1
2
3
-4 -3 -2 -1 0 1 2 3
Year 2
Year 3
Figure 15: Item Difficulty with IRT for Year 3 and 1
1
-5
-4
-3
-2
-1
0
1
2
3
-5 -4 -3 -2 -1 0 1 2 3
Year 1
Year 3
104
well and hence were not stable over the years. These were noted to be items no. 1 and 2; they
were deviating from the line of best fit and hence may be called as influential items for this set of
data. It is interesting to note that with IRT, an entirely different set of items were noted to deviate
from the line of fit. For the second figure, again a positive trend is seen and very few items are
noted to deviate from the line of best fit. For Year 2 and 3 here, the correlation coefficient was
noted to be better than for CTT (r = 0.74, p < 0.00). On closer inspection, the deviated items are
the same ones as reported for Year 1 and 2 earlier, i.e., items no. 1 and 2. Ultimate result shows a
moderate, linear and positive correlation. The last plot depicts very strongly positive, linear
correlations between the item discrimination index of Year 3 and 1 (r = 0.98, p < 0.00) as nearly
all the values on the line of best fit.
Figure 16: Item Discrimination with IRT of Year 1 and 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Year 1
Year 2 2
105
Figure 18: Item Discrimination with IRT of Year 3 and 1
To summarize, the higher value of correlation coefficient for Year 1and 3 is an indication of
their homogeneity as indicated by a positive, linear and strong cluster in the scatter plots above.
On the contrary, the scatter plot of Year 1 and 2 and Year 2 and Year 3 are not as strongly
Figure 17: Item Discrimination with IRT of Year 2 and 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Year 2
Year 1 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Year 3
Year 1
106
correlated though the correlations are still significant. Very similar trends are seen in CTT and
IRT for most correlations and scatter plots. Only the discrimination index calculated by IRT has
yielded different influential points (items no. 1 and 2). Years 3 and 1 have shown the most stable
temporal pattern.
4.4.3 Test Characteristic Curves
Test characteristic curves (TCCs) were generated to elucidate the stability of all three courses
across the three chosen years respectively. The TCC predicts the proportion or number of items
that an examinee would answer correctly as a function of theta. The X-axis depicts the levels of
theta. The left Y-axis is in proportion correct units while the right Y-axis is in number-correct
units. These graphs are presented in the figures below.
Test Characteristic Curves for Course 1
Figure 19. Test Characteristic Curve for Course 1, Year 1
107
Figure 20. Test Characteristic Curve for Course 1, Year 2
Figure 21. Test Characteristic Curve for Course 1, Year 3
As can be observed, all three curves for the three successive years appear to be quite similar
in shape which signifies the stability of the items across the years .i.e., times 1, 2 and 3. TCCs for
Courses 3 and 6 can be seen in A17 and B17 respectively.
It can be summarized that both CTT and IRT show temporal stability for the items across the
years. Year 1 and 3 show more stability with each other than Year 2 when using CTT. This
108
pattern is seen in the context of both item difficulty and discrimination. This may be attributable
to a non-homogeneous sample in Year 2 with students potentially having better abilities than
those in Year 1 and 3. Some items stand out as potentially influential ones leading to their
deviation from the line of best fit. These items need either revision or their removal from the test
to improve the stability over time.
109
CHAPTER V- DISCUSSION
In this research, item analysis, reliability and stability of 90 MCQs were assessed three times
over six years. These MCQs covered the skills of Basic Sciences, Investigations, Diagnosis and
Management. The data were analyzed using and comparing CTT and IRT. This research showed
that the items had adequate item difficulty and discrimination using both methods along with fair
reliability. Furthermore, the items were stable for some years when repeated. In the context of
Course 1, they were more stable for Year 1 and 3 than for Year 2. What is unique about this
research is that two measurement methods have been used to look at the psychometrics of
MCQs, one observing the raw score, the other using the true scores. In addition, stability of the
MCQs on re-using them in recurring years is also an element that has not been extensively
investigated in the field of medical education. Course 1 has been discussed at length in this
research as the number of examinees was the highest amongst the three courses and also the most
consistent.
5.1 Discussion Related to Research Question No. 1
What was the reliability of scores using and comparing two methods of analysis, i.e., item
response theory and classical test theory, on MCQ items administered three times over a six
year period reliability of the items?
This research question aimed to look at the item parameters using CTT and IRT. The
reliability index was also calculated for the items along with SEM and ICCs generated and
plotted using IRT.
110
5.1.1 Research Question No. 1 A: What are the item parameters when conducting item
analysis with CTT?
This research showed that most of the MCQs for Course 1 were of ‘adequate’ type, the
rest being easy. For Year 2, a slightly different trend was noted as half the items were easy.
Differences in the performance of Year 2 as compared to Year 1 and 3 may be attributable to
differences in their learning curves. Students tend to learn more as their experience increases.
One explanation of the difference in performance of students in Year 2 might be that some
students entering in the MD program at the University of Calgary have already finished a
master’s. The other reason could be the difference in the style of teaching. Both the Faculty of
Medicine and the Teaching and Learning Center at the University of Calgary offer teaching
certificates. The former is a requirement for the “master teachers” who provide a large
percentage of the teaching in the medical school. This may have made a difference to the
examinee’s performance in Year 2. Research in the fields of education and social sciences have
shown that teaching strategies and methods of information transfer do make a difference to
students’ results.154, 155
Discrimination index is important as it helps to distinguish between students of different
abilities. It also highlights the weaknesses of MCQs under study by giving a value to the degree
of difference between students of high and low ability. Confusing or ambiguous wordings along
with incorrect answer key may lead to poor discriminatory values. In the case of item
discrimination for Course 1 using CTT, nearly half of the items in this research had a fair
discrimination index Year 1 and 3. For Year 2, only about one third had a fair discrimination
index of >0.2. Similar findings were observed in the analyses of Course 3 and 6 as well.
111
It has been observed that good students become overcautious in attempting to answer
parts of an item they are not completely sure of as they fear losing hard-earned marks on the
other item parts. On the other hand, relatively weaker students would take risks since they
already know little about the topic. They expect the least score they can get is a zero and hence,
they take a chance at attempting to answer the option. With the SBA type of MCQs only one
option is the correct one; hence the element of chance guessing is reduced to an extent. It is quite
striking to note that in the case of Course 1, results were the most different for Year 2 where a
few items were found to be more difficult by the Year 2 cohort. One explanation of poor
discrimination, in addition to there being miskeying and ambiguity of the item is that the clarity
of concepts may have been less among the students in the other two years. Year 2 students, on
the other hand, likely selected the right answer because of their intrinsic ability to explore , but it
was marked wrong. Some more potential causes of varied indices may be the wording of
question and areas of controversy in the topic being questioned. Bhakta et al 31 have noted that
the reason for frequently selecting the incorrect response as the correct one is attributable to the
distracter being very close to the correct option in terms of the accuracy of information it
provides. According to their findings, if a distracter is constructed so that it is very close to the
correct option, it is chosen frequently as the correct option by the students. The difference lies in
the ability of the examinees as the ones with lower ability usually choose the distracter as the
correct option and those with higher ability actually choose the correct option.
Hingorjo et al156 utilized 50 MCQs from a physiology exam for undergraduate students. The
mean difficulty index reported by them is again similar to our research, i.e., 0.78. Furthermore,
they reported a mean discrimination index of 0.35 with 62% of the items having an excellent
112
discrimination index of 0.4-0.45. This is also comparable to this research where similar
discrimination indices were reported. Contrary to this research, another study reported lower
discrimination indices of mostly between 0.2 to 0.25 on a set of seventy MCQs, which were
randomly selected from para-clinical subjects.157 They attributed these low indices to the
ambiguity in the content of the MCQ items.
It is desirable that MCQs at the medical school level are constructed to assess higher order
thinking and analysis in addition to application and synthesis. In the case of Course 1 and 3
which are offered in the first year of medical school at the University of Calgary, these items
may be slightly less in number but as the student matures and moves on, it is acceptable, in fact
warranted, that more difficult type of MCQs be encountered in an exam. In the context of this
research, more MCQs were of easy / adequate type and less of difficult type and discrimination
only fair. For a summative exam, it is desirable that a certain proportion of the items are the type
that are more difficult and discriminating. In our research, because of similarities in the groups of
students, this was not the case. It must be kept in mind that some items will have low
discrimination indices because they may represent content that is expected to be known and
understood by the student.158
5.1.2 Research Question No.1 B: What are the item parameters when conducting item analysis
with IRT?
The majority of the items were of the easy type for all three years when IRT was applied.
Strikingly in Course 1, three items stood out as very difficult for students in all three years. It is
difficult to explain why they were found to be more difficult for the students in Year 2 who
113
otherwise have shown better performance in general. One explanation may be that these items
were the ones whose underlying concepts were not taught effectively and although the
misconception was understood by the students in Year 1 and 3 as taught, the students in Year 2,
with their superior ability of reasoning, were able to identify the concept as unclear or wrong.
Items 1 and 2 had the lowest discrimination amongst all. Discrimination was better when IRT
was applied and was in fact noted to be quite high as several values were above the ideal cut-off
value of 0.4.
An item with a difficulty level where fifty percent of the students are able to answer correctly
may be appropriate depending on what the aim of the exam is and what the sample
characteristics might be, i.e., smaller size, narrow content as was the case in this research. The
difficulty and discrimination indices are often reciprocally related.159 However, this may not
always be true. Questions having higher p (easier questions), discriminate poorly; conversely,
questions with lower p (harder questions) are considered to be good discriminators. A potential
reason for such high discrimination indices as noted in this research could be the narrow
examination content that the students were assessed on. On the other hand, if the efficiency of
distracters is good, the discrimination index becomes narrow.
For Course 1, if one looks at the percentages of the items, more than half of the items were of
the easy type. A close inspection revealed that the easy ones were easier for students in Year 2.
This may be an indication of their better ability or better quality of both teaching and learning, as
stated earlier. Some items were easier for Years 1 and 3 when it has been observed elsewhere
that students in Year 2 were better performers. A closer look at the discrimination index shows
that such items were also more discriminating for Year 1 and 3 than for Year 2. It is
114
recommended that such items be either revised or removed from the exam. On the other hand,
items with a low difficulty index for Year 1 and 3 which had a higher difficulty index for Year 2
were likely appropriately taught and tested. Since it is thought that students in Year 2 had better
abilities, it might be that the concepts underlying these items may have been misunderstood by
students in Year 1 and 3. Another reason for the Year 2 students finding easy items as difficult
could be that although everybody might have made a guess, Year 2 students failed to guess the
right answer. This is where the 3 PL model can help which looks at the guessing behaviour of
students. Difficult items like nos. 10, 15 and 28 are the ones that may play a role in
differentiating between students with high and higher abilities where honours need to be
determined in addition to decisions about pass and fail.
From the results so far, one gets an impression that although CTT and IRT are mostly
comparable, there are subtle differences noted in context of both parameters. The fact remains
that in CTT, the item statistics are sample-dependant and in IRT, sample-independent. It appears
that IRT has demonstrated a more specific analysis of the items than CTT which is what was
anticipated as IRT works at item level and CTT at test level. These parameters are sometimes
affected by unidentified changes in the characteristics of a sample drawn from a population and
thus the item statistics are completely changed, thus providing evidence for its sample
dependence in CTT.160
Fan152 conducted research with the objective of looking at the comparability of CTT and IRT
with a very large data size. In this research, 108 MCQs were analyzed that were used to assess
40,000 students. Although Fan152 used all three parameters to assess the comparability of IRT
with CTT, it was found that the results of the analysis were most comparable in the context of
115
both item difficulty and discrimination when 1 and 2 PL models were used. Similar results were
also reported by more recent study, again using a very large data set. Guler and colleagues151
looked at comparing the two measurement methods, i.e., CTT and IRT. Although their data are
smaller compared to the one reported on by Fan, the results were consistent. CTT and IRT,
especially the 2 PL, were found to be comparable with each other.
5.1.3 Research Question No.1 C: Are the item parameters comparable when conducting item
analysis with both CTT and IRT?
Studies have shown moderate to excellent comparability between item parameters when
applying CTT and IRT.151, 152 This research showed similar results for most of the years for all
three courses. Fan152 conducted research comparing CTT with the three dichotomous models of
IRT. The examinee data size in their research were much larger at 1,000 for each sample set.
One, two and three parameter logistic models were applied to a criterion-referenced test. As in
this research, correlation coefficients were calculated for CTT and IRT for the item difficulty and
discrimination. The correlation coefficients for item difficulty reported by Fan are around 0.9;
the ones reported by ourselves are about the same or around 0.8 for most of the years for all three
courses. In Fan’s study, the best correlation coefficients were noted for the 1 PL model. Similar
trends were noted for item difficulty for both 2 and 3 PL models. The researcher attributed the
differences in the correlations to the sampling of the items. In contrast, item discrimination,
although correlated with each other, did not do as well as item difficulty. Like the study by Fan,
a ceiling effect is seen in this research as well. Although theirs is attributable to the nature of the
exam which was minimum-competency, ours is likely due to the homogeneity of students. In the
116
context of item discrimination, Fan found that both CCT and IRT were comparable though not as
strongly. Our research showed very good comparability for item discrimination as well with both
CTT and IRT. Our sample of students was quite consistent with each other in ability levels and
the discrimination was likely uniform due to that reason. One reason for the findings above may
be that although the number of examinees was relatively adequate in our research, the number of
items was small. Fan’s research was replicated by Courville161 to study similarities between CTT
and IRT which further strengthened the notion that CTT and IRT are quite comparable.
In another study carried out by Guler et al151 about 1200 students were assessed with 25
items for a high school entrance exam. Both CTT and IRT were applied to the data to look at
person and item fit statistics. These data were about the same size as the one in my study,
although smaller than ones reported by Fan152 and Courville.161 Our results are quite similar to
the ones reported by Guler as high correlation coefficients were noted between both CTT and
IRT for the given data. The best correlations were seen for the 1 PL model for item difficulty and
for 2 PL model for item discrimination. Interestingly, the poorest correlations were seen with the
3 PL model which was attributed to the guessing behaviour of the students.
5.1.4 Research Question No.1 D:
What is the reliability index of the test scores?
Reliability coefficient and SE of estimates were calculated for each item for the three years
examined for all three courses. The SE for both difficulty and discrimination parameters were
mostly found to be large in this research. As a consequence, the reliability was noted to be only
fair to moderate. The large sizes of SE are most likely attributable to the small sample size. It is
117
also known that the error tends to be larger for students who are high scorers as the case in this
research since the noise from stronger students is likely to be larger than from the weaker
ones.149 There are several ways of ensuring that the SE is kept within the acceptable range and
reliability improved.. The items should be written without any confusing or misleading
statements. In addition, the instructions about the question should be clearly written.
Furthermore, the marking should be objective. Reliability of the scores tends to decrease if the
items on a test are too easy or too difficult. It is also affected by the characteristics of a group
since the more heterogeneous the group, the higher the reliability.162 In this particular research,
students belonged to a local medical school where entry is gained after going through a rigorous
admission process. As a result, those who ultimately enter the school have very similar
characteristics and ability levels. One reason why the reliability coefficient of the scores may be
low in this research could be the homogeneity of the sample. In addition, the items that were
analyzed were mostly found to be easy. Both these factors may have led to low reliability
reported in this research. Research has highlighted that reliability coefficients are helpful in
informing the researchers about the sampling errors that can adversely affect the reliability.163 If
the reliability of the scores is low, it may also indicate that either the test is short or the content
being examined is narrow. In one study, a smaller data set of about 25 MCQs for a low stake
exam was analyzed.164 The research was conducted in the field of pulmonology where the MCQs
were randomly selected from a larger pool of 70 items. Cronbach’s alpha was reported to be
0.69, quite similar to the ones reported for most of the years in our research. The research
concluded that the relatively low reliability index was attributable to the narrow content of
assessment.
118
When the reliability coefficients of test scores were compared by the two methods, i.e.,
CTT and IRT, the results indicated that neither CTT nor IRT was particularly better than the
other. In fact, the results with both the methods were quite consistent with each other. Although
the reliability coefficients for the test scores for all three years for the three courses were slightly
better when applying IRT, at a local medical school, they were not significant enough to
recommend the use of only IRT for measurement purposes.
5.1.5 Research Question No.1 E: What are the item characteristic curves like for the
individual items for each year?
ICCs were generated for individual items for the three years for Course 1, 3 and 6. As
discussed earlier, the ICC expresses the relationship between the ability of an examinee and the
probability of his or her endorsing an item. With SBA type of MCQs, the curve tends to be s-
shaped since with the increase in the level of ability, the probability of endorsing an item also
increases. The curve is noted to be steeper with large changes in the probability of endorsing an
item and little changes in the level of ability. This regression is non-linear. In an ICC, the slope is
formed by the item discrimination index. The threshold at which the examinees endorse the item
and the slope of the curve establish the effectiveness of the item as an indicator of the ability. In
our research, more than 50% of the items were of the easy type. Hence, several curves are noted
to be moved to the left. Furthermore, one of the objectives of our research was to look at the
temporal stability of the items. One can notice that the curves look similar at a glance for Year 1
and 3 but not so for Year 2. It has also been speculated that the likely reason for the difference is
119
the performance of students in Year 2 and hence the difference in the curve is attributable to the
ability of the students in this year which seems to be superior to the other two years.
In summary, analyses like ours assist the assessors in revising the items. One option is to
merge the answers of an item together if the domains overlap for items with similar curves. This
will lead to the creation of a single option. It may also be advisable to remove unused options
and replace them with more effective ones. Such measures lead to improvement of
discrimination between less and more able students.
5.2 Discussion Related to Research Question No. 2
Do the items exhibit temporal stability when repeated over Year 1, 2 and 3?
5.2.1 Research Question No. 2 A : Do the items show stability across years using CTT?
The stability of items was assessed by using repeated measures ANOVA and calculating the
correlation coefficients of the items under scrutiny which yielded stable results for all three years
for the three course.
The aim of studying the stability of items over time was to present evidence that they are
repeatable across the years without compromising their psychometric properties. For this purpose
F ratios were calculated for the three courses. In the context of F ratio, if the p value is non-
significant, it shows that between-groups differences are not remarkable and item parameters,
hence, stable. Our research did yield small F ratios for the three courses for both item difficulty
and discrimination parameters, thus signifying the stability of items. Baig and Violato conducted
similar sort of analyses using MANOVA to compare station stability in the background of
OSCEs for international medical graduates in Alberta. 165 They also documented adequate
120
stability of the OSCE station over three points in time. The construction and maintenance of an
item bank is difficult, both in terms of monetary factors and in terms of faculty time and
expertise. Once an item has been constructed, it also requires timely updating due to the changes
in the curricular content, usually as a result of new knowledge that has been acquired about the
topic that a student is being assessed on. Keeping in mind the logistics of developing and
maintaining such an item bank, the items that show less stability across the years may sometimes
need to be revised due to a threat to their psychometric properties. Alternatively, they might
require removal from the exam altogether. This decision is also influenced by the objective of the
examination. If such exams are low stake, formative type, the items may only need to be revised.
On the other hand, for summative, end-of-year high stakes exams where decisions about
graduation and certification are involved, such items may need removal.
In our research, correlation coefficients were also calculated, most showing very good
correlation with each other, thus providing further evidence for the temporal stability of the
items. Correlation coefficients express the linear relationship between two variables, the years
being those variables in this research. As indicated in the Results section, some items stood out
as having only fair correlation coefficients. These items cause concerns with both their difficulty
and discrimination indices, when assessed with either CTT or IRT. Such items, if noted to be
affecting the reliability of the scores, should be removed.
5.2.2 Research Question No. 2 B: Do the items show stability across years using IRT?
Repeated measures ANOVA was also carried out for the three courses using IRT to look at
the stability of items over three administrations and to compare the findings of CTT with IRT.
121
The effect size of the F ratio was small for all the three courses signifying stability over time.
The results yielded by repeated measures ANOVA for IRT showed the same trend as CTT. It can
be, thus, stated that neither the CTT nor the IRT is necessarily superior over the other and the
choice between the two is influenced by factors discussed earlier like the objective of the
research, the data size and the model fit.
Test characteristic curves were also generated to further elucidate the stability of all three
courses across the three chosen years respectively. It was assumed that the temporal stability
would be reflected by the uniformity of the curves observed visually. Baig and Violato have
used similar methods for analyzing the temporal stability of OSE stations for high stakes
licensing exams for international medical graduates.165 The research under discussion revealed
that the scores using IRT were consistent over three years for the three courses when graphs were
plotted between the ability levels and the scores of the items. Very similar results were obtained
for all three courses for the three years. TCCS provide a means for converting ability scores to
true scores. In this way, a number is given to the examinee which relates to the number of items
in the test. It can be noted in the curves generated for the data in this research that the shape is
mostly of that a smooth S. This is dependent on the number of items and the item parameters.
The ability of the examinee is noted to correspond to the mid true score of the examinee and is
plotted on theta. The mid true score is actually the difficulty level of the item and contributes to
the interpretation of the curve for descriptive purposes.
122
5.3 Implications and Future Directions for Research
High stakes exams require the construction of items that are psychometrically sound in the
context of their reliability. Furthermore, they have to be stable for repeatability since item
banking has many logistic issues associated with the construction and security of items. Item
parameters also influence the selection of items for the exams. If the item parameters are not
taken into consideration before the selection for an exam, there is a chance that good items that
should be in an exam are mistakenly removed and weak ones included. This research has shown
that the choice of one method of scoring over the other depends on the objective of the research
and the size of data. At the level of a local medical school, both the methods yielded very
comparable results. Stability of the items across time is also an issue that must be addressed
while administering them repeatedly since changes in construct, curricular content, test wiseness
and other threats to the security of such items require that the factors that lead to parameter drift
be more thoroughly explored.
Although CTT has been the mainstay of measurement methods in the past, the more recent
decades have seen increased use of IRT. It is now being increasingly utilized in the educational
field for the calibration and evaluation of items in various tests and questionnaires for the scoring
of attitudes, abilities and other traits. Recent advances have seen more frequent application of
IRT in the context of item scaling, equating and CAT. Item calibration and test equating with
IRT are both important for the movement of IRT in a forward direction. As IRT models continue
to evolve, it is hoped that they will soon become less analytically and computationally intensive.
As these models become more able to adapt to the design, size and complexity of assessments,
they are expected to play a more pivotal role in assessments.
123
5.4 Limitations of the Study
This research looked at the reliability of MCQs using both CTT and IRT. One of the
limitations of this study was the choice of model. Since a 2 PL was applied to this research, the
guessing behaviour of the students could not be studied.
Another limitation of this study was the limited choice of items. The SBA type of MCQs
were included in this research. For consistency, it was also decided to include in the study those
items that had five options to choose the correct answer from. They also had to have been
repeated in at least three consecutive or overlapping years. In addition to the factors narrowing
down the data, the content examined was narrow as well. The items were chosen from the four
skills of Basic Sciences, Investigations, Treatment and Management. Several of the items were
of ‘easy’ type. It is clear that such items do affect the stability over time (as evidenced by the
deviation of some items from the line of best fit in the scatter plots). In future, it might be useful
to look at a wider variety of MCQs as recommended for a high stakes exam.152
5.5 Conclusion
Effective measurement of knowledge is vital for the growth of a program. Methods that
are used to assess students’ knowledge have to be evaluated for the qualities of a good
assessment tool as recommended by Norcini et al.8 It is, therefore, important to evaluate the
MCQs to observe their effectiveness in measuring the knowledge of students in preclinical years.
This research was carried out by using and comparing two methods of item analysis for
establishing the reliability of scores on MCQs of MD certifying exams at the University of
124
Calgary. Results showed that the analyses of the selected items were comparable between CTT
and IRT to some extent. Several items were noted to be of the ‘easy’ type. Furthermore, the item
discrimination was noted to be ‘good’. The reliability of these MCQs was found to be fair only.
The fair indices of reliability may be attributable to the homogeneity of the student sample and
the relatively small size of the data as also indicated by mostly large standard errors of estimates.
In addition, the correlation coefficients calculated for the three years for three courses were only
moderate to good in some instances which means that those items which correlated to a lesser
extent with each other did not exhibit remarkable temporal stability.
On a continuum from less to more complex, the development of IRT models has taken
place with the intent to address the restrictions posed by CTT. IRT models require larger data for
better fit and interpretation. It is clear that the choice between CTT and IRT depends on the aim
of research since IRT is better suited to data when being analyzed at item level. The most
effective application of IRT is with a large data since that improves the reliability of the scores
Despite its advantage of item level statistics, the results so far do not prove the superiority of IRT
over CTT. These results are similar to results reported by Fan,152 Macdonald et al 78 and
Courville.161 This research has shown that both CTT and IRT often yield similar results. There is
a growing body of literature that points strongly towards the fact put forward by Fan152 who
states “when scores developed by IRT can be correlated with those obtained by the more usual
approach to simply sum items scores, typically it is found that the two sets of scores correlate
higher; thus there is hardly any difference between the two approaches or any marked departure
from linearity of the measurement obtained from the two approaches.”
125
5.6 Recommendations
This research looked at the psychometrics of MCQs at a local medical school. It did not show
significant superiority of one method of measurement over the other and in such situations, both
CTT and IRT have their respective utility. Although CTT is easier to use due to its robustness, a
combination of both measurement methods may be applied at a local medical school to analyze
the psychometric properties of MCQs in high stakes summative exams. Hence, CTT may be used
to look at the reliability of the test scores and IRT may be applied to analyze the item parameters.
It is hoped that a combination of the two methods would be more practical than using only IRT
considering the fact that it is less robust than CTT when a smaller data is being analyzed.
Another aspect of this research was to analyze the temporal stability of MCQs across time.
This research showed stability of items in the context of their parameters although these findings
were not entirely consistent across all the years; some variability was noted in both difficulty and
discrimination parameters. It is, thus, recommended that parameter drift should be analyzed so
that measures can be taken to curtail the observed drift. Since parameter drift has certain
undesirable consequences, schools should make sure that methods are available for assessing and
detecting this drift. One method might be recalibration of an item bank on a regular basis;
another would be to increase the item bank.This is helpful when reusing the same items across a
number of administrations by ensuring that repeating the MCQs in subsequent administrations
does not affect their psychometric properties.
126
REFERENCES
1. Bernstein J. Evidence-Based Medicine. Journal of the American Academy of Orthopaedic
Surgeons. 2004;12(2):80-88.
2. Cooke M, Irby DM, Sullivan W, Ludmerer KM. American Medical Education 100 Years
after the Flexner Report. New England Journal of Medicine. 2006;355(13):1339-1344.
3. Boulet JR. Summative Assessment in Medicine: The Promise of Simulation for High
stakes Evaluation. Academic Emergency Medicine. 2008;15(11):1017-1024.
4. Dannefer EF. Beyond assessment of learning toward assessment for learning: Educating
tomorrow's physicians. Medical Teacher. 2013;35(7):560-563.
5. Driessen E, Scheele F. What is wrong with assessment in postgraduate training? Lessons
from clinical practice and educational research. Medical Teacher.2013;35(7):569-574.
6. Hodges B. Assessment in the post-psychometric era: Learning to love the subjective and
collective. Medical Teacher.2013;35(7):564-568.
7. Schuwirth L, Ash J. Assessing tomorrow's learners: In competency-based education only
a radically different holistic method of assessment will work. Six things we could forget.
Medical Teacher.2013;35(7):555-559.
8. Norcini J, Anderson B, Bollela V, Burch V, Costa MJo, Duvivier R, et al. Criteria for
good assessment: consensus statement and recommendations from the Ottawa 2010
Conference. Medical Teacher.2010;33(3):206-214.
9. Rudolph JW, Simon R, Raemer DB, Eppich WJ. Debriefing as formative assessment:
closing performance gaps in medical education. Academic Emergency Medicine.
2008;15(11):1010-1016.
10. Yorke M. Formative assessment in higher education: Moves towards theory and the
enhancement of pedagogic practice. Higher Education. 2003;45(4):477-501.
11. Wiliam D, Black P. Meanings and consequences: a basis for distinguishing formative and
summative functions of assessment? British Educational Research Journal.
1996;22(5):537-548.
12. Roberts TE. Assessment est mort, vive assessment 1. Medical Teacher.2013;35(7):535-
536.
13. Harlen W, James M. Assessment and learning: differences and relationships between
formative and summative assessment. Assessment in Education. 1997;4(3):365-379.
14. Miller GE. The assessment of clinical skills/competence/performance. Academic
Medicine. 1990;65(9):S63-67.
15. Van Der Vleuten CP, Schuwirth LW. Assessing professional competence: from methods
to programmes. Medical education. 2005;39(3):309-317.
16. Davis MH, Karunathilake I. The place of the oral examination in today's assessment
systems. Medical Teacher. 2005;27(4):294-297.
17. Palmer EJ, Devitt PG. Assessment of higher order cognitive skills in undergraduate
education: modified essay or multiple choice questions? Research paper. BMC Medical
Education. 2007;7(1):49-54.
18. Van Der Vleuten CP. The assessment of professional competence: developments,
research and practical implications. Advances in Health Sciences Education.
1996;1(1):41-67.
127
19. Case S, Swanson D. Extended matching items: a practical alternative to free response
questions. Teaching and Learning in Medicine. 1993;5:107-115.
20. Roberts C, Newble D, Jolly B, Reed M, Hampton K. Assuring the quality of high-stakes
undergraduate assessments of clinical competence. Medical Teacher. 2006;28(6):535-
543.
21. Newble D. Techniques for measuring clinical competence: objective structured clinical
examinations. Medical education. 2004;38(2):199-203.
22. Ramani S. Twelve tips to improve bedside teaching. Medical Teacher. 2003;25(2):112-
115.
23. Stillman P, Swanson D, Regan MB, Philbin MM, Nelson V, Ebert T, et al. Assessment of
Clinical Skills of Residents Utilizing Standardized PatientsA Follow-up Study and
Recommendations for Application. Annals of Internal Medicine. 1991;114(5):393-401.
24. Lockyer J. Multisource feedback in the assessment of physician competencies. Journal of
Continuing Education in the Health Professions. 2003;23(1):4-12.
25. Whitehouse A, Hassell A, Bullock A, Wood L, Wall D. 360 degree assessment
(multisource feedback) of UK trainee doctors: Field testing of team assessment of
behaviours (TAB). Medical Teacher. 2007;29(2-3):171-176.
26. Sandars J. The use of reflection in medical education: AMEE Guide No. 44. Medical
Teacher. 2009;31(8):685-695.
27. Moonen-van Loon J, Overeem K, Donkers H, van der Vleuten C, Driessen E. Composite
reliability of a workplace-based assessment toolbox for postgraduate medical education.
Advances in Health Sciences Education.18(5):1087-1102.
28. Norcini J, Burch V. Workplace-based assessment as an educational tool: AMEE Guide
No. 31. Medical Teacher. 2007;29(9-10):855-871.
29. Alagumalai S, Keeves J. Distractors - Can they be biased too? Journal of Outcome
Measurement. 1999;3:89-102.
30. Beullens J, Struyf E, Van Damme B. Do extended matching multiple-choice questions
measure clinical reasoning? Medical education. 2005;39(4):410-417.
31. Bhakta B, Tennant A, Horton M, Lawton G, Andrich D. Using item response theory to
explore the psychometric properties of extended matching questions examination in
undergraduate medical education. BMC Medical Education. 2005;5(1):5-9.
32. Campbell DE. How to write good multiple choice questions. Journal of paediatrics and
child health.2013;47(6):322-325.
33. Schuwirth LW, Van Der Vleuten CP. Different written assessment methods: what can be
said about their strengths and weaknesses? Medical education. 2004;38(9):974-979.
34. Fowell SL, Bligh JG. Recent developments in assessing medical students. Postgraduate
medical journal. 1998;74(867):18-24.
35. Wass V, Van der Vleuten C, Shatzer J, Jones R. Assessment of clinical competence. The
Lancet. 2001;357(9260):945-949.
36. Norcini J, Swanson D, Grosso L, Webster G. Reliability, validity and efficiency of
multiple choice question and patient management problem item formats in assessment of
clinical competence. Medical education. 1985;19(3):238-247.
128
37. Lukhele R, Thissen D, Wainer H. On the Relative Value of Multiple-Choice, Constructed
Response, and Examinee-Selected Items on Two Achievement Tests. Journal of
Educational Measurement. 1994;31(3):234-250.
38. Mislevy RJ, Stocking ML. A Consumer's Guide to LOGIST and BILOG. Applied
Psychological Measurement. 1989;13(1):57-75.
39. Bock R, Aitkin M. Marginal maximum likelihood estimation of item parameters:
Application of an EM algorithm. Psychometrika. 1981;46(4):443-459.
40. Patz RJ, Junker BW. Applications and Extensions of MCMC in IRT: Multiple Item
Types, Missing Data, and Rated Responses. Journal of Educational and Behavioral
Statistics. 1999;24(4):342-366.
41. Drasgow F, Levine MV, Tsien S, Williams B, Mead AD. Fitting Polytomous Item
Response Theory Models to Multiple-Choice Tests. Applied Psychological Measurement.
1995;19(2):143-166.
42. Chang KY, Tsou MY, Chan KH, Chang SH, Tai J, Chen HH. Item analysis for the
written test of Taiwanese board certification examination in anaesthesiology using the
Rasch model. British journal of anaesthesia.2010;104(6):717-722.
43. Huang Y-F, Tsou M-Y, Chen E-T, Chan K-H, Chang K-Y. Item response analysis on an
examination in anesthesiology for medical students in Taiwan: A comparison of one- and
two-parameter logistic models. Journal of the Chinese Medical
Association.2010;76(6):344-349.
44. Birnbaum A. Some latent trait models and their use in inferring an examinee’s ability.
Statistical theories of mental test scores. 1968:397–479.
45. Norcini JJ, McKinley DW. Assessment methods in medical education. Teaching and
Teacher Education. 2007;23(3):239-250.
46. Harden RMG, Brown R, Biran L, Ross WD, Wakeford R. Multiple choice questions: to
guess or not to guess. Medical education. 2009;10(1):27-32.
47. Tarrant M, Knierim A, Hayes SK, Ware J. The frequency of item writing flaws in
multiple-choice questions used in high stakes nursing assessments. Nurse education in
practice. 2006;6(6):354-363.
48. Schuwirth LWT, Vleuten CPM, Donkers H. A closer look at cueing effects in multiple-
choice questions. Medical Education. 1996;30(1):44-49.
49. Brady A. Assessment of learning with multiple-choice questions. Nurse Education in
Practice. 2005;5(4):238-242.
50. Tarrant M, Knierim A, Hayes SK, Ware J. The frequency of item writing flaws in
multiple-choice questions used in high stakes nursing assessments. Nurse Education
Today. 2006;26(8):662-671.
51. McCoubrie P. Improving the fairness of multiple-choice questions: a literature review.
Medical Teacher. 2004;26(8):709-712.
52. Fox J. The multiple choice tutorial: its use in the reinforcement of fundamentals in
medical education. Med Educ. 1983;17:90-94.
53. Laura TF. Using feedback to reduce students' judgment bias on test questions. Journal of
Nursing Education. 2001;40(1):10-22.
129
54. Downing SM. The effects of violating standard item writing principles on tests and
students: the consequences of using flawed test items on achievement examinations in
medical education. Advances in Health Sciences Education. 2005;10(2):133-143.
55. Spearman C. The proof and measurement of association between two things. The
American Journal of Psychology. 1904;15(1):72-101.
56. Harvill LM. Standard Error of Measurement. Educational measurement: Issues and
practice. 1991;10(2):33-41.
57. Cronbach L. Coefficient alpha and the internal structure of tests. Psychometrika.
1951;16(3):297-334.
58. Traub RE, Rowley GL. Understanding reliability. Educational measurement: Issues and
practice. 1991;10(1):37-45.
59. DeVellis RF. Classical test theory. Medical care. 2006;44(11):S50.
60. Crocker L, Algina J. Introduction to classical and modem test theory. New York: Holt,
Rinehart, and Winston. 1986.
61. Lord FM. Applications of item response theory to practical testing problems: Lawrence
Erlbaum Associates New Jersey; 1980.
62. Dent J, Harden RM. A Practical Guide for Medical Teachers E-Book: Churchill
Livingstone; 2009.
63. Gay LR, Mills GE, Airasian PW. Educational research: Competencies for analysis and
applications. 2006.
64. Cox M, Irby DM, Epstein RM. Assessment in medical education. New England Journal
of Medicine. 2007;356(4):387-396.
65. Downing SM. Reliability: on the reproducibility of assessment data. Medical education.
2004;38(9):1006-1012.
66. Cortina JM. What is coefficient alpha? An examination of theory and applications.
Journal of applied psychology. 1993;78(1):98-104.
67. Sijtsma K. On the use, the misuse, and the very limited usefulness of Cronbach’s
alpha. Psychometrika. 2009;74(1):107-120.
68. Tavakol M, Dennick R. Making sense of Cronbach's alpha. International journal of
medical education.2011;2:53-55.
69. Gliem JA, Gliem RR. Calculating, interpreting, and reporting Cronbach’s alpha
reliability coefficient for Likert-type scales. In; 2003: Midwest Research-to-Practice
Conference in Adult, Continuing, and Community Education; 2003.
70. Phinney JS. The multigroup ethnic identity measure a new scale for use with diverse
groups. Journal of adolescent research. 1992;7(2):156-176.
71. De Champlain AF. A primer on classical test theory and item response theory for
assessments in medical education. Medical education.2010;44(1):109-117.
72. Sim S, Rasiah RI. Relationship between item difficulty and discrimination indices in
true/false-type multiple choice questions of a para-clinical multidisciplinary paper.
Annals-Academy of Medicine Singapore. 2006;35(2):67-72.
73. Ebel RL. Measuring educational achievement: Prentice-hall Englewood Cliffs, NJ; 1965.
74. DeVellis RF. Classical test theory. Medical care. 2006;44(11):S50-S59.
75. Wells CS, Wollack JA. An instructors guide to understanding test reliability. Testing &
Evaluation Services University of Wisconsin. 2003.
130
76. Hambleton RK. Emergence of Item Response Modeling in Instrument Development and
Data Analysis. Medical care. 2000;38(9):II60-II65.
77. Kolen MJ. Comparison of traditional and item response theory methods for equating
tests. Journal of Educational Measurement. 1981;18(1):1-11.
78. Macdonald P, Paunonen SV. A Monte Carlo comparison of item and person statistics
based on item response theory versus classical test theory. Educational and psychological
measurement. 2002;62(6):921-943.
79. Bechger TM, Maris G, Verstralen HH, Baguin AA. Using classical test theory in
combination with item response theory. Applied Psychological Measurement.
2003;27(5):319-334.
80. Traub RE. Classical test theory in historical perspective. Educational Measurement:
issues and practice. 2005;16(4):8-14.
81. Lord FM, Wingersky MS. Comparison of IRT True-Score and Equipercentile Observed-
Score "Equatings". Applied Psychological Measurement. 1984;8(4):453-461.
82. Oliveri ME, Olson BF, Ercikan K, Zumbo BD. Methodologies for Investigating Item-
and Test-Level Measurement Equivalence in International Large-Scale Assessments.
International Journal of Testing.12(3):203-223.
83. McEldoon K, Cho S-J, Rittle-Johnson B, Society for Research on Educational E.
Measuring Intervention Effectiveness: The Benefits of an Item Response Theory
Approach: Society for Research on Educational Effectiveness.
84. Magno C. Demonstrating the Difference between Classical Test Theory and Item
Response Theory Using Derived Test Data: Online Submission; 2009.
85. Wainer H, Kiely GL. Item clusters and computerized adaptive testing: A case for testlets.
Journal of Educational Measurement. 1987;24(3):185-201.
86. Cooke DJ, Michie C. An item response theory analysis of the Hare Psychopathy
Checklist--Revised. Psychological assessment. 1997;9(1):3-10.
87. Hays RD, Morales LS, Reise SP. Item response theory and health outcomes measurement
in the 21st century. Medical care. 2000;38(9 Suppl):II28-II42.
88. Fraley RC, Waller NG, Brennan KA. An item response theory analysis of self-report
measures of adult attachment. Journal of personality and social psychology.
2000;78(2):350-365.
89. Hulin CL, Drasgow F, Komocar J. Applications of item response theory to analysis of
attitude scale translations. Journal of Applied Psychology.1982;67(6):818-825.
90. Saha TD, Chou SP, Grant BF. Toward an alcohol use disorder continuum using item
response theory: results from the National Epidemiologic Survey on Alcohol and Related
Conditions. Psychological medicine. 2006;36(7):931-942.
91. Bolt DM, Hare RD, Vitale JE, Newman JP. A Multigroup Item Response Theory
Analysis of the Psychopathy Checklist-Revised. Psychological assessment.
2004;16(2):155-168.
92. Justice LM, Bowles RP, Skibbe LE. Measuring preschool attainment of print-concept
knowledge: a study of typical and at-risk 3-to 5-year-old children using item response
theory. Language, Speech & Hearing Services in Schools. 2006;37(3):460-476.
131
93. Scherbaum CA, Cohen-Charash Y, Kern MJ. Measuring General Self-Efficacy A
Comparison of Three Measures Using Item Response Theory. Educational and
psychological measurement. 2006;66(6):1047-1063.
94. Downing SM. Item response theory: applications of modern test theory in medical
education. Medical Education. 2003;37(8):739-745.
95. Hambleton RK. Fundamentals of item response theory: Sage Publications, Incorporated;
1991.
96. Steinberg L, Thissen D. Uses of Item Response Theory and the Testlet Concept in the
Measurement of Psychopathology. [Article]. Psychological Methods March.
1996;1(1):81-97.
97. Hambleton RK, Swaminathan H, Rogers HJ. Fundamentals of item response theory:
Sage; 1991.
98. Reise SP, Ainsworth AT, Haviland MG. Item Response Theory: Fundamentals,
Applications, and Promise in Psychological Research. Current Directions in
Psychological Science. 2005;14(2):95-101.
99. van der Linden WJ, Hambleton RK. Handbook of modern item response theory:
Springer; 1997.
100. Hambleton RK, Van der Linden WJ. Advances in item response theory and applications:
An introduction. 1982.
101. Samejima F. Estimation of latent ability using a response pattern of graded scores.
Psychometrika Monograph Supplement. 1969;34(4, Pt. 2):100.
102. Hambleton RK, Cook LL. Latent Trait Models and Their Use in the Analysis of
Educational Test Data. Journal of Educational Measurement. 1977;14(2):75-96.
103. Wright B. Rasch measurement models. Advances in measurement in educational
research and assessment. 1999:85-97.
104. Mislevy RJ. Foundations of a new test theory. Test theory for a new generation of tests.
1993:19-39.
105. Linacre JM. A user's guide to WINSTEPS MINISTEP Rasch-model computer programs.
Chicago: Winsteps com. 2005.
106. Guyer R, Thompson N. User's manual for Xcalibre 4.1. In: St. Paul MN: Assessment
Systems Corporation.
107. Hambleton RK, Swaminathan H. Item response theory: Principles and applications:
Boston; 1985.
108. Edelen MO, Reeve BB. Applying item response theory (IRT) modeling to questionnaire
development, evaluation, and refinement. Quality of Life Research. 2007;16:5-18.
109. Van Alphen A, Halfens R, Hasman A, Imbos T. Likert or Rasch? Nothing is more
applicable than a good theory. Journal of Advanced Nursing. 1994;20:196 - 201.
110. Wainer H, Thissen D. How is reliability related to the quality of test scores? What is the
effect of local dependence on reliability? Educational measurement: Issues and practice.
1996;15(1):22-29.
111. Hambleton R, Rogers H, Swaminathan H. Fundamentals of item response theory: Sage
Publ.; 1995.
112. De Ayala RJ. Theory and practice of item response theory: Guilford Publications; 2009.
132
113. Hambleton R, Slater S. Item response theory models and testing practices: Current
international status and future directions. European Journal of Psychological Assessment.
1997;13:20-28.
114. Hambleton RK. Item response theory: a broad psychometric framework for measurement
advances 1, 2. Psicothema. 1994;6(3):535-556.
115. Harris D. Comparison of 1 , 2 , and 3 Parameter IRT Models. Educational measurement:
Issues and practice. 1989;8(1):35-41.
116. Lawson S. One parameter latent trait measurement: Do the results justify the effort.
Advances in educational research: Substantive findings, methodological developments.
1991;1:159-168.
117. Tavakol M, Dennick R. Psychometric evaluation of a knowledge based examination
using Rasch analysis: An illustrative guide: AMEE Guide No. 72. Medical Teacher.
(0):1-11.
118. Van Batenburg T, Laros J. Graphical analysis of test items. Educational Research and
Evaluation. 2002;8:319 - 333.
119. May K, Jackson TS. IRT Item Parameters and the Reliability and Validity of Pretest,
Posttest, and Gain Scores. International Journal of Testing. 2005;5(1):11-18.
120. Swanson DB, Holtzman KZ, Allbee K, Clauser BE. Psychometric Characteristics and
Response Times for Content-Parallel Extended-Matching and One-Best-Answer Items in
Relation to Number of Options. Academic Medicine. 2006;81(10):S52-S55.
121. Yang S-C, Tsou M-Y, Chen E-T, Chan K-H, Chang K-Y. Statistical item analysis of the
examination in anesthesiology for medical students using the Rasch model. Journal of the
Chinese Medical Association.74(3):125-129.
122. Gonzalves F, Gamerman D, Soares T. Simultaneous multifactor DIF analysis and
detection in Item Response Theory. Computational Statistics & Data Analysis.59:144-
160.
123. Wang N. Use of the Rasch IRT Model in Standard Setting: An Item Mapping Method.
Journal of Educational Measurement. 2003;40(3):23-253.
124. De Champlain AF, Melnick D, Scoles P, Subhiyah R, Holtzman K, Swanson D, et al.
Assessing medical students' clinical sciences knowledge in France: a collaboration
between the NBME and a consortium of French medical schools. Academic Medicine.
2003;78(5):509-517.
125. Linn RL. Has Item Response Theory Increased the Validity of Achievement Test Scores?
Applied Measurement in Education. 1990;3(2):115-141.
126. Kreiter C, Ferguson K, Gruppen L. Evaluating the usefulness of computerized adaptive
testing for medical in-course assessment. Academic Medicine. 1999;74:1125 - 1128.
127. Thissen D, Orlando M. Item response theory for items scored in two categories. Test
scoring. 2001:73–140.
128. Andersen E, Madsen M. Estimating the parameters of the latent population distribution.
Psychometrika. 1977;42(3):357-374.
129. Williams VSL, Pommerich M, Thissen D. A comparison of developmental scales based
on Thurstone methods and item response theory. Journal of Educational Measurement.
1998;35(2):93-107.
133
130. Hambleton R. Principles and selected applications of item response theory. Educational
measurement. 1989;3:147-200.
131. Weiss DJ, Kingsbury G. Application of computerized adaptive testing to educational
problems. Journal of Educational Measurement. 1984;21(4):361-375.
132. Melvin R N. The axioms and principal results of classical test theory. Journal of
Mathematical Psychology. 1966;3(1):1-18.
133. Lawson DM. Applying the Item Response Theory to classroom examinations. Journal of
manipulative and physiological therapeutics. 2006;29(5):393-397.
134. Linacre J, Wright B. A user’s guide to Winsteps Rasch-model computer program. 2001.
In: MESA Press Chicago, IL.
135. Bock RD, Murakl E, Pfeiffenberger W. Item pool maintenance in the presence of item
parameter drift. Journal of Educational Measurement. 1988;25(4):275-285.
136. Cook LL, Eignor DR, Taft HL. A comparative study of the effects of recency of
instruction on the stability of IRT and conventional item parameter estimates. Journal of
Educational Measurement. 1988;25(1):31-45.
137. Bergstrom B, Stahl J, Netzky B. Factors that influence item parameter drift. In: annual
meeting of the American Educational Research Association, Seattle, WA; 2001; 2001.
138. Wells CS, Subkoviak MJ, Serlin RC. The effect of item parameter drift on examinee
ability estimates. Applied Psychological Measurement. 2002;26(1):77-87.
139. Babcock B, Albano AD. Rasch scale stability in the presence of item parameter and trait
drift. Applied Psychological Measurement..2012;36(7): 565-580
140. Donoghue JR, Isham SP. A comparison of procedures to detect item parameter drift.
Applied Psychological Measurement. 1998;22(1):33-51.
141. Kim W, Nering M. Evaluation of equating items using DFIT. In: Annual meeting of the
national council on measurement in education Chicago, IL; 2007; 2007.
142. Babcock B, Albano A, Raymond M. Nominal Weights Mean Equating A Method for
Very Small Samples. Educational and psychological measurement.72(4):608-628.
143. Wollack JA, Cohen AS, Wells CS. A Method for Maintaining Scale Stability in the
Presence of Test Speededness. Journal of Educational Measurement. 2003;40(4):307-
330.
144. Mandin H, Harasym P, Eagle C, Watanabe M. Developing a" clinical presentation"
curriculum at the University of Calgary. Academic Medicine. 1995;70(3):186-193.
145. Woloschuk W, Harasym P, Mandin H, Jones A. Use of schema based problem solving:
an evaluation of the implementation and utilization of schemes in a clinical presentation
curriculum. Medical education. 2000;34(6):437-442.
146. Breithaupt K, Ariel AA, Hare DR. Assembling an inventory of multistage adaptive
testing systems. In: Elements of adaptive testing: Springer. p. 247-266.
147. Gao F, Chen L. Bayesian or non-Bayesian: A comparison study of item parameter
estimation in the three-parameter logistic model. Applied Measurement in Education.
2005;18(4):351-380.
148. Gay LR, Airasian PW. Educational research: Competencies for analysis and application.
2000.
149. Weir JP. Quantifying test-retest reliability using the intraclass correlation coefficient and
the SEM. The Journal of Strength & Conditioning Research. 2005;19(1):231-240.
134
150. Hojat M, Xu G. A visitor's guide to effect sizes–statistical significance versus practical
(clinical) importance of research findings. Advances in Health Sciences Education.
2004;9(3):241-249.
151. Galer N, Uyan GlK, Teker GlT. Comparison of classical test theory and item response
theory in terms of item parameters. European Journal of Research on
Education.2013;2(1):1-6.
152. Fan X. Item response theory and classical test theory: An empirical comparison of their
item/person statistics. Educational and psychological measurement. 1998;58(3):357-381.
153. Cohen J. Statistical Power Analysis. Current Directions in Psychological Science.
1992;1(3):98-101.
154. Garet MS, Porter AC, Desimone L, Birman BF, Yoon KS. What makes professional
development effective? Results from a national sample of teachers. American
Educational Research Journal. 2001;38(4):915-945.
155. Hill HC, Rowan B, Ball DL. Effects of teachers mathematical knowledge for teaching on
student achievement. American educational research journal. 2005;42(2):371-406.
156. Hingorjo MR, Jaleel F. Analysis of one-best MCQs: the difficulty index, discrimination
index and distractor efficiency analysis. Journal of Pakistan Medical Association. 2012;
157. Baxi S, Parmar R, Parmar D, Tripathi C. Item Analysis of MCQ from Presently Available
MCQ Books. The Practising Doctor.
158. McGahee TW, Ball J. How to read and really use an item analysis. Nurse educator.
2009;34(4):166-171.
159. Carroll RG. Evaluation of vignette-type examination items for testing medical
physiology. The American journal of physiology. 1993;264(6 Pt 3):S11-15.
160. Hambleton RK, Slater SC. Item response theory models and testing practices: current
international status and future directions. European Journal of Psychological Assessment.
1997;13(1):21-28.
161. Courville TG. An empirical comparison of item response theory and classical test theory
item/person statistics: Texas A&M University; 2004.
162. Frisbie DA. Reliability of Scores From Teacher-Made Tests. Educational measurement:
Issues and practice. 1988;7(1):25-35.
163. Charter RA. Sample size requirements for precise estimates of reliability,
generalizability, and validity coefficients. Journal of Clinical and Experimental
Neuropsychology. 1999;21(4):559-566.
164. Quadrelli S, Davoudi M, Galandez F, Colt HG. Reliability of a 25-item low-stakes
multiple-choice assessment of bronchoscopic knowledge. CHEST Journal.
2009;135(2):315-321.
165. Baig LA, Violato C. Temporal stability of objective structured clinical exams: a
longitudinal study employing item response theory. BMC Medical
Education.2012;12(1):121.
135
APPENDIX A: Course 3
List of Tables
Table 35: App A1: Item Diff (p) and p-bis Correl of Course 3 Using CTT.....….......................137
Table 36: App A2: Diff (b) and Discrim (a) Indices of Course 3 Using IRT……………..........139
Table 37: App A3: Correl Coeff of Difficulty Index b/w CTT and IRT for Course 3…………141
Table 38: App A4: Correl Coeff of p-bis and Discrim b/w CTT and IRT for Course 3……….141
Table 39: App A5: SE and Reliability Index (Alpha w/o) Course 3 Year 1…………………...142
Table 40: App A6: SE and Reliability Index (Alpha w/o) Course 3 Year 2…………………...144
Table 41: App A7: SE and Reliability Index (Alpha w/o) Course 3 Year 3…………………...146
Table 42: App A9: Correl Coeff of Difficulty Index of CTT for Course 3 Year 1, 2, 3……….156
Table 43: App A10: Correl Coeff of Difficulty Index of IRT for Course 3 Year 1, 2, 3………156
Table 44: App A11: Correl Coeff of Discrim Index of CTT for Course 3 Year 1, 2, 3…..........157
Table 45: App A12: Correl Coeff of Discrim Index of IRT for Course 3 Year 1, 2, 3………...157
136
APPENDIX A: Course 3
List of Figures
Figure 22: App A8: Item Characteristic Curves for Course 3 for Year 1, 2, 3…………………148
Figure 23: App A13: Scatter Plots for Item Difficulty Using CTT for Course 3…………........158
Figure 24: App A14: Scatter Plots of Item Difficulty Using IRT for Course 3………………...160
Figure 25: App A15: Scatter Plots of Item Discrim (p-bis) Using CTT for Course 3…….........161
Figure 26: App A16: Scatter Plots of Item Discrim Using IRT for Course 3………………….162
Figure 27: App A17: Test Characteristic Curves for Course 3…………………………………163
137
APPENDIX A: COURSE 3
Table 35: Appendix A1 - Item Difficulty (p) and Point Biserial (p-bis) Correlation of
Course 3 Using CTT
Year 1 Year 2 Year 3
ID p p-bis p p-bis p p-bis
1 0.887 0.149 0.793 0.320 0.777 0.171
2 0.411 0.006 0.631 0.123 0.480 0.037
3 0.556 0.114 0.777 0.010 0.731 0.069
4 0.821 0.111 0.810 0.095 0.777 0.032
5 0.344 0.256 0.670 0.138 0.434 0.109
6 0.623 0.051 0.827 0.084 0.651 0.147
7 0.762 0.204 0.866 0.102 0.903 0.094
8 0.550 0.334 0.927 0.044 0.909 0.001
9 0.722 0.290 0.737 0.241 0.623 0.295
10 0.801 0.160 0.782 0.171 0.731 0.037
11 0.848 0.154 0.726 0.052 0.766 0.093
12 0.815 0.211 0.682 0.182 0.611 0.247
13 0.722 0.198 0.721 0.242 0.789 0.127
14 0.801 0.183 0.939 0.106 0.863 0.144
15 0.768 0.207 0.659 0.075 0.617 0.302
16 0.662 0.219 0.581 0.147 0.503 0.116
17 0.768 0.247 0.844 0.134 0.800 0.127
18 0.768 0.155 0.715 0.259 0.697 0.152
19 0.775 0.145 0.771 0.088 0.714 0.119
138
20 0.874 0.274 0.737 0.123 0.869 0.165
21 0.788 0.011 0.816 0.025 0.806 0.111
22 0.788 0.415 0.849 0.010 0.806 0.162
23 0.623 0.160 0.877 0.088 0.840 0.213
24 0.344 0.320 0.659 0.296 0.309 0.091
27 0.755 0.119 0.821 0.198 0.766 0.197
28 0.775 0.189 0.860 0.127 0.869 0.085
29 0.788 0.011 0.816 0.025 0.806 0.111
30 0.788 0.415 0.849 0.010 0.806 0.162
139
Table 36: Appendix A2 - Difficulty (b) and Discrimination (a) Indices of Course 3 Using
IRT
Year 1
Year 2 Year 3
ID a b a b a b
1 0.377 -2.114 0.603 0.572 0.351 -0.853
2 0.234 2.215 0.322 1.209 0.212 1.547
3 0.307 0.903 0.257 -0.681 0.314 -0.568
4 0.402 -1.044 0.238 -1.382 0.360 -0.768
5 0.467 2.297 0.306 0.820 0.379 1.818
6 0.339 0.445 0.350 -0.575 0.429 0.439
7 0.463 -0.296 0.372 -0.939 0.535 -1.317
8 0.517 1.093 0.392 -1.829 0.467 -1.697
9 0.534 0.125 0.398 0.552 0.552 0.750
10 0.456 -0.638 0.364 0.006 0.398 -0.206
11 0.490 -0.946 0.286 0.172 0.439 -0.346
12 0.505 -0.602 0.353 0.885 0.512 0.779
13 0.386 0.203 0.292 0.497 0.542 -0.209
14 0.408 -0.405 0.396 -0.249 0.506 -0.168
15 0.472 -0.372 0.386 -0.764 0.512 -1.010
18 0.430 -0.443 0.408 0.767 0.462 0.203
19 0.437 -0.474 0.317 -0.128 0.436 0.028
20 0.565 -0.972 0.326 0.274 0.559 -0.857
21 0.379 0.380 0.324 1.652 0.401 1.396
22 0.479 -0.793 0.345 1.098 0.392 -0.456
23 0.391 0.543 0.359 -1.184 0.563 -0.592
24 0.527 2.218 0.418 1.211 0.420 2.612
140
25 0.452 0.418 0.404 -0.884 0.402 0.233
26 0.386 0.203 0.292 0.497 0.542 -0.209
27 0.408 -0.405 0.396 -0.249 0.506 -0.168
28 0.472 -0.372 0.386 -0.764 0.512 -1.010
29 0.365 -0.873 0.307 -0.729 0.476 -0.551
30 0.699 -0.054 0.306 -1.199 0.499 -0.483
141
Table 37: Appendix A3 - Correlation Coefficients of Difficulty Index Between CTT and
IRT for Course 3
Year 1
p-b
Year 2
p-b
Year 3
p-b
-0.972 -0.950 -0.982
Table 38: Appendix A4 - Correlation Coefficients of Point Biserial and Discrimination
Index Between CTT and IRT for Course 3
Year 1
pbis-a
Year 2
pbis-a
Year 3
pbis-a
0.741 0.894 0.690
142
Table 39: Appendix A5 - SE and Reliability Index (Alpha w/o) Course 3 Year 1
Item ID a aSE b bSE Alpha w/o
Item 1 0.377 0.131 -2.114 0.412 0.619
Item 2 0.234 0.326 2.215 0.424 0.637
Item 3 0.307 0.307 0.903 0.325 0.634
Item 4 0.402 0.145 -1.044 0.319 0.622
Item 5 0.467 0.191 2.297 0.232 0.608
Item 6 0.339 0.246 0.445 0.302 0.630
Item 7 0.463 0.163 -0.296 0.255 0.614
Item 9 0.534 0.175 0.125 0.214 0.605
Item 10 0.456 0.151 -0.638 0.274 0.618
Item 11 0.490 0.141 -0.946 0.282 0.619
Item 12 0.505 0.148 -0.602 0.256 0.614
Item 13 0.429 0.179 -0.085 0.261 0.615
Item 14 0.484 0.151 -0.551 0.260 0.616
Item 15 0.482 0.161 -0.297 0.248 0.614
Item 16 0.472 0.201 0.404 0.228 0.612
Item 17 0.463 0.163 -0.296 0.255 0.614
Item 18 0.430 0.161 -0.443 0.275 0.619
Item 19 0.437 0.159 -0.474 0.273 0.620
Item 20 0.565 0.138 -0.972 0.267 0.610
Item 21 0.379 0.224 0.380 0.276 0.626
Item 22 0.479 0.145 -0.793 0.275 0.612
Item 23 0.391 0.233 0.543 0.265 0.619
Item 24 0.527 0.183 2.218 0.209 0.601
Item 25 0.452 0.206 0.418 0.236 0.613
143
Item 26 0.386 0.208 0.203 0.275 0.624
Item 27 0.408 0.166 -0.405 0.284 0.622
Item 28 0.472 0.159 -0.372 0.255 0.615
Item 29 0.365 0.155 -0.873 0.329 0.634
Item 30 0.699 0.156 -0.054 0.184 0.594
144
Table 40: Appendix A6 - SE and Reliability Index (Alpha w/o) Course 3 Year 2
Item ID a aSE b bSE Alpha w/o
Item 1 0.603 0.145 0.572 0.198 0.518
Item 2 0.322 0.213 1.209 0.297 0.509
Item 3 0.257 0.146 -0.681 0.419 0.524
Item 4 0.238 0.135 -1.382 0.476 0.536
Item 5 0.306 0.194 0.820 0.318 0.506
Item 6 0.350 0.132 -0.575 0.342 0.513
Item 7 0.372 0.125 -0.939 0.356 0.511
Item 8 0.286 0.167 0.172 0.355 0.519
Item 10 0.364 0.145 0.006 0.304 0.502
Item 11 0.286 0.167 0.172 0.355 0.519
Item 12 0.353 0.184 0.885 0.281 0.512
Item 13 0.405 0.165 0.713 0.256 0.524
Item 14 0.441 0.124 -1.713 0.419 0.511
Item 15 0.293 0.203 0.883 0.329 0.516
Item 16 0.312 0.242 1.634 0.299 0.505
Item 17 0.363 0.129 -0.689 0.343 0.507
Item 18 0.408 0.167 0.767 0.253 0.548
Item 19 0.317 0.148 -0.128 0.341 0.513
Item 20 0.326 0.161 0.274 0.318 0.508
Item 21 0.324 0.238 1.652 0.289 0.520
Item 22 0.345 0.199 1.098 0.282 0.519
Item 23 0.359 0.123 -1.184 0.380 0.512
Item 24 0.418 0.188 1.211 0.237 0.548
Item 26 0.292 0.125 0.497 0.341 0.517
145
Item 27 0.396 0.135 -0.249 0.302 0.544
Item 28 0.386 0.127 -0.764 0.338 0.518
Item 29 0.307 0.135 -0.729 0.377 0.521
Item 30 0.306 0.127 -1.199 0.407 0.524
146
Table 41: Appendix A7 - SE and Reliability Index (Alpha w/o) Course 3 Year 3
Item ID a aSE b bSE Alpha w/o
Item 1 0.351 0.146 -0.853 0.312 0.617
Item 2 0.212 0.449 1.547 0.426 0.640
Item 3 0.314 0.168 -0.568 0.325 0.626
Item 4 0.360 0.147 -0.768 0.303 0.628
Item 5 0.379 0.263 1.818 0.247 0.626
Item 6 0.429 0.208 0.439 0.227 0.621
Item 7 0.535 0.123 -1.317 0.286 0.620
Item 8 0.467 0.121 -1.697 0.331 0.625
Item 9 0.552 0.205 0.750 0.177 0.607
Item 10 0.398 0.166 -0.206 0.260 0.629
Item 11 0.439 0.151 -0.346 0.247 0.623
Item 12 0.512 0.217 0.779 0.189 0.612
Item 13 0.459 0.144 -0.466 0.245 0.620
Item 14 0.527 0.127 -0.902 0.253 0.618
Item 15 0.549 0.208 0.778 0.178 0.606
Item 16 0.407 0.283 1.360 0.229 0.625
Item 18 0.462 0.180 0.203 0.218 0.619
Item 19 0.436 0.173 0.028 0.234 0.622
Item 20 0.559 0.127 -0.857 0.244 0.616
Item 21 0.401 0.285 1.396 0.232 0.626
Item 22 0.392 0.154 -0.456 0.272 0.632
Item 23 0.563 0.132 -0.592 0.225 0.613
Item 24 0.420 0.174 2.612 0.241 0.628
Item 25 0.402 0.197 0.233 0.244 0.625
147
Item 26 0.542 0.145 -0.209 0.209 0.613
Item 27 0.392 0.154 -0.456 0.272 0.632
Item 28 0.512 0.126 -1.010 0.264 0.622
Item 29 0.476 0.139 -0.551 0.244 0.621
Item 30 0.499 0.139 -0.483 0.234 0.617
148
Figure 22: Appendix A8 - Item Characteristic Curves for Course 3 for Year 1, 2 and 3
Year 1 Year 2 Year 3
149
Year 1 Year 2 Year 3
150
Year 1 Year 2 Year 3
151
Year 1 Year 2 Year 3
152
Year 1 Year 2 Year 3
153
Year 1 Year 2 Year 3
154
Year 1 Year 2 Year 3
155
Year 1 Year 2 Year 3
156
Correlation Coefficients
Table 42: Appendix A9 - Correlation Coefficients of Difficulty Index of CTT for Course 3
Year 1, 2, 3
Year 1 Year 2 Year 3
Year 1 1 0.604 0.650
Year 2 0.604 1 0.654
Year 3 0.650 0.654 1
Table 43: Appendix A10 - Correlation Coefficients of Difficulty Index of IRT for Course 3
Year 1, 2, 3
Year 1 Year 2 Year 3
Year 1 1 0.676 0.607
Year 2 0.676 1 0.723
Year 3 0.607 0.723 1
157
Table 44: Appendix A11 - Correlation Coefficient of Discrimination Index of CTT for
Course 3 Year 1, 2, 3
Year 1 Year 2 Year 3
Year 1 1 0.577 0.790
Year 2 0.577 1 0.593
Year 3 0.790 0.593 1
Table 45: Appendix A12 - Correlation Coefficient of Discrimination Index of IRT for
Course 3 Year 1, 2, 3
Year 1 Year 2 Year 3
Year 1 1 0.551 0.747
Year 2 0.551 1 0.770
Year 3 0.747 0.770 1
158
Figure 23: Appendix A13 – Scatter Plots for Item Difficulty Using CTT for Course 3
Item Difficulty w ith CTT Year 2 and 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Year 2
Year
3Figure 1: Item Difficulty for CTT Year 1and 2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
0 0.
2
0.4 0.
6
0.8 1
Year 1
Yr 2
159
Item Difficulty with CTT Year 3 and 1
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Year 3
Year 1
160
Figure 24: Appendix A14 - Scatter Plots of Item Difficulty Using IRT for Course 3
Item Difficulty with IRT Year 1and 2
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Year 1
Year
2
Item Difficulty with IRT Year 2 and 3
-2
-1
0
1
2
3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Year 2
Year
3
Item Difficulty with IRT Year3 and 1
-3
-2
-1
0
1
2
3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Year 3
Year
1
161
Figure 25: Appendix A15 - Scatter Plots of Item Discrimination (p-bis) with CTT
Item Discrimination with CTT Year 1 and 2
-0.2
-0.1
0
0.1
0.2
0.3
0.4
-0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
Year 1
Year 2
Item Discrimination with CTT Year 2 and 3
-0.1 -0.05
0 0.05 0.1
0.15 0.2
0.25 0.3
0.35
-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Year 2
Year 3
Item Discrimination with CTT Year 3 and 1
-0.1
0
0.1
0.2
0.3
0.4
0.5
-0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Year 3
Year 1
162
Figure 26: Appendix A16 - Scatter Plots of Item Discrimination Using IRT for Course 3
Item Discrimination with IRT Year 1 and 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Year 1
Year
2
Item Discrimination with IRT Year 2 and 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Year 2
Year
3
Item Discrimination with IRT Year 3 and 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1 0.2 0.3 0.4 0.5 0.6
Year3
Year
1
163
Figure 27: Appendix A17 - Test Characteristic Curves for Course 3
Test Characteristic Curve for Course 3, Year 1
Test Characteristic Curve for Course 3, Year 2
164
Test Characteristic Curve for Course 3, Year 3
165
APPENDIX B, COURSE 6
List of Tables
Table 46: App B1: Item Diff (p) and p-bis Correlation of Course 6 Using CTT………………167
Table 47: App B2: Difficulty (b) and Discrim (a) Indices of Course 6 Using IRT…………….169
Table 48: App B3: Correl Coeff of Difficulty Index b/w CTT and IRT for Course 6…………171
Table 49: App B4: Correl Coeff of p-bis and Discrim b/w CTT and IRT for Course 6…..........171
Table 50: App B5: SE and Reliability Index (Alpha w/o) Course 6 Year 1……………………172
Table 51: App B6: SE and Reliability Index (Alpha w/o) Course 6 Year 2……………………174
Table 52: App B7: SE and Reliability Index (Alpha w/o) Course 6 Year 3……………………176
Table 53: App B9: Correl Coeff of Difficulty Index of CTT for Course 6 Year 1, 2, 3….........184
Table 54: App B10: Correl Coeff of Diff Index of IRT for Course 6 Year 1, 2, 3……………..184
Table 55: App B11: Correl Coeff of Discrim Index of CTT for Course 6 Year 1, 2, 3……......185
Table 56: App B12: Correl Coeff of Discrim Index of IRT for Course 6 Year 1, 2, 3………...185
166
APPENDIX B - COURSE 6
List of Figures
Figure 28: App B8 - Item Characteristic Curves for Course 6 for Year 1, 2, 3…………...……178
Figure 29: App B13 - Scatter Plots for Item Difficulty Using CTT for Course 6………….......186
Figure 30: App B14 - Scatter Plots of Item Difficulty Using IRT for Course 6…………..…...187
Figure 31: App B15 - Scatter Plots of Item Discrim (p-bis) Using CTT for Course 6…............189
Figure 32: App B16 - Scatter Plots of Item Discrimination Using IRT for Course 6………….190
Figure 33: App B17 - Test Characteristic Curves for Course 6…………………..……………192
167
APPENDIX B: COURSE 6
Table 46: Appendix B1 - Item Difficulty (p) and Point Biserial (p-bis) Correlation of
Course 6 Using CTT
Year 1 Year 2 Year 3
ID p p-bis p p-bis p p-bis
1 0.909 0.058 0.920 0.149 0.915 0.217
2 0.610 0.014 0.648 0.006 0.341 0.141
3 0.779 0.117 0.858 0.114 0.780 0.013
4 0.994 0.066 0.989 0.111 0.994 0.067
5 0.747 0.248 0.915 0.256 0.829 0.168
6 0.500 0.112 0.682 0.051 0.598 0.048
7 0.851 0.296 0.920 0.204 0.835 0.079
8 0.766 0.401 0.926 0.334 0.872 0.213
9 0.792 0.258 0.682 0.29 0.762 0.051
10 0.721 0.01 0.631 0.16 0.622 0.206
11 0.968 0.084 0.892 0.154 0.890 0.042
12 0.701 0.072 0.705 0.211 0.762 0.096
13 0.662 0.112 0.739 0.198 0.744 0.054
14 0.877 0.216 0.778 0.183 0.805 0.011
15 0.558 0.092 0.727 0.207 0.628 0.088
16 0.877 0.038 0.756 0.219 0.841 0.093
17 0.610 0.269 0.244 0.247 0.256 0.123
18 0.766 0.202 0.824 0.155 0.799 0.105
168
19 0.500 0.114 0.642 0.145 0.701 0.264
20 0.799 0.214 0.784 0.274 0.866 0.15
21 0.896 0.15 0.847 0.011 0.835 0.131
22 0.701 0.148 0.773 0.415 0.659 0.122
23 0.955 0.278 0.642 0.16 0.707 0.142
24 0.799 0.164 0.864 0.32 0.939 0.16
27 0.708 0.18 0.682 0.119 0.634 0.15
28 0.760 0.162 0.665 0.189 0.720 0.188
29 0.896 0.15 0.847 0.011 0.835 0.131
30 0.701 0.148 0.773 0.415 0.659 0.122
169
Table 47: Appendix B2 - Difficulty (b) and Discrimination (a) Indices of Course 6 Using
IRT
Year 1
Year 2 Year 3
ID a b a b a b
1 0.482 -1.329 0.421 -2.016 0.780 -0.682
2 0.358 0.926 0.274 0.167 0.459 2.333
3 0.439 -0.099 0.402 -1.164 0.403 -0.502
4 0.713 -2.429 0.726 -2.099 0.738 -2.492
5 0.552 0.424 0.532 -1.196 0.594 -0.312
6 0.411 1.740 0.553 0.663 0.437 0.866
7 0.525 -0.412 0.603 -0.992 0.523 -0.537
8 0.501 0.195 0.688 -0.816 0.679 -0.457
9 0.405 -0.314 0.606 0.724 0.480 -0.098
10 0.539 0.567 0.523 0.901 0.567 0.852
11 0.571 -1.909 0.581 -0.734 0.551 -0.964
12 0.420 0.457 0.638 0.646 0.513 -0.017
13 0.430 0.248 0.388 0.946 0.520 0.584
14 0.536 0.094 0.610 -0.006 0.467 -0.129
15 0.336 1.313 0.522 0.362 0.467 0.724
18 0.446 0.050 0.561 -0.194 0.539 -0.210
19 0.341 1.741 0.436 0.729 0.609 0.494
20 0.467 -0.143 0.515 -0.020 0.620 -0.532
21 0.377 0.954 0.591 -0.044 0.483 -0.175
22 0.442 -0.928 0.631 -0.204 0.625 -1.154
23 0.668 -1.243 0.434 0.725 0.517 0.335
170
24 0.391 -0.429 0.539 -0.584 0.707 -1.089
25 0.430 0.248 0.388 0.946 0.520 0.584
26 0.536 0.094 0.610 -0.006 0.467 -0.129
27 0.451 0.483 0.487 0.571 0.512 0.740
28 0.432 0.058 0.415 0.547 0.576 0.355
29 0.561 -0.754 0.547 -0.410 0.577 -0.389
30 0.427 0.473 0.448 -0.121 0.500 0.591
171
Table 48: Appendix B3 - Correlation Coefficients of Difficulty Index Between CTT and
IRT for Course 6
Year 1
p-b
Year 2
p-b
Year 3
p-b
-0.972 -0.949 -0.970
Table 49: Appendix B4 - Correlation Coefficients of Point Biserial and Discrimination
Index Between CTT and IRT for Course 6
Year 1
pbis-a
Year 2
pbis-a
Year 3
pbis-a
0.601 0.631 0.637
172
Table 50: Appendix B5 - SE and Reliability Index (Alpha w/o) Course 6 Year 1
Item ID a aSEM b bSEM Alpha w/o
Item 1 0.482 0.128 -1.329 0.354 0.661
Item 2 0.358 0.264 0.926 0.279 0.655
Item 3 0.439 0.153 -0.099 0.269 0.648
Item 4 0.713 0.191 -2.429 0.674 0.670
Item 5 0.552 0.163 0.424 0.208 0.629
Item 6 0.411 0.316 1.740 0.239 0.660
Item 7 0.525 0.135 -0.412 0.262 0.657
Item 9 0.405 0.150 -0.314 0.295 0.689
Item 10 0.539 0.173 0.567 0.206 0.636
Item 11 0.571 0.140 -1.909 0.447 0.682
Item 12 0.420 0.189 0.457 0.254 0.676
Item 13 0.405 0.215 0.695 0.255 0.671
Item 14 0.552 0.163 0.424 0.208 0.629
Item 15 0.336 0.338 1.313 0.291 0.600
Item 16 0.565 0.132 -0.536 0.265 0.657
Item 17 0.566 0.220 1.199 0.183 0.618
Item 18 0.446 0.158 0.050 0.259 0.672
Item 19 0.341 0.369 1.741 0.285 0.604
Item 20 0.467 0.147 -0.143 0.261 0.668
Item 21 0.377 0.258 0.954 0.266 0.683
Item 22 0.442 0.131 -0.928 0.322 0.683
Item 23 0.668 0.139 -1.243 0.345 0.658
Item 24 0.391 0.148 -0.429 0.307 0.698
173
Item 25 0.430 0.172 0.248 0.257 0.672
Item 26 0.536 0.149 0.094 0.228 0.650
Item 27 0.451 0.184 0.483 0.240 0.660
Item 28 0.432 0.161 0.058 0.264 0.676
Item 29 0.561 0.130 -0.754 0.285 0.658
Item 30 0.427 0.189 0.473 0.251 0.669
174
Table 51: Appendix B6 - SE and Reliability Index (Alpha w/o) Course 6 Year 2
Item ID a aSE b bSE Alpha w/o
Item 1 0.421 0.117 -2.016 0.395 0.506
Item 2 0.274 0.232 0.167 0.344 0.533
Item 3 0.402 0.123 -1.164 0.319 0.522
Item 4 0.726 0.157 -2.099 0.513 0.515
Item 5 0.436 0.227 0.729 0.217 0.531
Item 6 0.553 0.188 0.663 0.179 0.575
Item 7 0.603 0.120 -0.992 0.274 0.514
Item 8 0.688 0.122 -0.816 0.252 0.500
Item 9 0.606 0.186 0.724 0.164 0.570
Item 10 0.523 0.223 0.901 0.182 0.584
Item 11 0.581 0.121 -0.734 0.249 0.518
Item 12 0.638 0.174 0.646 0.160 0.557
Item 13 0.562 0.161 0.364 0.186 0.581
Item 14 0.447 0.146 -0.172 0.243 0.538
Item 15 0.522 0.167 0.362 0.197 0.501
Item 16 0.585 0.154 0.302 0.183 0.584
Item 17 0.449 0.150 3.125 0.237 0.535
Item 18 0.561 0.132 -0.194 0.213 0.502
Item 19 0.436 0.227 0.729 0.217 0.531
Item 20 0.515 0.144 -0.020 0.214 0.512
Item 21 0.591 0.136 -0.044 0.198 0.594
Item 22 0.631 0.128 -0.204 0.201 0.518
Item 23 0.434 0.227 0.725 0.219 0.527
175
Item 24 0.539 0.124 -0.584 0.243 0.520
Item 25 0.388 0.282 0.946 0.238 0.553
Item 27 0.487 0.193 0.571 0.201 0.501
Item 28 0.415 0.211 0.547 0.231 0.540
Item 29 0.547 0.127 -0.410 0.229 0.512
Item 30 0.448 0.148 -0.121 0.240 0.541
176
Table 52: Appendix B7 - SE and Reliability Index (Alpha w/o) Course 6 Year 3
Item ID a aSE b bSE Alpha w/o
Item 1 0.780 0.130 -0.682 0.224 0.636
Item 2 0.459 0.203 2.333 0.219 0.641
Item 3 0.403 0.150 -0.502 0.281 0.663
Item 4 0.738 0.180 -2.492 0.611 0.658
Item 5 0.594 0.136 -0.312 0.214 0.638
Item 6 0.437 0.261 0.866 0.222 0.660
Item 7 0.523 0.135 -0.537 0.243 0.652
Item 8 0.679 0.131 -0.457 0.211 0.633
Item 9 0.480 0.157 -0.098 0.232 0.657
Item 10 0.567 0.219 0.852 0.176 0.628
Item 11 0.551 0.126 -0.964 0.270 0.656
Item 12 0.513 0.156 -0.017 0.217 0.649
Item 13 0.485 0.164 0.041 0.224 0.657
Item 14 0.469 0.142 -0.448 0.252 0.663
Item 15 0.467 0.232 0.724 0.211 .652
Item 16 0.542 0.134 -0.530 0.238 0.650
Item 17 0.502 0.158 2.802 0.219 0.645
Item 19 0.609 0.178 0.494 0.174 0.618
Item 20 0.620 0.130 -0.532 0.225 0.642
Item 21 0.483 0.152 -0.175 0.234 0.659
Item 22 0.625 0.126 -1.154 0.284 0.655
Item 23 0.517 0.180 0.335 0.203 0.641
Item 24 0.707 0.130 -1.089 0.275 0.644
177
Item 25 0.520 0.202 0.584 0.195 0.641
Item 26 0.542 0.134 -0.530 0.238 0.650
Item 27 0.512 0.221 0.740 0.195 0.639
Item 28 0.576 0.172 0.355 0.186 0.633
Item 29 0.577 0.135 -0.389 0.222 0.644
Item 30 0.500 0.207 0.591 0.202 0.645
178
Figure 28: Appendix B8 - Item Characteristic Curves for Course 6 for Year 1, 2 and 3
Year 1 Year 2 Year 3
179
Year 1 Year 2 Year 3
180
Year 1 Year 2 Year 3
181
Year 1 Year 2 Year 3
182
Year 1 Year 2 Year 3
183
Year 1 Year 2 Year 3
184
Correlation Coefficients
Table 53: Appendix B9 - Correlation Coefficient of Difficulty Index of CTT for Course 6
Year 1, 2, 3
Year 1 Year 2 Year 3
Year 1 1 0.527 0.661
Year 2 0.527 1 0.852
Year 3 0.661 0.852 1
Table 54: Appendix B10 - Correlation Coefficient of Difficulty Index of IRT for Course 6
Year 1, 2, 3
Year 1 Year 2 Year 3
Year 1 1 0.585 0.727
Year 2 0.585 1 0.778
Year 3 0.727 0.778 1
185
Table 55: Appendix B11 - Correlation Coefficient of Discrimination Index of CTT for
Course 6 Year 1, 2, 3
Year 1 Year 2 Year 3
Year 1 1 0.661 0.651
Year 2 0.661 1 0.620
Year 3 0.651 0.620 1
Table 56: Appendix B12 – Correlation Coefficient of Discrimination Index of IRT for
Course 6 Year 1, 2, 3
Year 1 Year 2 Year 3
Year 1 1 0.530 0.610
Year 2 0.530 1 0.675
Year 3 0.610 0.675 1
186
Figure 29: Appendix B13 - Scatter Plots for Item Difficulty Using CTT for Course 6
Item Difficulty with CTT Year 1 and 2
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
Year 1
Year
2
Item Difficulty with CTT Year 3 and 1
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
Year 3
Year
1
Item Difficulty with CTT Year 2 and 3
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2 Year 2
Year 3
187
Figure 30: Appendix B14 - Scatter Plots of Item Difficulty Using IRT for Course 6
Item Difficulty with IRT Year 1 and 2
-3
-2
-1
0
1
2
3
4
-3 -2 -1 0 1 2
Year 1
Year
2
Item Difficulty with IRT Year 2 and 3
-3
-2
-1
0
1
2
3
4
-3 -2 -1 0 1 2 3 4
Year 2
Year
3
188
Item Difficulty with IRT Year 3 and 1
-3
-2
-1
0
1
2
-3 -2 -1 0 1 2 3 4
Year 3
Year
1
189
Figure 31: Appendix B15 - Scatter Plots of Item Discrimination (P-bis) Using CTT for
Course 6
Item Discrimination with CTT Year 2 and 3
0
0.05
0.1
0.15
0.2
0.25
0.3
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
Year 2
Year
3
Item Discrimination with CTT Year 3 and 1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 0.05 0.1 0.15 0.2 0.25 0.3
Year 3
Year
1
Item Discrimination with CTT Year 1 and 2
-0.2
-0.1
0
0.1
0.2
0.3
0.4
-0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
Year 1
Year 2
190
Figure 32: Appendix B16 - Scatter Plots of Item Discrimination Using IRT for Course 6
Item Discrimination with IRT Year 1 and 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Year 1
Year
2
Item Discrimination with IRT Year 2 and 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Year 2
Year
3
191
Item Discrimination with IRT Year 3 and 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Year 3
Year
1
192
Figure 33: Appendix B17 - Test Characteristic Curves for Course 6
Test Characteristic Curve for Course 6, Year 1
3
Test Characteristic Curve for Course 6, Year 2
193
Test Characteristic Curve for Course 6, Year 3
194
APPENDIX C: ICCs for COURSE 1
ITEM 1
Year 1 Year 2 Year 3
ITEM 4
ITEM 5
195
ITEM 6
Year 1 Year 2 Year 3
ITEM 7
ITEM 10
196
ITEM 11
Year 1 Year 2 Year 3
ITEM 12
ITEM 13
197
ITEM 14
Year 1 Year 2 Year 3
ITEM 15
ITEM 16
198
ITEM 17
Year 1 Year 2 Year 3
ITEM 18
ITEM 19
199
ITEM 20
Year 1 Year 2 Year 3
ITEM 21
ITEM 22
200
ITEM 23
Year 1 Year 2 Year 3
ITEM 25
ITEM 26
201
ITEM 27
Year 1 Year 2 Year 3
ITEM 28
ITEM 29
202
ITEM 30
Year 1 Year 2 Year 3
top related