examining the untestable assumptions of the chained linear linking for livingston score adjustment...
TRANSCRIPT
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
1/117
EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEARLINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO
THE 2005 MSCE MATHEMATICS PAPER 2.
M.Ed (Testing, Measurement and Evaluation) Thesis
ByCHIFUNDO STEVEN AZIZI
BSc (Ed) Mzuzu University
Submitted to the Department of Educational Foundations, Faculty of Education,
in partial fulfilment of the requirements for the degree of
Master of Education (Testing, Measurement and Evaluation)
University of MalawiChancellor College
June, 2009
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
2/117
DECLARATION
I the undersigned hereby declare that this thesis is my own original work which has not
been submitted to any other institution for similar purposes. Where other peoples work
has been used acknowledgements have been made.
____________________________________
Full Legal Name
_____________________________________
Signature
_____________________________________
Date
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
3/117
Certificate of Approval
The undersigned certify that this thesis represents the students own work and effort andhas been submitted with our approval.
Signature: ____________________________Date:__________________________
M. Kazima PhD (Senior Lecturer)
Main Supervisor
Signature: ____________________________Date:__________________________
L. Kazembe PhD (Senior Lecturer)
Member, Supervisory Committee
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
4/117
iv
To the memory of my late father, Charles Frank Azizi and late brother, Charles Mike
Azizi. May their souls rest in peace!
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
5/117
v
ACKNOWLEDGEMENTS
I would like to thank Dr. M. Kazima and Dr. L. Kazembe, my main supervisor and
co-supervisor respectively, for their many suggestions and constant support during this
research. Without them this work would never have come into existence.
I also wish to thank the headteachers of Blantyre, Henry Henderson Institute,
Bangwe, Chiradzulu, and Njamba secondary schools for allowing me to collect data from
their institutions. Again, my gratitude goes to the Executive Director of Malawi
Examinations Board (MANEB) for authorising me to use 2005 MSCE mathematics
examination paper 2. Big appreciations should also go to the students who participated in
this study; you really helped me a lot.
I am grateful to my mum, my fiance, brothers and sisters for their love and
financial support. Special mention goes to Ministry of Education for funding my tuition
fee. Finally, words alone can not express my gratitude to the Almighty God who made it
possible for me to complete this study and for the infinite blessings.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
6/117
vi
ABSTRACT
MSCE mathematics paper 2, like many high-stakes test formats, includes a section
of optional questions in addition to mandatory part. It has been argued that offering
options and comparing final scores is often not fair to examinees especially to those that
attempt most difficult questions from the optional part. Livingston (1988) proposed a way
of adjusting essay score. This was later explained from the perspective of test equating by
Allen, Holland, and Thayer (1993) and they concluded that the proposal made implicit
assumptions of chained linear equating about the unobserved data. This study has tested
the assumptions with application to 2005 MSCE mathematics examination paper 2 so as
to determine if Livingston score adjustment could be used on this examination.
The study used systematic sampling to obtain examinees from five purposively
selected secondary schools. The 2005 MSCE mathematics paper 2 was administered to
247 examinees in two parts, section A followed by section B. For section B, examinees
were asked to first indicate their choice of three optional questions and were then
instructed to answer all of the questions.
The results were analysed using Root Mean Square Difference (RMSD) and Root
Expected Mean Square Difference (REMSD) to quantify the differences between the
subgroups linking functions of unobserved and observed data. It was found that group
invariance did not hold across the entire subgroups that were involved. This means that
Livingston score adjustment would not be possible on this examination. It is
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
7/117
vii
recommended that in order to minimize optional scores inequity, item writers need to
use analytical methods to strictly match different levels of cognitive demands of topics by
using MSCE mathematics performance level descriptors when constructing the optional
items.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
8/117
viii
TABLE OF CONTENTS
Page
DEDICATION. iv
ACKNOWLEDGEMENTS.. v
ABSTRACT.. vi
LIST OF TABLES xiii
LIST OF FIGURES.. xiv
LIST OF ACRONYMS AND ABBREVIATIONS.. xv
CHAPTER
1 INTRODUCTION 1
1.1 Background... 1
1.1.1 Characteristic of the examination investigated 1
1.1.2 Grade Awarding Process. 2
1.1.3 Comparability of optional questions raw scores 2
1.1.4 Livingstons raw score adjustment.. 4
1.2 Statement of the Problem. 6
1.2.1 Purpose of the Study 7
1.2.2 Research Questions. 8
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
9/117
ix
1.2.3 Significance of the study 8
1.3 Theoretical Framework 9
1.4 Definition of terms 13
2 LITERATURE REVIEW. 15
2.1 Introduction. 15
2.2 General information on optional questions... 15
2.3 Advantages of optional questions. 17
2.4 Problems of optional questions. 18
2.4.1 The syllabus. 19
2.4.2 The abilities of candidates 19
2.5 Relationship between candidates question choice and getting
high scores.. 21
2.6 Linking and Equating 22
2.7 Can we link or equate optional questions?........................................ 25
2.8 What are the consequences of not linking/equating optional questions
scores?............................................................................................... 28
3 METHODOLOGY.. 30
3.1 Introduction.. 30
3.2 The Research Questions 30
3.3 The Design 31
3.3.1 Description of the Research 31
3.3.2 Population 31
3.3.3 Sampling.. 31
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
10/117
x
3.3.4 Instruments. 33
3.3.5 The administration of the instruments and data gathering. 34
3.4 Data Analysis ........ 34
3.4.1 Extent of difficulty in optional questions 34
3.4.2 Correlation of scores on section B and total scores of
the section A. 35
3.4.3 Establishing group invariance on linking/ equating functions
of examinees that chose a concerned optional question and
for those that selected other questions. 36
3.5 Ethical Considerations. 39
3.6 Validity and Reliability 40
3.7 Delimitations and Limitations of the study. 41
3.7.1 Delimitations.. 41
3.7.2 Limitations. 41
4 RESULTS AND DISCUSSION OF THE FINDINGS. 43
4.1 Introduction.. 43
4.2 To what extent do optional questions differ?................................... 43
4.2.1 Preliminary analysis... 43
4.2.2 Comparing p-values of section B............................................. 46
4.3 How are scores on section A and section B with choice
correlated?........................................................................................ 47
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
11/117
xi
4.4 Establishing group invariance on linking/ equating functions
of examinees that chose a concerned optional question and
for those that selected other questions . 48
4.4.1 Linking functions that largely vary at lower tale of choice
question scale... 49
4.4.2 Linking functions that largely vary at upper tale of choice
question scale.. 51
4.4.3 Linking functions that largely vary at lower and second
upper tale of choice question scale. 54
4.4.4 Linking functions that largely vary at both lower and upper
tales of choice question scale.. 57
4.4.5 Linking functions that constantly vary across the entire
score scale. 58
5 CONCLUSIONS, IMPLICATIONS AND RECOMMENDATION.... 60
5.1 Introduction.. 60
5.2 Conclusions... 60
5.2.1 The main findings of the literature review60
5.2.2 The main findings of the empirical investigation..61
5.3 Implications....63
5.4 Recommendation... 64
REFERENCES.. 66
APPENDICES... 74
A. Pairs of subgroups that chose particular questions and other questions. 75
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
12/117
xii
B. Pairs of subgroups that chose particular questions and other questions ... 77
C. Section A of 2005 M.S.C.E. Examination paper 2 presented in this studyas paper I.... 81
D. Section B of 2005 M.S.C.E. Examination paper 2 presented in this studyas paper II .. 85
E. Answer sheet cover page for paper I.. 89F. Answer sheet cover page for paper II 90
G. Original form of 2005 M.S.C.E. Examination mathematics paper 2 91H.
Letter to Executive Director of Malawi National Examinations Board 97
I. Letter from Executive Director of Malawi National Examinations Board 98J. Letter to secondary school headteacher 99K. Letter to Shirehighlands Education Division Manageress 100L. Letter to South West Education Division Manager... 101M.My introduction letter from Head of Department to secondary schools
headteachers... 102
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
13/117
xiii
LIST OF TABLES
Table Page
4.1 Major content areas of section A.. 44
4.2 Major content areas of section B..45
4.3 P-values for questions in section A and section B without choice.... 46
4.4 Pairs of subgroups that chose particular questions and other questions and
their graphs are illustrated in appendix A.... 51
4.5 Pairs of subgroups that chose particular questions and other questions and
their graphs are illustrated in appendix B.... 53
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
14/117
xiv
LIST OF FIGURES
Figure Page
4.1 Equated scores on section A from optional question 7 that largely vary at
lower tale of choice question scale ...................................... 49
4.2 Equated scores on section A from optional question 8 that largely vary at
higher tale of choice question scale . 50
4.3 Equated scores on section A that largely vary at lower and second upper tale
of choice question scale from different optional questions . 54
4.4 Equated scores on section A that largely vary at both lower and upper tales
of score scale of optional question 10 .. 57
4.5 Equated scores on section A that vary constantly across the entire score scale
of optional question 7 ......58
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
15/117
xv
LIST OF ACRONYMS AND ABBREVIATIONS
AP Advanced Placement
CSE Certificate of Secondary Education
DTM Difference That Matters
HHI Henry Henderson Institute
IRT Item Response Theory
MANEB Malawi National Examinations Board
MSCE Malawi School Certificate of Education
NEAT Non-Equivalent groups Anchor Test
REMSD Root Expected Mean Square Difference
RMSD Root Mean Square Difference
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
16/117
1
CHAPTER 1
1.0 INTRODUCTION
This chapter provides a general overview of the problem under study. It
considers important concepts that dissect the problem into manageable components.
The first section is the background, followed by statement of the problem, theoretical
framework, and definition of terms is the last component.
1.1 Background
Malawi School Certificate of Education (MSCE) examination among other uses
is for certification, selection for tertiary education, and employment decisions. There
are several subjects examined at MSCE including mathematics. It is rated as one of
the most significant subjects for entry into most programmes in Malawian
universities. University of Malawi, in particular, prefers candidates with at least a
credit in mathematics among other subjects to enrol in almost every programme that
is offered.
1.1.1 Characteristic of the examination investigated
At MSCE examination, mathematics has two papers; paper 1 and paper 2. Paper
1 asks candidates to attempt all 24 questions in 2 hours and, by design, it is easier
than paper 2, although the two papers carry the same weight: each paper carries 100
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
17/117
2
marks. Paper 2 has two sections, A and B (see appendix G). Section A is
compulsory, where candidates attempt six questions worth 55 marks in total. In
section B, however, candidates are allowed choice of questions to answer. Out of six
questions, candidates are asked to answer three questions only, worth 45 marks in
total. Paper 2 runs for 2 hours.
1.1.2 Grade Awarding Process
Mathematics, like all other subjects at MSCE examination, is graded on a nine-
point scale (Malawi National Examinations Board, 1999).
1-2, denote pass with distinction;
3-6, denote pass with credit;
7-8, denote general pass; and
9, denotes fail.
The raw score of each candidate is converted into grades. This is done by
awards committee that uses grade boundaries (cutoff scores) to turn scores into
grades (Khembo, 2004). Because mathematics has two papers, each paper is graded
separately and then corresponding cutoff scores at 2/3, 6/7, and 8/9 are summed to
determine the final cutoff scores for the subject.
1.1.3 Comparability of optional questions raw scores
Livingston (1988) observed that question developers try their best to make
optional questions equally difficult. Angoff (1971); Newton (1977); and Wainer &
Thissen (1994), however, argue that it is not easy to produce tests that are similar in
difficulty. Though item setters strive to produce questions of equal difficulty, the
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
18/117
3
questions have their own inherent intricacy that cannot be equalized. The difficult
inherencies come from the complexity of the topics where the questions are
formulated. It could be nave to compare a raw score that an examinee gets from an
optional question which elicits, for example, the use of Venn diagrams to analyse
and interpret data to a question which asks an examinee to find the sum of
geometric progression using a formula. These two questions come from different
topics which differ in complexity; hence raw scores on these two questions will not
mean the same thing because the raw scores on the two questions do not indicate
the same level of knowledge and skill. The scores will not be comparable. To treat
them as if they are comparable would be misleading for the score users and unfair
to the examinees.
Having looked at the complexity of measuring examinees who answer different
questions, the question would be: should choice questions still be incorporated in our
examinations? The merits and demerits of optional questions are discussed in
literature review section. However, Kierkegaard (1986, p.24) argues if you allow
choice, you will regret it, if you dont allow choice, you will regret it; whether you
allow choice or not, you will regret both. This argument highlights that if choice
were not allowed, the limitations on the domain coverage forced by the small
number of questions might unfairly affect some candidates. And on the other hand,
choice would compromise test fairness when it comes to comparison of scores
because of different levels of knowledge and skills being elicited from examinees
from each optional question. Nevertheless, one would propose to increase the length
of the test; this is not often practical (Wainer and Thissen, 1994) taking in
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
19/117
4
consideration of exams time and examinees fatigue. The onus, therefore, remains
with the examiners.
In case of mathematics paper 2, there have not been any intense arguments over
optional questions behaviour, except Khembo (2004) sentiments against the policy
of allowing choice. With little or no study done on optional questions on
examinations administered by Malawi National Examinations Board (MANEB), the
policy of allowing choice questions in mathematics paper 2 would continue without
reforms and innovations to improve fair assessment because most of the stakeholders
would not know how the choice questions are performing on this paper.
1.1.4 Livingstons raw score adjustment
Psychometricians, nevertheless, have tried to find a post hoc solution to the
incomparability of optional questions scores. Livingston (1988) developed a method
for adjusting scores of optional questions to take away the differential in difficulty of
the questions. The procedures, in brief, are imputing a score for the examinee on
each optional question which the examinee does not answer, and then averaging the
scores, observed and imputed, over all optional questions. Allen, Holland and
Thayer (1993) observe that the methodology makes implicit assumptions when
imputing scores using chained linear equating. Under this procedure, raw scores on
optional question i are transformed to the scale of optional question j through
scores on mandatory section (also known as common portion) for the examinees that
answered question i .
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
20/117
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
21/117
6
1.2 Statement of the Problem
Mathematics is one of the papers at MSCE examinations that are not pre-tested
(Khembo, 2004). Pretesting allows item analysis, which in turn ensures that only
questions of proven quality are included in the final examination. When examiners
compile examination paper they assume that the selected questions have equal
inherent difficulty, as it is evidenced by the equal allocation of marks (each optional
question carries 15marks).
In the study done by Khembo (2004), where he was investigating the use of
performance level descriptors to ensure consistency and comparability in standard
setting divulged that item difficulty indices (item p values) for 2002 mathematics
paper 2 examination were varying much for questions in section B. For example,
question 10aand bhadp-values of 0.03 and 0.01. Question 7aand bp-values were
0.52 and 0.15, question 12aand bdifficulty indices were 0.27 and 0.14. Comparing
the p-values of the mentioned questions; one would note that the items were
differentially difficult. However, some would argue that the items were attempted by
non-equivalent groups conditioned to choice, and that it would not be possible to
compare theirp-values outright. This argument is valid, but in the mentioned study,
the researcher employed competent mathematics teachers to establish differential
difficulty on the optional questions. The rating by the judges using performance
level descriptors for questions in section B for 2002 and 2003 mathematics papers
confirmed that some questions required higher order cognitive demands than others
for an examinee to succeed. The judges complemented what was observed from the
p-values.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
22/117
7
With observations from the teachers and coupled with conspicuous differential
p-values for optional questions, it is clear that the introduction of optional questions
into this paper brings in unfairness in grading. The basis for comparability of raw
scores, thus, is considerably weakened since different examinees would answer
samples of questions that are not comparable in difficulty.
For this reason, there is a need of finding a method which would circumvent
incomparability of measurements. Livingston (1988) proposed a method of adjusting
raw scores of optional questions to achieve fairness in grading examinees that take
different questions. In the procedure, Allen et al. (1993) note that there are implicit
assumptions, which are used in order to adjust the scores. They call them
Livingston missing data assumptions.
The assumptions are based on a key theoretical requirement of test equating
which emphasises that the resulting equating functions should not depend on the
population on which it is calculated. In other words, the two equating functions
should be identical regardless of which subpopulation has attempted which question.
Therefore, before the method is adopted and adapted in our grading system,
especially in mathematics, there is a need to scrutinise it in detail.
1.2.1 Purpose of the Study
General objective
The general objective of the study is to test the assumption of chain linear
equating/linking for Livingston raw score adjustment method on optional questions
scores of MSCE mathematics paper 2.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
23/117
8
Specific objectives
distinguish item difficulty level of optional questions using item difficulty
indices of raw scores.
compare correlations between total scores of compulsory section ( i.e.
Section A/common portion) and scores of optional questions portion.
establish whether equating/linking functions of examinees that chose a
concerned optional question and for those that selected a different choice
question are group invariance.
1.2.2 Research Questions
1. To what extent do optional questions differ in difficulty?
2. How are scores on optional questions portion and total scores on the
common portion correlated?
3. Are equating/linking functions of examinees that chose a concerned
optional question and for those that selected alternate question group
invariance?
1.2.3 Significance of the study
Fairness in measurement is of paramount significance. Every examinee ought to
be measured using the same instrument and the same scale for comparability to be
meaningful. As already mentioned, mathematics is one of the subjects that are
treasured at Malawi School Certificate of Education; and as a result a certificate
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
24/117
9
without a pass in mathematics puts a person at a disadvantage position when it comes
to selection for further studies or even job selection.
To forestall this measurement quandary, Livingston suggests a method for score
adjustment of optional questions to a common scale. It would be easy to adjust the
scores of MSCE mathematics paper 2 using this method. The consequences,
however, of that action are not known in our context; and therefore it is worth testing
the mentioned fundamental assumptions as Dorans (2004); Liu, Cahn and Dorans
(2006) say that subgroups invariance is the most critical and plays a significant role
in assessing fairness.
Furthermore, there has been no detailed research to the knowledge of the
researcher that has addressed the consequences of optional questions on the
examinations administered by Malawi National Examinations Board. This study
would evaluate the extent of relationship between knowledge and skills measured in
section A and those measured in section B. It would also explore the pattern of
choices in section B conditioned to topics in Malawi senior mathematics syllabus.
1.3 Theoretical Framework
The process of equating is used to obtain comparable scores when more than one
test forms are used in a test administration (Holland, von Davier, Sinharay, and Han,
2006). Angoff (1971) has defined the equating of tests as a process to convert the
system of units of one form to the system of units of the other so that the scores
obtained from one form could be compared directly with the scores obtained from
the other form.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
25/117
10
The central reason for equating different test forms is to ensure fair decision
making regarding the test results (Liu and Dorans, 2008). There are three techniques
and methodologies for making different test forms comparable known as equating
procedures (Jaeger, 1981; Petersen, Kolen, and Hoover, 1989; Cook and Eignor,
1991), or designs; namely random groups, single group, and common item non-
equivalent groups (also known as non-equivalent groups anchor test, NEAT).
There are three equating methods used in common item non-equivalent groups
design such as Tucker, Levine, and chain linear (von Davier and Kong, 2005). This
study focuses on the chain linear because it uses common item(s) scores(s) as the
middle link in a chain of linear linking relationships. Basically, the chain linear
linking is done by equalising standardised deviation scores (z-scores) on the two test
forms via standardised deviation common item(s) scores. Before going into detail of
chain linear equating/linking, we first look at Livingston score adjustment procedure
in steps as presented by Allen, Holland, and Thayer (1993, pp17-18); because at the
end would like to connect it with the chain equating/linking functions. Here are some
more notions for easy grasp of what to follow:
A
jY
PY
*
j
jj
sectionon thescoreswithcorrelated
perfectlywerequestiononscoreifimputedbedthat woulscorethe
innotexamineeanforjquestiononimputedscorethe
=
=
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
26/117
11
jXYjj
iXYii
j
i
j
i
P,,
P,,
jY
iY
AX
BjPP
BiPP
X
AP
j
i
intcoefficienncorrelatiodeviation,standardmean,denote
intcoefficienncorrelatiodeviation,standardmean,denote
questionoptionalonscore
questionoptionalonscoreportion)(commonsectiononscore
sectioninquestionanswerthatofpopulationsub
sectioninquestionanswerthatofpopulationsub
testas
knownalsoiswhich,sectiontakewhoexamineesofpopulationentirethe
=
=
=
=
=
=
Step 1: equating iY to each of the jY . For examinees in iP obtain the converted
value of the observed iy to the scales of the other jY s. The converted values are
denoted )(*
iij yY .
Step 2: obtaining imputed values, ( )iimputedj yy , , for ij for every examinee in iP .
These imputed scores are weighted averages of the raw score iy and its equated score
in the jY scale, )(*
iij yY :
)()1()( *, 11 iijXYiXYiimputedi yYyyy jj +=
Step 3: calculate the adjusted score as the simple average of the observed raw score
and the imputed scores over all koptional questions.
{ } kyyyYij
iimputedjiadj
+=
)(,
Combining steps 2 and 3 to get a simple expression for adjY , we first denote as the
average of all the correlations,1jXY
:
kj XYj= 1 and
= j XYj iijXYii jj yYyY 11 )()(*
,
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
27/117
12
where )( ii yY is the weighted average of the converted values, in other words, a
transformation of iy into an average scale of the kquestion scores determined by the
equations with weights proportional to the correlations,1jXY
.
A simple Livingston adjusted score function is expressed as
)()1( iiiadj yYyY +=
Coming back to chain linear equating/linking functions and connecting it with
Livingston score adjustment, it is discovered that:
In step 1, the linear equation for equating iY to the scale of X in iP is
( ) 1)()( 11
1
1 ii
i
Xii yyXiX
i
+=
and the linear equation for equating iX to the scale of jj PY in is
(2))()(1
1
1
1 j
j
X
X
j
jj xxY
+=
where11
andjj XX
are the mean and standard deviation of X for examinees
choosing question j . The essence of the word chained in the chained linear
equating is the substitution of x in the )(xYj of equation (2) with )( ii yX in equation
(1), neglecting the fact that the two equating functions are for different populations
(Brennan, 2006). That is
(3))(
)()())((
*
1
1
11
1
1
1
11
1
iij
ii
i
j
X
X
XX
X
j
jijj
yY
yyXY
j
i
ji
j
=
++=
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
28/117
13
Braun and Holland (1982) indicate that for chain equating/linking to produce
unbiased results, the two chained equating/linking functions should not depend on
which population is used for the equating. Dorans and Holland (2000); von Davier,
Holland, and Thayer (2004); Dorans (2004); Liu, Cahn, and Dorans (2006) call this
requirement population invariance. It means that equating iY to iPXon ought to
give the same equating function as ji PXY onto (Allen et al., 1993). In this case iY
is missing data on jP , which in this study will be available. The resulting linear
equating function of ji PXY onto is
(4))()( 11
1
1 iji
ij
X
Xiij yyXj
j
+=
The two linear equating/linking functions (1) and (4) therefore must have the same
slope and intercepts in order to meet the above condition or requirement.
1.4 Definition of terms
Conventional secondary school: public school owned by Malawi government.
Cutoff score/cut score: a point on a score scale in which scores at or above that
that point are in a different category or classification than scores below the
point.
Difficulty: a factor causing trouble in achieving a positive result or tending to
produce a negative result.
Optional questions: examinees self-selected questions or choice of questions in a
test.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
29/117
14
Performance descriptors: scale of achievement levels with a set of observable
behavioural descriptions
Test form: examination paper
National secondary: a school where its students are selected for admission from
different districts across Malawi.
District secondary: a school that admits students taken from the same district. It
offers boarding and lodging.
Day secondary: a school that offers no boarding and lodging. Its students come from
surrounding communities.
Grant-aided secondary: church affiliated school that receives financial assistance
from Malawi Government.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
30/117
15
CHAPTER 2
2.0 LITERATURE REVIEW
2.1 Introduction
The literature review has seven sections. The first section gives general
information on optional questions. The second section discusses some advantages of
optional questions regarding to their use in test forms. The third section looks at
problems that come with the policy of allowing candidates to choose questions in an
examination. Relationship between candidate question choice and performing high is
discussed in the fourth section. Definition of linking and equating under this study is
given in fifth section. Sixth section discusses the possibility of linkage and
equitability of optional questions using traditional equating methods. The last section
discusses the consequences of not linking/equating when choice items are
differentially difficult.
2.2 General information on optional questions
The introduction of optional questions into examinations brings in a certain
complication of the process of measurement, since different groups of candidates
will attempt different questions yet from a single paper; thereby creating room for
combination of different test forms in candidates scripts (Willmott & Hall, 1975;
Bell, 1997). In the context of mathematics paper 2, choosing three questions out of
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
31/117
16
six creates twenty possible combinations of test forms. The complication comes in
because candidates answer in effect different papers out of these different
combinations, especially when questions vary much in difficulty. It then means the
same total mark may not represent comparable performance (Lewis, 1974).
Good test adequately samples out questions from the content domain to provide
a sound basis for determining the extent to which a student has mastered the course.
Mann (1845, pp.37-40) as cited by Wainer, Wang, and Thissen (1991, p.2) argued
that
it is clear that the larger the number of questions put to a scholar the
better is the opportunity to test his merits. If but a single question is put,
the best scholar in the school may miss it, though he would succeed in
answering the next twenty without a blunder; or the poorest scholar may
succeed in answering one question, though certain to fail in twenty
others. Each question is a partial test, and the greater the number of
questions, therefore, the nearer does the test approach to completeness. It
is very uncertain which face of a die will turn up at the first throw; but if
the dice are thrown all day, there will be a greater equality in the number
of faces turned up.
The argument of Mann is quite plausible in the context of MSCE mathematics
syllabus. To determine that one has indeed mastered MSCE mathematics, it does not
take a single question answered correctly, but enough questions that cover fairly the
content domain. Section A, which is a mandatory section of the mathematics paper 2
contains fairly small items whilst in section B there are large items. Wainer et al.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
32/117
17
(1991) define large items as those that take examinee longer to complete than do
short items. Large items provide deep coverage of the content domain that can
guarantee the examiner if one answers them correctly that the examinee has
thoroughly mastered the course. In this case, large items need to be many but an
examinee cannot complete many large items within the allotted testing time. One
way of compromising testing time limits and domain coverage is by providing many
large items and allow examinees to choose them.
2.3 Advantages of optional questions
Optional questions have some advantages to candidates, teachers and examiners.
In this study, only three main advantages are discussed.
First, optional questions provide each candidate the chance to answer questions
on a wide range of topics (Bradlow and Thomas, 1998). It is so because the presence
of so many questions on a paper than time can allow means wider coverage of the
syllabus. This in return increases fairness among candidates (Allen, Holland, and
Thayer, 2005) because they are not restricted to answer samples of questions from
few topics.
Second, optional questions are used in the examinations that are interested in
measuring higher order cognitive domain (Allen et al., 2005). In these examinations,
authenticity of candidates work is perceived by the examiners to be more realistic
(Bradlow and Thomas, 1998). This advantage is more applicable to essay optional
questions where candidates are just given a topic to write about. In mathematics, it is
also applicable because optional questions demand high level of thinking. When an
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
33/117
18
examinee gets all marks on an optional question, it means s/he has demonstrated
high-level cognitive ability.
Third, examinations with question choice give teachers freedom to teach
particular portions of the syllabus in which they may be particularly interested
(Schools Council Examinations Bulletin, 1971; Willmott and Hall, 1975). Similarly
candidates do concentrate on particular aspects of the topics in which they are able to
show themselves to the best advantage. However, optional questions of mathematics
paper 2, no teacher can confidently know which topics will be examined, therefore;
in essence, there is no freedom of teaching particular topics and leaving out others.
Nevertheless, some teachers have problems in executing lessons involving
some mathematics topics. As a result, they either engage someone who is
comfortable with the particular topics or they fallibly present the topics. The latter
situation puts students in awkward position in terms of thorough examination
preparations. It eventually negatively influences their choices in the examination
since the mathematics domain has been reduced by the teachers incompetence.
Nonetheless, candidates are forced to prepare thoroughly by studying the whole
syllabus. One can be good at a particular topic, but still s/he is extrinsically
motivated to study hard on the other topics in order to do well because no one can
predict exact topics that will be examined.
2.4 Problems of optional questions
Although the merits of the above section cannot be denied, little attention has
been paid to the problems brought by optional questions when they are used in
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
34/117
19
examinations. It appears examiners over look some of very important aspects of a
test as a measuring instrument. Below are accounts of two main problems associated
with examinees choice of questions. The first discusses about the difference in
cognitive domain demands of topics in a syllabus; while the second challenge looks
at the variability in abilities of candidates.
2.4.1 The syllabus
In a syllabus, there are a number of different topics. It may be argued whether or
not syllabus topics are of the same basic level of difficulty (Willmott, 1972). One
good example of these arguments is the one presented by School Council
Examinations (1971) which say that in mathematics; is the quoting of geometry
theorem followed by an example on par with factorisation followed by the solution
of a pair of simultaneous equation? Certainly, the two topics or branch of
mathematics could not be at the same difficulty level in our syllabus. There are quite
a number of topics in senior secondary school mathematics syllabus which have
different levels of difficulty. The comparability of the results of candidates
attempting these questions drawn from different topics may be questioned.
Therefore, putting scores from different optional questions on the same scale is
necessary for fair comparisons.
2.4.2 The abilities of candidates
The level of questions may vary considerably within the same test form in terms
of level of proficiency required of the candidates to be able to answer the question
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
35/117
20
fully (Willmott, 1972). The provision of question choice results in the type of
responses required of the candidates over the whole paper not to be controlled in any
way. Some candidates may choose to answer questions with a certain pattern of
proficiency. For example, if a paper of ten questions consisted of five description
questions and five explanatory questions, and candidates were to answer five
questions in all, it is likely to see describers only and explainers only (School
Council Examinations, 1971). This would create measurement problem when one
tries to consider candidates with the same marks to be worthy of the same ability
level (Willmott, 1972). In the case of mathematics, candidates who are not good at
graphs, for example, will tend to avoid graph questions, and some whose proficiency
is low in matrices and vectors will choose other questions. However, the fact that
they have answered their preferred questions does not guarantee them to get full
marks on that particular question. The gist of the matter is if they like geometry most
than arithmetic and algebra they go for such branch of mathematics. The problem
that would come in is of comparison: is my geometry better than your algebra or
arithmetic? Wainer and Thissen (1994) are also concerned with such comparisons
because there is need to take into account the difficulty of the accomplishment for
comparison to be meaningful. It would not be fair to judge two examinees
mathematics proficiency based on different questions. Fair play is ought to be
achieved.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
36/117
21
2.5 Relationship between candidates question choice and getting high scores
The suggestion that optional questions allow candidates to select the questions
on which they can perform better is contradicted by research evidence. According to
Wang (1996), the correlation between the popularity ranking of the five choice
questions and their corresponding means was 60.0 , and the correlation between
the ranking of the choice questions combinations and mean score was .22.0 It is
very surprising to note the negative correlations because it is assumed that
examinees choose questions they feel that they would get right. Taylor and Nuttal
study (1974) as cited by Bell (1997) asked candidates taking a Certificate of
Secondary Education (CSE) examination to answer the questions they omitted on a
separate occasion after the actual examination. It was found that about %25 of
candidates actually showed an improvement in the final marks. This meant that not
all candidates are able to choose in advance the questions on which they will score
most highly.
Power, Fowles, Farnum, and Gerritz (1992) found that the more the examinees
liked a particular topic, the lower they scored on an essay they subsequently wrote
on the chosen topic. This phenomenon is quite true when the choice between the
questions is relatively hard for examinees to make, that is, the choices are not
strongly determined (Allen, Holland, and Thayer, 1993). There is no knowledge on
whether MSCE mathematics paper 2 optional questions presents this kind of
scenario where most candidates find it hard to select questions that they would
attempt and score most highly or not. Malawi National Examinations Board item
developers do try to produce optional questions of equivalence in difficulty by
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
37/117
22
following available guidelines (Khembo, 2004). It is yet to be seen if examiners
effort to produce optional questions of equivalence in difficulty, on face value,
would produce hard choices on the part of examinees. The face value words are
used because no detailed research has been done to ascertain the notion of equal
difficulty of optional questions.
2.6 Linking and Equating
Linking encompasses a broad perspective on score adjustment of different test
forms. Feurer, Holland, Green, Bertenthal, and Hemphill (1999) in their uncommon
measures report presented three types of linking of scores of different tests that are
built based on
1. the same framework and same test specifications,
2. the same framework and different test specifications, or
3. different frameworks and different test specifications.
Kolen and Brennan (2004, p.427) ably defined the term frameworkas a delineation
of the scope and extent (e.g., specific content areas, skills, etc) of the domain to be
represented in the assessment They also defined test specifications or blue printas
specific mix of context areas and items formats, number of tasks/items, scoring
rules, etc. On the other hand, Mislevy (1992) and Linn (1993) proposed a type of
taxonomy for linking which mainly focuses on methodologies. They grouped the
taxonomy into four categories, based on the strength of the resulting linkage, starting
with equating, followed by calibration, projection, and lastly moderation.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
38/117
23
When the first two types of linking presented by Feurer et al. (1999), Mislevy
(1992) and Linn (1993) are put into the same perspective, one would find that score
adjustment relationship of different test forms that are built on the same framework
and same test specifications is called equating(Kolen and Brennan, 2004). Tests that
are developed on the same framework and different specifications when linked the
resulting relationships is called calibration. The term projection comes in because
the methodology does not require the test forms to measure the same constructs or
domain, and score adjustment relationship is obtained through linear or non linear
regression.Moderationis a type of linking in which the test frameworks are different
but the constructs are similar (Kolen and Brennan, 2004). For this case, the
fundamental aspect relies on distribution matching.
Looking specifically at equating as one type of linking, Lord (1980) outlined
four requirements that must be met for equating of, say, test iY to test jY
1. the same construct: the two tests must measure the same construct,
2. equity: once two test forms have been equated, it should not matter to
the examinees which form of test is administered,
3. symmetry: the equating transformation should be systematic. This
means the equating of iY to jY should be the inverse of equating jY to
iY ,
4. subpopulation invariance: the equating transformation should be
invariant across subpopulations.
As noted previously from the definitions on the types of linking in uncommon
measure report; same framework is viewed as construct similarity and same test
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
39/117
24
specifications is considered as similarity in measurement characteristics such as test
length, test format, administration conditions, etc (Kolen and Brennan, 2004). These
definitions are concordant with four requirements for equating as delineated by Lord
(1980). The study would use these definitions as benchmarks for deciding the type of
linking which would be involved. Therefore, the term linking would be used
(henceforth) to refer to any function used to connect the scores on one test to those
of another test, and would reserve the term equating to the special case of linking
that satisfies the benchmarks.
Livingston (2004); von Davier, Holland, and Thayer (2004); Holland, von
Davier, Sinharay, and Han (2006) describe chain linking as equating the scores on
the new form to scores on the anchor and then equating the scores on the anchor to
scores on the reference form. Putting the definition in our context, chain linear
linking describes equating the scores on a particular optional question (new form) to
total scores on common portion (anchor) and then linking the total scores on the
common portion to scores on the other optional questions (reference forms). The
chain formed by these two linking functions connects the score on the concerned
optional question to the scores on the other optional questions.
The study is particularly interested in the first part of the chain where a
particular optional question scores are linked to total scores on common portion.
There is an assumption that says the linear function from a particular optional
question scores on a common portion is the same in the two populations, those that
answer the concerned question and those that do not ( iP and jP ) (von Davier &
Kong, 2005). Based on the assumptions level of attainment, we can substantiate the
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
40/117
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
41/117
26
Thayer (2005) discovered that the question choice tends to be positively associated
with performance in the sense that the better an examinee does on a question the
likely s/he is to prefer that question and vice versa. This revelation, however, is
mudded with a reversal where examinees who prefer a certain question perform
better on the unprefered question. They concluded that there is a substantial amount
of variation around the performances in regard to preferred and unprefered choices
and, therefore, it is difficult to justify the non-ignorable selection. With the above
findings, it seems impossible for scores on optional questions to be treated
interchangeably through traditional equating because it is inconsistent with the
notion of standardised testing (Kolen and Brennan, 2004).
Though it is deemed impossible to equate optional questions scores,
nevertheless, comparability of scores is possible through score adjustment
procedures (Kolen and Brennan, 2004) by employing linking paradigms. Wainer,
Wang, and Thissen (1991) employed Item Response Theory (IRT) to explore
equating possibility of choice items by assuming ignorable non-response using data
from the College Boards Advanced Placement (AP) test in Chemistry. They treated
examinees as two subpopulations. Both were administered the common items, but
differing in the administration of the chosen questions to calibrate the item
parameters for the common items and selected questions. They succeeded but
without the confirmatory evidence that could only be sourced with further data.
Allen, Holland, and Thayer (1994a, b) provided a general procedure based on
missing-data methods for non-ignorable non-response to estimate distribution of
scores on an optional part of a 1987 Advanced Placement (AP) European History
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
42/117
27
test. Using sensitivity analysis approach, they observed that an assumption of
ignorable non-response given additional information from the common section score
could determine the correct assumption about the non-response when only the
optional essay score and the common section were available. Fitzpatrick and Yen
(1995) investigated the psychometric characteristics of constructed response items
referring to choice and non-choice passages administered to students in Grades 3, 5,
and 8. The items were scaled using IRT methodology. The findings indicated that
the scores obtained on different choice sets were comparable when these choices
were scaled together with the non-choice items that all students took. The non-
choice items play an important role in producing comparable scores. Bridgeman,
Morgan, and Wang (1997) assessed the ability of history students to choose the
essay topic on which they could get highest score. They concluded that techniques
for equating scores generated by different topics are not totally satisfactory therefore;
scoring rubrics must be established by single group of raters to enable single
standard.
As it can be noted, there is mixed bag of success and failure in making choice
items scores comparable. Most of the mentioned studies used IRT methodology in
data analyses which require strong assumptions on the test, such as
unidimensionality and local independence. Unidimensionality is statistical
dependence among items which comes about because the test is measuring one latent
trait) and local independence is achieved when items are statistically independent for
each subpopulation of examinees whose members are homogenous with respect to
the latent trait (Crocker and Algina, 1986; Hambleton, Swaminathan, and Rogers,
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
43/117
28
1991). The opponents of IRT always argue that it is nave to assume that a single
latent trait is accounted for the responses to items on a test. Thus, this study uses
classical item analyses statistics in testing a key assumption of Livingstons score
adjustment on MSCE mathematics paper 2 based on the requirement that the two
equating/linking functions should not depend on a particular population used for
equating.
At this juncture, it should be accentuated that examinations used in the
mentioned studies are quite different in terms of format with the one understudy. In
those studies, examination papers had more than two sections whilst ours has two
sections only. In regard of this, it would not be plausible to conclude equating is not
possible for every examination with optional questions until prove so beyond
reasonable doubt.
2.8 What are the consequences of not linking/equating optional questions scores?
Linking/equating have the potential to ameliorate problems presented by choice,
through making them equivalent in difficulty. If examinees who choose different
items are to be fairly compared with one another, the scores obtained on these items
must be equated (Wainer, Wang, and Thissen, 1991, p.2). This process facilitates
the linkage of scores on optional items to one another by putting them on
comparable scale using z-score model.
The optional questions are intended to test the same skills and types of
knowledge which are taken from the same syllabus. Though test developers try to
make the questions equally difficult, oftentimes more optional questions turn out to
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
44/117
29
be harder than others. Wang, Wainer, & Thissen (1993) observed that in 1989 AP
chemistry and 1989 AP American History, women were adversely affected because
most of them chose the more difficult items. This is one example among many of
unfairness that comes along with question choice. When some optional questions are
harder than others, the raw scores on those questions would not indicate the same
level of the knowledge or skill the questions are intended to measure thus the scores
would not be comparable.
As noted previously, it remains a fact that to develop choice items of equal
difficulty is a gargantuan challenge. Even so, to remove choice from the examination
would reduce domain coverage because of small number of items that would be
examined. This would affect some students. Increasing the length of time to
accommodate large number of items is often impractical. Since choice has been
decided as the desirable format for MSCE mathematics paper 2 examinations, there
are two main consequences of not putting the scores on the same scale. First, same
observed raw score on each optional item would not imply the same accomplishment
because the difficulties of the tasks are different. Second, observed total raw scores
from choice items combinations in section B would still present different patterns of
mathematical proficiencies. From one combination to another that might create
intricacy in comparison.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
45/117
30
CHAPTER 3
3.0 METHODOLOGY
3.1 Introduction
This chapter describes how the research problem was investigated. The list of
questions to be answered is given first. This is followed by the design of the study,
analysis plan, ethical consideration, validity and reliability. In the final section, a
narrative of delimitations and limitations of the study is presented.
3.2 The Research Questions
The following questions were addressed in this study:
1. To what extent do optional questions differ in difficulty?
2. How are scores on optional questions and total scores on the common
portion correlated?
3. Are linking/equating functions of examinees that chose a concerned
optional question and for those that selected another choice question
similar?
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
46/117
31
3.3 The Design
3.3.1 Description of the Research
The research strategy which was employed is survey because the researcher
wanted the measures used to be reliable and valid, and that there was guarantee of
fair representation of all individuals to whom the researcher wanted the results to
apply (Cohen, Manion, & Marrison, 2000; Slavin, 1984). Further, quantitative
approach was the method used because it uses the positivism approach, which holds
the belief that the social environment is real and constant regardless of time and
setting (Creswell, 1994).
3.3.2 Population
The population of the study was all form 4 students from purposively sampled
secondary schools in southwest and shirehighlands education divisions.
3.3.3 Sampling
The study used purposive sampling where five secondary schools were chosen
to participate in the study. Two main reasons are given why purposive sampling
was preferred to others. First, the researcher wanted to ensure representation of four
major conventional secondary schools types. This is in agreement with Borg, Gall
and Gall (1996) who say that purposive sample provides a more focused data and
allows for a detailed analysis of a particular segment of population. Second, due to
limitations of research funds and time it was judicious to engage schools which
were close to each other.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
47/117
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
48/117
33
classroom assessment. Since the study wanted sixty participants from each school; a
sampling interval, k, was computed by dividing the class size of students in form 4
class at each school by 60. From the teachers list, a name of student corresponding
to thk number was picked, and every thk name thereafter was chosen until the
required number was achieved.
3.3.4 Instruments
The main instrument that was used is a 2005 Malawi School Certificate of
Education Examinations mathematics paper 2 (see appendix G). This paper was
purposively chosen because it was the latest paper at the time of writing the
research proposal.
The design was that the candidates had no choice in section B, thereby
increasing the test length by three more questions. In view of this, the paper was
divided into two parts; paper 1 representing section A (see appendix C) and paper 2
representing section B (see appendix D). This was done in agreement with the
observation of Hand (2004, p.120) that the more questions included in a test, the
more difficulty one might find in obtaining valid responses and candidates tire as
the number of questions increases, and might even refuse to take part if there are
too many.
Paper 1 consisted of six questions and time allotted to it was 1 hour 30 minutes.
Paper 2 took 2 hours and had six question choices. In this paper, examinees were
instructed to read all the optional questions and chose three questions; and that they
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
49/117
34
should write down the number of these questions in order of preferences. Then they
were instructed to answer all the six questions.
The other instrument was the questionnaire that was used as a cover page for
candidates answer sheets for paper 1 and paper 2 (see appendices E and F
respectively). The questionnaire was used to solicit extra information from the
candidates such as question choice preference, exclusively for paper 2, gender, and
age.
3.3.5The administration of the instruments and data gathering
The two papers were administered three weeks prior to commencement of
National Examinations. This was done to ensure that students had prepared
thoroughly in terms of mastering the whole mathematics syllabus. This is the time
when the majority of the secondary schools finish delivering lessons to students and
instead they engage in revisions of various courses that are offered. The two test
papers were administered on the same day, starting with paper 1, and after 30
minute break, paper 2 was taken.
Students were instructed to answer the questionnaire first before attempting the
questions in both papers. The time given to fill the questionnaire was two minutes.
3.4 Data Analysis
3.4.1 Extent of difficulty in optional questions
The item difficulty indices (p-values) were used to analyse the extent of
difficulty in optional question. These p-values are obtained by computing the
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
50/117
35
average mark obtained on the question divided by the maximum mark for that
question (Nuttal & Willmott, 1972). The p-values for questions in section A and
section B without choice (i.e. no choices were allowed on the optional questions
portion) were all calculated in the same manner. The item difficulty indices for
questions in section B without choice were unbiased statistics because all
examinees (population) were used to compute them.
3.4.2 Correlation of scores on section B and total scores of the section A
Pearson product-moment correlation coefficient between the common portion
and question choice portion was calculated. The coefficient of determinant was
worked out to determine variance in section A that is associated with the variance in
section B. This question helped the researcher to see if the examinees would differ
in the same way on the common portion as they would do on the optional questions
portion. If the correlation coefficient were strong, then the researcher would know
that section A measured similar construct as section B. It signifies that the
mathematical knowledge and skills that were asked in section A were also available
in section B; making the two sections measure the same mathematical elements.
This is one requirement amenable to equating for two tests (Liu, Cahn, and Dorans,
2006).
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
51/117
36
3.4.3 Establishing group invariance on equating/linking functions of examinees that
chose a concerned optional question and for those that selected another
question
In normal examination, the raw score iY for examinee selecting question j is
unobservable, in fact, iY is missing datum. Equating ji PXY onto , therefore, is
impossible. This equating function is denoted )( iij yX . For instance, an examinee
who chose optional questions, say, 7, 9, and 12 would have unobserved scores on
optional questions 8, 10, and 11. Thus equating the score of, say, question 8 to scale
of total score of section A on the group that selected question 7, or 9, or 12 is
impossible. We could denote this equating function as )( 87,8 yX , or )( 89,8 yX , or
)( 812,8 yX with respect to the chosen optional questions.
The missing scores, however, were available and were used to determine the
means 1,ji and standard deviations 1,ji . These moments were used together with
means1jX
and standard deviations1jX
of section A to establish slopes and
intercepts of functions )( iij yX . The computable missing linear equating
is ( )11
1
1)( iji
ij
X
Xiij yyXj
j
+= . Other slopes and intercepts of the observable scores
equating functions )( ii yX were computed using this equation
( )11
1
1)( ii
i
Xii yyXiX
i
+= .
For each optional question, there were five sets of linear functions. For each
set, one function belonged to subgroup that chose a concerned question; the other
function was for a subgroup that never selected the concerned question but chose
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
52/117
37
another question; and the last function was for the combined group. The two
subgroups in each set were mutually exclusive.
Dorans and Holland (2000) introduced two statistics to summarise differences
between the equating functions obtained from subgroups and combined group. The
first one is standardised Root Mean Square Difference, RMSD, which gives
detailed information as to which Y-score points, y, that are most affected the
subgroup difference. The second one is the standardised Root Expected Mean
Square Difference, REMSD, which summarises overall differences between the
equating/linking functions. The formulae for the two statistics are
[ ]
)(
)()(
)(
1
2
groupcombined
H
h
XXh
X
yeqyeqw
yRMSD
h
=
= (5)
[ ]
)(
)()(
1
)max(
)min(
2
groupcombined
H
h
y
y
XXyhh
X
yeqyeqw
REMSD
h
=
= (6)
Xeq represents transformed scores on Yto the scale of X for the combined group,
hXeq represents transformed scores on Yto the scale of X for subgroup h. hN is the
sample size for subgroup h, Nis the total number of examinees andN
Nw hh = is the
weight for the subgroup h. Furthermore, yhN is the number of examinees for
subgroup hwith a particular score (y) on Y, andh
yhyh N
N= is a weighting factor
for subgroup hand score (y).
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
53/117
38
As it can be noted, RMSD is computed at each y-value and the contribution of
each subgroup is weighted by its proportional representation in the combined group.
REMSD is a doubly weighted statistics over yh and hw .
To evaluate the relative magnitude of RMSD and REMSD, Dorans and
Feigenbaum (1994) suggested the notion of score Difference That Matters (DTM)
in the context of linking the SAT to the old SAT. Test that is reported in 10-point
unit, linking functions that are within 5 scaled score points of each other at a given
raw score point are treated as close enough to ignore because they are less than half
of a reported score unit of 10 (Dorans, 2004). Kolen & Brennan (2004, p. 462) give
a good illustration on the logic of DTM when reported scores are integers,
equivalents of 15.4 and 15.6 round to differentintegers even though they differ by
only .2 (less than a DTM). Also equivalents of 14.6 and 15.4 round to the same
integer even though the different by .8 (more than a DTM). The score unit on
MSCE mathematics examination is 1-point, which is an integer. This means that
half of score unit was considered as a score Difference That Matters, .5.
Recall that RMSD and REMSD statistics are standardised by dividing by the
standard deviation of scores on compulsory section for combined group. DTM was
standardised in the same manner so that it could be used as a benchmark for
evaluating RMSD and REMSD. When REMSD was below the standardised DTM it
indicated that the equating functions for each subgroup were very close to that of
the combined group, hence they were group invariance. Otherwise, they failed
group invariance test. These functions and RMSD were plotted on graphs to
visually display their similarities and the differences.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
54/117
39
3.5 Ethical Considerations
Creswell (2003) says codes of professional conduct for researchers are
applicable to all research methods: qualitative, quantitative, and mixed methods. In
this study, the researcher observed two ethical codes of conduct. First was obtaining
informed consent, and second was to do with privacy and confidentiality.
First, Gay and Airasian (2003) say that very rarely is it possible to conduct
research without the cooperation of people in the setting of the study. Cooperation
would come into play if the researcher obtains consent from participates. Before
carrying out the research, a written permission was sought from the Education
Division Managers and headteachers to conduct the research at their schools
(appendices J, K, & L), and furthermore, students of the participating schools were
asked if they were acceding to take part in the study. Only those that acceded were
systematically selected to be the candidates. Rossman and Rallis (2003) comment
on the significance of getting informed consent from participants by saying that the
permission from the subjects is crucial for the ethical conduct of the research
because it serves to protect the privacy of the participants.
Second, Fowler (1995); Vaughn, Schumm, and Sinagub (1996); Rossman and
Rallis (2003) mention that privacy and confidentiality during data collection is of
paramount importance. Participants responses should be kept confidential and they
should know the purpose of the study. Based on these assertions, the study assured
subjects of their privacy and confidentiality during the administration of the tests by
advising them not to disclose or write their names on the answer sheets. Letters and
numerical values were used to distinguish examinees from one another.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
55/117
40
3.6 Validity and Reliability
Validity is defined as the accuracy or truthfulness of a measurement with
reference to a construct of specific interest; and reliability is concerned with
consistency of a measurement (Crocker & Algina, 1986; Bakewell, 2003). Hand
(2004, p.129) defines validity as how well the measured variable represents the
attribute being measured, or how well it captures the concept which is the target of
measurement. He further defines reliability as the differences between multiple
measurements of an attribute.
On validity, MANEB item setters developed the instrument that was used in
this study. These item setters are well-trained personnel with vast teaching
experience in mathematics. During the development of the tests, they use blue
prints, that is, tables of specifications to guide them in terms of content coverage
and the level of cognitive demands. The blue prints help to maintain consistency of
difficulty level of the tests over years. The papers, therefore, possess the required
magnitude of content validity based on how they are designed. Furthermore, the
examinees took the tests three week prior to the National examinations. This means
that the students at that time were well prepared. Hence their responses were taken
as their optimal performance or achievement in MSCE mathematics paper 2 as they
displayed their true mathematics knowledge and skills.
In assuring reliability, marking scheme was used for consistency in scoring,
and one item rater was used to avoid inter-rater variability. The marking scheme
used for scoring the test was developed by two experienced mathematics teachers
from Chiradzulu Secondary School. These teachers are also MANEB mathematics
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
56/117
41
raters. The scheme is similar with MANEB scheme in terms of mark allocation and
content specification. Furthermore, before rating the items, the researcher and the
two teachers standardised the marking scheme to encompass examinees diversity
answers. One question at a time was marked on each script before marking the
subsequent question to ensure consistency.
3.7 Delimitations and Limitations of the study
3.7.1Delimitations
The study focused only on optional questions of mathematics paper 2; hence
the finding would not apply to other MANEB examinations that allow examinee
choice.
The results would not be generalised to all secondary schools in Malawi
because the participating schools were purposively sampled. However, the results
would be related to other schools with similar characteristics as the sampled ones.
3.7.2Limitations
Visiting all the secondary schools that offer mathematics would have been an
ideal but this was impossible due to time and financial constraints. Instead the study
was done on five schools only.
Some students declined to participate in the study after previously affirming to
do so. In some instances, candidates took only one paper instead of two. This
behaviour provided scores for one paper only, as for the other one were not
available. With this regard, they were dropped from the study thereby reducing the
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
57/117
42
targeted sample size. This attrition was much observed in Njamba secondary
school. The total number of attrition was 53.
Finally, MANEB marking scheme was not issued to the researcher to be used.
They say it is a confidential document, hence cannot be given to anyone outside the
organisation. This created a minor setback because it was planned to use their mark
scheme. It resulted into extra finances and resources in bringing about two
experienced teachers from Chiradzulu secondary schools, who are also MANEB
item writers and scorers, together with the researcher to develop another marking
scheme. Nonetheless, our combined experience as item scorers made the marking
scheme similar to the ones developed by MANEB.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
58/117
43
CHAPTER 4
4.0 RESULTS AND DISCUSSION OF THE FINDINGS
4.1 Introduction
In this chapter, results and discussions of the findings are presented under three
main sections. The sections were formulated based on the research questions. Thus
they display answers to the posed research questions in chronological order, starting
off with the first research question, and the second. Third research question is
addressed in the final section coupled with a chapter summary.
4.2 To what extent do optional questions differ?
4.2.1 Preliminary Analysis
The item content and major content areas that made up section A and section B
are outlined in Tables 4.1 and 4.2 respectively. Almost all content areas that were
examined in section A were also tested in section B, but with different item
contents. It signifies that the two sections were measuring the same construct.
Construct similarity is viewed as same framework (Feuer et al. 1999), thus both
sections were built on the same framework.
Furthermore, Feuer et al. (1999) define same test specifications as similarity
in measurement characteristics/conditions such as test length, test format,
administrations conditions, etc. Popham (1974) as cited by Crocker and Algina
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
59/117
44
(1986) defines item specification as sources of item content, descriptions of the
problem situations or stimuli, etc. In view of both definitions, the items in both
sections were built on different item specifications. This is evidenced by similar
item format but different sources of item content. Further, the differences rested in
the levels of cognitive operation demands. Most questions in section A demand less
cognitive operation than those in section B as indicated by p-values in Table 4.3.
Table 4.1: Major content areas of section A
Section A
Question No. Item content Content areas
a Algebra fractions Algebra, patterns, & functions1
b Irrational numbers Numeration
a Subject of a formula Algebra, patterns, & functions2
b Matrices Algebra, patterns, & functions
a Triangle geometry Geometry3
b Remainder theorem Algebra, patterns, & functions
a Circle geometry Geometry4
b Mapping Algebra, patterns, & functions
a Measurement Numeration5
b Speed-time graph Numeration
a Similar figures Geometry6
b Vectors Numeration
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
60/117
45
Section B
Question No. Item content Content areas
a Statistics Statistics & probability7
b Formulation & solving
quadratic equation
Algebra, patterns, & functions
a Partial variation Algebra, patterns, & functions8
b Probability Statistics & probability
a Exponential equation Algebra, patterns, & functions9
b Linear programming Algebra, patterns, & functions
a Equation of a straight line Algebra, patterns, & functions10
b Arithmetic progression Algebra, patterns, & functions
a Cyclic quadrilateral Geometry11
b Sets Numeration
a Trigonometry Numeration12
b Solving polynomial
equation graphically
Algebra, patterns, & functions
Table 4.2: Major content areas of section B
Having looked at same framework and test/item specifications of the two
sections of the test under investigation, it would be reasonable to use the term
linking rather than equating because the two sections had different item content
but same content areas; and the length of choice items in section B were not equal
to items in section A. Further, level of cognitive processes required in two sections
was different as illustrate in subsection 4.2.2. Thus, the two portions measured the
same construct, but different specifications. However, when equating choice items,
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
61/117
46
the interest is on item content as opposed to content areas of the test form because
item scores are the ones to be linked within the same test.
4.2.2 Comparing p-values of section B
Table 4.3: P-values for questions in section A and section B 'without choice'
Section A Section B
Item Max.
mark
Average
mark
p-value Item Max.
mark
Average
mark
p-value
1 8 5.190 0.649 7 15 6.436 0.429
2 7 4.401 0.629 8 15 5.061 0.337
3 9 5.518 0.613 9 15 5.869 0.391
4 10 5.801 0.580 10 15 5.116 0.341
5 11 6.324 0.575 11 15 3.927 0.262
6 10 1.917 0.192 12 15 7.566 0.504
Table 4.3 displays the item difficulty indices (p-values) for questions in section
A and section B without any choice. Questions in section A have generally higher
p-values than those in section B. This affirms the notions that section A questions
are easier than section B questions. The questions in the latter section were
relatively difficult because they usually provided deep coverage of the content
domain. Adopting the terms used by Wainer and Thissen (1994), most of the
questions in section B would be called large items. Section A questions would be
dubbed short items because most of them were considerably straight forward.
-
8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH
62/117
47
However, question 6 in section A had the lowest p-value amongst all questions in
the test. The predicament which candidates faced in attempting this question was
translating the word problem into correct computable mathematical concepts.
Levels of proficiency in language skills might have influenced the performances on
this question (Crocker and Algina, 1986).
Focusing on section B questions, it is noted that question 11 was the most
difficult, and question 12 was the fairest question. Ordering them from least
difficult to the most difficult question, one would get questions 12, 7, 9, 10, 8, and
11.
As noted, optional question 11 was the most difficult and if a student gets a raw
score of, say 7, on that problem, it has no consequence, with the current assessment
policy on MSCE mathematics paper 2 examinations, whether it is on problem 12
which is the easiest. In all fairness, it is clear that one who receives a score of 7
demonstrated more proficiency than another student who gets the same score on
problem 12. Wainer, Wang, and Thissen (1991) and Wainer and Thissen (1994) say
that when optional questions that are differentially difficu