examining the untestable assumptions of the chained linear linking for livingston score adjustment...

8/13/2019 EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEAR LINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO TH

1/117

EXAMINING THE UNTESTABLE ASSUMPTIONS OF THE CHAINED LINEARLINKING FOR LIVINGSTON SCORE ADJUSTMENT WITH APPLICATION TO

THE 2005 MSCE MATHEMATICS PAPER 2.

M.Ed (Testing, Measurement and Evaluation) Thesis

ByCHIFUNDO STEVEN AZIZI

BSc (Ed) Mzuzu University

Submitted to the Department of Educational Foundations, Faculty of Education,

in partial fulfilment of the requirements for the degree of

Master of Education (Testing, Measurement and Evaluation)

University of MalawiChancellor College

June, 2009


2/117

DECLARATION

I the undersigned hereby declare that this thesis is my own original work which has not

been submitted to any other institution for similar purposes. Where other peoples work

has been used acknowledgements have been made.

____________________________________

Full Legal Name

_____________________________________

Signature

_____________________________________

Date


3/117

Certificate of Approval

The undersigned certify that this thesis represents the students own work and effort andhas been submitted with our approval.

Signature: ____________________________Date:__________________________

M. Kazima PhD (Senior Lecturer)

Main Supervisor

Signature: ____________________________Date:__________________________

L. Kazembe PhD (Senior Lecturer)

Member, Supervisory Committee


4/117

iv

To the memory of my late father, Charles Frank Azizi and late brother, Charles Mike

Azizi. May their souls rest in peace!


5/117

v

ACKNOWLEDGEMENTS

I would like to thank Dr. M. Kazima and Dr. L. Kazembe, my main supervisor and

co-supervisor respectively, for their many suggestions and constant support during this

research. Without them this work would never have come into existence.

I also wish to thank the headteachers of Blantyre, Henry Henderson Institute,

Bangwe, Chiradzulu, and Njamba secondary schools for allowing me to collect data from

their institutions. Again, my gratitude goes to the Executive Director of Malawi

Examinations Board (MANEB) for authorising me to use 2005 MSCE mathematics

examination paper 2. Big appreciations should also go to the students who participated in

this study; you really helped me a lot.

I am grateful to my mum, my fiance, brothers and sisters for their love and

financial support. Special mention goes to Ministry of Education for funding my tuition

fee. Finally, words alone can not express my gratitude to the Almighty God who made it

possible for me to complete this study and for the infinite blessings.


6/117

vi

ABSTRACT

MSCE mathematics paper 2, like many high-stakes test formats, includes a section

of optional questions in addition to mandatory part. It has been argued that offering

options and comparing final scores is often not fair to examinees especially to those that

attempt most difficult questions from the optional part. Livingston (1988) proposed a way

of adjusting essay score. This was later explained from the perspective of test equating by

Allen, Holland, and Thayer (1993) and they concluded that the proposal made implicit

assumptions of chained linear equating about the unobserved data. This study has tested

the assumptions with application to 2005 MSCE mathematics examination paper 2 so as

to determine if Livingston score adjustment could be used on this examination.

The study used systematic sampling to obtain examinees from five purposively

selected secondary schools. The 2005 MSCE mathematics paper 2 was administered to

247 examinees in two parts, section A followed by section B. For section B, examinees

were asked to first indicate their choice of three optional questions and were then

instructed to answer all of the questions.

The results were analysed using Root Mean Square Difference (RMSD) and Root

Expected Mean Square Difference (REMSD) to quantify the differences between the

subgroups linking functions of unobserved and observed data. It was found that group

invariance did not hold across the entire subgroups that were involved. This means that

Livingston score adjustment would not be possible on this examination. It is


7/117

vii

recommended that in order to minimize optional scores inequity, item writers need to

use analytical methods to strictly match different levels of cognitive demands of topics by

using MSCE mathematics performance level descriptors when constructing the optional

items.


8/117

viii

TABLE OF CONTENTS

Page

DEDICATION. iv

ACKNOWLEDGEMENTS.. v

ABSTRACT.. vi

LIST OF TABLES xiii

LIST OF FIGURES.. xiv

LIST OF ACRONYMS AND ABBREVIATIONS.. xv

CHAPTER

1 INTRODUCTION 1

1.1 Background... 1

1.1.1 Characteristic of the examination investigated 1

1.1.2 Grade Awarding Process. 2

1.1.3 Comparability of optional questions raw scores 2

1.1.4 Livingstons raw score adjustment.. 4

1.2 Statement of the Problem. 6

1.2.1 Purpose of the Study 7

1.2.2 Research Questions. 8


9/117

ix

1.2.3 Significance of the study 8

1.3 Theoretical Framework 9

1.4 Definition of terms 13

2 LITERATURE REVIEW. 15

2.1 Introduction. 15

2.2 General information on optional questions... 15

2.3 Advantages of optional questions. 17

2.4 Problems of optional questions. 18

2.4.1 The syllabus. 19

2.4.2 The abilities of candidates 19

2.5 Relationship between candidates question choice and getting

high scores.. 21

2.6 Linking and Equating 22

2.7 Can we link or equate optional questions?........................................ 25

2.8 What are the consequences of not linking/equating optional questions

scores?............................................................................................... 28

3 METHODOLOGY.. 30

3.1 Introduction.. 30

3.2 The Research Questions 30

3.3 The Design 31

3.3.1 Description of the Research 31

3.3.2 Population 31

3.3.3 Sampling.. 31


10/117

x

3.3.4 Instruments. 33

3.3.5 The administration of the instruments and data gathering. 34

3.4 Data Analysis ........ 34

3.4.1 Extent of difficulty in optional questions 34

3.4.2 Correlation of scores on section B and total scores of

the section A. 35

3.4.3 Establishing group invariance on linking/ equating functions

of examinees that chose a concerned optional question and

for those that selected other questions. 36

3.5 Ethical Considerations. 39

3.6 Validity and Reliability 40

3.7 Delimitations and Limitations of the study. 41

3.7.1 Delimitations.. 41

3.7.2 Limitations. 41

4 RESULTS AND DISCUSSION OF THE FINDINGS. 43


4.2 To what extent do optional questions differ?................................... 43

4.2.1 Preliminary analysis... 43

4.2.2 Comparing p-values of section B............................................. 46

4.3 How are scores on section A and section B with choice

correlated?........................................................................................ 47


11/117

xi

4.4 Establishing group invariance on linking/ equating functions

of examinees that chose a concerned optional question and

for those that selected other questions . 48

4.4.1 Linking functions that largely vary at lower tale of choice

question scale... 49

4.4.2 Linking functions that largely vary at upper tale of choice

question scale.. 51

4.4.3 Linking functions that largely vary at lower and second

upper tale of choice question scale. 54

4.4.4 Linking functions that largely vary at both lower and upper

tales of choice question scale.. 57

4.4.5 Linking functions that constantly vary across the entire

score scale. 58

5 CONCLUSIONS, IMPLICATIONS AND RECOMMENDATION.... 60


5.2 Conclusions... 60

5.2.1 The main findings of the literature review60

5.2.2 The main findings of the empirical investigation..61

5.3 Implications....63

5.4 Recommendation... 64

REFERENCES.. 66

APPENDICES... 74

A. Pairs of subgroups that chose particular questions and other questions. 75


12/117

xii

B. Pairs of subgroups that chose particular questions and other questions ... 77

C. Section A of 2005 M.S.C.E. Examination paper 2 presented in this studyas paper I.... 81

D. Section B of 2005 M.S.C.E. Examination paper 2 presented in this studyas paper II .. 85

E. Answer sheet cover page for paper I.. 89F. Answer sheet cover page for paper II 90

G. Original form of 2005 M.S.C.E. Examination mathematics paper 2 91H.

Letter to Executive Director of Malawi National Examinations Board 97

I. Letter from Executive Director of Malawi National Examinations Board 98J. Letter to secondary school headteacher 99K. Letter to Shirehighlands Education Division Manageress 100L. Letter to South West Education Division Manager... 101M.My introduction letter from Head of Department to secondary schools

headteachers... 102


13/117

xiii

LIST OF TABLES

Table Page

4.1 Major content areas of section A.. 44

4.2 Major content areas of section B..45

4.3 P-values for questions in section A and section B without choice.... 46

4.4 Pairs of subgroups that chose particular questions and other questions and

their graphs are illustrated in appendix A.... 51

4.5 Pairs of subgroups that chose particular questions and other questions and

their graphs are illustrated in appendix B.... 53


14/117

xiv

LIST OF FIGURES

Figure Page

4.1 Equated scores on section A from optional question 7 that largely vary at

lower tale of choice question scale ...................................... 49

4.2 Equated scores on section A from optional question 8 that largely vary at

higher tale of choice question scale . 50

4.3 Equated scores on section A that largely vary at lower and second upper tale

of choice question scale from different optional questions . 54

4.4 Equated scores on section A that largely vary at both lower and upper tales

of score scale of optional question 10 .. 57

4.5 Equated scores on section A that vary constantly across the entire score scale

of optional question 7 ......58


15/117

xv

LIST OF ACRONYMS AND ABBREVIATIONS

AP Advanced Placement

CSE Certificate of Secondary Education

DTM Difference That Matters

HHI Henry Henderson Institute

IRT Item Response Theory

MANEB Malawi National Examinations Board

MSCE Malawi School Certificate of Education

NEAT Non-Equivalent groups Anchor Test

REMSD Root Expected Mean Square Difference

RMSD Root Mean Square Difference


16/117

1

CHAPTER 1

1.0 INTRODUCTION

This chapter provides a general overview of the problem under study. It

considers important concepts that dissect the problem into manageable components.

The first section is the background, followed by statement of the problem, theoretical

framework, and definition of terms is the last component.

1.1 Background

Malawi School Certificate of Education (MSCE) examination among other uses

is for certification, selection for tertiary education, and employment decisions. There

are several subjects examined at MSCE including mathematics. It is rated as one of

the most significant subjects for entry into most programmes in Malawian

universities. University of Malawi, in particular, prefers candidates with at least a

credit in mathematics among other subjects to enrol in almost every programme that

is offered.

1.1.1 Characteristic of the examination investigated

At MSCE examination, mathematics has two papers; paper 1 and paper 2. Paper

1 asks candidates to attempt all 24 questions in 2 hours and, by design, it is easier

than paper 2, although the two papers carry the same weight: each paper carries 100


17/117

2

marks. Paper 2 has two sections, A and B (see appendix G). Section A is

compulsory, where candidates attempt six questions worth 55 marks in total. In

section B, however, candidates are allowed choice of questions to answer. Out of six

questions, candidates are asked to answer three questions only, worth 45 marks in

total. Paper 2 runs for 2 hours.

1.1.2 Grade Awarding Process

Mathematics, like all other subjects at MSCE examination, is graded on a nine-

point scale (Malawi National Examinations Board, 1999).

1-2, denote pass with distinction;

3-6, denote pass with credit;

7-8, denote general pass; and

9, denotes fail.

The raw score of each candidate is converted into grades. This is done by

awards committee that uses grade boundaries (cutoff scores) to turn scores into

grades (Khembo, 2004). Because mathematics has two papers, each paper is graded

separately and then corresponding cutoff scores at 2/3, 6/7, and 8/9 are summed to

determine the final cutoff scores for the subject.

1.1.3 Comparability of optional questions raw scores

Livingston (1988) observed that question developers try their best to make

optional questions equally difficult. Angoff (1971); Newton (1977); and Wainer &

Thissen (1994), however, argue that it is not easy to produce tests that are similar in

difficulty. Though item setters strive to produce questions of equal difficulty, the


18/117

3

questions have their own inherent intricacy that cannot be equalized. The difficult

inherencies come from the complexity of the topics where the questions are

formulated. It could be nave to compare a raw score that an examinee gets from an

optional question which elicits, for example, the use of Venn diagrams to analyse

and interpret data to a question which asks an examinee to find the sum of

geometric progression using a formula. These two questions come from different

topics which differ in complexity; hence raw scores on these two questions will not

mean the same thing because the raw scores on the two questions do not indicate

the same level of knowledge and skill. The scores will not be comparable. To treat

them as if they are comparable would be misleading for the score users and unfair

to the examinees.

Having looked at the complexity of measuring examinees who answer different

questions, the question would be: should choice questions still be incorporated in our

examinations? The merits and demerits of optional questions are discussed in

literature review section. However, Kierkegaard (1986, p.24) argues if you allow

choice, you will regret it, if you dont allow choice, you will regret it; whether you

allow choice or not, you will regret both. This argument highlights that if choice

were not allowed, the limitations on the domain coverage forced by the small

number of questions might unfairly affect some candidates. And on the other hand,

choice would compromise test fairness when it comes to comparison of scores

because of different levels of knowledge and skills being elicited from examinees

from each optional question. Nevertheless, one would propose to increase the length

of the test; this is not often practical (Wainer and Thissen, 1994) taking in


19/117

4

consideration of exams time and examinees fatigue. The onus, therefore, remains

with the examiners.

In case of mathematics paper 2, there have not been any intense arguments over

optional questions behaviour, except Khembo (2004) sentiments against the policy

of allowing choice. With little or no study done on optional questions on

examinations administered by Malawi National Examinations Board (MANEB), the

policy of allowing choice questions in mathematics paper 2 would continue without

reforms and innovations to improve fair assessment because most of the stakeholders

would not know how the choice questions are performing on this paper.

1.1.4 Livingstons raw score adjustment

Psychometricians, nevertheless, have tried to find a post hoc solution to the

incomparability of optional questions scores. Livingston (1988) developed a method

for adjusting scores of optional questions to take away the differential in difficulty of

the questions. The procedures, in brief, are imputing a score for the examinee on

each optional question which the examinee does not answer, and then averaging the

scores, observed and imputed, over all optional questions. Allen, Holland and

Thayer (1993) observe that the methodology makes implicit assumptions when

imputing scores using chained linear equating. Under this procedure, raw scores on

optional question i are transformed to the scale of optional question j through

scores on mandatory section (also known as common portion) for the examinees that

answered question i .


20/117


21/117

6

1.2 Statement of the Problem

Mathematics is one of the papers at MSCE examinations that are not pre-tested

(Khembo, 2004). Pretesting allows item analysis, which in turn ensures that only

questions of proven quality are included in the final examination. When examiners

compile examination paper they assume that the selected questions have equal

inherent difficulty, as it is evidenced by the equal allocation of marks (each optional

question carries 15marks).

In the study done by Khembo (2004), where he was investigating the use of

performance level descriptors to ensure consistency and comparability in standard

setting divulged that item difficulty indices (item p values) for 2002 mathematics

paper 2 examination were varying much for questions in section B. For example,

question 10aand bhadp-values of 0.03 and 0.01. Question 7aand bp-values were

0.52 and 0.15, question 12aand bdifficulty indices were 0.27 and 0.14. Comparing

the p-values of the mentioned questions; one would note that the items were

differentially difficult. However, some would argue that the items were attempted by

non-equivalent groups conditioned to choice, and that it would not be possible to

compare theirp-values outright. This argument is valid, but in the mentioned study,

the researcher employed competent mathematics teachers to establish differential

difficulty on the optional questions. The rating by the judges using performance

level descriptors for questions in section B for 2002 and 2003 mathematics papers

confirmed that some questions required higher order cognitive demands than others

for an examinee to succeed. The judges complemented what was observed from the

p-values.


22/117

7

With observations from the teachers and coupled with conspicuous differential

p-values for optional questions, it is clear that the introduction of optional questions

into this paper brings in unfairness in grading. The basis for comparability of raw

scores, thus, is considerably weakened since different examinees would answer

samples of questions that are not comparable in difficulty.

For this reason, there is a need of finding a method which would circumvent

incomparability of measurements. Livingston (1988) proposed a method of adjusting

raw scores of optional questions to achieve fairness in grading examinees that take

different questions. In the procedure, Allen et al. (1993) note that there are implicit

assumptions, which are used in order to adjust the scores. They call them

Livingston missing data assumptions.

The assumptions are based on a key theoretical requirement of test equating

which emphasises that the resulting equating functions should not depend on the

population on which it is calculated. In other words, the two equating functions

should be identical regardless of which subpopulation has attempted which question.

Therefore, before the method is adopted and adapted in our grading system,

especially in mathematics, there is a need to scrutinise it in detail.

1.2.1 Purpose of the Study

General objective

The general objective of the study is to test the assumption of chain linear

equating/linking for Livingston raw score adjustment method on optional questions

scores of MSCE mathematics paper 2.


23/117

8

Specific objectives

distinguish item difficulty level of optional questions using item difficulty

indices of raw scores.

compare correlations between total scores of compulsory section ( i.e.

Section A/common portion) and scores of optional questions portion.

establish whether equating/linking functions of examinees that chose a

concerned optional question and for those that selected a different choice

question are group invariance.

1.2.2 Research Questions

1. To what extent do optional questions differ in difficulty?

2. How are scores on optional questions portion and total scores on the

common portion correlated?

3. Are equating/linking functions of examinees that chose a concerned

optional question and for those that selected alternate question group

invariance?

1.2.3 Significance of the study

Fairness in measurement is of paramount significance. Every examinee ought to

be measured using the same instrument and the same scale for comparability to be

meaningful. As already mentioned, mathematics is one of the subjects that are

treasured at Malawi School Certificate of Education; and as a result a certificate


24/117

9

without a pass in mathematics puts a person at a disadvantage position when it comes

to selection for further studies or even job selection.

To forestall this measurement quandary, Livingston suggests a method for score

adjustment of optional questions to a common scale. It would be easy to adjust the

scores of MSCE mathematics paper 2 using this method. The consequences,

however, of that action are not known in our context; and therefore it is worth testing

the mentioned fundamental assumptions as Dorans (2004); Liu, Cahn and Dorans

(2006) say that subgroups invariance is the most critical and plays a significant role

in assessing fairness.

Furthermore, there has been no detailed research to the knowledge of the

researcher that has addressed the consequences of optional questions on the

examinations administered by Malawi National Examinations Board. This study

would evaluate the extent of relationship between knowledge and skills measured in

section A and those measured in section B. It would also explore the pattern of

choices in section B conditioned to topics in Malawi senior mathematics syllabus.

1.3 Theoretical Framework

The process of equating is used to obtain comparable scores when more than one

test forms are used in a test administration (Holland, von Davier, Sinharay, and Han,

2006). Angoff (1971) has defined the equating of tests as a process to convert the

system of units of one form to the system of units of the other so that the scores

obtained from one form could be compared directly with the scores obtained from

the other form.


25/117

10

The central reason for equating different test forms is to ensure fair decision

making regarding the test results (Liu and Dorans, 2008). There are three techniques

and methodologies for making different test forms comparable known as equating

procedures (Jaeger, 1981; Petersen, Kolen, and Hoover, 1989; Cook and Eignor,

1991), or designs; namely random groups, single group, and common item non-

equivalent groups (also known as non-equivalent groups anchor test, NEAT).

There are three equating methods used in common item non-equivalent groups

design such as Tucker, Levine, and chain linear (von Davier and Kong, 2005). This

study focuses on the chain linear because it uses common item(s) scores(s) as the

middle link in a chain of linear linking relationships. Basically, the chain linear

linking is done by equalising standardised deviation scores (z-scores) on the two test

forms via standardised deviation common item(s) scores. Before going into detail of

chain linear equating/linking, we first look at Livingston score adjustment procedure

in steps as presented by Allen, Holland, and Thayer (1993, pp17-18); because at the

end would like to connect it with the chain equating/linking functions. Here are some

more notions for easy grasp of what to follow:

A

jY

PY

*

j

jj

sectionon thescoreswithcorrelated

perfectlywerequestiononscoreifimputedbedthat woulscorethe

innotexamineeanforjquestiononimputedscorethe

=

=


26/117

11

jXYjj

iXYii

j

i

j

i

P,,

P,,

jY

iY

AX

BjPP

BiPP

X

AP

j

i

intcoefficienncorrelatiodeviation,standardmean,denote

intcoefficienncorrelatiodeviation,standardmean,denote

questionoptionalonscore

questionoptionalonscoreportion)(commonsectiononscore

sectioninquestionanswerthatofpopulationsub

sectioninquestionanswerthatofpopulationsub

testas

knownalsoiswhich,sectiontakewhoexamineesofpopulationentirethe

=

=

=

=

=

=

Step 1: equating iY to each of the jY . For examinees in iP obtain the converted

value of the observed iy to the scales of the other jY s. The converted values are

denoted )(*

iij yY .

Step 2: obtaining imputed values, ( )iimputedj yy , , for ij for every examinee in iP .

These imputed scores are weighted averages of the raw score iy and its equated score

in the jY scale, )(*

iij yY :

)()1()( *, 11 iijXYiXYiimputedi yYyyy jj +=

Step 3: calculate the adjusted score as the simple average of the observed raw score

and the imputed scores over all koptional questions.

{ } kyyyYij

iimputedjiadj

+=

)(,

Combining steps 2 and 3 to get a simple expression for adjY , we first denote as the

average of all the correlations,1jXY

:

kj XYj= 1 and

= j XYj iijXYii jj yYyY 11 )()(*

,


27/117

12

where )( ii yY is the weighted average of the converted values, in other words, a

transformation of iy into an average scale of the kquestion scores determined by the

equations with weights proportional to the correlations,1jXY

.

A simple Livingston adjusted score function is expressed as

)()1( iiiadj yYyY +=

Coming back to chain linear equating/linking functions and connecting it with

Livingston score adjustment, it is discovered that:

In step 1, the linear equation for equating iY to the scale of X in iP is

( ) 1)()( 11

1

1 ii

i

Xii yyXiX

i

+=

and the linear equation for equating iX to the scale of jj PY in is

(2))()(1

1

1

1 j

j

X

X

j

jj xxY

+=

where11

andjj XX

are the mean and standard deviation of X for examinees

choosing question j . The essence of the word chained in the chained linear

equating is the substitution of x in the )(xYj of equation (2) with )( ii yX in equation

(1), neglecting the fact that the two equating functions are for different populations

(Brennan, 2006). That is

(3))(

)()())((

*

1

1

11

1

1

1

11

1

iij

ii

i

j

X

X

XX

X

j

jijj

yY

yyXY

j

i

ji

j

=

++=


28/117

13

Braun and Holland (1982) indicate that for chain equating/linking to produce

unbiased results, the two chained equating/linking functions should not depend on

which population is used for the equating. Dorans and Holland (2000); von Davier,

Holland, and Thayer (2004); Dorans (2004); Liu, Cahn, and Dorans (2006) call this

requirement population invariance. It means that equating iY to iPXon ought to

give the same equating function as ji PXY onto (Allen et al., 1993). In this case iY

is missing data on jP , which in this study will be available. The resulting linear

equating function of ji PXY onto is

(4))()( 11

1

1 iji

ij

X

Xiij yyXj

j

+=

The two linear equating/linking functions (1) and (4) therefore must have the same

slope and intercepts in order to meet the above condition or requirement.

1.4 Definition of terms

Conventional secondary school: public school owned by Malawi government.

Cutoff score/cut score: a point on a score scale in which scores at or above that

that point are in a different category or classification than scores below the

point.

Difficulty: a factor causing trouble in achieving a positive result or tending to

produce a negative result.

Optional questions: examinees self-selected questions or choice of questions in a

test.


29/117

14

Performance descriptors: scale of achievement levels with a set of observable

behavioural descriptions

Test form: examination paper

National secondary: a school where its students are selected for admission from

different districts across Malawi.

District secondary: a school that admits students taken from the same district. It

offers boarding and lodging.

Day secondary: a school that offers no boarding and lodging. Its students come from

surrounding communities.

Grant-aided secondary: church affiliated school that receives financial assistance

from Malawi Government.


30/117

15

CHAPTER 2

2.0 LITERATURE REVIEW

2.1 Introduction

The literature review has seven sections. The first section gives general

information on optional questions. The second section discusses some advantages of

optional questions regarding to their use in test forms. The third section looks at

problems that come with the policy of allowing candidates to choose questions in an

examination. Relationship between candidate question choice and performing high is

discussed in the fourth section. Definition of linking and equating under this study is

given in fifth section. Sixth section discusses the possibility of linkage and

equitability of optional questions using traditional equating methods. The last section

discusses the consequences of not linking/equating when choice items are

differentially difficult.

2.2 General information on optional questions

The introduction of optional questions into examinations brings in a certain

complication of the process of measurement, since different groups of candidates

will attempt different questions yet from a single paper; thereby creating room for

combination of different test forms in candidates scripts (Willmott & Hall, 1975;

Bell, 1997). In the context of mathematics paper 2, choosing three questions out of


31/117

16

six creates twenty possible combinations of test forms. The complication comes in

because candidates answer in effect different papers out of these different

combinations, especially when questions vary much in difficulty. It then means the

same total mark may not represent comparable performance (Lewis, 1974).

Good test adequately samples out questions from the content domain to provide

a sound basis for determining the extent to which a student has mastered the course.

Mann (1845, pp.37-40) as cited by Wainer, Wang, and Thissen (1991, p.2) argued

that

it is clear that the larger the number of questions put to a scholar the

better is the opportunity to test his merits. If but a single question is put,

the best scholar in the school may miss it, though he would succeed in

answering the next twenty without a blunder; or the poorest scholar may

succeed in answering one question, though certain to fail in twenty

others. Each question is a partial test, and the greater the number of

questions, therefore, the nearer does the test approach to completeness. It

is very uncertain which face of a die will turn up at the first throw; but if

the dice are thrown all day, there will be a greater equality in the number

of faces turned up.

The argument of Mann is quite plausible in the context of MSCE mathematics

syllabus. To determine that one has indeed mastered MSCE mathematics, it does not

take a single question answered correctly, but enough questions that cover fairly the

content domain. Section A, which is a mandatory section of the mathematics paper 2

contains fairly small items whilst in section B there are large items. Wainer et al.


32/117

17

(1991) define large items as those that take examinee longer to complete than do

short items. Large items provide deep coverage of the content domain that can

guarantee the examiner if one answers them correctly that the examinee has

thoroughly mastered the course. In this case, large items need to be many but an

examinee cannot complete many large items within the allotted testing time. One

way of compromising testing time limits and domain coverage is by providing many

large items and allow examinees to choose them.

2.3 Advantages of optional questions

Optional questions have some advantages to candidates, teachers and examiners.

In this study, only three main advantages are discussed.

First, optional questions provide each candidate the chance to answer questions

on a wide range of topics (Bradlow and Thomas, 1998). It is so because the presence

of so many questions on a paper than time can allow means wider coverage of the

syllabus. This in return increases fairness among candidates (Allen, Holland, and

Thayer, 2005) because they are not restricted to answer samples of questions from

few topics.

Second, optional questions are used in the examinations that are interested in

measuring higher order cognitive domain (Allen et al., 2005). In these examinations,

authenticity of candidates work is perceived by the examiners to be more realistic

(Bradlow and Thomas, 1998). This advantage is more applicable to essay optional

questions where candidates are just given a topic to write about. In mathematics, it is

also applicable because optional questions demand high level of thinking. When an


33/117

18

examinee gets all marks on an optional question, it means s/he has demonstrated

high-level cognitive ability.

Third, examinations with question choice give teachers freedom to teach

particular portions of the syllabus in which they may be particularly interested

(Schools Council Examinations Bulletin, 1971; Willmott and Hall, 1975). Similarly

candidates do concentrate on particular aspects of the topics in which they are able to

show themselves to the best advantage. However, optional questions of mathematics

paper 2, no teacher can confidently know which topics will be examined, therefore;

in essence, there is no freedom of teaching particular topics and leaving out others.

Nevertheless, some teachers have problems in executing lessons involving

some mathematics topics. As a result, they either engage someone who is

comfortable with the particular topics or they fallibly present the topics. The latter

situation puts students in awkward position in terms of thorough examination

preparations. It eventually negatively influences their choices in the examination

since the mathematics domain has been reduced by the teachers incompetence.

Nonetheless, candidates are forced to prepare thoroughly by studying the whole

syllabus. One can be good at a particular topic, but still s/he is extrinsically

motivated to study hard on the other topics in order to do well because no one can

predict exact topics that will be examined.

2.4 Problems of optional questions

Although the merits of the above section cannot be denied, little attention has

been paid to the problems brought by optional questions when they are used in


34/117

19

examinations. It appears examiners over look some of very important aspects of a

test as a measuring instrument. Below are accounts of two main problems associated

with examinees choice of questions. The first discusses about the difference in

cognitive domain demands of topics in a syllabus; while the second challenge looks

at the variability in abilities of candidates.

2.4.1 The syllabus

In a syllabus, there are a number of different topics. It may be argued whether or

not syllabus topics are of the same basic level of difficulty (Willmott, 1972). One

good example of these arguments is the one presented by School Council

Examinations (1971) which say that in mathematics; is the quoting of geometry

theorem followed by an example on par with factorisation followed by the solution

of a pair of simultaneous equation? Certainly, the two topics or branch of

mathematics could not be at the same difficulty level in our syllabus. There are quite

a number of topics in senior secondary school mathematics syllabus which have

different levels of difficulty. The comparability of the results of candidates

attempting these questions drawn from different topics may be questioned.

Therefore, putting scores from different optional questions on the same scale is

necessary for fair comparisons.

2.4.2 The abilities of candidates

The level of questions may vary considerably within the same test form in terms

of level of proficiency required of the candidates to be able to answer the question


35/117

20

fully (Willmott, 1972). The provision of question choice results in the type of

responses required of the candidates over the whole paper not to be controlled in any

way. Some candidates may choose to answer questions with a certain pattern of

proficiency. For example, if a paper of ten questions consisted of five description

questions and five explanatory questions, and candidates were to answer five

questions in all, it is likely to see describers only and explainers only (School

Council Examinations, 1971). This would create measurement problem when one

tries to consider candidates with the same marks to be worthy of the same ability

level (Willmott, 1972). In the case of mathematics, candidates who are not good at

graphs, for example, will tend to avoid graph questions, and some whose proficiency

is low in matrices and vectors will choose other questions. However, the fact that

they have answered their preferred questions does not guarantee them to get full

marks on that particular question. The gist of the matter is if they like geometry most

than arithmetic and algebra they go for such branch of mathematics. The problem

that would come in is of comparison: is my geometry better than your algebra or

arithmetic? Wainer and Thissen (1994) are also concerned with such comparisons

because there is need to take into account the difficulty of the accomplishment for

comparison to be meaningful. It would not be fair to judge two examinees

mathematics proficiency based on different questions. Fair play is ought to be

achieved.


36/117

21

2.5 Relationship between candidates question choice and getting high scores

The suggestion that optional questions allow candidates to select the questions

on which they can perform better is contradicted by research evidence. According to

Wang (1996), the correlation between the popularity ranking of the five choice

questions and their corresponding means was 60.0 , and the correlation between

the ranking of the choice questions combinations and mean score was .22.0 It is

very surprising to note the negative correlations because it is assumed that

examinees choose questions they feel that they would get right. Taylor and Nuttal

study (1974) as cited by Bell (1997) asked candidates taking a Certificate of

Secondary Education (CSE) examination to answer the questions they omitted on a

separate occasion after the actual examination. It was found that about %25 of

candidates actually showed an improvement in the final marks. This meant that not

all candidates are able to choose in advance the questions on which they will score

most highly.

Power, Fowles, Farnum, and Gerritz (1992) found that the more the examinees

liked a particular topic, the lower they scored on an essay they subsequently wrote

on the chosen topic. This phenomenon is quite true when the choice between the

questions is relatively hard for examinees to make, that is, the choices are not

strongly determined (Allen, Holland, and Thayer, 1993). There is no knowledge on

whether MSCE mathematics paper 2 optional questions presents this kind of

scenario where most candidates find it hard to select questions that they would

attempt and score most highly or not. Malawi National Examinations Board item

developers do try to produce optional questions of equivalence in difficulty by


37/117

22

following available guidelines (Khembo, 2004). It is yet to be seen if examiners

effort to produce optional questions of equivalence in difficulty, on face value,

would produce hard choices on the part of examinees. The face value words are

used because no detailed research has been done to ascertain the notion of equal

difficulty of optional questions.

2.6 Linking and Equating

Linking encompasses a broad perspective on score adjustment of different test

forms. Feurer, Holland, Green, Bertenthal, and Hemphill (1999) in their uncommon

measures report presented three types of linking of scores of different tests that are

built based on

1. the same framework and same test specifications,

2. the same framework and different test specifications, or

3. different frameworks and different test specifications.

Kolen and Brennan (2004, p.427) ably defined the term frameworkas a delineation

of the scope and extent (e.g., specific content areas, skills, etc) of the domain to be

represented in the assessment They also defined test specifications or blue printas

specific mix of context areas and items formats, number of tasks/items, scoring

rules, etc. On the other hand, Mislevy (1992) and Linn (1993) proposed a type of

taxonomy for linking which mainly focuses on methodologies. They grouped the

taxonomy into four categories, based on the strength of the resulting linkage, starting

with equating, followed by calibration, projection, and lastly moderation.


38/117

23

When the first two types of linking presented by Feurer et al. (1999), Mislevy

(1992) and Linn (1993) are put into the same perspective, one would find that score

adjustment relationship of different test forms that are built on the same framework

and same test specifications is called equating(Kolen and Brennan, 2004). Tests that

are developed on the same framework and different specifications when linked the

resulting relationships is called calibration. The term projection comes in because

the methodology does not require the test forms to measure the same constructs or

domain, and score adjustment relationship is obtained through linear or non linear

regression.Moderationis a type of linking in which the test frameworks are different

but the constructs are similar (Kolen and Brennan, 2004). For this case, the

fundamental aspect relies on distribution matching.

Looking specifically at equating as one type of linking, Lord (1980) outlined

four requirements that must be met for equating of, say, test iY to test jY

1. the same construct: the two tests must measure the same construct,

2. equity: once two test forms have been equated, it should not matter to

the examinees which form of test is administered,

3. symmetry: the equating transformation should be systematic. This

means the equating of iY to jY should be the inverse of equating jY to

iY ,

4. subpopulation invariance: the equating transformation should be

invariant across subpopulations.

As noted previously from the definitions on the types of linking in uncommon

measure report; same framework is viewed as construct similarity and same test


39/117

24

specifications is considered as similarity in measurement characteristics such as test

length, test format, administration conditions, etc (Kolen and Brennan, 2004). These

definitions are concordant with four requirements for equating as delineated by Lord

(1980). The study would use these definitions as benchmarks for deciding the type of

linking which would be involved. Therefore, the term linking would be used

(henceforth) to refer to any function used to connect the scores on one test to those

of another test, and would reserve the term equating to the special case of linking

that satisfies the benchmarks.

Livingston (2004); von Davier, Holland, and Thayer (2004); Holland, von

Davier, Sinharay, and Han (2006) describe chain linking as equating the scores on

the new form to scores on the anchor and then equating the scores on the anchor to

scores on the reference form. Putting the definition in our context, chain linear

linking describes equating the scores on a particular optional question (new form) to

total scores on common portion (anchor) and then linking the total scores on the

common portion to scores on the other optional questions (reference forms). The

chain formed by these two linking functions connects the score on the concerned

optional question to the scores on the other optional questions.

The study is particularly interested in the first part of the chain where a

particular optional question scores are linked to total scores on common portion.

There is an assumption that says the linear function from a particular optional

question scores on a common portion is the same in the two populations, those that

answer the concerned question and those that do not ( iP and jP ) (von Davier &

Kong, 2005). Based on the assumptions level of attainment, we can substantiate the


40/117


41/117

26

Thayer (2005) discovered that the question choice tends to be positively associated

with performance in the sense that the better an examinee does on a question the

likely s/he is to prefer that question and vice versa. This revelation, however, is

mudded with a reversal where examinees who prefer a certain question perform

better on the unprefered question. They concluded that there is a substantial amount

of variation around the performances in regard to preferred and unprefered choices

and, therefore, it is difficult to justify the non-ignorable selection. With the above

findings, it seems impossible for scores on optional questions to be treated

interchangeably through traditional equating because it is inconsistent with the

notion of standardised testing (Kolen and Brennan, 2004).

Though it is deemed impossible to equate optional questions scores,

nevertheless, comparability of scores is possible through score adjustment

procedures (Kolen and Brennan, 2004) by employing linking paradigms. Wainer,

Wang, and Thissen (1991) employed Item Response Theory (IRT) to explore

equating possibility of choice items by assuming ignorable non-response using data

from the College Boards Advanced Placement (AP) test in Chemistry. They treated

examinees as two subpopulations. Both were administered the common items, but

differing in the administration of the chosen questions to calibrate the item

parameters for the common items and selected questions. They succeeded but

without the confirmatory evidence that could only be sourced with further data.

Allen, Holland, and Thayer (1994a, b) provided a general procedure based on

missing-data methods for non-ignorable non-response to estimate distribution of

scores on an optional part of a 1987 Advanced Placement (AP) European History


42/117

27

test. Using sensitivity analysis approach, they observed that an assumption of

ignorable non-response given additional information from the common section score

could determine the correct assumption about the non-response when only the

optional essay score and the common section were available. Fitzpatrick and Yen

(1995) investigated the psychometric characteristics of constructed response items

referring to choice and non-choice passages administered to students in Grades 3, 5,

and 8. The items were scaled using IRT methodology. The findings indicated that

the scores obtained on different choice sets were comparable when these choices

were scaled together with the non-choice items that all students took. The non-

choice items play an important role in producing comparable scores. Bridgeman,

Morgan, and Wang (1997) assessed the ability of history students to choose the

essay topic on which they could get highest score. They concluded that techniques

for equating scores generated by different topics are not totally satisfactory therefore;

scoring rubrics must be established by single group of raters to enable single

standard.

As it can be noted, there is mixed bag of success and failure in making choice

items scores comparable. Most of the mentioned studies used IRT methodology in

data analyses which require strong assumptions on the test, such as

unidimensionality and local independence. Unidimensionality is statistical

dependence among items which comes about because the test is measuring one latent

trait) and local independence is achieved when items are statistically independent for

each subpopulation of examinees whose members are homogenous with respect to

the latent trait (Crocker and Algina, 1986; Hambleton, Swaminathan, and Rogers,


43/117

28

1991). The opponents of IRT always argue that it is nave to assume that a single

latent trait is accounted for the responses to items on a test. Thus, this study uses

classical item analyses statistics in testing a key assumption of Livingstons score

adjustment on MSCE mathematics paper 2 based on the requirement that the two

equating/linking functions should not depend on a particular population used for

equating.

At this juncture, it should be accentuated that examinations used in the

mentioned studies are quite different in terms of format with the one understudy. In

those studies, examination papers had more than two sections whilst ours has two

sections only. In regard of this, it would not be plausible to conclude equating is not

possible for every examination with optional questions until prove so beyond

reasonable doubt.

2.8 What are the consequences of not linking/equating optional questions scores?

Linking/equating have the potential to ameliorate problems presented by choice,

through making them equivalent in difficulty. If examinees who choose different

items are to be fairly compared with one another, the scores obtained on these items

must be equated (Wainer, Wang, and Thissen, 1991, p.2). This process facilitates

the linkage of scores on optional items to one another by putting them on

comparable scale using z-score model.

The optional questions are intended to test the same skills and types of

knowledge which are taken from the same syllabus. Though test developers try to

make the questions equally difficult, oftentimes more optional questions turn out to


44/117

29

be harder than others. Wang, Wainer, & Thissen (1993) observed that in 1989 AP

chemistry and 1989 AP American History, women were adversely affected because

most of them chose the more difficult items. This is one example among many of

unfairness that comes along with question choice. When some optional questions are

harder than others, the raw scores on those questions would not indicate the same

level of the knowledge or skill the questions are intended to measure thus the scores

would not be comparable.

As noted previously, it remains a fact that to develop choice items of equal

difficulty is a gargantuan challenge. Even so, to remove choice from the examination

would reduce domain coverage because of small number of items that would be

examined. This would affect some students. Increasing the length of time to

accommodate large number of items is often impractical. Since choice has been

decided as the desirable format for MSCE mathematics paper 2 examinations, there

are two main consequences of not putting the scores on the same scale. First, same

observed raw score on each optional item would not imply the same accomplishment

because the difficulties of the tasks are different. Second, observed total raw scores

from choice items combinations in section B would still present different patterns of

mathematical proficiencies. From one combination to another that might create

intricacy in comparison.


45/117

30

CHAPTER 3

3.0 METHODOLOGY

3.1 Introduction

This chapter describes how the research problem was investigated. The list of

questions to be answered is given first. This is followed by the design of the study,

analysis plan, ethical consideration, validity and reliability. In the final section, a

narrative of delimitations and limitations of the study is presented.

3.2 The Research Questions

The following questions were addressed in this study:

1. To what extent do optional questions differ in difficulty?

2. How are scores on optional questions and total scores on the common

portion correlated?

3. Are linking/equating functions of examinees that chose a concerned

optional question and for those that selected another choice question

similar?


46/117

31

3.3 The Design

3.3.1 Description of the Research

The research strategy which was employed is survey because the researcher

wanted the measures used to be reliable and valid, and that there was guarantee of

fair representation of all individuals to whom the researcher wanted the results to

apply (Cohen, Manion, & Marrison, 2000; Slavin, 1984). Further, quantitative

approach was the method used because it uses the positivism approach, which holds

the belief that the social environment is real and constant regardless of time and

setting (Creswell, 1994).

3.3.2 Population

The population of the study was all form 4 students from purposively sampled

secondary schools in southwest and shirehighlands education divisions.

3.3.3 Sampling

The study used purposive sampling where five secondary schools were chosen

to participate in the study. Two main reasons are given why purposive sampling

was preferred to others. First, the researcher wanted to ensure representation of four

major conventional secondary schools types. This is in agreement with Borg, Gall

and Gall (1996) who say that purposive sample provides a more focused data and

allows for a detailed analysis of a particular segment of population. Second, due to

limitations of research funds and time it was judicious to engage schools which

were close to each other.


47/117


48/117

33

classroom assessment. Since the study wanted sixty participants from each school; a

sampling interval, k, was computed by dividing the class size of students in form 4

class at each school by 60. From the teachers list, a name of student corresponding

to thk number was picked, and every thk name thereafter was chosen until the

required number was achieved.

3.3.4 Instruments

The main instrument that was used is a 2005 Malawi School Certificate of

Education Examinations mathematics paper 2 (see appendix G). This paper was

purposively chosen because it was the latest paper at the time of writing the

research proposal.

The design was that the candidates had no choice in section B, thereby

increasing the test length by three more questions. In view of this, the paper was

divided into two parts; paper 1 representing section A (see appendix C) and paper 2

representing section B (see appendix D). This was done in agreement with the

observation of Hand (2004, p.120) that the more questions included in a test, the

more difficulty one might find in obtaining valid responses and candidates tire as

the number of questions increases, and might even refuse to take part if there are

too many.

Paper 1 consisted of six questions and time allotted to it was 1 hour 30 minutes.

Paper 2 took 2 hours and had six question choices. In this paper, examinees were

instructed to read all the optional questions and chose three questions; and that they


49/117

34

should write down the number of these questions in order of preferences. Then they

were instructed to answer all the six questions.

The other instrument was the questionnaire that was used as a cover page for

candidates answer sheets for paper 1 and paper 2 (see appendices E and F

respectively). The questionnaire was used to solicit extra information from the

candidates such as question choice preference, exclusively for paper 2, gender, and

age.

3.3.5The administration of the instruments and data gathering

The two papers were administered three weeks prior to commencement of

National Examinations. This was done to ensure that students had prepared

thoroughly in terms of mastering the whole mathematics syllabus. This is the time

when the majority of the secondary schools finish delivering lessons to students and

instead they engage in revisions of various courses that are offered. The two test

papers were administered on the same day, starting with paper 1, and after 30

minute break, paper 2 was taken.

Students were instructed to answer the questionnaire first before attempting the

questions in both papers. The time given to fill the questionnaire was two minutes.

3.4 Data Analysis

3.4.1 Extent of difficulty in optional questions

The item difficulty indices (p-values) were used to analyse the extent of

difficulty in optional question. These p-values are obtained by computing the


50/117

35

average mark obtained on the question divided by the maximum mark for that

question (Nuttal & Willmott, 1972). The p-values for questions in section A and

section B without choice (i.e. no choices were allowed on the optional questions

portion) were all calculated in the same manner. The item difficulty indices for

questions in section B without choice were unbiased statistics because all

examinees (population) were used to compute them.

3.4.2 Correlation of scores on section B and total scores of the section A

Pearson product-moment correlation coefficient between the common portion

and question choice portion was calculated. The coefficient of determinant was

worked out to determine variance in section A that is associated with the variance in

section B. This question helped the researcher to see if the examinees would differ

in the same way on the common portion as they would do on the optional questions

portion. If the correlation coefficient were strong, then the researcher would know

that section A measured similar construct as section B. It signifies that the

mathematical knowledge and skills that were asked in section A were also available

in section B; making the two sections measure the same mathematical elements.

This is one requirement amenable to equating for two tests (Liu, Cahn, and Dorans,

2006).


51/117

36

3.4.3 Establishing group invariance on equating/linking functions of examinees that

chose a concerned optional question and for those that selected another

question

In normal examination, the raw score iY for examinee selecting question j is

unobservable, in fact, iY is missing datum. Equating ji PXY onto , therefore, is

impossible. This equating function is denoted )( iij yX . For instance, an examinee

who chose optional questions, say, 7, 9, and 12 would have unobserved scores on

optional questions 8, 10, and 11. Thus equating the score of, say, question 8 to scale

of total score of section A on the group that selected question 7, or 9, or 12 is

impossible. We could denote this equating function as )( 87,8 yX , or )( 89,8 yX , or

)( 812,8 yX with respect to the chosen optional questions.

The missing scores, however, were available and were used to determine the

means 1,ji and standard deviations 1,ji . These moments were used together with

means1jX

and standard deviations1jX

of section A to establish slopes and

intercepts of functions )( iij yX . The computable missing linear equating

is ( )11

1

1)( iji

ij

X

Xiij yyXj

j

+= . Other slopes and intercepts of the observable scores

equating functions )( ii yX were computed using this equation

( )11

1

1)( ii

i

Xii yyXiX

i

+= .

For each optional question, there were five sets of linear functions. For each

set, one function belonged to subgroup that chose a concerned question; the other

function was for a subgroup that never selected the concerned question but chose


52/117

37

another question; and the last function was for the combined group. The two

subgroups in each set were mutually exclusive.

Dorans and Holland (2000) introduced two statistics to summarise differences

between the equating functions obtained from subgroups and combined group. The

first one is standardised Root Mean Square Difference, RMSD, which gives

detailed information as to which Y-score points, y, that are most affected the

subgroup difference. The second one is the standardised Root Expected Mean

Square Difference, REMSD, which summarises overall differences between the

equating/linking functions. The formulae for the two statistics are

[ ]

)(

)()(

)(

1

2

groupcombined

H

h

XXh

X

yeqyeqw

yRMSD

h

=

= (5)

[ ]

)(

)()(

1

)max(

)min(

2

groupcombined

H

h

y

y

XXyhh

X

yeqyeqw

REMSD

h

=

= (6)

Xeq represents transformed scores on Yto the scale of X for the combined group,

hXeq represents transformed scores on Yto the scale of X for subgroup h. hN is the

sample size for subgroup h, Nis the total number of examinees andN

Nw hh = is the

weight for the subgroup h. Furthermore, yhN is the number of examinees for

subgroup hwith a particular score (y) on Y, andh

yhyh N

N= is a weighting factor

for subgroup hand score (y).


53/117

38

As it can be noted, RMSD is computed at each y-value and the contribution of

each subgroup is weighted by its proportional representation in the combined group.

REMSD is a doubly weighted statistics over yh and hw .

To evaluate the relative magnitude of RMSD and REMSD, Dorans and

Feigenbaum (1994) suggested the notion of score Difference That Matters (DTM)

in the context of linking the SAT to the old SAT. Test that is reported in 10-point

unit, linking functions that are within 5 scaled score points of each other at a given

raw score point are treated as close enough to ignore because they are less than half

of a reported score unit of 10 (Dorans, 2004). Kolen & Brennan (2004, p. 462) give

a good illustration on the logic of DTM when reported scores are integers,

equivalents of 15.4 and 15.6 round to differentintegers even though they differ by

only .2 (less than a DTM). Also equivalents of 14.6 and 15.4 round to the same

integer even though the different by .8 (more than a DTM). The score unit on

MSCE mathematics examination is 1-point, which is an integer. This means that

half of score unit was considered as a score Difference That Matters, .5.

Recall that RMSD and REMSD statistics are standardised by dividing by the

standard deviation of scores on compulsory section for combined group. DTM was

standardised in the same manner so that it could be used as a benchmark for

evaluating RMSD and REMSD. When REMSD was below the standardised DTM it

indicated that the equating functions for each subgroup were very close to that of

the combined group, hence they were group invariance. Otherwise, they failed

group invariance test. These functions and RMSD were plotted on graphs to

visually display their similarities and the differences.


54/117

39

3.5 Ethical Considerations

Creswell (2003) says codes of professional conduct for researchers are

applicable to all research methods: qualitative, quantitative, and mixed methods. In

this study, the researcher observed two ethical codes of conduct. First was obtaining

informed consent, and second was to do with privacy and confidentiality.

First, Gay and Airasian (2003) say that very rarely is it possible to conduct

research without the cooperation of people in the setting of the study. Cooperation

would come into play if the researcher obtains consent from participates. Before

carrying out the research, a written permission was sought from the Education

Division Managers and headteachers to conduct the research at their schools

(appendices J, K, & L), and furthermore, students of the participating schools were

asked if they were acceding to take part in the study. Only those that acceded were

systematically selected to be the candidates. Rossman and Rallis (2003) comment

on the significance of getting informed consent from participants by saying that the

permission from the subjects is crucial for the ethical conduct of the research

because it serves to protect the privacy of the participants.

Second, Fowler (1995); Vaughn, Schumm, and Sinagub (1996); Rossman and

Rallis (2003) mention that privacy and confidentiality during data collection is of

paramount importance. Participants responses should be kept confidential and they

should know the purpose of the study. Based on these assertions, the study assured

subjects of their privacy and confidentiality during the administration of the tests by

advising them not to disclose or write their names on the answer sheets. Letters and

numerical values were used to distinguish examinees from one another.


55/117

40

3.6 Validity and Reliability

Validity is defined as the accuracy or truthfulness of a measurement with

reference to a construct of specific interest; and reliability is concerned with

consistency of a measurement (Crocker & Algina, 1986; Bakewell, 2003). Hand

(2004, p.129) defines validity as how well the measured variable represents the

attribute being measured, or how well it captures the concept which is the target of

measurement. He further defines reliability as the differences between multiple

measurements of an attribute.

On validity, MANEB item setters developed the instrument that was used in

this study. These item setters are well-trained personnel with vast teaching

experience in mathematics. During the development of the tests, they use blue

prints, that is, tables of specifications to guide them in terms of content coverage

and the level of cognitive demands. The blue prints help to maintain consistency of

difficulty level of the tests over years. The papers, therefore, possess the required

magnitude of content validity based on how they are designed. Furthermore, the

examinees took the tests three week prior to the National examinations. This means

that the students at that time were well prepared. Hence their responses were taken

as their optimal performance or achievement in MSCE mathematics paper 2 as they

displayed their true mathematics knowledge and skills.

In assuring reliability, marking scheme was used for consistency in scoring,

and one item rater was used to avoid inter-rater variability. The marking scheme

used for scoring the test was developed by two experienced mathematics teachers

from Chiradzulu Secondary School. These teachers are also MANEB mathematics


56/117

41

raters. The scheme is similar with MANEB scheme in terms of mark allocation and

content specification. Furthermore, before rating the items, the researcher and the

two teachers standardised the marking scheme to encompass examinees diversity

answers. One question at a time was marked on each script before marking the

subsequent question to ensure consistency.

3.7 Delimitations and Limitations of the study

3.7.1Delimitations

The study focused only on optional questions of mathematics paper 2; hence

the finding would not apply to other MANEB examinations that allow examinee

choice.

The results would not be generalised to all secondary schools in Malawi

because the participating schools were purposively sampled. However, the results

would be related to other schools with similar characteristics as the sampled ones.

3.7.2Limitations

Visiting all the secondary schools that offer mathematics would have been an

ideal but this was impossible due to time and financial constraints. Instead the study

was done on five schools only.

Some students declined to participate in the study after previously affirming to

do so. In some instances, candidates took only one paper instead of two. This

behaviour provided scores for one paper only, as for the other one were not

available. With this regard, they were dropped from the study thereby reducing the


57/117

42

targeted sample size. This attrition was much observed in Njamba secondary

school. The total number of attrition was 53.

Finally, MANEB marking scheme was not issued to the researcher to be used.

They say it is a confidential document, hence cannot be given to anyone outside the

organisation. This created a minor setback because it was planned to use their mark

scheme. It resulted into extra finances and resources in bringing about two

experienced teachers from Chiradzulu secondary schools, who are also MANEB

item writers and scorers, together with the researcher to develop another marking

scheme. Nonetheless, our combined experience as item scorers made the marking

scheme similar to the ones developed by MANEB.


58/117

43

CHAPTER 4

4.0 RESULTS AND DISCUSSION OF THE FINDINGS

4.1 Introduction

In this chapter, results and discussions of the findings are presented under three

main sections. The sections were formulated based on the research questions. Thus

they display answers to the posed research questions in chronological order, starting

off with the first research question, and the second. Third research question is

addressed in the final section coupled with a chapter summary.

4.2 To what extent do optional questions differ?

4.2.1 Preliminary Analysis

The item content and major content areas that made up section A and section B

are outlined in Tables 4.1 and 4.2 respectively. Almost all content areas that were

examined in section A were also tested in section B, but with different item

contents. It signifies that the two sections were measuring the same construct.

Construct similarity is viewed as same framework (Feuer et al. 1999), thus both

sections were built on the same framework.

Furthermore, Feuer et al. (1999) define same test specifications as similarity

in measurement characteristics/conditions such as test length, test format,

administrations conditions, etc. Popham (1974) as cited by Crocker and Algina


59/117

44

(1986) defines item specification as sources of item content, descriptions of the

problem situations or stimuli, etc. In view of both definitions, the items in both

sections were built on different item specifications. This is evidenced by similar

item format but different sources of item content. Further, the differences rested in

the levels of cognitive operation demands. Most questions in section A demand less

cognitive operation than those in section B as indicated by p-values in Table 4.3.

Table 4.1: Major content areas of section A

Section A

Question No. Item content Content areas

a Algebra fractions Algebra, patterns, & functions1

b Irrational numbers Numeration

a Subject of a formula Algebra, patterns, & functions2

b Matrices Algebra, patterns, & functions

a Triangle geometry Geometry3

b Remainder theorem Algebra, patterns, & functions

a Circle geometry Geometry4

b Mapping Algebra, patterns, & functions

a Measurement Numeration5

b Speed-time graph Numeration

a Similar figures Geometry6

b Vectors Numeration


60/117

45

Section B

Question No. Item content Content areas

a Statistics Statistics & probability7

b Formulation & solving

quadratic equation

Algebra, patterns, & functions

a Partial variation Algebra, patterns, & functions8

b Probability Statistics & probability

a Exponential equation Algebra, patterns, & functions9

b Linear programming Algebra, patterns, & functions

a Equation of a straight line Algebra, patterns, & functions10

b Arithmetic progression Algebra, patterns, & functions

a Cyclic quadrilateral Geometry11

b Sets Numeration

a Trigonometry Numeration12

b Solving polynomial

equation graphically

Algebra, patterns, & functions

Table 4.2: Major content areas of section B

Having looked at same framework and test/item specifications of the two

sections of the test under investigation, it would be reasonable to use the term

linking rather than equating because the two sections had different item content

but same content areas; and the length of choice items in section B were not equal

to items in section A. Further, level of cognitive processes required in two sections

was different as illustrate in subsection 4.2.2. Thus, the two portions measured the

same construct, but different specifications. However, when equating choice items,


61/117

46

the interest is on item content as opposed to content areas of the test form because

item scores are the ones to be linked within the same test.

4.2.2 Comparing p-values of section B

Table 4.3: P-values for questions in section A and section B 'without choice'

Section A Section B

Item Max.

mark

Average

mark

p-value Item Max.

mark

Average

mark

p-value

1 8 5.190 0.649 7 15 6.436 0.429

2 7 4.401 0.629 8 15 5.061 0.337

3 9 5.518 0.613 9 15 5.869 0.391

4 10 5.801 0.580 10 15 5.116 0.341

5 11 6.324 0.575 11 15 3.927 0.262

6 10 1.917 0.192 12 15 7.566 0.504

Table 4.3 displays the item difficulty indices (p-values) for questions in section

A and section B without any choice. Questions in section A have generally higher

p-values than those in section B. This affirms the notions that section A questions

are easier than section B questions. The questions in the latter section were

relatively difficult because they usually provided deep coverage of the content

domain. Adopting the terms used by Wainer and Thissen (1994), most of the

questions in section B would be called large items. Section A questions would be

dubbed short items because most of them were considerably straight forward.


62/117

47

However, question 6 in section A had the lowest p-value amongst all questions in

the test. The predicament which candidates faced in attempting this question was

translating the word problem into correct computable mathematical concepts.

Levels of proficiency in language skills might have influenced the performances on

this question (Crocker and Algina, 1986).

Focusing on section B questions, it is noted that question 11 was the most

difficult, and question 12 was the fairest question. Ordering them from least

difficult to the most difficult question, one would get questions 12, 7, 9, 10, 8, and

11.

As noted, optional question 11 was the most difficult and if a student gets a raw

score of, say 7, on that problem, it has no consequence, with the current assessment

policy on MSCE mathematics paper 2 examinations, whether it is on problem 12

which is the easiest. In all fairness, it is clear that one who receives a score of 7

demonstrated more proficiency than another student who gets the same score on

problem 12. Wainer, Wang, and Thissen (1991) and Wainer and Thissen (1994) say

that when optional questions that are differentially difficu

examining the untestable assumptions of the chained linear linking for livingston score adjustment...

Documents