using tests dr ayaz afsar. introduction since the spelling test of rice (1897), the fatigue test of...

Using Tests Dr Ayaz Afsar

Introduction

Since the spelling test of Rice (1897), the fatigue test of Ebbinghaus (1897) and the intelligence scale of Binet (1905), the growth of tests has proceeded at an extraordinary pace in terms of volume, variety, scope and sophistication.

The field of testing is so extensive that the comments that follow must needs be of an introductory nature and the reader seeking a deeper understanding will need to refer to specialist texts and sources on the subject.

Limitations of space permit no more than a brief outline of a small number of key issues to do with tests and testing.

In tests, researchers have at their disposal a powerful method of data collection, an impressive array of tests for gathering data of a numerical rather than verbal kind.

Cont.

In considering testing for gathering research data, several issues need to be borne in mind, not the least of which is why tests are being used at all:

What are we testing (e.g. achievement, aptitude, attitude, personality, intelligence, social adjustment etc.)?

Are we dealing with parametric or non-parametric tests? Are they norm-referenced or criterion- referenced? Are they available commercially for researchers to use or will

researchers have to develop home-produced tests? Do the test scores derive from a pretest and post-test in the

experimental method? Are they group or individual tests? Do they involve self-reporting or are they administered tests? Let us unpack some of these issues in the following:

What are we testing?

There is a myriad of tests, to cover all aspects of a student’s life and for

all ages (young children to old adults), for example: aptitude, attainment,

personality, social adjustment, attitudes and values, stress and burnout,

performance, projective tests, potential, ability, achievement, diagnosis of

difficulties, intelligence, verbal and non-verbal reasoning, higher order

thinking, performance in school subjects, introversion and extraversion,

self-esteem, locus of control, depression and anxiety, reading readiness,

university entrance tests, interest inventories, language proficiency tests,

motivation and interest, sensory and perceptual tests, special abilities

and disabilities, and many others.

Parametric and non-parametric tests

Parametric tests are designed to represent the wide population, e.g. of a country or age group. They make assumptions about the wider population and the characteristics of that wider population, i.e. the parameters of abilities are known. They assume the following (Morrison 1993):

There is a normal curve of distribution of scores in the population: the bell-shaped symmetry of the Gaussian curve of distribution seen, for example, in standardized scores of IQ or the measurement of people’s height or the distribution of achievement on reading tests in the population as a whole.

There are continuous and equal intervals between the test scores and, with tests that have a true zero, the opportunity for a score of, say, 80 per cent to be double that of 40 per cent; this differs from the ordinal scaling of rating scales discussed earlier in connection with questionnaire design where equal intervals between each score could not be assumed.

Parametric tests

Parametric tests will usually be published tests which are commercially

available and which have been piloted and standardized on a large and

representative sample of the whole population.

They usually arrive complete with the backup data on sampling,

reliability and validity statistics which have been computed in the

devising of the tests.

Working with these tests enables the researcher to use statistics

applicable to interval and ratio levels of data.

Non-parametric tests

Non-parametric tests make few or no assumptions about the distribution

of the population (the parameters of the scores) or the characteristics of

that population.

The tests do not assume a regular bell-shaped curve of distribution in the

wider population; indeed the wider population is perhaps irrelevant as

these tests are designed for a given specific population – a class in

school, a chemistry group, a primary school year group.

Because they make no assumptions about the wider population, the

researcher must work with non-parametric statistics appropriate to

nominal and ordinal levels of data. Parametric tests, with a true zero and

marks awarded, are the stock-in-trade of classroom teachers – the

spelling test, the mathematics test, the end-of-year examination, the

mock-examination.

Non-parametric tests vs Parametric tests

• The attraction of non-parametric statistics is their utility for small samples because they do not make any assumptions about how normal, even and regular the distributions of scores will be.

• Furthermore, computation of statistics for non-parametric tests is less complicated than that for parametric tests. Non-parametric tests have the advantage of being tailored to particular institutional, departmental and individual circumstances.

• They offer teachers a valuable opportunity for quick, relevant and focused feedback on student performance.

• Parametric tests are more powerful than non- parametric tests because they not only derive from standardized scores but also enable the researcher to compare sub-populations with a whole population (e.g. to compare the results of one school or local education authority with the whole country, for instance in comparing students’ performance in norm-referenced or criterion-referenced tests against a national average score in that same test).

Cont.

They enable the researcher to use powerful statistics in data processing

and to make inferences about the results. Because non-parametric tests

make no assumptions about the wider population a different set of

statistics is available to the researcher.

These can be used in very specific situations – one class of students,

one year group, one style of teaching, one curriculum area and hence

are valuable to teachers.

Norm-referenced, criterion-referencedand domain-referenced tests

A norm-referenced test compares students’ achievements relative to other students’ achievements, for example a national test of mathematical performance or a test of intelligence which has been standardized on a large and representative sample of students between the ages of 6 and 16. A criterion-referenced test does not compare student with student but, rather, requires the student to fulfil a given set of criteria, a predefined and absolute standard or outcome (Cunningham 1998).

For example, a driving test is usually criterion-referenced since to pass it requires the ability to meet certain test items – reversing round a corner, undertaking an emergency stop, avoiding a crash, etc. – regardless of how many others have or have not passed the driving test. Similarly many tests of playing a musical instrument require specified performances, such as the ability to play a particular scale or arpeggio, the ability to play a Bach fugue without hesitation or technical error.

If the student meets the criteria, then he or she passes the examination.

Cont.

A criterion-referenced test provides the researcher with information about exactly what a student has learned, what he or she can do, whereas a norm-referenced test can only provide the researcher with information on how well one student has achieved in comparison with another, enabling rank orderings of performance and achievement to be constructed.

Hence a major feature of the norm-referenced test is its ability to discriminate between students and their achievements – a well-constructed norm-referenced test enables differences in achievement to be measured acutely, i.e. to provide variability or a great range of scores.

For a criterion-referenced test this is less of a problem: the intention here is to indicate whether students have achieved a set of given criteria, regardless of how many others might or might not have achieved them, hence variability or range is less important here.

More recently an outgrowth of criterion-referenced testing has seen the rise of domain-referenced tests (Gipps 1994: 81).

Cont.

Here considerable significance is accorded to the careful and detailed specification of the content or the domain which will be assessed. The domain is the particular field or area of the subject that is being tested, for example, light in science, two-part counterpoint in music, parts of speech in English language.

The domain is set out very clearly and very fully, such that the full depth and breadth of the content are established. Test items are then selected from this very full field, with careful attention to sampling procedures so that representativeness of the wider field is ensured in the test items.

The student’s achievements on that test are computed to yield a proportion of the maximum score possible, and this, in turn, is used as an index of the proportion. of the overall domain that s/he has grasped.

So, for example, if a domain has 1,000 items and the test has 50 items, and the student scores 30 marks from the possible 50, then it is inferred that s/he has grasped 60 per cent ({30 ÷ 50} × 100) of the domain of 1,000 items.

Commercially produced tests andresearcher-produced tests

There is a battery of tests in the public domain which cover a vast range of topics and that can be used for evaluative purposes (references were indicated earlier). Most schools will have used published tests at one time or another.

There are several attractions to using published tests:

◦ They are objective.

◦ They have been piloted and refined.

◦ They have been standardized across a named population (e.g. a region of the country, the whole country, a particular age group or various age groups) so that they represent a wide population.

◦ They declare how reliable and valid they are (mentioned in the statistical details which are usually contained in the manual of instructions for administering the test).

Cont…Commercially produced tests andresearcher-produced tests

They tend to be parametric tests, hence enabling sophisticated statistics to be calculated.

They come complete with instructions for administration. They are often straightforward and quick to administer and to mark. Guides to the interpretation of the data are usually included in the

manual. Researchers are spared the task of having to devise, pilot and refine

their own test. On the other hand, Howitt and Cramer (2005) suggest that commercially

produced tests are expensive to purchase and to administer; they are often targeted to special, rather than to general populations (e.g. in psychological testing), and they may not be exactly suited to the purpose required.

Further, several commercially produced tests have restricted release or availability, hence the researcher might have to register with a particular association or be given clearance to use the test or to have copies of it.


• For example, Harcourt Assessment and McGraw-Hill publishers not only hold the rights to a world-wide battery of tests of all kinds but also require registration before releasing tests.

• In this example Harcourt Assessment also has different levels of clearance, so that certain parties or researchers may not be eligible to have a test released to them because they do not fulfil particular criteria for eligibility.

• Published tests by definition are not tailored to institutional or local contexts or needs; indeed their claim to objectivity is made on the grounds that they are deliberately supra-institutional. The researcher wishing to use published tests must be certain that the purposes, objectives and content of the published tests match the purposes, objectives and content of the evaluation.

• For example, a published diagnostic test might not fit the needs of the evaluation to have an achievement test; a test of achievement might not have the predictive quality that the researcher seeks in an aptitude test, a published reading test might not address the areas of reading that the researcher is wishing to cover, a verbal reading test written in English might contain language that is difficult for a student whose first language is not English.


• The golden rule for deciding to use a published test is that it must demonstrate fitness for purpose. If it fails to demonstrate this, then tests will have to be devised by the researcher.

• The attraction of this latter point is that such a ‘home-grown’ test will be tailored to the local and institutional context very tightly, i.e. that the purposes, objectives and content of the test will be deliberately fitted to the specific needs of the researcher in a specific, given context.

• In discussing fitness for purpose, Cronbach (1949) and Gronlund and Linn (1990) set out a range of criteria against which a commercially produced test can be evaluated for its suitability for specific research purposes.

• Against these advantages of course there are several important considerations in devising a ‘home-grown’ test.

• Not only might it be time-consuming to devise, pilot, refine and then administer the test but also, because much of it will probably be non-parametric, there will be a more limited range of statistics that may be applied to the data than in the case of parametric tests.


The scope of tests and testing is far-reaching; no areas of educational activity are untouched by them. Achievement tests, largely summative in nature, measure achieved performance in a given content area. Aptitude tests are intended to predict capability, achievement potential, learning potential and future achievements.

However, the assumption that these two constructs – achievement and aptitude – are separate has to be questioned (Cunningham 1998); indeed, it is often the case that a test of aptitude for, say, geography at a particular age or stage will be measured by using an achievement test at that age or stage.

Cunningham (1998) has suggested that an achievement test might include more straightforward measures of basic skills, whereas aptitude tests might put these in combination, for example combining reasoning (often abstract) and particular knowledge; thus achievement and aptitude tests differ according to what they are testing.

Cont…Commercially produced tests and researcher produced tests

Not only do the tests differ according to what they measure, but also, since both can be used predictively, they differ according to what they might be able to predict.

For example, because an achievement test is more specific and often tied to a specific content area, it will be useful as a predictor of future performance in that content area but will be largely unable to predict future performance out of that content area. An aptitude test tends to test more generalized abilities (e.g. aspects of ‘intelligence’, skills and abilities that are common to several areas of knowledge or curricula), hence it is able to be used as a more generalized predictor of achievement.

Achievement tests, Gronlund (1985) suggests, are more linked to school experiences whereas aptitude tests encompass out-of-school learning and wider experiences and abilities.

Constructing a test

In devising a test the researcher will have to consider:

• the purposes of the test (for answering evaluation questions and ensuring that it tests what it is supposed to be testing, e.g. the achievement of the objectives of a piece of the curriculum) the type of test (e.g. diagnostic, achievement, aptitude, criterion-referenced, norm- referenced)

• The objectives of the test (cast in very specific terms so that the content of the test items can be seen to relate to specific objectives of a programme or curriculum)

• The content of the test (what is being tested and what the test items are) the construction of the test, involving item analysis in order to clarify the item discriminability and item difficulty of the test

• The format of the test: its layout, instructions, method of working and of completion (e.g. oral instructions to clarify what students will need to write, or a written set of instructions to introduce a practical piece of work) the nature of the piloting of the test

Cont…Constructing a test

The validity and reliability of the test the provision of a manual of instructions for the administration, marking and data treatment of the test (this is particularly important if the test is not to be administered by the researcher or if the test is to be administered by several different people, so that reliability is ensured by having a standard procedure).

In planning a test the researcher can proceed thus:

1. Identify the purposes of the test.

2. Identify the test specifications.

3. Select the contents of the test.

4. Consider the form of the test.

5. Write the test item.

6. Consider the layout of the test.

7. Consider the timing of the test.

8. Plan the scoring of the test

Identify the purposes of the test

The purposes of a test are several, for example to diagnose a student’s strengths, weaknesses and difficulties, to measure achievement, to measure aptitude and potential, to identify readiness for a programme.

Gronlund and Linn (1990) term this ‘placement testing’ and it is usually in a form of pretest, normally designed to discover whether students have the essential prerequisites to begin a programme (e.g. in terms of knowledge, skills, understandings).

These types of tests occur at different stages. For example, the placement test is conducted prior to the commencement of a programme, and will identify starting abilities and achievements – the initial or ‘entry’ abilities in a student.

Cont…Identify the purposes of the test

If the placement test is designed to assign students to tracks, sets or teaching groups (i.e. to place them into administrative or teaching groupings), then the entry test might be criterion- referenced or norm-referenced; if it is designed to measure detailed starting points, knowledge, abilities and skills, then the test might be more criterion-referenced as it requires a high level of detail.

It has its equivalent in ‘baseline assessment’ and is an important feature if one is to measure the ‘value-added’ component of teaching and learning:

One can only assess how much a set of educational experiences has added value to the student if one knows that student’s starting point and starting abilities and achievements.

Cont…Identify the purposes of the test

Formative testing is undertaken during a programme, and is designed to monitor students’ progress during that programme, to measure achievement of sections of the programme, and to diagnose strengths and weaknesses.

It is typically criterion-referenced. Diagnostic testing is an in-depth test to discover particular strengths,

weaknesses and difficulties that a student is experiencing, and is designed to expose causes and specific areas of weakness or strength. Clearly this type of test is criterion-referenced.

Summative testing is the test given at the end of the programme, and is designed to measure achievement, outcomes, or ‘mastery’. This might be criterion-referenced or norm-referenced, depending to some extent on the use to which the results will be put (e.g. to award certificates or grades, to identify achievement of specific objectives).

Identify the test specifications

• Which programme objectives and student learning outcomes will be addressed:

• which content areas will be addressed

• the relative weightings, balance and coverage of items

• the total number of items in the test

• the number of questions required to address a particular element of a programme or learning

• Outcomes the exact items in the test

• To ensure validity in a test it is essential to ensure that the objectives of the test are fairly addressed in the test items.

• Objectives, it is argued (Mager 1962; Wiles and Bondi 1984), should be specific and be expressed with an appropriate degree of precision represent intended learning outcomes identify the actual and observable behaviour that will demonstrate achievement include an active verb be unitary (focusing on one item per objective).

Cont…Identify the test specifications

One way of ensuring that the objectives are fairly addressed in test items can be done through a matrix frame that indicates the coverage of content areas, the coverage of objectives of the programme, and the relative weighting of the items on the test.

Such a matrix is set out in the following taking the example from a secondary school history syllabus.

The matrix indicates the main areas of the programme to be covered in the test (content areas); then it indicates which objectives or detailed content areas will be covered (1a–3c) – these numbers refer to the identified specifications in the syllabus; then it indicates the marks/percentages to be awarded for each area.


A matrix of test items


The matrix indicates several points: The least emphasis is given to the build-up to and end of the war (10

marks each in the ‘total’ column). The greatest emphasis is given to the invasion of France (35 marks in

the ‘total’ column). There is fairly even coverage of the objectives specified (the figures in

the ‘total’ row only vary from 9 to 13). Greatest coverage is given to objectives 2a and 3a, and least coverage

is given to objective 1c. Some content areas are not covered in the test items (the blanks in the

matrix). Hence we have here a test scheme that indicates relative weightings,

coverage of objectives and content, and the relation between these two latter elements.


Having undertaken the test specifications, the researcher should have achieved clarity on the exact test items that test certain aspects of achievement of objectives, programmes, contents etc., the coverage and balance of coverage of the test items and the relative weightings of the test items.

Compiling elements of test items:

Select the contents of the test

• Here the test is subject to item analysis. Gronlund and Linn (1990) suggest that an item analysis will need to consider:

• The suitability of the format of each item for the (learning) objective (appropriateness)

• The ability of each item to enable students to demonstrate their performance of the (learning) objective (relevance)

• The clarity of the task for each item

• The straightforwardness of the task

• The unambiguity of the outcome of each item, and agreement on what that outcome should be the cultural fairness of each item.

• The independence of each item (i.e. where the influence of other items of the test is minimal and where successful completion of one item is not dependent on successful completion of another).

• The adequacy of coverage of each (learning) objective by the items of the test.

Cont…Select the contents of the test

In moving to test construction the researcher will need to consider how each element to be tested will be operationalized:

What indicators and kinds of evidence of achievement of the objective will be required

What indicators of high, moderate and low achievement there will be what will the students be doing when they are working on each element of the test?

What the outcome of the test will be (e.g.a written response, a tick in a box of multiple choice items, an essay, a diagram, a computation)?

Indeed the Task Group on Assessment and Testing (1988) in the UK suggest that attention will have to be given to the presentation, operation and response modes of a test:

How the task will be introduced (e.g.oral, written, pictorial, computer, practical demonstration)?


What the students will be doing when they are working on the test (e.g. mental computation, practical work, oral work, written)?

What the outcome will be – how they will show achievement and present the outcomes (e.g. choosing one item from a multiple choice question, writing a short response, open-ended writing, oral, practical outcome, computer output)?

Operationalizing a test from objectives can proceed by stages:

◦ Identify the objectives/outcomes/elements to be covered.

◦ Break down the objectives/outcomes/elements into constituent components or elements.

◦ Select the components that will feature in the test, such that, if possible, they will represent the larger field (i.e. domain referencing, if required).


Recast the components in terms of specific, practical, observable behaviours, activities and practices that fairly represent and cover that component.

Specify the kinds of data required to provide information on the achievement of the criteria.

Specify the success criteria (performance indicators) in practical terms, working out marks and grades to be awarded and how weightings will be addressed.

Write each item of the test. Conduct a pilot to refine the language/ readability and presentation of

the items, to gauge item discriminability, item difficulty and distractors, and to address validity and reliability.


The contents of the test will also need to take account of the notion of fitness for purpose, for example in the types of test items.

Here the researcher will need to consider whether the kinds of data to demonstrate ability, understanding and achievement will be best demonstrated in, for example (Lewis 1974; Cohen et al. 2004:

an open essay a factual and heavily directed essay short answer questions divergent thinking items completion items multiple-choice items (with one correct answer or more than one correct

answer) matching pairs of items or statements inserting missing words


incomplete sentences or incomplete, unlabelled diagrams true/false statements open-ended questions where students are given guidance on how much

to write (e.g. 300 words, a sentence, a paragraph) closed questions. These items can test recall, knowledge, comprehension, application,

analysis, synthesis and evaluation, i.e. different orders of thinking. These take their rationale from Bloom (1956) on hierarchies of thinking –

from low order (comprehension, application), through middle order thinking (analysis, synthesis) to higher order thinking (evaluation, judgement, criticism).


Clearly the selection of the form of the test item will be based on the principle of gaining the maximum amount of information in the most economical way.

This is evidenced in the use of machine-scorable multiple choice completion tests, where optical mark readers and scanners can enter and process large-scale data rapidly.

In considering the contents of a test the test writer must also consider the scale for some kinds of test.

Further, the selection of the items needs to be considered in order to have the highest reliability.

Let us say that we have ten items that measure students’ negative examination stress. Each item is intended to measure stress, for example:


Item 1: Loss of sleep at examination time. Item 2: Anxiety at examination time. Item 3: Irritability at examination time. Item 4: Depression at examination time. Item 5: Tearfulness at examination time. Item 6: Unwillingness to do household chores at examination time. Item 7: Mood swings at examination time. Item 8: Increased consumption of coffee at examination time. Item 9: Positive attitude and cheerfulness at examination time. Item 10: Eager anticipation of the examination.


You run a reliability test of internal consistency and find strong intercorrelations between items 1–5 (e.g. around 0.85), negative correlations between items 9 and 10 and all the other items (e.g. −0.79), and a very low intercorrelation between items 6 and 8 and all the others (e.g. 0.26). Item-to-total correlations (one kind of item analysis in which the item in question is correlated with the sum of the other items) vary here. What do you do?

You can retainitems 1–5. For items 9 and 10 you can reverse the scoring (as these items looked at positive rather than negative aspects), and for items 6 and 8 you can consider excluding them from the test, as they appear to be measuring something else.

Such item analysis is designed to include items that measure the same construct and to exclude items that do not.

using tests dr ayaz afsar. introduction since the spelling test of rice (1897), the fatigue test of...

Documents