tests © louis cohen, lawrence manion & keith morrison

TESTS

© LOUIS COHEN, LAWRENCE MANION & KEITH MORRISON

STRUCTURE OF THE CHAPTER

• What are we testing?

• Parametric and non-parametric tests

• Norm-referenced, criterion-referenced and domain-referenced tests

• Commercially produced tests and researcher-produced tests

• Constructing a test

• Software for preparation of a test

• Devising a pre-test and post-test

• Ethical issues in testing

• Computerized adaptive testing

INITIAL CONSIDERATIONS

• What are we testing (e.g. achievement, aptitude, attitude, personality, intelligence, social adjustment etc.)?

• Are we dealing with parametric or non-parametric tests?• Are they norm-referenced or criterion-referenced?• Are they available commercially for researchers to use or

will researchers have to develop home produced tests? • Do the test scores derive from a pre-test and post-test in

the experimental method? • Are they group or individual tests? • Do they involve self-reporting or are they administered

tests?

WHAT ARE WE TESTING?

• Handbook of Psychoeducational Assessment • Handbook of Psychological and Educational

Assessment of Children: Intelligence, Aptitude and Achievement

• The Eighteenth Mental Measurements Yearbook

• Tests in Print VII

PARAMETRIC AND NON-PARAMETRIC TESTS

• Parametric tests:– assume that there is a normal curve of distribution

of scores in the population

– assume that there are continuous and equal intervals between the test scores, and, with tests that have a true zero

– use standardized scores

• Non-parametric tests:– make few or no assumptions about the distribution

of the population or the characteristics of that population

– are useful for small samples

NORM-REFERENCED TESTS

• Norm-referenced tests:– compare students’ attainments relative to

other students’ attainments– Are usually standardized to the curve of

distribution– provide the researcher with information on

how well one student has achieved in comparison to another, enabling rank orderings of performance and achievement to be constructed

CRITERION- REFERENCED TESTS• Criterion-referenced tests:

– are not based on, or intended to, compare student with student but require the student to fulfil a given set of criteria, a predefined and absolute standard or outcome

– provides the researcher with information about exactly what a student has learned and can do

– A driving test is an example of a criterion-referenced test: if the candidate meets the requirements then s/he passes the test, regardless of, and without reference to, other candidates (i.e. s/he is not being compared to other candidates)

DOMAIN- REFERENCED TESTS• Domain-referenced tests:

– The domain to be assessed is specified clearly.– A domain is the particular field or area of the subject that

is being tested (e.g. light in science). – The domain is set out in depth and breadth. Test items

are then selected from this full domain, with careful attention to sampling to ensure representativeness of the wider field in the test items.

– The student’s achievements on the test are computed to yield a proportion of the maximum score possible. This is used as an index of the proportion of the overall domain that s/he has grasped.

– Inferences are being made from a limited number of items to the student’s achievements in the whole domain.

COMMERCIALLY PRODUCED TESTS• Are objective

• Have been piloted and refined

• Have been standardized across a named population

• Declare how reliable and valid they are

• Tend to be parametric

• Include instructions for administration

• Are straightforward and quick to administer and mark

• Guides to the interpretation of the data are usually included in the manual

• Save researchers the task of having to devise, pilot and refine their own test

COMMERCIALLY PRODUCED TESTS

• Are expensive• Are often targeted to special, rather than to

general populations• May not be exactly suited to the researcher’s

specific purposes• May be culturally/linguistically biased• May have restricted release or availability

RESEARCHER PRODUCED TESTS

• Are cheap• Are targeted to the population/sample in hand• Fit the local context and situation• Fit the researcher’s specific purposes (fitness for

purpose)• May be culturally/linguistically biased• May have restricted release or availability

RESEARCHER PRODUCED TESTS

• Are time-consuming to devise, pilot, refine and administer

• Are unstandardized • May require extensive procedures for validation

and reliability testing• Often yield non-parametric data• Have limited generalizability

CONSTRUCTING A TEST

Step 1: Consider the basis of the test (classical test theory/item response theory)

Step 2: Consider the purposes of the test

Step 3: Consider the type of test

Step 4: Consider the objectives of the test

Step 5: Write the test specifications, items and content

CONSTRUCTING A TEST

Step 6: Construct the test, involving item analysis, item discriminability,

item difficulty and distractors

Step 7: Plan the format, layout, form and timing of the test

Step 8: Pilot the test

Step 9: Address validity and reliability

Step 10: Devise the manual of instructions for the administration, scoring,

marking, weighting and data treatment of the test

CONSTRUCTING A TEST

Address classical test theory or item response theory

Classical test theory: – assumes that there is a ‘true score’, which is

the score which an individual would obtain on that test if the measurement was made without error and the individual test taker would obtain on that same test if s/he took it on an infinite number of occasions.

CONSTRUCTING A TESTItem response theory assumes that:• It is possible to measure single, specific traits,

abilities, attributes that, themselves, are not observable

• It is possible to identify objective levels of difficulty of an item

• It is possible to devise items that discriminate between individuals

• An item can be described independently of any particular sample of people responding to it

• A testee’s proficiency can be described in terms of his/her achievement of an item of a known difficulty level

CONSTRUCTING A TESTItem response theory assumes that:• Traits are unidimensional and that single traits

are specifiable• A set of items can measure a common trait or

ability• A testee’s response to any one test item will not

affect his /her response to another test item• The probability of the correct response to an

item does not depend on the number of testees who might be at the same level of ability

SOFTWARE FOR PREPARING A TEST

• Software and online testing can remove some of the burden of layout, marking, data entry and analysis, as these can be done automatically

• Optical mark scanners can read in marks from hard copy into a computer file

DEVISING A PRE-TEST AND POST-TEST

• The pre-test may have questions which differ in form or wording from the post-test, though the two tests must test the same content, i.e. ‘alternate forms’ of a test

• In an experiment the pre-test and post-test must be the same for the control and experimental groups.

• Care must be taken in the construction of a post-test to avoid making the test easier to complete by one group than another.

• The level of difficulty must be the same in both tests.

ETHICAL ISSUES IN TESTING

How ethical are these?• Ensuring coverage of the objectives and

program that will be tested;• Restricting the coverage of the program content

and objectives to those only that will be tested;• Preparing students with ‘exam technique’;• Practice with past/similar papers;• Directly matching the teaching to specific test

items, where each piece of teaching and contents is the same as each test item;


How ethical are these?• Practice on an exactly parallel form of the test;• telling students in advance what will appear on

the test;• Practice on, and preparation of, the identical

test itself without teacher input;• Practice on, and preparation of, the identical

test itself with teacher input, maybe providing sample answers.

• Inflating or adjusting marks.


• Tests must be valid and reliable • The administration, marking and use of the test

should only be undertaken by suitably competent/qualified people

• Access to test materials should be controlled• Tests should benefit the testee (beneficence) • Clear marking and grading protocols should

operate• Test results must be reported in a way that

cannot be misinterpreted


• The privacy and dignity of individuals should be respected

• Individuals should not be harmed by the test or its results (non-maleficence)

• Informed consent to participate in the test should be sought

COMPUTERIZED ADAPTIVE TESTING

• Which particular test items to administer is based on the subjects’ responses to previous items, i.e. it adapts the test to the student’s performance on prior items: if an item is too hard then the next item could adapt to this and be easier, and if a testee was successful on an item the next item could be harder.

• Avoids the problem of tests being too easy or too difficult.

• The first item is pitched in the middle of the assumed ability range; if the testee answers it correctly then it is followed by a more difficult item, and if the testee answers it incorrectly then it is followed by an easier item.

COMPUTERIZED ADAPTIVE TESTING

• The test is scored instantly.

• Requires a large item pool for each area of content domain to be developed, with sufficient numbers, variety and spread of difficulty.

• All items must measure a single aptitude or dimension.

• Items must be independent of each other, i.e. a person’s response to an item should not depend on that person’s response to another item.

DEVISING A PRE-TEST AND POST-TEST

• Software and online testing can remove some of the burden of layout, marking, data entry and analysis, as these can be done automatically

• Optical mark scanners can read in marks from hard copy into a computer file

tests © louis cohen, lawrence manion & keith morrison

Documents

produced tests

individual tests

normreferenced tests

nonparametric tests

tests louis cohen

test scores

test items

test software