tests © louis cohen, lawrence manion & keith morrison
TRANSCRIPT
TESTS
© LOUIS COHEN, LAWRENCE MANION & KEITH MORRISON
STRUCTURE OF THE CHAPTER
• What are we testing?
• Parametric and non-parametric tests
• Norm-referenced, criterion-referenced and domain-referenced tests
• Commercially produced tests and researcher-produced tests
• Constructing a test
• Software for preparation of a test
• Devising a pre-test and post-test
• Ethical issues in testing
• Computerized adaptive testing
INITIAL CONSIDERATIONS
• What are we testing (e.g. achievement, aptitude, attitude, personality, intelligence, social adjustment etc.)?
• Are we dealing with parametric or non-parametric tests?• Are they norm-referenced or criterion-referenced?• Are they available commercially for researchers to use or
will researchers have to develop home produced tests? • Do the test scores derive from a pre-test and post-test in
the experimental method? • Are they group or individual tests? • Do they involve self-reporting or are they administered
tests?
WHAT ARE WE TESTING?
• Handbook of Psychoeducational Assessment • Handbook of Psychological and Educational
Assessment of Children: Intelligence, Aptitude and Achievement
• The Eighteenth Mental Measurements Yearbook
• Tests in Print VII
PARAMETRIC AND NON-PARAMETRIC TESTS
• Parametric tests:– assume that there is a normal curve of distribution
of scores in the population
– assume that there are continuous and equal intervals between the test scores, and, with tests that have a true zero
– use standardized scores
• Non-parametric tests:– make few or no assumptions about the distribution
of the population or the characteristics of that population
– are useful for small samples
NORM-REFERENCED TESTS
• Norm-referenced tests:– compare students’ attainments relative to
other students’ attainments– Are usually standardized to the curve of
distribution– provide the researcher with information on
how well one student has achieved in comparison to another, enabling rank orderings of performance and achievement to be constructed
CRITERION- REFERENCED TESTS• Criterion-referenced tests:
– are not based on, or intended to, compare student with student but require the student to fulfil a given set of criteria, a predefined and absolute standard or outcome
– provides the researcher with information about exactly what a student has learned and can do
– A driving test is an example of a criterion-referenced test: if the candidate meets the requirements then s/he passes the test, regardless of, and without reference to, other candidates (i.e. s/he is not being compared to other candidates)
DOMAIN- REFERENCED TESTS• Domain-referenced tests:
– The domain to be assessed is specified clearly.– A domain is the particular field or area of the subject that
is being tested (e.g. light in science). – The domain is set out in depth and breadth. Test items
are then selected from this full domain, with careful attention to sampling to ensure representativeness of the wider field in the test items.
– The student’s achievements on the test are computed to yield a proportion of the maximum score possible. This is used as an index of the proportion of the overall domain that s/he has grasped.
– Inferences are being made from a limited number of items to the student’s achievements in the whole domain.
COMMERCIALLY PRODUCED TESTS• Are objective
• Have been piloted and refined
• Have been standardized across a named population
• Declare how reliable and valid they are
• Tend to be parametric
• Include instructions for administration
• Are straightforward and quick to administer and mark
• Guides to the interpretation of the data are usually included in the manual
• Save researchers the task of having to devise, pilot and refine their own test
COMMERCIALLY PRODUCED TESTS
• Are expensive• Are often targeted to special, rather than to
general populations• May not be exactly suited to the researcher’s
specific purposes• May be culturally/linguistically biased• May have restricted release or availability
RESEARCHER PRODUCED TESTS
• Are cheap• Are targeted to the population/sample in hand• Fit the local context and situation• Fit the researcher’s specific purposes (fitness for
purpose)• May be culturally/linguistically biased• May have restricted release or availability
RESEARCHER PRODUCED TESTS
• Are time-consuming to devise, pilot, refine and administer
• Are unstandardized • May require extensive procedures for validation
and reliability testing• Often yield non-parametric data• Have limited generalizability
CONSTRUCTING A TEST
Step 1: Consider the basis of the test (classical test theory/item response theory)
Step 2: Consider the purposes of the test
Step 3: Consider the type of test
Step 4: Consider the objectives of the test
Step 5: Write the test specifications, items and content
CONSTRUCTING A TEST
Step 6: Construct the test, involving item analysis, item discriminability,
item difficulty and distractors
Step 7: Plan the format, layout, form and timing of the test
Step 8: Pilot the test
Step 9: Address validity and reliability
Step 10: Devise the manual of instructions for the administration, scoring,
marking, weighting and data treatment of the test
CONSTRUCTING A TEST
Address classical test theory or item response theory
Classical test theory: – assumes that there is a ‘true score’, which is
the score which an individual would obtain on that test if the measurement was made without error and the individual test taker would obtain on that same test if s/he took it on an infinite number of occasions.
CONSTRUCTING A TESTItem response theory assumes that:• It is possible to measure single, specific traits,
abilities, attributes that, themselves, are not observable
• It is possible to identify objective levels of difficulty of an item
• It is possible to devise items that discriminate between individuals
• An item can be described independently of any particular sample of people responding to it
• A testee’s proficiency can be described in terms of his/her achievement of an item of a known difficulty level
CONSTRUCTING A TESTItem response theory assumes that:• Traits are unidimensional and that single traits
are specifiable• A set of items can measure a common trait or
ability• A testee’s response to any one test item will not
affect his /her response to another test item• The probability of the correct response to an
item does not depend on the number of testees who might be at the same level of ability
SOFTWARE FOR PREPARING A TEST
• Software and online testing can remove some of the burden of layout, marking, data entry and analysis, as these can be done automatically
• Optical mark scanners can read in marks from hard copy into a computer file
DEVISING A PRE-TEST AND POST-TEST
• The pre-test may have questions which differ in form or wording from the post-test, though the two tests must test the same content, i.e. ‘alternate forms’ of a test
• In an experiment the pre-test and post-test must be the same for the control and experimental groups.
• Care must be taken in the construction of a post-test to avoid making the test easier to complete by one group than another.
• The level of difficulty must be the same in both tests.
ETHICAL ISSUES IN TESTING
How ethical are these?• Ensuring coverage of the objectives and
program that will be tested;• Restricting the coverage of the program content
and objectives to those only that will be tested;• Preparing students with ‘exam technique’;• Practice with past/similar papers;• Directly matching the teaching to specific test
items, where each piece of teaching and contents is the same as each test item;
ETHICAL ISSUES IN TESTING
How ethical are these?• Practice on an exactly parallel form of the test;• telling students in advance what will appear on
the test;• Practice on, and preparation of, the identical
test itself without teacher input;• Practice on, and preparation of, the identical
test itself with teacher input, maybe providing sample answers.
• Inflating or adjusting marks.
ETHICAL ISSUES IN TESTING
• Tests must be valid and reliable • The administration, marking and use of the test
should only be undertaken by suitably competent/qualified people
• Access to test materials should be controlled• Tests should benefit the testee (beneficence) • Clear marking and grading protocols should
operate• Test results must be reported in a way that
cannot be misinterpreted
ETHICAL ISSUES IN TESTING
• The privacy and dignity of individuals should be respected
• Individuals should not be harmed by the test or its results (non-maleficence)
• Informed consent to participate in the test should be sought
COMPUTERIZED ADAPTIVE TESTING
• Which particular test items to administer is based on the subjects’ responses to previous items, i.e. it adapts the test to the student’s performance on prior items: if an item is too hard then the next item could adapt to this and be easier, and if a testee was successful on an item the next item could be harder.
• Avoids the problem of tests being too easy or too difficult.
• The first item is pitched in the middle of the assumed ability range; if the testee answers it correctly then it is followed by a more difficult item, and if the testee answers it incorrectly then it is followed by an easier item.
COMPUTERIZED ADAPTIVE TESTING
• The test is scored instantly.
• Requires a large item pool for each area of content domain to be developed, with sufficient numbers, variety and spread of difficulty.
• All items must measure a single aptitude or dimension.
• Items must be independent of each other, i.e. a person’s response to an item should not depend on that person’s response to another item.
DEVISING A PRE-TEST AND POST-TEST
• Software and online testing can remove some of the burden of layout, marking, data entry and analysis, as these can be done automatically
• Optical mark scanners can read in marks from hard copy into a computer file