m&e slides tutorial 2
TRANSCRIPT
CHAPTER 3HOW TO ASSESS?OBJECTIVE TESTS
HOW TO ASSESSObjective TestsEssay TestsProjects, Practicals, Fieldwork &
Oral TestsObservations & Portfolio
Assessment
OBJECTIVE TEST
DefinitionA written test consisting of questions which require respondents to select from a list of possible answers. Marking/Scoring of answers is not influenced by the subjective opinions of the marker.
Formats/TypesMultiple-Choice QuestionsMatching QuestionsTrue-False Questions
Parts of an MCQ
What is the capital of Mongolia?
(A) Cochin(B) Calcutta(C) Katmandu(D) Ulan Bator
Stem
Options/Alternatives
KeyDistracters
The Stem In the form of a question or statement
• Direct-question form• Incomplete-statement form
Clear & concise with a definite focus, free from poor grammar, complex sentences, ambiguity & double negatives
Present a positive question (highlight negative if used)
Ask for ONE answer only
Avoid asking for opinions
Avoid using ALWAYS & NEVER in the stem
Include as many as possible words common to all alternatives
The StemCan be in the form of a question or statement
• Direct-question formE.g. Who was the first Prime Minister of Malaysia?
(A) Tun Dr. Mahathir(B) Tun Abdul Razak(C) Tun Hussein Onn(D) Tunku Abdul Rahman
• Incomplete-statement formE.g. The first Prime Minister of Malaysia was
(A) Tun Dr. Mahathir(B) Tun Abdul Razak(C) Tun Hussein Onn(D) Tunku Abdul Rahman
The above examples are CORRECT-ANSWER TYPE of multiple-choice item
The other type: The BEST-ANSWER TYPE
Example:
Which of the following is the best title for the passage?
(A) A bad experience
(B) An eventful journey
(C) A terrifying occasion
(D) An unforgettable day
Clear & concise with a definite focus
Poor item:Wold War II was:(A) the result of the failure of the League of Nations(B) horrible(C) fought in Europe, Asia and Africa(D) fought during the period of 1939-1945.
N.B. there is no sense from the stem what the question is asking.
Better item: In which of these time periods was World War II fought?(A) 1914 – 1917(B) 1929 – 1934(C) 1939 – 1945(D) 1951 – 1955
N. B. The Improved version more clearly identifies the question and offers the student a set of homogeneous choices.
Use clear, straight forward language. The stem with complex wording may become a test of reading comprehension, rather than an assessment of the subject matter.
Poor Item:As the level of fertility approaches its nadir, what is the most
likely ramification for the citizenry of a developing nation?(A) a decrease in the labour force participation rate of women(B) a downward trend in the youth dependency ratio(C) a broader base in the population pyramid(D) an increased infant mortality rate
Better Item: A major decline in fertility in a developing nation is likely to
produce(A) a decrease in the labour force participation rate of women(B) a downward trend in the youth dependency ratio(C) a broader base in the population pyramid(D) an increased infant mortality rate
N.B. In the Improved question the word “nadir” is replaced with “decline” and “ramification” is replaced with “produce” which are simpler words.
Present a positive question (highlight negative if used)
Example:
Which of the following is NOT a symptom of osteoporosis?(A) decreased bone density(B) frequent bone fractures(C) raised body temperature(D) lower back pain
Better Item Which of the following is a symptom of osteoporosis?(A) hair loss(B) painful joints(C) decreased bone density(D) raised body temperature
Include as many as possible words common to all alternatives
Poor Item Theorists of pluralism have asserted which of the following?(A) The maintenance of democracy requires a large middle class.(B) The maintenance of democracy requires autonomous centres of
countervailing power.(C) The maintenance of democracy requires the existence of a
multiplicity of religious groups.(D) The maintenance of democracy requires the separation of
governmental powers.
Better ItemTheorists of pluralism have asserted that the maintenance of democracy requires(A) a large middle class(B) autonomous centres of countervailing power(C) the existence of a multiplicity of religious groups(D) the separation of governmental powers
Avoid giving away the answer because of grammatical cues
Poor ItemA fertile area in the desert in which the water table reaches the ground surface is called an(A) oasis(B) polder(C) mirage(D) water hole
Better Item: A fertile area in the desert in which the water table reaches the ground surface is called a/an(A) oasis(B) polder(C) mirage(D) water hole
Avoid asking for an opinion
Poor Item
Which of the following men contributed most towards the defeat of Hitler's Germany in World War II?
(A) Winston Churchill
(B) Josef Stalin
(C) Franklin D. Roosevelt
(D) George Patton
The Options/Alternatives
Each item should have 4 or 5 optionsOptions should be grammatically consistent with stemOptions should be clearly different with only ONE
correct responseOptions should be fairly consistent in lengthAvoid “None of the above” & “All of the above”.Key should be clearly correct to the informed while
distracters should be clearly incorrect but plausible to the uninformed.
Options should be fairly consistent in length
Poor ItemThe main purpose of a placement test is to(A) determine the prerequisite skills of learners so
that they can be placed at an appropriate level.
(B) determine end-of-course achievement(C) determine learning progress(D) determine learning difficulties
Better itemThe main purpose of a placement test is to
determine learners’(A) prerequisite skills(B) learning progress(C) learning difficulties(D) overall achievement
Options should be clearly different with only ONE correct response
Poor ItemWhat is the main source of pollution of Malaysian
rivers?(A) land clearing(B) open burning(C) coastal erosion(D) solid waste dumping
NB: (A) and (B) could be the answers
Better ItemWhat is the main source of pollution of Malaysian
rivers?(A) carbon dioxide emission(B) open burning(C) solid waste dumping(D) coastal erosion
Use only plausible and attractive alternatives as distractors
Poor ItemWho was the third Prime Minister of Malaysia?(A) Hussein Onn(B) Ghafar Baba(C) Mahathir Mohamad(D) Musa Hitam
NB. (B) and (D) are not serious distracters.
Better ItemWho was the third Prime Minister of Malaysia?(A) Hussein Onn(B) Abdul Razak Hussein(C) Mahathir Mohamad(D) Abdullah Badawi
Refer to Linn & Gronlund for more examples, p. 203 - 214
MCQ: Strengths/Advantages
Measure LOs from simple to complex
Provide highly structured and clear tasks
Capable of covering a wide range of areas taught
Distracters provide diagnostic information
Scores – more reliable than subjective marking
Easy scoring
Can include options that vary in degree of correctness
Allow for item analysis – reveal which item is too difficult or ambiguous
MCQ: Weaknesses/Disadvantages/Limitations
Time consuming in making good itemsDifficult to find plausible distractersNot suitable in measuring the ability to organise &
express ideasScores can be influenced by reading abilityUnable to detect individual thought processesUnable to measure writing and speaking skills
(language test)Open to guessing
TRUE-FALSE QUESTIONS
Strengths
Suitable for testing recall or comprehensionWide coverage of contentEasy to construct & can be written quicklyEasy to scoreScores are more reliable – objective scoring
Tunku Abdul Rahman was the first Prime Minister of Malaysia
True False
Limitations
Open to guessing – 50% chancesRecognising a false statement does not indicate
that the respondent knows what is rightDifficult to write true-false statements for complex
materials
Constructing True-False Qs
Avoid broad general statementsAvoid trivial statementsAvoid the use of negative statements, esp double
negativesAvoid long complex sentenceAvoid including more than one idea in one
statementAvoid statements of opinionAvoid True and False statements of unequal
lengthAvoid unequal number of true & false statements
Linn, R.L. & Gronlund, N.E. (2000). Measurement and assessment in teaching. NJ: Prentice hall
Matching Questions
Column BA. Edwin AldrinB. Neil ArmstrongC. Frank BormanD. Scott CarpenterE. John GlennF. Wally SchirraG. Alan ShepardH. Edward White
Column AColumn A
1.1. First US astronaut First US astronaut to walk in spaceto walk in space
2.2. First US astronaut First US astronaut to ride in a space to ride in a space capsulecapsule
3.3. First US astronaut First US astronaut to orbit the earthto orbit the earth
4.4. First US astronaut First US astronaut to step on the to step on the moonmoon
premises responses
G
E
H
B
AdvantagesGood at assessing understanding of relationships.
E.g. achievement – peoplePossible to measure a large amount of contentGenerally easy to write and score
DisadvantagesLimited to measurement of factual informationPossible to use elimination to pick the right
answer
Constructing Matching QuestionsProvide clear directions Include an unequal number of responses &
premises or allow responses to be used more than once
Keep information in each column homogenousPut items with more words on the left (A)Place all of the items for one matching exercise on
one page.
Table of Specifications
Test blue-print that includes the following information:
Topics/Skills/knowledge to be tested
Types & formats of questions
Weighting of each section/question
Time allocation
Topics Recall ApplicationEvaluation
A. Identify crisis vs. role confusion; achievement motivation.
2, 9 4, 21, 33 16 18%
B. Adolescent sexual behavior; transition of puberty.
5, 8 1, 13, 26 11 18%
C. Social isolation and self-esteem; person perception.
14, 6 3, 20 25 15%
D. Egocentrism; adolescent idealism.
7, 29 12, 31 10, 15, 27 21%
E. Law and maintenance of the social order.
17 22 18 9%
F. Authoritarian bias; moral development.
19 30 24 9%
G. Universal ethical principle orientation.
28 23 32 9%
33% 40% 27%
CHAPTER 7
RELIABILITY & VALIDITY
What is a good test?
A good test must be able to measure the TRUE ABILITY of an individual, i.e. it should be able to give the TRUE SCORE of an individual
TRUE SCORE is difficult to obtain because of the presence of errors which may come from various sources such as within the test takers within the test in the administration of the test during the scoring/marking of the test
TRUE SCORE = OBSERVED SCORE + ERROR
To ensure that a test measures the TRUE SCORE, we should reduce the magnitude of error in our test.
Error OOBSERVED SCORE TRUE SCORE
While it’s impossible to eliminate error completely, it is possible to reduce it. To reduce the error, the test must be reliable and valid
RELIABILITY
Reliability refers to the consistency of the measurement
A test is reliable (a) when it yields the same score for a student who
takes the test on different occasions
takes the parallel forms of the same test
(b) When a student who answers a given question correctly is more likely to answer other similar or related questions correctly as well
METHODS FOR
ESTIMATING RELIABILITY
Test-Retest
Parallel or Equivalent form
Internal Consistency
Split-half
Cronbach Alpha
TEST-RETEST/PARALLEL FORMS
Subject Score 1 Score 2
1 4 8
2 8 10
3 20 18
4 12 12
5 14 16
6 8 10
7 20 16
8 4 4
9 20 16
10 20 16
Pearson Product Moment Correlation
r = Nξ XiYi
(ξXi)(ξYi)
[N ξXi2 (ξXi)2] [NξYi2 _ (ξYi) 2]
Internal Consistency – Split-half
Subject ODD EVEN
1 4 8
2 8 10
3 20 18
4 12 12
5 14 16
6 8 10
7 20 16
8 4 4
9 20 16
10 20 16
rsb
= --------------------2rxy
(1 + rxy)
Spearman-Brown Correlation coefficient
Internal Consistency – Cronbach Alpha
suitable to check the reliability of a measurement instrument with
binary-type items e.g. I’m afraid of school tests T F
Scale items e.g. I’m afraid of school tests SA A N D
SD MCQs
Reliability = correlation between the individual items & the extent to which individual items correlate with the total test (Refer to p.156 for the formula)
Value of Reliability Coefficient (rxy)
rxy = ---------------------------------------Variance of the True Score
Variance of the Observed Score
No reliability
0.00
Perfect reliability
1.00
Rule of Thumb – Reliability for a classroom test
ReliabilityReliability InterpretationInterpretation
.90 & above.90 & above Excellent reliabilityExcellent reliability
.80 - .90.80 - .90 Very good Very good
.70 - .80.70 - .80 Good for a classroom test but a Good for a classroom test but a few items could be improvedfew items could be improved
.60 - .70.60 - .70 Somewhat low. Some items could Somewhat low. Some items could be removed or improvedbe removed or improved
.50 - .60.50 - .60 Test needs to be revisedTest needs to be revised
.50 & below.50 & below Questionable reliability. Test needs Questionable reliability. Test needs to be replaced /needs major to be replaced /needs major revisionrevision
Use of Test Reliability to determine the true score (p. 152)
Standard Error of Measurement (Sm) - the standard deviation of the error scores of
a test, i.e. the extent the error scores deviate from the mean error score.
You can determine Sm if you know SD & r of a test.
Sm = SD √ 1 – r , where r = test reliability
You can estimate a student’s TRUE SCORE with some degree of certainty based on the observed score & Sm
INTER-RATER RELIABILITY
- Indicates whether two examiners are consistent in their scoring/marking
INTRA-RATER RELIABILITY
- Indicates whether an examiner is consistent in his scoring when marking at different times
Validity
Validity refers to the extent to which a test measures what it is supposed to measure.
Types of
validity
Construct validity
Content validity
Criterion-related validity
Predictive validity
Concurrent validity
Construct Validity
• How far does the test measure the attributes of a construct?
Content Validity
• How far does the test cover the content (syllabus) that has been taught?
Criterion-related Validity
How far is the test related to some other criterion measure?
Examples:
How far is the students’ SPM performance related to their performance in STPM? – Predictive Validity How far is the students’ year-end English performance related to their SPM English performance?
Concurrent Validity
Factors Affecting Reliability & Validity
Construction of test itemsLength of testSelection of topicsChoice of testing techniquesMethod of administrationMethod of marking
Task
Can you explain how each of the following testing situations could have happened?
(1) The test is valid but not reliable
(2) The test is not reliable and not valid.
(3) The test is reliable and valid.
(4) The test is reliable but not valid.