m&e slides tutorial 2

CHAPTER 3HOW TO ASSESS?OBJECTIVE TESTS

HOW TO ASSESSObjective TestsEssay TestsProjects, Practicals, Fieldwork &

Oral TestsObservations & Portfolio

Assessment

OBJECTIVE TEST

DefinitionA written test consisting of questions which require respondents to select from a list of possible answers. Marking/Scoring of answers is not influenced by the subjective opinions of the marker.

Formats/TypesMultiple-Choice QuestionsMatching QuestionsTrue-False Questions

Parts of an MCQ

What is the capital of Mongolia?

(A) Cochin(B) Calcutta(C) Katmandu(D) Ulan Bator

Stem

Options/Alternatives

KeyDistracters

The Stem In the form of a question or statement

• Direct-question form• Incomplete-statement form

Clear & concise with a definite focus, free from poor grammar, complex sentences, ambiguity & double negatives

Present a positive question (highlight negative if used)

Ask for ONE answer only

Avoid asking for opinions

Avoid using ALWAYS & NEVER in the stem

Include as many as possible words common to all alternatives

The StemCan be in the form of a question or statement

• Direct-question formE.g. Who was the first Prime Minister of Malaysia?

(A) Tun Dr. Mahathir(B) Tun Abdul Razak(C) Tun Hussein Onn(D) Tunku Abdul Rahman

• Incomplete-statement formE.g. The first Prime Minister of Malaysia was

(A) Tun Dr. Mahathir(B) Tun Abdul Razak(C) Tun Hussein Onn(D) Tunku Abdul Rahman

The above examples are CORRECT-ANSWER TYPE of multiple-choice item

The other type: The BEST-ANSWER TYPE

Example:

Which of the following is the best title for the passage?

(A) A bad experience

(B) An eventful journey

(C) A terrifying occasion

(D) An unforgettable day

Clear & concise with a definite focus

Poor item:Wold War II was:(A) the result of the failure of the League of Nations(B) horrible(C) fought in Europe, Asia and Africa(D) fought during the period of 1939-1945.

N.B. there is no sense from the stem what the question is asking.

Better item: In which of these time periods was World War II fought?(A) 1914 – 1917(B) 1929 – 1934(C) 1939 – 1945(D) 1951 – 1955

N. B. The Improved version more clearly identifies the question and offers the student a set of homogeneous choices.

Use clear, straight forward language. The stem with complex wording may become a test of reading comprehension, rather than an assessment of the subject matter.

Poor Item:As the level of fertility approaches its nadir, what is the most

likely ramification for the citizenry of a developing nation?(A) a decrease in the labour force participation rate of women(B) a downward trend in the youth dependency ratio(C) a broader base in the population pyramid(D) an increased infant mortality rate

Better Item: A major decline in fertility in a developing nation is likely to

produce(A) a decrease in the labour force participation rate of women(B) a downward trend in the youth dependency ratio(C) a broader base in the population pyramid(D) an increased infant mortality rate

N.B. In the Improved question the word “nadir” is replaced with “decline” and “ramification” is replaced with “produce” which are simpler words.

Present a positive question (highlight negative if used)

Example:

Which of the following is NOT a symptom of osteoporosis?(A) decreased bone density(B) frequent bone fractures(C) raised body temperature(D) lower back pain

Better Item Which of the following is a symptom of osteoporosis?(A) hair loss(B) painful joints(C) decreased bone density(D) raised body temperature

Include as many as possible words common to all alternatives

Poor Item Theorists of pluralism have asserted which of the following?(A) The maintenance of democracy requires a large middle class.(B) The maintenance of democracy requires autonomous centres of

countervailing power.(C) The maintenance of democracy requires the existence of a

multiplicity of religious groups.(D) The maintenance of democracy requires the separation of

governmental powers.

Better ItemTheorists of pluralism have asserted that the maintenance of democracy requires(A) a large middle class(B) autonomous centres of countervailing power(C) the existence of a multiplicity of religious groups(D) the separation of governmental powers

Avoid giving away the answer because of grammatical cues

Poor ItemA fertile area in the desert in which the water table reaches the ground surface is called an(A) oasis(B) polder(C) mirage(D) water hole

Better Item: A fertile area in the desert in which the water table reaches the ground surface is called a/an(A) oasis(B) polder(C) mirage(D) water hole

Avoid asking for an opinion

Poor Item

Which of the following men contributed most towards the defeat of Hitler's Germany in World War II?

(A) Winston Churchill

(B) Josef Stalin

(C) Franklin D. Roosevelt

(D) George Patton

The Options/Alternatives

Each item should have 4 or 5 optionsOptions should be grammatically consistent with stemOptions should be clearly different with only ONE

correct responseOptions should be fairly consistent in lengthAvoid “None of the above” & “All of the above”.Key should be clearly correct to the informed while

distracters should be clearly incorrect but plausible to the uninformed.

Options should be fairly consistent in length

Poor ItemThe main purpose of a placement test is to(A) determine the prerequisite skills of learners so

that they can be placed at an appropriate level.

(B) determine end-of-course achievement(C) determine learning progress(D) determine learning difficulties

Better itemThe main purpose of a placement test is to

determine learners’(A) prerequisite skills(B) learning progress(C) learning difficulties(D) overall achievement

Options should be clearly different with only ONE correct response

Poor ItemWhat is the main source of pollution of Malaysian

rivers?(A) land clearing(B) open burning(C) coastal erosion(D) solid waste dumping

NB: (A) and (B) could be the answers

Better ItemWhat is the main source of pollution of Malaysian

rivers?(A) carbon dioxide emission(B) open burning(C) solid waste dumping(D) coastal erosion

Use only plausible and attractive alternatives as distractors

Poor ItemWho was the third Prime Minister of Malaysia?(A) Hussein Onn(B) Ghafar Baba(C) Mahathir Mohamad(D) Musa Hitam

NB. (B) and (D) are not serious distracters.

Better ItemWho was the third Prime Minister of Malaysia?(A) Hussein Onn(B) Abdul Razak Hussein(C) Mahathir Mohamad(D) Abdullah Badawi

Refer to Linn & Gronlund for more examples, p. 203 - 214

MCQ: Strengths/Advantages

Measure LOs from simple to complex

Provide highly structured and clear tasks

Capable of covering a wide range of areas taught

Distracters provide diagnostic information

Scores – more reliable than subjective marking

Easy scoring

Can include options that vary in degree of correctness

Allow for item analysis – reveal which item is too difficult or ambiguous

MCQ: Weaknesses/Disadvantages/Limitations

Time consuming in making good itemsDifficult to find plausible distractersNot suitable in measuring the ability to organise &

express ideasScores can be influenced by reading abilityUnable to detect individual thought processesUnable to measure writing and speaking skills

(language test)Open to guessing

TRUE-FALSE QUESTIONS

Strengths

Suitable for testing recall or comprehensionWide coverage of contentEasy to construct & can be written quicklyEasy to scoreScores are more reliable – objective scoring

Tunku Abdul Rahman was the first Prime Minister of Malaysia

True False

Limitations

Open to guessing – 50% chancesRecognising a false statement does not indicate

that the respondent knows what is rightDifficult to write true-false statements for complex

materials

Constructing True-False Qs

Avoid broad general statementsAvoid trivial statementsAvoid the use of negative statements, esp double

negativesAvoid long complex sentenceAvoid including more than one idea in one

statementAvoid statements of opinionAvoid True and False statements of unequal

lengthAvoid unequal number of true & false statements

Linn, R.L. & Gronlund, N.E. (2000). Measurement and assessment in teaching. NJ: Prentice hall

Matching Questions

Column BA. Edwin AldrinB. Neil ArmstrongC. Frank BormanD. Scott CarpenterE. John GlennF. Wally SchirraG. Alan ShepardH. Edward White

Column AColumn A

1.1. First US astronaut First US astronaut to walk in spaceto walk in space

2.2. First US astronaut First US astronaut to ride in a space to ride in a space capsulecapsule

3.3. First US astronaut First US astronaut to orbit the earthto orbit the earth

4.4. First US astronaut First US astronaut to step on the to step on the moonmoon

premises responses

G

E

H

B

AdvantagesGood at assessing understanding of relationships.

E.g. achievement – peoplePossible to measure a large amount of contentGenerally easy to write and score

DisadvantagesLimited to measurement of factual informationPossible to use elimination to pick the right

answer

Constructing Matching QuestionsProvide clear directions Include an unequal number of responses &

premises or allow responses to be used more than once

Keep information in each column homogenousPut items with more words on the left (A)Place all of the items for one matching exercise on

one page.

Table of Specifications

Test blue-print that includes the following information:

Topics/Skills/knowledge to be tested

Types & formats of questions

Weighting of each section/question

Time allocation

Topics Recall ApplicationEvaluation

A. Identify crisis vs. role confusion; achievement motivation.

2, 9 4, 21, 33 16 18%

B. Adolescent sexual behavior; transition of puberty.

5, 8 1, 13, 26 11 18%

C. Social isolation and self-esteem; person perception.

14, 6 3, 20 25 15%

D. Egocentrism; adolescent idealism.

7, 29 12, 31 10, 15, 27 21%

E. Law and maintenance of the social order.

17 22 18 9%

F. Authoritarian bias; moral development.

19 30 24 9%

G. Universal ethical principle orientation.

28 23 32 9%

33% 40% 27%

CHAPTER 7

RELIABILITY & VALIDITY

What is a good test?

A good test must be able to measure the TRUE ABILITY of an individual, i.e. it should be able to give the TRUE SCORE of an individual

TRUE SCORE is difficult to obtain because of the presence of errors which may come from various sources such as within the test takers within the test in the administration of the test during the scoring/marking of the test

TRUE SCORE = OBSERVED SCORE + ERROR

To ensure that a test measures the TRUE SCORE, we should reduce the magnitude of error in our test.

Error OOBSERVED SCORE TRUE SCORE

While it’s impossible to eliminate error completely, it is possible to reduce it. To reduce the error, the test must be reliable and valid

RELIABILITY

Reliability refers to the consistency of the measurement

A test is reliable (a) when it yields the same score for a student who

takes the test on different occasions

takes the parallel forms of the same test

(b) When a student who answers a given question correctly is more likely to answer other similar or related questions correctly as well

METHODS FOR

ESTIMATING RELIABILITY

Test-Retest

Parallel or Equivalent form

Internal Consistency

Split-half

Cronbach Alpha

TEST-RETEST/PARALLEL FORMS

Subject Score 1 Score 2

1 4 8

2 8 10

3 20 18

4 12 12

5 14 16

6 8 10

7 20 16

8 4 4

9 20 16

10 20 16

Pearson Product Moment Correlation

r = Nξ XiYi

(ξXi)(ξYi)

[N ξXi2 (ξXi)2] [NξYi2 _ (ξYi) 2]

Internal Consistency – Split-half

Subject ODD EVEN

1 4 8

2 8 10

3 20 18

4 12 12

5 14 16

6 8 10

7 20 16

8 4 4

9 20 16

10 20 16

rsb

= --------------------2rxy

(1 + rxy)

Spearman-Brown Correlation coefficient

Internal Consistency – Cronbach Alpha

suitable to check the reliability of a measurement instrument with

binary-type items e.g. I’m afraid of school tests T F

Scale items e.g. I’m afraid of school tests SA A N D

SD MCQs

Reliability = correlation between the individual items & the extent to which individual items correlate with the total test (Refer to p.156 for the formula)

Value of Reliability Coefficient (rxy)

rxy = ---------------------------------------Variance of the True Score

Variance of the Observed Score

No reliability

0.00

Perfect reliability

1.00

Rule of Thumb – Reliability for a classroom test

ReliabilityReliability InterpretationInterpretation

.90 & above.90 & above Excellent reliabilityExcellent reliability

.80 - .90.80 - .90 Very good Very good

.70 - .80.70 - .80 Good for a classroom test but a Good for a classroom test but a few items could be improvedfew items could be improved

.60 - .70.60 - .70 Somewhat low. Some items could Somewhat low. Some items could be removed or improvedbe removed or improved

.50 - .60.50 - .60 Test needs to be revisedTest needs to be revised

.50 & below.50 & below Questionable reliability. Test needs Questionable reliability. Test needs to be replaced /needs major to be replaced /needs major revisionrevision

Use of Test Reliability to determine the true score (p. 152)

Standard Error of Measurement (Sm) - the standard deviation of the error scores of

a test, i.e. the extent the error scores deviate from the mean error score.

You can determine Sm if you know SD & r of a test.

Sm = SD √ 1 – r , where r = test reliability

You can estimate a student’s TRUE SCORE with some degree of certainty based on the observed score & Sm

INTER-RATER RELIABILITY

- Indicates whether two examiners are consistent in their scoring/marking

INTRA-RATER RELIABILITY

- Indicates whether an examiner is consistent in his scoring when marking at different times

Validity

Validity refers to the extent to which a test measures what it is supposed to measure.

Types of

validity

Construct validity

Content validity

Criterion-related validity

Predictive validity

Concurrent validity

Construct Validity

• How far does the test measure the attributes of a construct?

Content Validity

• How far does the test cover the content (syllabus) that has been taught?

Criterion-related Validity

How far is the test related to some other criterion measure?

Examples:

How far is the students’ SPM performance related to their performance in STPM? – Predictive Validity How far is the students’ year-end English performance related to their SPM English performance?

Concurrent Validity

Factors Affecting Reliability & Validity

Construction of test itemsLength of testSelection of topicsChoice of testing techniquesMethod of administrationMethod of marking

Task

Can you explain how each of the following testing situations could have happened?

(1) The test is valid but not reliable

(2) The test is not reliable and not valid.

(3) The test is reliable and valid.

(4) The test is reliable but not valid.

m&e slides tutorial 2

Documents