5 reliability

Reliability

CH 6 & 7

Reliability

• the proportion of variance in a set of test scores that is due to the real or “true” attributes of the person’s being measured, rather than error

• Also, repeatability, consistency, or stability

rxx = σTrue2 = σT

2

σobserved2 σT

2 + σe2

0 ≤ rxx≤ 1

Reliability as Repeatability• Conceptually, any observation has some degree of error or

imprecision

Observed score = TRUE SCORE + ERRORS OF MEASUREMENT

• By taking multiple measurements it is presumed that these random errors will cancel each other out

• Under certain assumptions the mean of repeated measurements is considered an estimate of the “true” score

Components of Reliability

• Test scores reflect:– Consistency stable characteristics– Inconsistency factors that affect the scores

but have nothing to do with characteristics being measured

Components of Reliability

• Want a statistic of the proportion of total test score variance that is due to the true score variance

• i.e., what proportion is not due to error variance?

• Defining true score variance as the consistent, stable variance

Classical Test Theory (CTT) Reliability• Observed score = true score + error • X = True + error

• σX2 = σT

2 + σe2

– What is observed is a function of the variability in the true score and variability of the errors of measurement

Assumptions of Classical Test Theory

• Error of measurement is unsystematic or random deviation of an individual’s score from a theoretically expected observed score (true-score)– Observed score = True Score + error– True score is an ‘expected’ or mean score not a ‘real’ trait–

“represents a combination of all the factors that lead to consistency in the measurement”

– Errors are not correlated with true score (i.e., random)• For a given individual, an error may not be a completely random

event. However, across a number of individuals, the causes of error are assumed to be random.

– Errors on two different tests are not correlated

Reliability

• Completely Reliable scale

133 lb133 lb133 lb133 lb133 lb

Reliability

• Completely Unreliable scale

115 lb140 lb141 lb122 lb118 lb

Reliability

• Reliability is highest when…

X = T + EE is small

Less error

Methods of Assessing Reliability

1. Test-retest2. Alternate Forms3. Split-half4. Internal Consistency

5. Inter-Rater Reliability

1. Test-Retest Reliability

• "Temporal stability”: Simply, the rank-order stability of scores from one administration of a test to another.1. Administer the test to a group of people2. Re-administer it at some other time to the same

group of people3. correlate Time 1 and Time 2 scores

• If correlation < 1.0, due to error variance

Test-Retest Issues

• The “true score” should remain the same and that is what is correlating across the time points– The lower the correlation, the less stable the

scores and the more error or extraneous variation

• Problems with this approach?

Problem One• Characteristics or attributes being measured may

change between Time 1 and 2• Why might this happen?

– Change in “true” score; almost all psychological traits exhibit some change across a long enough time interval (e.g., reading ability of children)

• Use short time intervals (1 week, 1 month) to estimate reliability– Want error due to random fluctuations, not long term

changes

Problem Two

• Practice Effects (a.k.a carry over effects)

• Learning might occur during the first administration• Remembering content• Especially an issue if the time between

administrations is too short • More of a problem for performance-type measures

Problem Three• Reactivity Effects

• The experience of taking the test itself can change a person’s true score– E.g., on a test of geography, test takers may become

curious about the correct answers of the questions- go out and study.

– E.g., a test of marital satisfaction may involve questions addressing dimensions that the person had never thought of before- then, they may start paying attention to that dimension and change accordingly. Thus, the practice of mere measurement may change the test-taker.

2. Parallel (Alternative) Forms Reliability

• Using the same test on repeated occasions has certain problems (test-retest)

• Therefore, we can use parallel forms of the test on the two occasions

• Persons take one form at Time 1 and the alternate form at Time 2 (e.g., GRE)

• The correlation between the tests is the reliability of the test

• Mean of the two scores is an individual’s true score• Correlation between parallel forms is also an index of

temporal stability

To Be Parallel You Must

• Have the same means and standard deviations• Items must be of the same difficulty• Same number of items, expressed in the same

form, and cover the same content• ALL other characteristics must be the same• Examinees should be indifferent to the form

Issues With Parallel Forms

• Nice idea, but it is not easy to construct identical or very similar at all times.

• Just as the test-retest method, this method requires two separate test administrations- thus, it can be quite costly and cumbersome.

• Unless the forms are perfectly parallel this form of reliability violates the assumptions of CTT.

3. Split-Half Reliability

• Instead of creating two different forms, why not create one form and split it into two?

• Reliability is the correlation between the two halves

Issues with Split-Half Reliability

• How should we split the test?– 1st half, 2nd half ?(not a good idea; think about

fatigue effects)– Even-odd– Random halves

• Fact: Half does not contain all the items– Problem because there is a direct relationship

between test length and reliability• Only using half the items reduces our estimate of reliability

– Fundamentally violates the assumptions of CTT

Spearman-Brown Formula

• A way to “correct” for using only half of the items

• Formula that computes the reliability if a test were longer or shorter– So corrects for small number of items

4. Internal Consistency

• Average Item Intercorrelation• Cronbach’s Coefficient Alpha• Take the logic of split-half and parallel forms

reliability to the extreme– Every ITEM is a parallel test of the construct– Therefore, the average correlation among items is

an index of reliability – Last, the correlation of an item to the total is an

index of reliability

Average Item Intercorrelation

• Internal consistency reliability• ICR estimates are different from the other

methods; focus is on # of items in the test and the intercorrelations among the items and their correlation to the test as a whole.

• An example: imagine two people who are taking an internally consistent test of extraversion

Internal Consistency Example

• Person A is very extraverted, Person B is not– For every item, Person A always responds “true”

and Person B always responds “false”

• So, within a sample of different people, the responses to items will be correlated– People who score high on item 1 will also score high

on item 2, 3,…..n – Internal consistency

A Second Example

• Imagine Person A and Person B take an internally consistent test of intelligence– Person A is very intelligent; Person B is not so bright– Person A passes every item; Person B fails nearly

every item– Again within a sample of different people, the item

responses will be correlated – People who pass item 1 will tend to pass items 2, 3,

….n

Internal Consistency Data

1. Administering a test to a group of individuals

2. Computing correlations among all items and computing the average of those intercorrelations

3. Computing the correlation of an item score to the score of the test without that item

Average Inter-Item Correlations

I1 I2 I3 I4 I5 In

I1 1.00

I2 .89 1.00

I3 .91 .92 1.00

I4 .88 .93 .95 1.00

I5 .84 .86 .92 .85 1.00

In .88 .91 .95 .87 .85 1.00

.90

Internal Consistency Reliability

• The reliability of a test is based on the number of items in the test (k) and the average intercorrelation among test items.

• Thus, ICR methods are mathematically linked to the split-half method.– alpha represents the mean reliability coefficient

one would obtain from all possible split halves.

Cronbach’s Coefficient Alpha

• Most commonly used measure of internal reliability

• Alpha is the average value of all possible split-half reliabilities

1

1 2

2

ks

sk

scoretotal

i

Internal Consistency as a Method

• The principal advantage of ICR methods is practicality.– You don’t need to administer a test multiple times.

• Split-half methods may look more computationally challenging.– This is not an issue anymore, computers generate

reliability estimates in a second

Using Cronbach’s Alpha

• By convention, alpha should be at least .70 or higher to retain an item in an "adequate" scale;

• Many researchers require a cut-off of .80 for a good scale.

• Most consider alpha above .95 indication of redundancy.

• However, specificity/generality of the scale focus affects the estimates

Some Examples of Reliability

The Data• 291 Participants.

• From a recent scale development project.

• Included 308 potential items over 5 facets.– This facet ended up

with 25 items.

Observation Item3 Item5 Item9 Item12 Item25

19 5 1 2 1 1

24 5 4 3 4 2

30 4 2 1 3 1

58 4 4 3 2 3

71 5 5 3 3 1

86 5 1 2 5 1

93 4 2 2 1 1

99 5 4 3 2 3

100 5 1 2 3 1

261 5 1 4 4 1

Sample Mean 2.51 2.66 2.14 2.74 1.82

Split-Half Reliability

• Item 1-13 and 14-25: r = .92

• Odd vs Even: r = .92

• By 3’s: r = .97

Average Inter-item Correlations

Item 3 Item 5 Item 9 Item 12 Item 25

Item 3 1.00Item 5 0.68 1.00Item 9 0.58 0.52 1.00

Item 12 0.71 0.67 0.55 1.00Item 25 0.71 0.68 0.53 0.70 1.00

Pure mean = 0.56

Cronbach’s AlphaItem Alpha If Item

DeletedItem-Total Correlation

Item 3 .97 .97 .80

Item 5 .97 .97 .74

Item 9 .97 .97 .75

Item 12 .97 .97 .81

Item 25 .97 .97 .71

α = 0.97

Reliability Estimates and Error

5. Inter-rater Reliability

• What if you’re not using a test but instead observing individual’s behaviors as a psychological assessment tool?

• How can we tell if the judges (assessors) are reliable?

• Typically a set of criteria are established for judging the behavior and the judge is trained on the criteria

• Then to establish the reliability of both the set of criteria and the judge, multiple judges rate the same series of behaviors

• The correlation between the judges is the typical measure of reliability

• Kappa is a measure of inter-rater reliability that controls for chance agreement

• Values range from -1 (less agreement than expected by chance) to +1 (perfect agreement)

• +.75 “excellent” • .40 - .75 “fair to good” • Below .40 “poor”

5 reliability

Documents