5 reliability
DESCRIPTION
Reliability in MeasurementTRANSCRIPT
Reliability
CH 6 & 7
Reliability
• the proportion of variance in a set of test scores that is due to the real or “true” attributes of the person’s being measured, rather than error
• Also, repeatability, consistency, or stability
rxx = σTrue2 = σT
2
σobserved2 σT
2 + σe2
0 ≤ rxx≤ 1
Reliability as Repeatability• Conceptually, any observation has some degree of error or
imprecision
Observed score = TRUE SCORE + ERRORS OF MEASUREMENT
• By taking multiple measurements it is presumed that these random errors will cancel each other out
• Under certain assumptions the mean of repeated measurements is considered an estimate of the “true” score
Components of Reliability
• Test scores reflect:– Consistency stable characteristics– Inconsistency factors that affect the scores
but have nothing to do with characteristics being measured
Components of Reliability
• Want a statistic of the proportion of total test score variance that is due to the true score variance
• i.e., what proportion is not due to error variance?
• Defining true score variance as the consistent, stable variance
Classical Test Theory (CTT) Reliability• Observed score = true score + error • X = True + error
• σX2 = σT
2 + σe2
– What is observed is a function of the variability in the true score and variability of the errors of measurement
Assumptions of Classical Test Theory
• Error of measurement is unsystematic or random deviation of an individual’s score from a theoretically expected observed score (true-score)– Observed score = True Score + error– True score is an ‘expected’ or mean score not a ‘real’ trait–
“represents a combination of all the factors that lead to consistency in the measurement”
– Errors are not correlated with true score (i.e., random)• For a given individual, an error may not be a completely random
event. However, across a number of individuals, the causes of error are assumed to be random.
– Errors on two different tests are not correlated
Reliability
• Completely Reliable scale
133 lb133 lb133 lb133 lb133 lb
Reliability
• Completely Unreliable scale
115 lb140 lb141 lb122 lb118 lb
Reliability
• Reliability is highest when…
X = T + EE is small
Less error
Methods of Assessing Reliability
1. Test-retest2. Alternate Forms3. Split-half4. Internal Consistency
5. Inter-Rater Reliability
1. Test-Retest Reliability
• "Temporal stability”: Simply, the rank-order stability of scores from one administration of a test to another.1. Administer the test to a group of people2. Re-administer it at some other time to the same
group of people3. correlate Time 1 and Time 2 scores
• If correlation < 1.0, due to error variance
Test-Retest Issues
• The “true score” should remain the same and that is what is correlating across the time points– The lower the correlation, the less stable the
scores and the more error or extraneous variation
• Problems with this approach?
Problem One• Characteristics or attributes being measured may
change between Time 1 and 2• Why might this happen?
– Change in “true” score; almost all psychological traits exhibit some change across a long enough time interval (e.g., reading ability of children)
• Use short time intervals (1 week, 1 month) to estimate reliability– Want error due to random fluctuations, not long term
changes
Problem Two
• Practice Effects (a.k.a carry over effects)
• Learning might occur during the first administration• Remembering content• Especially an issue if the time between
administrations is too short • More of a problem for performance-type measures
Problem Three• Reactivity Effects
• The experience of taking the test itself can change a person’s true score– E.g., on a test of geography, test takers may become
curious about the correct answers of the questions- go out and study.
– E.g., a test of marital satisfaction may involve questions addressing dimensions that the person had never thought of before- then, they may start paying attention to that dimension and change accordingly. Thus, the practice of mere measurement may change the test-taker.
2. Parallel (Alternative) Forms Reliability
• Using the same test on repeated occasions has certain problems (test-retest)
• Therefore, we can use parallel forms of the test on the two occasions
• Persons take one form at Time 1 and the alternate form at Time 2 (e.g., GRE)
• The correlation between the tests is the reliability of the test
• Mean of the two scores is an individual’s true score• Correlation between parallel forms is also an index of
temporal stability
To Be Parallel You Must
• Have the same means and standard deviations• Items must be of the same difficulty• Same number of items, expressed in the same
form, and cover the same content• ALL other characteristics must be the same• Examinees should be indifferent to the form
Issues With Parallel Forms
• Nice idea, but it is not easy to construct identical or very similar at all times.
• Just as the test-retest method, this method requires two separate test administrations- thus, it can be quite costly and cumbersome.
• Unless the forms are perfectly parallel this form of reliability violates the assumptions of CTT.
3. Split-Half Reliability
• Instead of creating two different forms, why not create one form and split it into two?
• Reliability is the correlation between the two halves
Issues with Split-Half Reliability
• How should we split the test?– 1st half, 2nd half ?(not a good idea; think about
fatigue effects)– Even-odd– Random halves
• Fact: Half does not contain all the items– Problem because there is a direct relationship
between test length and reliability• Only using half the items reduces our estimate of reliability
– Fundamentally violates the assumptions of CTT
Spearman-Brown Formula
• A way to “correct” for using only half of the items
• Formula that computes the reliability if a test were longer or shorter– So corrects for small number of items
4. Internal Consistency
• Average Item Intercorrelation• Cronbach’s Coefficient Alpha• Take the logic of split-half and parallel forms
reliability to the extreme– Every ITEM is a parallel test of the construct– Therefore, the average correlation among items is
an index of reliability – Last, the correlation of an item to the total is an
index of reliability
Average Item Intercorrelation
• Internal consistency reliability• ICR estimates are different from the other
methods; focus is on # of items in the test and the intercorrelations among the items and their correlation to the test as a whole.
• An example: imagine two people who are taking an internally consistent test of extraversion
Internal Consistency Example
• Person A is very extraverted, Person B is not– For every item, Person A always responds “true”
and Person B always responds “false”
• So, within a sample of different people, the responses to items will be correlated– People who score high on item 1 will also score high
on item 2, 3,…..n – Internal consistency
A Second Example
• Imagine Person A and Person B take an internally consistent test of intelligence– Person A is very intelligent; Person B is not so bright– Person A passes every item; Person B fails nearly
every item– Again within a sample of different people, the item
responses will be correlated – People who pass item 1 will tend to pass items 2, 3,
….n
Internal Consistency Data
1. Administering a test to a group of individuals
2. Computing correlations among all items and computing the average of those intercorrelations
3. Computing the correlation of an item score to the score of the test without that item
Average Inter-Item Correlations
I1 I2 I3 I4 I5 In
I1 1.00
I2 .89 1.00
I3 .91 .92 1.00
I4 .88 .93 .95 1.00
I5 .84 .86 .92 .85 1.00
In .88 .91 .95 .87 .85 1.00
.90
Internal Consistency Reliability
• The reliability of a test is based on the number of items in the test (k) and the average intercorrelation among test items.
• Thus, ICR methods are mathematically linked to the split-half method.– alpha represents the mean reliability coefficient
one would obtain from all possible split halves.
Cronbach’s Coefficient Alpha
• Most commonly used measure of internal reliability
• Alpha is the average value of all possible split-half reliabilities
1
1 2
2
ks
sk
scoretotal
i
Internal Consistency as a Method
• The principal advantage of ICR methods is practicality.– You don’t need to administer a test multiple times.
• Split-half methods may look more computationally challenging.– This is not an issue anymore, computers generate
reliability estimates in a second
Using Cronbach’s Alpha
• By convention, alpha should be at least .70 or higher to retain an item in an "adequate" scale;
• Many researchers require a cut-off of .80 for a good scale.
• Most consider alpha above .95 indication of redundancy.
• However, specificity/generality of the scale focus affects the estimates
Some Examples of Reliability
The Data• 291 Participants.
• From a recent scale development project.
• Included 308 potential items over 5 facets.– This facet ended up
with 25 items.
Observation Item3 Item5 Item9 Item12 Item25
19 5 1 2 1 1
24 5 4 3 4 2
30 4 2 1 3 1
58 4 4 3 2 3
71 5 5 3 3 1
86 5 1 2 5 1
93 4 2 2 1 1
99 5 4 3 2 3
100 5 1 2 3 1
261 5 1 4 4 1
Sample Mean 2.51 2.66 2.14 2.74 1.82
Split-Half Reliability
• Item 1-13 and 14-25: r = .92
• Odd vs Even: r = .92
• By 3’s: r = .97
Average Inter-item Correlations
Item 3 Item 5 Item 9 Item 12 Item 25
Item 3 1.00Item 5 0.68 1.00Item 9 0.58 0.52 1.00
Item 12 0.71 0.67 0.55 1.00Item 25 0.71 0.68 0.53 0.70 1.00
Pure mean = 0.56
Cronbach’s AlphaItem Alpha If Item
DeletedItem-Total Correlation
Item 3 .97 .97 .80
Item 5 .97 .97 .74
Item 9 .97 .97 .75
Item 12 .97 .97 .81
Item 25 .97 .97 .71
α = 0.97
SPSS
Reliability Estimates and Error
5. Inter-rater Reliability
• What if you’re not using a test but instead observing individual’s behaviors as a psychological assessment tool?
• How can we tell if the judges (assessors) are reliable?
• Typically a set of criteria are established for judging the behavior and the judge is trained on the criteria
• Then to establish the reliability of both the set of criteria and the judge, multiple judges rate the same series of behaviors
• The correlation between the judges is the typical measure of reliability
• Kappa is a measure of inter-rater reliability that controls for chance agreement
• Values range from -1 (less agreement than expected by chance) to +1 (perfect agreement)
• +.75 “excellent” • .40 - .75 “fair to good” • Below .40 “poor”