1 class 4 basic psychometric characteristics: variability, reliability, interpretability october 15,...

90
1 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging University of California, San Francisco

Upload: archibald-davidson

Post on 13-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

1

Class 4

Basic Psychometric Characteristics:Variability, Reliability, Interpretability

October 15, 2009

Anita L. StewartInstitute for Health & Aging

University of California, San Francisco

Page 2: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

2

Overview of Class 4

Concepts of error, sources of error and bias in measures.

Indicators of variability and reasons for poor variability

Indicators of reliability Interpretability of scores

Page 3: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

3

Components of an Individual’s Observed Item Score

(Simplistic view)

Observed true item score score

= + error

Page 4: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

4

Components of an Individual’s Observed Item Score

Observed true item score score

= + error

“score that would be obtained over repeated testings”

Nunnally, 1994, p211

Page 5: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

5

Random versus Systematic Error

Observed true item score score

= + error random

systematic

Page 6: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

6

Random versus Systematic Error

Observed true item score score

= + error random

systematic

Relevant to reliability

Relevant to validity

Page 7: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

7

Components of Variability in Item Scores of a Group of Individuals

Observed true score score variance variance

Total variance (sum of all observed item scores)

= + errorvariance

Page 8: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

8

Components of Variability in Item Scores of a Group of Individuals

Observed true score score variance variance

Total variance (sum of all observed item scores)

= +(Random)

errorvariance

Page 9: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

9

Combining Items into Multi-Item Scales

When items are combined into a summated scale, random error to some extent “cancels out”– Error variance reduced as # items increases

– Reducing random error increases amount of “true score” variance

Page 10: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

10

Sources of Error

Subjects Observers or interviewers Measure or instrument

Page 11: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

11

Example: Measuring Weight of Children

Observed score is a linear combination of many sources of variation for an individual

Page 12: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

12

Measuring Weight in Pounds (Without Shoes) of One Child

Scale ismiscalibrated

True weight

80 lbs

Amount of water

past 30 min

Weightof clothes

Observed weight

Person weighing children

is not very precise

= + +

+ +

Page 13: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

13

Measuring Weight in Pounds (Without Shoes) of One Child

Scale ismiscalibrated

+.1 lb

True weight80 lbs

Amount of water

past 30 min+.25 lb

Weightof clothes

+.70 lb

Observed weight82.1 lbs

Person weighing children

is not very precise+1 lb

= + +

+ +

82.1 = 80 +.25 +.70 +.1 +1

Page 14: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

14

Sources of Error in Measuring Weight of Children

Weight of clothes– Subject source of random error

Scale is miscalibrated– Instrument source of systematic error

Person weighing child is not precise– Observer source of random error

Page 15: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

15

Measuring Depressive Symptoms (past 4 weeks) in an Asian or Latino Man

Unwillingnessto tell

interviewer

“True” depression

16

Hard to choose number on the 1-6

response choice scale

Observed depression

score

Measuremisses 2

culturally-bound symptoms

= +

+ +Poor

memoryof feelings

+

Page 16: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

16

Measuring Depressive Symptoms (past 4 weeks) in an Asian or Latino Man

Unwillingto tell

interviewer-2

“True” depression

16

Hard to choose number on the 1-6

response choice scale+1

Observed depression

score12

Measuremisses 2

culturally-bound symptoms

-2

= +

+ +

12 = 16 +1 -2 -1 -2

Poor memory

of feelings-1

+

Page 17: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

17

Sources of Error in Measuring Depression

Hard to choose one number on 1-6 response scale– Subject source of random error

Unwilling to tell interviewer, poor memory of feelings– Subject sources of systematic error (underreport true

depression) Measure misses culturally-bound symptoms

– Instrument source of systematic error (underestimate true depression)

Page 18: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

18

Four Types of Memory Errors: From Cognitive Psychology

Encoding– Information inadequately stored in memory

Storage– Memory eroded over time

Retrieval– Some events/feelings harder to recall

Reconstruction – Errors filling in missing pieces

R Torangeau, Chap 3, in AA Stone et al. (eds)The Science of Self-Report, London: Lawrence Erlbaum, 2000

Page 19: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

19

Memory and Time Autobiographical memory – memory of

things in time and space Events not encoded with their calendar dates

– Thus time is a poor retrieval method Numerous errors remembering “when” and

“how often” something occurred within a particular time frame

N Bradburn, Chap 4, The Science of Self-Report

Page 20: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

20

Memory and Emotion

Tend to remember– positive more than negative experiences– more emotionally intense than neutral

experiences– non-threatening events more than

threatening, sensitive events

Kihlstrom et al, Chap 6, The Science of Self-Report

Page 21: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

21

Overview

Concepts of error Basic psychometric characteristics

– Variability

– Reliability

– Interpretability

Page 22: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

22

Variability

Good variability– All (or nearly all) scale levels are represented– Distribution approximates bell-shaped normal

Variability is a function of the sample– Need to understand variability of a measure in

sample similar to one you are studying Review criteria

– Adequate variability on the latent variable that is relevant to your study

Page 23: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

23

Indicators of Variability

Range of scores Mean, median, mode Standard deviation (or standard error) Interquartile range Skewness statistic % at floor (lowest possible score) % at ceiling (highest possible score)

Page 24: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

24

Range of Scores: Possible and Observed

Especially important for multi-item measures Example:

– CES-D possible range is 0-30– Wong et al. study of mothers of young

children: observed range was 0-23» missing entire high end of the distribution (none

had high levels of depression)

Page 25: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

25

Mean, Median, Mode

Mean - average Median - midpoint Mode - most frequent score In normally distributed measures, these are

all the same In non-normal distributions, they will vary

Page 26: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

26

Mean and Standard Deviation Most information on variability is from

mean and standard deviation– Can envision how measure is distributed

on the possible range

– Mean + 1 SD = 64% of the scores

Page 27: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

27

Interquartile Range (IR)

Difference between the 3rd and 1st quartiles

IR = Quartile 3 - Quartile 1 This range contains the middle 50% of the

distribution– 25% of the sample is above and 25% is

below this range

Page 28: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

28

Quartiles

Divide distribution into 4 parts with 25% of the sample in each part (quartiles)

Quartile 1 - the scale score at the boundary of the lowest 25% of the distribution

Quartile 2 - the score that divides the distribution in half (same as the median)

Quartile 3 - the score at the boundary of the highest 25% (25% of the sample scores above this point)

Page 29: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

29

Set of Scores on 12 people

1 2 3 4 5 6 7 8 9 10 11 12 2 3 8 1 7 4 4 3 2 7 5 3

4 9 1 8 2 12 7 6 11 10 5 3 1 2 2 3 3 3 4 4 5 7 7 8

12 people (red), 12 scores (black)

Re-arrange scores in numeric order

Page 30: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

30

Example of Quartiles: Set of Scores on 12 people

1 2 2 3 3 3 4 4 5 7 7 8

Q1=lowest 25% (lowest 3 people)Q2= median (50% below, 50% above)Q3=highest 25% (highest 3 people)

2.5Q1

6Q3

3.5Q2

Page 31: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

31

Example of Quartiles: Set of Scores on 12 people

1 2 2 3 3 3 4 4 5 7 7 8

Interquartile range - quartile 3 - quartile 1 = 6 - 2.5 = 3.5

2.5Q1

6Q3

3.5Q2

Page 32: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

32

Skewness

Positive skew - scores bunched at low end, long tail to the right

Negative skew - opposite pattern Skewness coefficient ranges from - infinity to +

infinity– the closer to zero, the more normal

Scores +2.0 are cause for concern

Page 33: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

33

Ceiling and Floor Effects: Similar to Skewness Information

Ceiling effects: substantial number of people get highest possible score

Floor effects: opposite More helpful for single-item measures or

coarse scales with only a few levels

Page 34: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

34

… to what extent did health problems limit you in everyday physical activities (such as walking and climbing stairs)?

0

10

20

30

40

50

Not at all Slightly Moderately Quite a bit Extremely

%

49% not limited at all (can’t improve)

Page 35: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

35

SF-36 Variability Information in Patients with Chronic Conditions (N=3,445)

Physicalfunction10 items

Role-physical

4 items

Mental health5 items

Vitality (energy)

5 items

Mean (SD) 80 (27) 75 (41) 71 (21) 54 (22)

Skewness - .99 - .26 - .83 - .24

% floor < 1 24 <1 <1

% ceiling 19 37 4 <1

McHorney C et al. Med Care. 1994;32:40-66.

All on 0-100 scales, higher is better

Page 36: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

36

Evidence of Floor and Ceiling Effects in One SF-36 Scale

Physicalfunction10 items

Role-physical

4 items

Mental health5 items

Vitality (energy)

5 items

Mean (SD) 80 (27) 75 (41) 71 (21) 54 (22)

Skewness - .99 - .26 - .83 - .24

% floor < 1 <1 <1

% ceiling 19 4 <1

McHorney C et al. Med Care. 1994;32:40-66.

All on 0-100 scales, higher is better

24

37

Page 37: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

37

Reasons for Poor Variability

Low variability in construct being measured in that “sample” (true low variation)

Items not adequately tapping construct– If only one item, especially hard

Items not detecting variation at one end What to do:

– If developing measures, add items– If selecting measures – find another one

Page 38: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

38

Advantages of Multi-item Scales Revisited

Using multi-item scales minimizes likelihood of ceiling/floor effects

Even if items are skewed, multi-item scale “normalizes” the skew

Page 39: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

39

Percent with “Best” Score on 5 Items in the MOS MHI-5

6-level response scale - all of the time to none of the time:

Stewart A. et al., Measuring Functioning and Well-Being, 1992

%

Very nervous person (none of the time) 34

Felt calm and peaceful (all of the time) 4

Felt downhearted and blue (none of the time) 33

Happy person (all of the time) 10

So down in the dumps nothing could cheer you up (none of the time) 63

Page 40: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

40

Percent with “Best” Score on 5 Items in the MOS MHI-5

6-level response scale - all of the time to none of the time:

Stewart A. et al., Measuring Functioning and Well-Being, 1992

%

Very nervous person (none of the time) 34

Felt calm and peaceful (all of the time) 4

Felt downhearted and blue (none of the time) 33

Happy person (all of the time) 10

So down in the dumps nothing could cheer you up (none of the time) 63

Page 41: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

41

Percent with “Best” Score on 5 Items in the MOS MHI-5

6-level response scale - all of the time to none of the time:

Stewart A. et al., Measuring Functioning and Well-Being, 1992

%

Very nervous person (none of the time) 34

Felt calm and peaceful (all of the time) 4

Felt downhearted and blue (none of the time) 33

Happy person (all of the time) 10

So down in the dumps nothing could cheer you up (none of the time) 63

5-itemscale:

only 5%had

highestscore

Page 42: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

42

Overview

Concepts of error Basic psychometric characteristics

– Variability

– Reliability

– Interpretability

Page 43: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

43

Reliability

Extent to which an observed score is free of random error– Produces the same score each time it is administered (all

else being equal) Population-specific - reliability affected by:

– sample size– variability in scores (dispersion)– a person’s level on the scale

Page 44: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

44

Back to Components of Variability in Item Scores of a Group of Individuals

Observed true score score variance variance

Total variance (Variation is the sum of all observed item scores)

= + errorvariance

Page 45: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

45

Reliability Depends on True Score Variance

Reliability is a group-level statistic Reliability:

– Reliability = 1 – (error variance)– OR

Proportion of variance due to true score Total variance

Page 46: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

46

Reliability Depends on True Score Variance

Reliability of .70 means 30% of variancein observed scores is due to error

Reliability = total variance – error variance.70 = 1.0 – .30

Page 47: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

47

Reliability Coefficient

Typically ranges from .00 - 1.00 Higher scores indicate better reliability

Page 48: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

48

Importance of Reliability

Necessary for validity (but not sufficient)– Low reliability (or high measurement error)

attenuates correlations with other variables – May conclude that two variables are not

related when they are Greater reliability = greater power

– The more reliable your scales, the smaller sample size you need to detect an association

Page 49: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

49

Reliable Scale?

NO! There is no such thing as a “reliable” scale We accumulate “evidence” of reliability in a

variety of populations in which it has been tested

Page 50: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

50

How Do You Know if a Scale or Measure Has Adequate Reliability?

Adequacy of reliability judged according to standard criteria

– Criteria depend on type of coefficient

Page 51: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

51

Types of Reliability Tests

Internal-consistency Test-retest Inter-rater Intra-rater

Page 52: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

52

Internal Consistency Reliability: Cronbach’s Alpha

Requires multiple items supposedly measuring same construct to calculate

Extent to which all items measure the same construct (same latent variable)

Page 53: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

53

Internal-Consistency Reliability

For multi-item scales Cronbach’s alpha

– for scales using ordinal items (e.g., 1-5) Kuder Richardson 20 (KR-20)

– for scales using dichotomous items

Page 54: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

54

Minimum Standardsfor Internal Consistency Reliability

For group comparisons (e.g., regression, correlational analyses)– .70 or above is minimum (Nunnally, 1978)– .80 is optimal– above .90 is unnecessary

For individual assessment (e.g., treatment decisions)– .90 or above (.95) is preferred (Nunnally, 1978)

Page 55: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

55

Internal-Consistency Reliability Can be Spurious

Based on only those who answered all questions in the measure– If a lot of people are having trouble with the

items and skip some, they are not included in test of reliability

Important to compare sample size in reliability calculation to total sample

Page 56: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

56

Internal-Consistency Reliability is a Function of Number of Items in Scale

Increases with the number of items Very large scales (20 or more items) can

have high reliability without other good psychometric properties

Page 57: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

57

Example: 20 item Beck Depression Inventory (BDI)

BDI 1978 version (asks about past week)– Internal consistency reliability = .86

Beck AT et al. J Clin Psychol. 1984;40:1365-1367

Page 58: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

58

Example: 20 item Beck Depression Inventory (BDI)

BDI 1978 version (asks about past week)– Internal consistency reliability = .86

– BUT: 3 items correlated < .30 with other items in the scale

Beck AT et al. J Clin Psychol. 1984;40:1365-1367

Page 59: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

59

Reliability Varies by Level on Measure

Reliability can be poorer for those scoring at one end of the scale

Example: Number of visits to doctor in past 12 months– More reliable for those with fewer visits

Page 60: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

60

Test-Retest Reliability

Repeat assessment on individuals not expected to change

Time between assessments should be:– Short enough so no change occurs– Long enough so subjects don’t recall first response

Only reliability test for single item measures Coefficient: correlation between 2

measurements

Page 61: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

61

Appropriate Test-Retest Coefficients by Type of Scale

Continuous scales (ratio or interval scales, multi-item Likert scales):– Pearson

Ordinal or non-normally distributed scales:– Spearman or Kendall’s tau

Dichotomous (categorical) measures:– Phi or Kappa

Page 62: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

62

Minimum Standards for Test-Retest Reliability

Magnitude of a test-retest correlation is important, not significance

Criterion: similar to that for internal consistency

– >.70 is desirable

– >.80 is optimal

Page 63: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

63

Observer or Rater Reliability

Inter-rater reliability (across two or more raters)– Consistency (correlation) between two or more

observers of the same subjects (one point in time)

Intra-rater reliability (within one rater)– Consistency within one observer– Correlation among repeated values obtained by the

same observer (over time)

Page 64: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

64

Observer or Rater Reliability

Sometimes Pearson correlations are used – scores on a group of individuals obtained by one observer correlated with scores obtained by another observer– Assesses association only

.65 to .95 are typical correlations >.85 is considered acceptable

McDowell I et al. Measuring Health, 2006, p. 45.

Page 65: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

65

Association vs. Agreement When Correlating Scores from Two Times or Ratings

Association: degree to which scores of one rater linearly predict scores of 2nd rater

Agreement: extent to which same score obtained on 2nd measurement (retest, 2nd rater)

Can have high correlation and poor agreement– If second score is consistently higher for all

subjects, can obtain high correlation– Need second test of mean differences

Page 66: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

66

Hypothetical Scores on 4 Subjects by 2 Observers

1

2

3

4

5

6

7

S1 S2 S3 S4

Subjects

Page 67: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

67

Example of Association and Agreement

Scores by observer 1 are exactly 2 points above scores by observer 2– Correlation (association) would be perfect

(r=1.0)

– Agreement is poor (no agreement on score in all cases - a difference of 2 between scores on each subject

Page 68: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

68

Intraclass Correlation Coefficient (Kappa) for Testing Inter-rater Reliability

Coefficient indicates level of agreement of two or more judges, exceeding that which would be expected by chance

Appropriate for dichotomous (categorical) scales and ordinal scales

Several forms of kappa:– e.g., Cohen’s kappa: 2 judges, dichotomous scale

Sensitive to number of observations, distribution of data

Page 69: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

69

Interpreting Magnitude of Kappa: Level of Reliability

<0.00

.00 - .20

.21 - .40

.41 - .60

.61 - .80

.81 - 1.00

Poor

Slight

Fair

Moderate

Substantial

Almost perfect

.60 or higher is acceptable (Landis, 1977)

Page 70: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

70

Reliability Often Poorer in Lower SES or Low Literacy Groups

More random error due to Reading problems Difficulty understanding complex

questions Unfamiliarity with questionnaires and

surveys

Page 71: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

71

Advantages of Multi-item Scales Revisited

Using multi-item scales improves reliability

Random error is “canceled out” across multiple items

Page 72: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

72

What Makes a Measure Reliable?

Preventing measurement error easier than assessing its effects

Measure– Clear items, appropriate response choices, etc.

Format– Make instrument easily understood

Method of administration– Train raters to do their job– Adhere to standard administration procedures

Page 73: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

73

Overview

Concepts of error Basic psychometric characteristics

– Variability

– Reliability

– Interpretability

Page 74: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

74

Interpretability: What does a Score Mean?

What are the endpoints? What does a high score mean?

(direction of scoring) Compared to norms - is score low or high?

Single items, more easily interpretable

Multi-item scales, no inherent meaning to scores

Page 75: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

75

Endpoints

What is minimum and maximum possible?– Enable interpretation of mean score

When scores are added, endpoints depend on number of items & number of response choices– 5 items, 4 response choices = 5 to 20– 3 items, 5 response choices = 3 to 15

Page 76: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

76

Compare Results to Norms

Comparing your means to published norms helps interpret the mean of your sample

SF-36 has numerous norms, e.g.– General population

» By age group, gender, and chronic disease

Page 77: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

77

SF-36 in MOS Patients versus Population Norms

Physicalfunction

Role-physical

Mental health

Vitality (energy)

MOS patients

Mean (SD) 80 (27) 75 (41) 71 (21) 54 (22)

NORMS

Gen pop 84 (23) 81 (34) 75 (18) 61 (21)

Age 75+ 53 (30) 45 (42) 74 (20) 50 (24)

JE Ware et al, SF-36 Health Survey Manual andInterpretation Guide, The Health Institute, 1993.

Page 78: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

78

Direction of Scoring

What does a high score mean? Where in the range does the mean score

lie?– Toward top, bottom?

– In the middle?

Page 79: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

79

Descriptive Statistics for ~3,000 Women

M (SD) Min Max

Age 46.2 (2.7) 42.0 52.9

Activity 7.7 (1.8) 3.0 14.0

Stress 8.6 (2.9) 4.0 19.0

Med Care, 2003;41:1262-1276

Page 80: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

80

Descriptive Statistics for ~3,000 Women

M (SD) Min Max

Age 46.2 (2.7) 42.0 52.9

Activity 7.7 (1.8) 3.0 14.0

Stress 8.6 (2.9) 4.0 19.0

Med Care, 2003;41:1262

Activity: no measure mentionedStress: Perceived stress scale (Cohen, 1983)

Page 81: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

81

Perceived Stress Scale (Cohen 1983): Hard to Find

Available in JSTOR– Can print one page at a time

Searched article “on line” – Could not find scoring information other than

reverse 7 of the 14 items and sum them» Possible score range of 0-56

– Could not find response choices

Page 82: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

82

Another Example: Mean Scores in a Sample of Older Adults

Physical functioning 45.0Sleep problems 28.1Disability 35.7

Mean

Page 83: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

83

Making it Easier to Interpret

Physical functioning 45.0Sleep problems 28.1Disability 35.7

* All scores 0-100

Mean*

Page 84: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

84

Making it Easier to Interpret

Physical functioning (+) 45.0Sleep problems (-) 28.1Disability (-) 35.7

* All scores 0-100 (+) indicates higher score is better health(-) indicates lower score is better health

Mean*

Page 85: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

85

Confusion Introduced by Labels:

SF-36 Bodily Pain scale– Higher score is no pain or limitations due to pain– Rationale: so 8 subscales scored in same direction

Social Adjustment Scale (Weissman) Functional Status Index (Jette)

Page 86: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

86

Mean Has to be Interpreted Within Possible Range

M SD

Parents’ harsh discipline practices* Interviewers’ ratings of mother 2.55 .74 Husbands’ reports of wife 5.32 3.30

*Note: high score indicates more harsh practices

Page 87: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

87

Mean Has to be Interpreted Within Possible Range (Add Range)

M SD

Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) 2.55 .74 Husbands’ reports of wife (1-7) 5.32 3.30

*Note: high score indicates more harsh practices

Page 88: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

88

Mean Has to be Interpreted Within Possible Range

M SD

Parents’ harsh discipline practices* Interviewers’ ratings of mother (1-5) 2.55 .74 Husbands’ reports of wife (1-7) 5.32 3.30

Interviewer: 1 2 3 4 5

Husband: 1 2 3 4 5 6 7

*Note: high score indicates more harsh practices

2.55

5.32

Page 89: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

89

Transforming a Summated Scale to a 0-100 Scale

Works with any ordinal or summated scale Transforms it so 0 is the lowest possible and

100 is the highest possible Eases interpretation across numerous scales

100 x (observed score - minimum possible score)

(maximum possible score - minimum possible score)

Page 90: 1 Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 15, 2009 Anita L. Stewart Institute for Health & Aging

90

Homework

Complete rows 13-19 on matrix for both measures– Interpretability, nature of samples on which

it has been tested, variability and central tendency, reliability