reliability - the university of tennessee at chattanooga web viewpsy 513 – lecture 1....

PSY 513 – Lecture 1Reliability

Characteristics of a psychological test or measuring procedure

Answers to the questions: How do I know if I have a good test or not?What makes a good test?Is this a good test?

There are (only) two primary measures of test quality.

1. Reliability – The extent to which a test or instrument or measuring procedure yields the same score for the same person from one administration to the next.

2. Validity – the extent to which scores on a test correlate with some valued criterion. The criterion can be other measures of the same construct, other measures of different constructs, performance on a job or task.

So, we ask about any test: Is it reliable? Is it valid? These are the two main questions.

Other, less important characteristics, considered only for reliable and valid tests.

3. Reading level

4. Face validity – Extent to which test appears to measure what it measures.

5. Content validity – Extent to which test content corresponds to content of what it is designed to measure or predict.

6. Cost

7. Length – time required to test.

So what is a good test?

A good psychological test is reliable, valid, has a reading level appropriate to the intended population, has acceptable face and content validity, is cheap and doesn’t take too long.

This will likely be a test question.

PSY 513: Lecture 1: Reliability - 1 05/05/23

Scoring psychological testsMost tests have multiple items. The test score is usually the sum or average of responses to the multiple items.

If the test is one of knowledge, the score is typically the sum of the number of correct responses. But newer methods based on Item Response Theory use a different type of score.

If the test is a measure of a personality characteristic, the score is often the sum or mean of numerically coded responses, e.g., 1s, 2s, . . . 5s, is typically used.

But I’ll argue later that methods based on Item Response Theory or Factor Analysis may be better.

Sometimes subtest scores are computed and the overall score will be the sum of scores on subtests.

Occasionally, the overall score will be the result of performance on some task, such as holding a stylus on a revolving disk, as in the Pursuit Rotor task or moving pegs from holes in one board to holes in another, as in the Pegboard dexterity task. But most psychological tests are “paper and pencil” or the computer equivalent of paper and pencil.

Invariably the result of the “measurement” of a characteristic using a psychological test is a number – the person’s score on that test, just as the result of measurement of weight is a number – the score on the face of the bathroom scale.

Give Thompson Big Five Minimarkers for Inclass Administration (MDBR\Scales) here

Dimensions are Extraversion Openness to ExperienceStability Conscientiousness Agreeableness.

Below are summary statistics from a group of 206 UTC mostly undergraduates.

Plot your mean responses on the following graph to get an idea of your Big Five profile.


ReliabilityWorking Definition: The extent to which a test or measuring procedure yields the same score for the same person from one administration to the next in instances when the person’s true amount of whatever is being measured has not changed from one time to the next.

Consider the following hypothetical measurements of IQ

Highly Reliable Test Test with Low ReliabilityIQ at Time 1 IQ at Time 2 Person IQ at Time 1 IQ at Time 2

112 111 1 112 105140 141 2 140 12885 86 3 85 92106 108 4 106 100108 107 5 108 11695 93 6 95 105117 118 7 117 110120 121 8 120 126135 134 9 135 130

High reliability: Persons' scores will be about the same from measurement to measurement.

Low reliability: Persons' scores will be different from measurement to measurement.

Note that there is no claim that these IQ scores are the “correct” values for Persons 1-9. That is, this is not about whether or not they are valid or accurate measures. It’s just about whether whatever measures we have are the same from one time to the next.

Lay people often use the word, “reliability”, to mean validity. Don’t be one of them.

Reliability means simply whether or not the scores of the same people stay the same from one measurement to the next, regardless of whether those scores represent the true amount of whatever the test is supposed to measure.

Why do we care about reliability?

Think about your bathroom scale and the number it gives you from day to day.

What would you prefer – a number that varied considerably from day to day or a number that, assuming you haven’t changed, was about the same from day to day.

Obviously, we’re mostly interest in the validity of our tests. But we first have to consider reliability.

Need for high reliability is a technical issue – we have to have an instrument that gives the same result every time we use it before we can consider whether or not the result is valid.


Classical Test Theory: A way of thinking about test scores.

Basic Assumption: Each observed score is the sum of a True Score and an Error of Measurement.

True scores are assumed to be unchanged from one time to the next.Errors of measurement are assumed to vary randomly and independently from one time to the next.

Observed score. The score of a person on the measuring instrument.

True score. The actual amount of the characteristic possessed by an individual.It is assumed to be unchanged from measurement to measurement (within reason).

Error of measurement. An addition to or subtraction from the true score which is random and unique to the person and time of measurement.

In Classical Test Theory, the observed score is the sum of true score and the error of measurement.

Symbolically: Observed Score at time j = True Score + Error of Measurement at time j.

Xj = T + Ej where j represents the measurement time.

Note that T is not subscripted because it is assumed to be constant across times of measurement.

It is assumed that if there were no error of measurement the observed score would equal the true score. But, typically error of measurement causes the observed score to be different from the true score.

This means that everyone who measures anything hates error of measurement. At last, something we can agree on.

So, for a person, Observed Score at time 1 = True Score + Measurement Error at time 1.Observed Score at time 2 = True Score + Measurement Error at time 2.

Note again that the true score is assumed to remain constant across measurements.

Implications for Reliability

Notice that if the measurement error at each time is small, then the observed scores will be close to each other and the test will be reliable – we’ll get essentially the same number each time we measure.

So, reliability is related to the sizes of measurement errors – smaller measurement errors mean high reliability.

This means that unreliability is the fault of errors of measurement. If it weren’t for errors of measurement, all psychological tests would be perfectly reliable – scores would not change from one time to the next.


Two Ways of Conceptualizing reliability

Two possibilities, both requiring measurement at two points in time.

1. Conceptualizing reliability as differences between scores from one time to another.This is the conceptualization that follows directly from the Classical Test Theory notions above.

Consider just the absolute differences between measures.

PersonHighly Reliable Test with Low Reliability

IQ at Time 1 IQ at Time 2 Difference IQ at Time 1 IQ at Time 2 Difference1 112 111 1 112 109 22 140 140 0 140 128 123 85 86 -1 85 92 -74 106 108 -2 106 100 65 108 107 1 108 116 -86 95 93 2 95 105 -107 117 118 -1 117 110 78 120 120 0 120 123 -29 135 135 0 135 130 5

The distributions of differences

A measure of variability of the differences could be used as a summary of reliability. One such measure is the standard deviation of the difference scores obtained from two applications of the same test. The smaller the standard deviation, the more reliable the test.

Advantages1) This conceptualization naturally stems from the Classical Test Theory framework – it is based directly on the variability of the Es in the Xi = T + Ei formulation. Small Eis mean less variability.2) So, it’s easy to understand, kind of.

Problems: 1) It's a golf score, smaller is better. Some nongolfers have trouble with such measures.2) The standard deviation of difference scores depends on the response scale. Tests with a 1-7

scale will have larger standard deviations than tests that use a 1-5 scale, even though the test items might be identical.

3) It requires that the test be given twice, with no memory of the first test when participants take the 2nd test, a situation that’s hard to create.

It is useful however, to assess how much one could expect a person’s score to vary from one time to another.

For example: Suppose you miss the cutoff for a program by 10 points. If the standard deviation of differences is 40, then you have a good chance of exceeding the cutoff next time you take the test. If the standard deviation of differences is 2, then your chances of exceeding the cutoff by taking the test again are much smaller.


-12 -10 -8 -6 -4 1210-2 1086420

0 2 4 6 8 10-2 10 12-4-6-8-10-12

2. Conceptualizing reliability as the correlation between measurements at two time periods.

This conceptualization is based on the fact that if the differences in values of scores on two successive measurements are small, than the correlation between those two sets of scores will be large and positive.

HIGHREL2

1501401301201101009080

HIG

HR

EL1

150

140

130

120

110

100

90

80

LOWREL2

14013012011010090

LOW

REL

1

150

140

130

120

110

100

90

80

If the measurements are identical from time 1 to time 2 indicating perfect reliability, r = 1.

If there is no correspondence between measures at the two time periods, indicating the worst possible reliability, r = 0.

Advantages of using the correlation between two administrations as a measure of reliability -

1) It’s a bowling score – bigger r means higher reliability.2) It is relatively independent of response scale – items responded to on a 1-5 scale are about as reliable as the same items responded to on a 1-7 scale.3) The correlation is a standardized measure ranging from 0 to 1, so it’s easy to conceptualize reliability in an absolute sense – Close to 1 is good; close to 0 is bad.

Disadvantages 1) Nonobvious relationship to Classical Test Theory requires some thought.2) Assessment as described above requires two administrations.

Conclusion

Most common measures of reliability are based on the conception of reliability as the correlation between successive measures.


Correlation between the two administrations of a highly reliable test.

If the scores on two administrations are nearly the same, then the correlation between the paired scores will be positive and large.

Correlation between the two administrations of a test with low reliability.

If the scores on two administrations are not nearly the same, then the correlation between the paired scores will be close to zero.

Score at Time 1

Score at Time 1

Score at

Time 2

Score at

Time 2

Definition of reliability

The reliability of a test is the correlation between the population of values of the test at time 1 and the population of values at time 2 assuming constant true scores and no relationship between errors of measurement on the two occasions.

Symbolized as Population rXX' or simply as rXX' This is pronounced “r sub X, X-prime”.

As is the case with any population quantity such as the population mean or population variance, the definition of reliability refers to a situation that most likely is not realizable in practice.

1) If the population is large, vague, or infinite as most are, then it will be impossible to access all the members of the population.

2) The assumption of no carry-over from Time 1 to Time 2 is very difficult to realize in practice, since people remember how they performed or responded on tests. For this reason, it is often (though not always) not feasible in practice to test people twice to measure reliability.

The bottom line is that the true reliability of a test is a quantity that we’ll never actually know, just as we’ll never know the true value of a population mean or a population variance. What we will know is the value of one or more estimates of reliability.

You’ll hear people speak about “the reliability of the test”. You should remember that they should say, “the estimate of the reliability of the test”.

I’ll use the phrase “true reliability” or “population reliability” to refer to the population value. I’ll try to remember to use “estimate of reliability” when referring to one of the estimates.

Some facts about reliability if Classical Test Theory is true, in case you’re interested . . .

1. Variance of Observed scores = Variance of True scores + Variance of Errors of Measurement

σ2X = σ2

T + σ2E

2. True reliability = Variance of True scores / Variance of Observed scores.

rXX' = σ2T / σ2

X

Neither of these is of particular use in practice, though. They’re presented here for completeness.


Estimates of Reliability

As said above, we never know the true reliability of a test. So we have to get by with estimates of reliability.

Test-retest estimate – the most acceptable estimate if you can meet the assumptions

Operational Definition

1. Give the test to a normative group.2. Minimize memory/carryover from the first administration.3. Insure that there are no changes in true values of what is being measured.4. Give the test again to the same people.5. Compute the correlation between scores on the two administrations.

Most straightforward – fits nicely with the conceptual definition of true reliability

Disadvantages

Requires two administrations of the test – more time.

May be inflated by memory/carryover from the first administration to the secondMay be deflated by changes in True scores from the first to the second administration.

Advantages

Has good “face” validity.

Essentially always acceptable if the assumptions are met – regardless of the nature of the test.

For performance tests, the test-retest method may be the only feasible method.

For single-item scores, may be the only feasible method.

Bottom Line

You should always compute and report the test-retest reliability estimate if you can. If you can meet the assumptions necessary, it is the most generally accepted estimate of reliability in my view.

Practicing what I preach. Excerpt from a recent paper . . .


Parallel Forms estimate

Solving the “memory for previous responses to the same items problem.”


1. Develop two equivalent forms of the test. They should have same mean and variance.2. Give both forms to the normative group.3. Compute the correlation between paired scores. That correlation is the reliability estimate of each form.

Note that this definition has introduced a new notion – the notion that an equivalent form can “stand in” for the original test when computing the correlation that is the estimate of reliability.

If we give the same test twice to compute a test-retest estimate, we can be reasonably sure it’s the same test on the second administration as it was on the first.

But giving an equivalent form of the test requires a leap of faith – that the 2nd form is interchangeable with the original form on that second administration.

The key to the success of the parallel forms method is that the two forms be equivalent. Equal means and variances are necessary for that equivalence.

Advantages

Don’t have to worry about memory/carryover between two administrations.

Having two forms that can be used interchangeably may be useful in practice – a bonus.

One reliability estimate, if high enough, can be applied to TWO tests.

Disadvantages

Takes more time to develop two forms than it does one.

It may not be possible to develop alternative, equivalent forms.

A low reliability estimate, i.e., low r between forms, has two interpretations

1. It could be due to low reliability of one or both of the forms.

2. It could be that the forms are not equivalent.

The idea represented here – that the correlation between equivalent measures of the same thing can be used to assess the reliability of each is a profound one, one that has had important implications for the estimation of reliability, as we’ll soon see.

Bottom Line

If you can develop two alternative and equivalent forms of the test, then by all mean use them and report the correlation of the two as the reliability estimate of each.


Split-half estimate

“Halving your test and using it two.”The lazy person’s answer to parallel forms.


1. Identify two equivalent halves of the test with equal means and variances.2. Give the test once.3. Score the halves separately, so that you have two scores for each person – score on 1st half and on 2nd half.4. Compute the correlation between the 1st Half and 2nd Half scores. Call that correlation rH1,H2.

That correlation is the parallel forms reliability estimate of each half.But we want the reliability of the sum of the two halves, not each half separately. What to do????(Dial a statistician. Ring Ring. “Hello, Spearman here. Hmm, good question. Let me ask Brown about it, and I’ll get back with you.”)

5. Plug the correlation into the following Spearman-Brown Prophecy Formula

2 * rH1,H2

Split-half reliability estimate of whole test = -------------------------------1 + rH1,H2

Trust me: the higher the correlation between the two halves, the larger the estimated reliability.

This is what is called an Internal Consistency Estimate – assuming that if the two halves are highly correlated, the whole test would correlate highly with itself, if given twice.

The split-half method is the simplest example of what are called internal consistency estimates of reliability.

They’re called internal consistency estimates because they rely on the consistency (correlation) of the two halves, both of which are internal to the test.

The greater the consistency – correlation - of the two halves, the higher the reliability.

Advantages

1. It allows you to estimate reliability in a single setting – a major contribution to reliability estimation.2. Very computerizable. The program that scores the whole test can be program to score the two halves and compute a reliability estimate at the same time.

Disadvantages

1. Test may not be splittable – single item tests or performance tests.2. Requires equivalent halves. This may be hard to achieve.3. A low reliability estimate may be the result of either

1) low reliability of one or both halves or 2) nonequivalence of the halves.

4. Different halving techniques give different estimates of reliability.


Cronbach’s Coefficient Alpha estimate

Coefficient alpha takes the notion introduced by the split-half technique to its logical conclusion.

Logic

The split-half uses the consistency of two halves to estimate the reliability of the whole – the sum of the two halves.

But it’s surely the case that the particular halves chosen will affect the estimate of reliability. Some will lead to lower estimates. Other possible halves might lead to larger estimates of reliability.

So, the logic goes, why not look at all possible halves; compute a reliability estimate for each possible split; then average all those reliability estimates.

Coefficient alpha essentially does this, although it is not based directly on halving the test.

Instead, alpha is based on splitting the test into as many pieces as you can, usually into as many items as there are on the test, and computing the correlations between all of the pairs of pieces.

The basic idea is that if all the pieces are correlated with each other, the total of those pieces will be reliable from one administration to the next.

Operational Definition of Standardized Cronbach’s Alpha.

1. Identify as many equivalent pieces of the test as possible. Let K be the number of pieces identified. Each piece is usually one item, so K is usually the number of items on the test.

2. Compute the correlations between all possible pairs of pieces. You’ll compute K*(K-1)/2 correlations.3. Compute the mean (arithmetic average) of the correlations. Call it r-bar. (r for correlation; bar for mean)4. Plug K (number of items) and r-bar (mean of the K*(K-1)/2 correlations) into the following formula

K * r-barStandardized alpha of whole test = α = ----------------------------------------------

1 + (K-1) * r-bar

Relationship to split-half reliability

Coefficient alpha is simply an extension of split-half reliability to more than two pieces.

Note that if K = 2, then there is only one correlation – the correlation between the two halves.So if there were only two pieces, r-bar would be simply rH1,H2, the correlation between the two halves of the test.

So if K=2, the formula for alpha reduces to 2*rH1,H2 / (1 + rH1,H2). This is the split-half formula.

“Regular” Cronbach’s Alpha

There is another formula, based on variances of the pieces and covariances between them that is typically computed and reported. If you see alpha reported, it will likely be the variance-based version.

I presented the standardized version here, because 1) it’s formula is easier to follow than the variance-based formula and 2) its value is typically within .02 of the variance-based formula.

SPSS used to report both. Now, I believe, it reports only “Regular” alpha.PSY 513: Lecture 1: Reliability - 11 05/05/23

Hand Computation Of Standardized Coefficient Alpha

Suppose a scale of job satisfaction has four items.

Q1: I'M HAPPY ON MY JOB.Q2: I LOOK FORWARD TO GOING TO WORK EACH DAY.Q3: I HAVE FRIENDLY RELATIONSHIPS WITH MY COWORKERS.Q4: MY JOB PAYS WELL.

Suppose I gave this "job satisfaction" instrument to a group of 100 employees. Each person responded with extent of agreement to each item on a scale of 1 to 5. Total score, i.e., observed amount of job satisfaction, is either the sum of the responses to the four items or the mean of the four items

The data matrix might look like the following:

Two different Expressions of Scale scores

PERSON Q1 Q2 Q3 Q4 TOTAL MEAN1 3 4 3 3 13 3.252 5 4 5 5 19 4.753 1 2 1 1 5 1.254 3 2 3 3 11 2.255 4 5 4 3 16 4.006 4 4 3 2 13 3.25etc etc etc etc etc etc etc.

Suppose the correlations between the items were as follows:Q1 Q2 Q3 Q4

Q1 1Q2 .4 1Q3 .5 .4 1Q4 .3 .4 .5 1

The average of the interitem correlations, r-bar, is

r-bar = (.4 + .5 + .3 + .4 + .4 + .5) / 6 = 2.5 / 6 = .417

Standardized Coefficient alpha is

No. items * r-bar 4 * .417 1.668 1.668 Alpha = ---------------------------- = ------------------------ = ------------ --------- = .74 1+(No.items-1)*r-bar 1 + (4-1)*.417 1 + 1.251 2.251Notes:

1. Alpha is merely a re-expression of the correlations between the items. The more highly the items are intercorrelated, the larger the value of alpha.

2. Alpha can be increased by adding items, as long as adding them does not decrease the average of the interitem correlations, r-bar. So any test can be made more reliable by adding relevant items - items which correlate with the other items.

3. Just as was the split-half reliability estimate, alpha depends on the consistency (correlations) of the pieces of the test, all of which are internal to, i.e., part of, the test. So it’s an internal consistency estimate. The more consistent the responses to the items, the higher the reliability


Obviously, each item correlated perfectly with itself, so the 1's on the diagonal will not be used in computation of alpha.

The SPSS RELIABILITY PROCEDURE

Example Data: Items of a Job Satisfaction Scale. 60 respondents. 1=Dissatisfied; 7=Satisfied.

Q27 Q32 Q35 Q37 Q43 Q45 Q50 OVSAT

1.00 5.00 2.00 2.00 1.00 1.00 2.00 2.00 1.00 7.00 6.00 4.00 6.00 2.00 6.00 4.57 7.00 7.00 1.00 7.00 7.00 6.00 7.00 6.00 4.00 6.00 6.00 6.00 6.00 6.00 6.00 5.71 1.00 6.00 5.00 2.00 1.00 1.00 3.00 2.71 3.00 3.00 7.00 6.00 7.00 1.00 6.00 4.71 6.00 7.00 7.00 6.00 6.00 6.00 7.00 6.43 2.00 7.00 3.00 3.00 3.00 1.00 3.00 3.14 6.00 6.00 7.00 6.00 6.00 6.00 6.00 6.14 4.00 6.00 5.00 4.00 4.00 3.00 3.00 4.14 1.00 3.00 6.00 5.00 5.00 6.00 5.00 4.43 1.00 5.00 1.00 1.00 1.00 1.00 1.00 1.57 1.00 5.00 1.00 1.00 1.00 5.00 1.00 2.14 1.00 7.00 2.00 2.00 3.00 3.00 3.00 3.00 7.00 7.00 6.00 7.00 7.00 7.00 7.00 6.86 6.00 4.00 4.00 7.00 6.00 7.00 7.00 5.86 7.00 7.00 7.00 7.00 5.00 7.00 7.00 6.71 7.00 7.00 4.00 4.00 7.00 7.00 6.00 6.00 6.00 5.00 7.00 7.00 6.00 5.00 6.00 6.00 7.00 5.00 5.00 6.00 6.00 2.00 9.00 5.17 3.00 6.00 6.00 3.00 5.00 5.00 5.00 4.71 3.00 7.00 6.00 7.00 4.00 3.00 7.00 5.29 6.00 6.00 7.00 7.00 7.00 6.00 7.00 6.57 3.00 7.00 7.00 7.00 6.00 1.00 7.00 5.43 5.00 7.00 6.00 6.00 7.00 6.00 6.00 6.14 4.00 6.00 6.00 6.00 6.00 3.00 6.00 5.29 5.00 5.00 6.00 5.00 5.00 1.00 5.00 4.57 3.00 6.00 2.00 5.00 6.00 6.00 5.00 4.71 4.00 4.00 2.00 3.00 3.00 2.00 2.00 2.86 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 5.00 6.00 6.00 4.00 7.00 6.00 6.00 5.71 7.00 6.00 4.00 7.00 7.00 5.00 7.00 6.14 4.00 5.00 4.00 5.00 7.00 5.00 7.00 5.29 3.00 7.00 7.00 7.00 6.00 6.00 7.00 6.14 6.00 6.00 6.00 6.00 5.00 5.00 6.00 5.71 4.00 5.00 7.00 4.00 6.00 4.00 7.00 5.29 7.00 7.00 6.00 7.00 7.00 6.00 7.00 6.71 6.00 5.00 2.00 7.00 6.00 6.00 7.00 5.57 3.00 6.00 7.00 5.00 3.00 7.00 6.00 5.29 6.00 6.00 7.00 7.00 6.00 6.00 7.00 6.43 6.00 4.00 5.00 7.00 6.00 6.00 6.00 5.71 4.00 4.00 4.00 6.00 4.00 1.00 2.00 3.57 5.00 5.00 6.00 6.00 7.00 5.00 6.00 5.71 4.00 6.00 6.00 6.00 6.00 6.00 6.00 5.71 5.00 6.00 6.00 6.00 6.00 6.00 6.00 5.86 2.00 2.00 2.00 2.00 2.00 2.00 3.00 2.14 5.00 6.00 6.00 5.00 5.00 6.00 6.00 5.57 2.00 6.00 6.00 5.00 3.00 5.00 6.00 4.71 5.00 6.00 2.00 5.00 5.00 6.00 4.00 4.71 5.00 6.00 7.00 6.00 6.00 7.00 7.00 6.29 1.00 6.00 6.00 2.00 5.00 1.00 5.00 3.71 5.00 6.00 7.00 6.00 6.00 3.00 7.00 5.71 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 7.00 7.00 7.00 7.00 7.00 6.00 7.00 6.86 7.00 1.00 6.00 7.00 3.00 5.00 6.00 5.00 4.00 6.00 6.00 5.00 5.00 6.00 6.00 5.43 1.00 5.00 5.00 5.00 1.00 2.00 5.00 3.43 1.00 6.00 5.00 3.00 5.00 5.00 3.00 4.00 7.00 7.00 7.00 7.00 7.00 5.00 7.00 6.71 4.00 6.00 7.00 7.00 7.00 5.00 7.00 6.14


Analyze -> Scale -> Reliability Analysis …


Click on this button.

The syntax for this output, if you’re interested.

RELIABILITY /VARIABLES=Q27 Q32 Q35 Q37 Q43 Q45 Q50 /SCALE('ALL VARIABLES') ALL /MODEL=ALPHA /STATISTICS=DESCRIPTIVE SCALE /SUMMARY=TOTAL MEANS VARIANCE COV CORR.

Reliability

Scale: ALL VARIABLESCase Processing Summary

N %

Cases Valid 60 100.0

Excludeda 0 .0

Total 60 100.0

a. Listwise deletion based on all variables in the

procedure.Reliability Statistics

Cronbach's Alpha

Cronbach's Alpha Based on

Standardized Items N of Items.866 .862 7


Item StatisticsMean Std. Deviation N

Q27 4.3167 2.06251 60Q32 5.7000 1.27957 60Q35 5.2500 1.86516 60Q37 5.2833 1.76685 60Q43 5.2000 1.81145 60Q45 4.5500 2.04546 60Q50 5.6000 1.75827 60

In the correlation matrix below, look for items with small or negative correlations with the other items. They'll be the most likely candidates for exclusion from the scale. Item 32’s correlations have been highlighted.

Inter-Item Correlation MatrixQ27 Q32 Q35 Q37 Q43 Q45 Q50

Q27 1.000 .139 .296 .738 .659 .565 .648Q32 .139 1.000 .217 .143 .326 .245 .277Q35 .296 .217 1.000 .534 .477 .252 .631Q37 .738 .143 .534 1.000 .686 .495 .801Q43 .659 .326 .477 .686 1.000 .514 .760Q45 .565 .245 .252 .495 .514 1.000 .496Q50 .648 .277 .631 .801 .760 .496 1.000

Summary Item Statistics

Mean Minimum Maximum RangeMaximum / Minimum Variance N of Items

Item Means 5.129 4.317 5.700 1.383 1.320 .264 7Item Variances 3.293 1.637 4.254 2.617 2.598 .761 7Inter-Item Covariances 1.583 .324 2.688 2.365 8.305 .626 7Inter-Item Correlations .471 .139 .801 .662 5.747 .043 7

Item-Total StatisticsScale Mean if Item

DeletedScale Variance if

Item DeletedCorrected Item-Total Correlation

Squared Multiple Correlation

Cronbach's Alpha if Item Deleted

Q27 31.5833 62.484 .699 .644 .839Q32 30.2000 81.417 .280 .160 .884Q35 30.6500 69.926 .516 .442 .864Q37 30.6167 63.901 .796 .739 .826Q43 30.7000 63.536 .786 .649 .827Q45 31.3500 66.401 .568 .374 .859Q50 30.3000 62.959 .841 .767 .820

Scale StatisticsMean Variance Std. Deviation N of Items

35.9000 89.515 9.46125 7

I’ve reproduced the display of alpha for the whole scale to make it easier to use the values in the rightmost column above.

Reliability Statistics

Cronbach's Alpha

Cronbach's Alpha Based on

Standardized Items N of Items.866 .862 7


All items should have approximately equal standard deviations. Item 32 is suspect here. In general, items with small standard deviations will tend to suppress reliability.

Use this column to identify items whose inclusion makes alpha smaller than it would be without the item.

Reliability ExampleTests with Right/Wrong Answers

The example below illustrates how reliability analysis would be performed on a multiple choice test in which there was a right and wrong answer to each item.

I chose to enter the raw responses to the items into SPSS from within a Syntax Window.

The DATA LIST command tells SPSS the names of the variables (q1, q2, . . ., q36) and where each is located within a line (columns 1-36). For this example, q36 was an essay question and was not included in the reliability analysis.

The values represent responses marked by test takers as follows:

1=a 2=b 3=c 4=d 9=no answer provided.

DATA LIST /q1 to q36 1-36.BEGIN DATA.333331112113322114241221114421423122311333432212311114341422321112133224323333431212311424242411331222413225333333441212321421242411322921423223313333441232311121241411324121423225323333141212311121242412321221423225111411431212213434342421111222433225211321413212333314142413224121443124333311112313122412341411222122423221332333133212311414142411224222423220323333431212311414242441321221433223213332131212311134142211221221433225313333441312322114122411214222333325323333431212314121242411324221422223331332412212312114241111311422413221333133413232311124142214131121433224313333431212311414242411221121423225313321432313341324342431311221433224312321431333212424232111223221433223323332131213311414242411321421433224323333441212311124242411321221423225313333431212311123242411321221421225331333431212311121242411214312423293113333412313313121242241221321423324END DATA.


The following syntax commands "score" each response and put the score for each question into a new variable.

RECODE q1 (3=1) (ELSE=0) INTO q1score.RECODE q2 (2=1) (ELSE=0) INTO q2score.RECODE q3 (3=1) (ELSE=0) INTO q3score.RECODE q4 (3=1) (ELSE=0) INTO q4score.RECODE q5 (3=1) (ELSE=0) INTO q5score.RECODE q6 (3=1) (ELSE=0) INTO q6score.RECODE q7 (4=1) (ELSE=0) INTO q7score.RECODE q8 (3=1) (ELSE=0) INTO q8score.RECODE q9 (1=1) (ELSE=0) INTO q9score.RECODE q10 (2,3=1) (ELSE=0) into q10score.RECODE q11 (1=1) (ELSE=0) INTO q11score.RECODE q12 (2=1) (ELSE=0) INTO q12score.RECODE q13 (3=1) (ELSE=0) INTO q13score.RECODE q14 (1=1) (ELSE=0) INTO q14score.RECODE q15 (1=1) (ELSE=0) INTO q15score.RECODE q16 (1=1) (ELSE=0) INTO q16score.RECODE q17 (2=1) (ELSE=0) INTO q17score.RECODE q18 (1=1) (ELSE=0) INTO q18score.RECODE q19 (2=1) (ELSE=0) INTO q19score.RECODE q20 (4=1) (ELSE=0) INTO q20score.RECODE q21 (2=1) (ELSE=0) INTO q21score.RECODE q22 (4=1) (ELSE=0) INTO q22score.RECODE q23 (1=1) (ELSE=0) INTO q23score.RECODE q24 (1=1) (ELSE=0) INTO q24score.RECODE q25 (2,3=1) (ELSE=0) INTO q25score.RECODE q26 (2=1) (ELSE=0) INTO q26score.RECODE q27 (1=1) (ELSE=0) INTO q27score.RECODE q28 (2=1) (ELSE=0) INTO q28score.RECODE q29 (2=1) (ELSE=0) INTO q29score.RECODE q30 (1=1) (ELSE=0) INTO q30score.RECODE q31 (4=1) (ELSE=0) INTO q31score.RECODE q32 (2=1) (ELSE=0) INTO q32score.RECODE q33 (3=1) (ELSE=0) INTO q33score.RECODE q34 (2=1) (ELSE=0) INTO q34score.RECODE q35 (2=1) (ELSE=0) INTO q35score.


This question had two correct answers.

This question had two correct answers.

The following is a list of the newly created "score" variables.

Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q QQ Q Q Q Q Q Q Q Q 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 31 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S SC C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C CO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O OR R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R RE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E TOTSCORE

1 0 1 1 1 0 0 0 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 1 1 1 1 0 1 161 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 0 0 0 1 1 1 211 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 301 0 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 291 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 291 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 320 0 0 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1 0 0 1 1 1 0 1 0 1 1 1 180 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 1 0 1 171 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 1 1 1 1 1 171 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 251 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 300 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 261 0 1 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 211 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 321 0 0 1 1 0 1 0 0 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 0 1 0 1 0 1 1 1 211 0 1 0 1 1 1 0 0 1 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 221 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 301 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 231 0 0 1 0 0 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 211 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 271 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 331 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 321 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 270 0 1 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 0 1 25

The RELIABILITY procedure was invoked with the following syntax command. Obviously, it can also be invoked from a pull down menu:

Note that the variables which are assessed are the 1/0 "score" variables, not the original responses.

RELIABILITY/VARIABLES=q1score q2score q3score q4score q5score q6score q7score q8score q9score q10score q11score q12score q13score q14score q15score q16score q17score q18score q19score q20score q21score q22score q23score q24score q25score q26score q27score q28score q29score q30score q31score q32score q33score q34score q35score /FORMAT=NOLABELS /SCALE(ALPHA)=ALL/MODEL=ALPHA /STATISTICS=DESCRIPTIVE SCALE /SUMMARY=TOTAL CORR .


These are the variable names.

The syntax invoking the RELIABILITY procedure.

Reliability output from a previous version of SPSS. ****** Method 2 (covariance matrix) will be used for this analysis ******_ R E L I A B I L I T Y A N A L Y S I S - S C A L E (A L P H A)

Mean Std Dev Cases

1. Q1SCORE .8333 .3807 24.0 2. Q2SCORE .2500 .4423 24.0 3. Q3SCORE .7083 .4643 24.0 4. Q4SCORE .9167 .2823 24.0 5. Q5SCORE .7917 .4149 24.0 6. Q6SCORE .6250 .4945 24.0 7. Q7SCORE .7500 .4423 24.0 8. Q8SCORE .5417 .5090 24.0 9. Q9SCORE .6250 .4945 24.0 10. Q10SCORE .9583 .2041 24.0 11. Q11SCORE .8750 .3378 24.0 12. Q12SCORE .7500 .4423 24.0 13. Q13SCORE .8750 .3378 24.0 14. Q14SCORE .7500 .4423 24.0 15. Q15SCORE .6250 .4945 24.0 16. Q16SCORE .5417 .5090 24.0 17. Q17SCORE .5000 .5108 24.0 18. Q18SCORE .2500 .4423 24.0 19. Q19SCORE .6250 .4945 24.0 20. Q20SCORE .9167 .2823 24.0 21. Q21SCORE .7917 .4149 24.0 22. Q22SCORE .7500 .4423 24.0 23. Q23SCORE .7500 .4423 24.0 24. Q24SCORE .8333 .3807 24.0 25. Q25SCORE .8750 .3378 24.0 26. Q26SCORE .6667 .4815 24.0 27. Q27SCORE .5833 .5036 24.0 28. Q28SCORE .5000 .5108 24.0 29. Q29SCORE .9167 .2823 24.0 30. Q30SCORE .6667 .4815 24.0 31. Q31SCORE .9167 .2823 24.0 32. Q32SCORE .5000 .5108 24.0 33. Q33SCORE .9167 .2823 24.0 34. Q34SCORE .8333 .3807 24.0 35. Q35SCORE .9583 .2041 24.0

* * * Warning * * * Determinant of matrix is zero

Statistics based on inverse matrix for scale ALPHA are meaningless and printed as ._


This message will be printed whenever the number of variables exceed the number of persons. Alpha is not affected.

R E L I A B I L I T Y A N A L Y S I S - S C A L E (A L P H A)

N of Cases = 24.0

N ofStatistics for Mean Variance Std Dev Variables Scale 25.1667 28.7536 5.3622 35

Inter-itemCorrelations Mean Minimum Maximum Range Max/Min Variance .0972 -.3780 .7977 1.1757 -2.1106 .0454_


Item-total Statistics

Scale Scale Corrected Mean Variance Item- Squared Alpha if Item if Item Total Multiple if Item Deleted Deleted Correlation Correlation Deleted

Q1SCORE 24.3333 27.6232 .2463 . .8050Q2SCORE 24.9167 26.0797 .5486 . .7937Q3SCORE 24.4583 26.6938 .3844 . .7999Q4SCORE 24.2500 27.9348 .2477 . .8051Q5SCORE 24.3750 26.3315 .5285 . .7950Q6SCORE 24.5417 25.4764 .6075 . .7901Q7SCORE 24.4167 28.2536 .0647 . .8119Q8SCORE 24.6250 27.7228 .1440 . .8101Q9SCORE 24.5417 25.5634 .5890 . .7909Q10SCORE 24.2083 27.9982 .3304 . .8042Q11SCORE 24.2917 28.5634 .0211 . .8113Q12SCORE 24.4167 27.0362 .3308 . .8021Q13SCORE 24.2917 27.1721 .4166 . .8001Q14SCORE 24.4167 26.5145 .4486 . .7976Q15SCORE 24.5417 25.6504 .5707 . .7917Q16SCORE 24.6250 28.1576 .0624 . .8135Q17SCORE 24.6667 26.1449 .4495 . .7969Q18SCORE 24.9167 26.9493 .3503 . .8013Q19SCORE 24.5417 25.8243 .5342 . .7933Q20SCORE 24.2500 28.1087 .1888 . .8065Q21SCORE 24.3750 27.0272 .3604 . .8011Q22SCORE 24.4167 27.2101 .2921 . .8035Q23SCORE 24.4167 27.3841 .2536 . .8050Q24SCORE 24.3333 28.1449 .1148 . .8092Q25SCORE 24.2917 27.1721 .4166 . .8001Q26SCORE 24.5000 26.9565 .3130 . .8028Q27SCORE 24.5833 27.4710 .1949 . .8079Q28SCORE 24.6667 27.1884 .2449 . .8058Q29SCORE 24.2500 28.6304 .0144 . .8106Q30SCORE 24.5000 27.1304 .2774 . .8042Q31SCORE 24.2500 28.1087 .1888 . .8065Q32SCORE 24.6667 26.8406 .3122 . .8029Q33SCORE 24.2500 30.0217 -.4356 . .8208Q34SCORE 24.3333 27.0145 .4028 . .7999Q35SCORE 24.2083 28.9547 -.1105 . .8117_


Reliability Coefficients 35 items

Alpha = .8080 Standardized item alpha = .7902


The logic behind coefficient alpha

Coefficient alpha is based on the premise that originated with the use of parallel forms estimates and continued with the use of the Spearman-Brown split half estimate: If different tests or different parts of a test correlate highly with each other, then that means they would be likely to correlate higher with themselves at some different time. That’s because, according to classical test theory, these items have small errors of measurement values. T is much bigger than e for each item.

Coefficient alpha is best for instruments with the following characteristics . . .

1. The test is comprised of multiple items.2. Items have essentially equal means and standard deviations and are U.S.3. All items fit classical test theory with the same T value.

Relationships between Test-retest and Coefficient Alpha

Here are test-retest reliability and coefficient alpha values for four common measures of affect.Sample size is 1195. All participants responded to each scale twice – once when completing a “HEXACO” Sona project, the other when completing a “NEO” Sona project.

Scale Test-retest Alpha 1 Alpha 2Rosenberg self-esteem Scale .843 .912 .910Costello & Comrey Depression scale .829 .947 .946Watson PANAS PA scale .760 .895 .887 .Watson PANAS NA scale .786 .890 .870

Note that the coefficient alpha estimates of reliability are all slightly larger than the test-retest estimates involving the same variables. This is common.

Which estimate of reliability should you use?

If you can compute and defend your method, a test-retest estimate is probably the most defensible.If your instrument meets the assumptions listed above for alpha, it should be reported.Parallel forms and split-half estimates are alternative estimates in special circumstances.You must report SOME estimate of reliability.

Acceptable Reliability

How high should reliability be?How tall is tall? How tall should you be to play in the NBA?

Some very general guidelines

Reliability Range Characterization

0 - .6 Poor.6 - .7 Marginally Acceptable.7 - .8 Acceptable.8 - .9 Good.9+ Very Good.95+ Too good – this IS psychology, after all.


Factors affecting estimates of reliability and their relationships to population reliability.

There are at least three major factors that will affect the relationship of a reliability estimate to the true reliability of the test in the population in which the test will be used.

Let’s call the sample of persons upon whom the reliability estimate is based the reliability sample.

1. Variability of the people in reliability sample relative to variability of people in the population in which the instrument will be used.

If the reliability sample is more homogeneous than the population on which the test will be used, the reliability estimate from the sample will be smaller than true reliability for the whole population.

On the assumption that you want to report as high a reliability coefficient as possible, this suggests that you should make the sample from whom you obtain the estimate of reliability as heterogeneous as possible. The sample should definitely be at least as variable as the population.

2. Errors of measurement specific to the reliability sample.

Guessing is represented in Classical Test Theory by large errors of measurement.

Suppose the test requires the reading level of a college graduate and will be used with college graduates, but you include persons not in college in the reliability sample.

This means that some of the people won’t understand some of the items and will guess.

So test characteristics such as inappropriate reading level or poor wording that cause large errors of measurement reduce reliability and estimates of reliability.

3. Consistency of the people making up the reliability sample.

The specific people making up the sample may contribute to the errors of measurement referred to in 2 above.

Some people are more careless (?) inconsistent (?) than others. If the reliability sample is composed of a bunch of careless respondents, the reliability estimates will be smaller than if the reliability sample were composed of consistent responders.

Reddock, Biderman, & Nguyen (International Journal of Selection and Assessment, 2011) split a sample into two groups based on the variability of their responses to items within the same Big Five dimension. Here are the coefficient alpha reliability estimates from the two groups . . .

Group Extraversion Agreeableness Conscientiousness Stability OpennessConsistent Group

.92 .83 .84 .90 .85

Inconsistent Group

.85 .69 .79 .83 .76

So the bottom line is that the more consistent the respondents in the reliability sample, the higher the estimate of reliability.


4. For multiple item tests, the average of the interitem correlations determines reliability estimates.

The greater the mean of the interitem correlations, the higher the reliability estimate, all other things, e.g., K, being equal.

Recall . . . K * r-barAlpha = -------------------------------- 1 + (K-1) * r-bar

Trust the mathematicians – alpha gets bigger as r-bar gets bigger.

5. For multiple item tests, the number of items making up the test affects reliability. As K increases, alpha increases.

Longer tests have higher reliability, all other things, e.g., r-bar, being equal.

K: Number of items

So, a 5 item scale with mean of interitem correlations equal .3 would have reliability = .68.Adding 5 items to make a 10-item scale would increase reliability to .80.

A 5-item scale with mean of interitem correlations equal to .5 would have reliability = .85.Adding 5 items to make a 10-item scale would increase reliability to .92.

So, heterogeneous samples, items understandable by all respondents, consistent responders, closely-related items, lots of items lead to high reliability estimates.


alpha

The graph plots

K*rbarAlpha = -------------------- 1 + (K-1)rbar

as a function of K for 3 different values of rbar.

Note that the point of diminishing returns is K equal 7 or 8. After that, each additional item contributes less and less to alpha.

Why be concerned with reliability? The Reliability Ceiling: Start here on 1/23/18

Goal of research: To find relationships (significant correlations) between independent and dependent variables. By the way, significant difference between means counts as a significant relationship.

If we find significant correlations, our work is lauded, published, rewarded.

If we don’t find significant correlations, our work is round-filed, we end up homeless.

So, most of the time we want large correlations between the measures we administer.

Basic Question: Of all the tests out there, with which test will your test correlate most highly?

The answer is that any test will correlate most highly with itself.

It cannot correlate more highly with any other test than it does with itself.

And reliability is the extent to which a test would be expected to correlate with itself on two administrations. So test reliability is the best you can do if you’re looking for large correlations.

If reliability is low, that means that a test won’t even correlate highly with itself.

If a test won’t correlate with itself, how could we expect it to correlate highly with any other test?

And the answer is: we couldn’t. If a test can’t be expected to correlate highly with itself, it couldn’t be expected to correlate highly with any other test.

The fact that the reliability of a test limits its ability to correlate with other tests is called the reliability ceiling associated with a test.

Reliability Ceiling Formula.

Suppose X is the independent variable and Y is the dependent variable in the relationship being tests.

Let rXX’ and rYY’ be the true reliability of X and Y respectively.

Let rtX,tY be the correlation between the True scores on the X dimension and True scores on the Y dimension.

Then rXY < = rtX,tY * sqrt(rXX’*rYY’)

The observed correlation between observed X and Y scores can be expected to be no higher than the true correlation between X and Y times the square root of the products of the two reliabilities. Unless reliabilities are 1, this means that the observed correlation is expected to be less than the true correlation.

This means that low reliability is associated with smaller chances of significant correlations.

Let’s all hate low reliability. Something else we can agree on.


Turning the Reliability Ceiling Around – Estimating rTxTy.

If rXY < = rtX,tY * sqrt(rXX’*rYY’)

then, using the algebra you learned in 7th grade . . .

rXY

rtXtY >= --------------------------sqrt(rXX’*rYY’)

If the reliabilities of two tests are known, then a “purer” estimate of the correlation between true scores can be obtained by dividing the observed correlation by the square root of the product of the reliabilities.

So what?

Estimates of the true correlation is more constant across different scales of the same construct than are observed correlations.

Estimates of the true correlation give us a better perspective on how related or unrelated difference scales are.

Industrial – Organizational Psych Example

Are job satisfaction and job commitment different characteristics?

Le, H., Schmidt, F. L., Harter, J. K., & Lauver, K. J. (2010). The problem of empirical redundancy of constructs in organizational research: An empirical investigation. Organizational Behavior and Human decision Processes, 112, 112-125.

Question asked in this research: Are satisfaction and commitment different constructs?

Le et al. correlated Satisfaction with Commitment Correlation was rXY = .72 averaged over two measurements periods. (Table 1, p. 120). This is pretty high, but not so high as to cause us to treat them as the same construct.

But after adjusting for reliability of the measures using the above formula . . .

True correlations adjusted for unreliability = .90.

This suggests that the two constructs – job satisfaction and job commitment are essentially identical.

Even though the questionnaires seem different to us, they’re responded to essentially identically by respondents.

Le et al. (2010) argued that the Satisfaction literature and the Commitment literature may be redundant.


Affect Example from UTC research.

Many consider there to be two separate types of affect.

Self-esteem is typically viewed as type of positive affective.

Depression is typically viewed as type of negative affective.

For many people the two types of affect are viewed as distinct characteristics.

Here is the observed correlation, rXY, of Rosenberg Self-esteem scale scores (Rosenberg, 1965) with Costello and Comrey Depression scores: -0.805.

The test-retest of the Rosenberg scale is .843. (See above, p. 22)The reliability of the Depression scale is ..829. (See above, p. 22)

The estimate of true-score correlation,

rTXTY = rXY -.805 -.805------------------- = ------------------------- = ---------------- = -.963sqrt(rXX’,rYY’) sqrt(.843*.829) .836

If self-esteem and depression were truly distinct and independent characteristics, the correlation between the two would be zero: 0.

If self-esteem and depression were different views of the same affective state, the adjusted correlation would be -1.00.

It’s certainly not 0. And it’s not -1, but it’s getting very very close to -1. This suggests that these two constructs, which are typically measured separately and are parts of very different literatures, are in fact very highly related to each other, so much so that they must certainly involve many of the same underlying biological structures.

The bottom line is that if you’re going to study the self-esteem literature, you will probably benefit from reviewing the depression literature, and vice versa.


Reasons for nonsignficant correlations between independent and dependent variables. –.

We’ve now covered three reasons for failure to achieve statistically significant correlations.

1. Sucky Theory: The correlation between true scores, r tX,tY, is 0, i.e., X and Y really are not related to each other.

This means that our theory which predicts a relationship between X and Y is wrong.

We must revise our thinking about the X – Y relationship.

From a methodological point of view, this is the only excusable reason for a nonsignificant result.

2. Low power (from last semester).There could really be a relationship between True X and True Y, i.e., rtX,tY is different from 0, but our

sample size is too small for our statistical test to detect it.

This is inexcusable. (Ring, Ring. “Hello, who is this?” “This is Arnold Schwarzenegger. You’re terminated!!”)

We should always have sufficient sample size to detect the relationship we expect to find.

3. Low reliability (new – from this semester)There could really be a relationship between True X and True Y, i.e., rtX,tY is different from 0, but our

measures of X and Y are so unreliable that even though True X and True Y may be correlated the observed correlation is not significant.

This is inexcusable. (Ring, Ring. “Hello. Wait, I only just learned about reliability.” “OK, but if this happens next week,

you’re fired!”)

We should always have measures sufficiently reliable to allow us to detect the relationship we expect to find.

The above is a good candidate for an essay test question or a couple of multiple choice questions.


Introduction to Path Diagrams

Symbols

Observed variables are symbolized by Squares or Rectangles.

Theoretical Constructs, also called Latent Variables are symbolized by Circles or Ellipses.

Correlations between variables are represented by double-headed arrows

"Causal" or "Predictive" relationships between variables are represented by single-headed arrows


ObservedVariable

103 84121 76 . . . 97 81

106 78115 80. . . 93 83

"Correlation"Arrow

LatentVariable

LatentVariable

101 90128 72 . . . 93 80

103 84121 76 . . . 97 81

104 79114 79. . . 92 81

106 78115 80. . . 93 83

"Causal"Arrow

ObservedVariable

LatentVariable

LatentVariable

"Causal"Arrow

LatentVariable

ObservedVariable

ObservedVariable

"Causal"Arrow

LatentVariable

/TheoreticalConstruct

LatentVariable

ObservedVariable

"Causal"Arrow

ObservedVariable

"Correlation"Arrow

ObservedVariable

Representation of Classical Test Theory

In Equation form: Observed Scores = True Scores + Errors of Measurement Xi = T + Ei

That is, every observed score is the sum of a person's true position on the dimension of interest plus whatever error occurred in the process of measuring. The relationship between T and O is one in which Observed score is said to be a reflective indicator of the True amount.

In terms of the labels of the diagrams . . .


True ScoreLatent

Variable

ObservedVariable

"Causal"Arrow

Error of Measurement

LatentVariable

"Causal"Arrow

103 84121 76 . . . 97 81

106 78115 80. . . 93 83

-3+6+6-4. . .+4-2

"Causal"Arrow

"Causal"Arrow

"Causal"Arrow E

Errors of Measurement

"Causal"Arrow X

ObservedScores

TTrue Scores

The relatioinship of observed correlations to true score correlations using path notation

SYMBOLICALLY, rXY ≤ rTxTy sqrt(rxx' * ryy')

Example of how reliability affect estimates of rxy

From a recent study in which Intelligence was measured by the Paper and Pencil Wonderlic and Academic Ability was measures by Academic Record GPAs taken from Banner:

rXY = 0.299.

From a more recent study in which Intelligence was measured by a unproctored web-based short form of the Wonderlic and Academic Ability was measured by self-reported GPA:

rXY = 0.180

Why is the 2nd correlation so much smaller than the first? Perhaps because the unproctored web-based short form of the Wonderlic is less reliable than the Paper and Pencil form.


Wonderlic GPA’s



Intelligence AcademicAbility

Observed rXY

True rXY

What we observe.

What we want.

reliability - the university of tennessee at chattanooga web viewpsy 513 – lecture 1....

Documents