measurement error in psychological research: lessons from...

25
Psychological Methods 19%. Vol. I. No. 2. 199-223 Copyrighl 1996 by the American Psychological Association. Inc. 10K2-989X/96/$3.(X) Measurement Error in Psychological Research: Lessons From 26 Research Scenarios Frank L. Schmidt University of Iowa John E. Hunter Michigan State University As research in psychology becomes more sophisticated and more oriented toward the development and testing of theory, it becomes more important to eliminate biases in data caused by measurement error. Both failure to correct for biases induced by measurement error and improper corrections can lead to erroneous conclusions that retard progress toward cumulative knowledge. Corrections for attenuation due to measurement error are com- mon in the literature today and are becoming more common, yet errors are frequently made in this process. Technical psychometric presentations of abstract measurement theory principles have proved inadequte in improving the practices of working researchers. As an alternative, this article uses realis- tic research scenarios (cases) to illustrate and explain appropriate and inappro- priate instances of correction for measurement error in commonly occurring research situations. There are two central questions as to error of measurement: (a) Should we correct empirical findings for the biases produced by imperfect mea- surement? (b) If we decide to correct, then how do we use different estimates of reliability to make corrections in various situations? This article ad- dresses both questions. Should empirical findings be corrected for the distortions produced by imperfect measurement? There are two very different answers to this ques- tion suggested by the current literature. On the one hand, the methodological literature since 1910 has been virtually universal in stating that correc- tion is not only desirable but critical to both accu- rate estimation of scientific quantities and to the assessment of scientific theories. We summarize that informed opinion later in this article. On the other hand, examination of currently published Frank L. Schmidt, Department of Management and Organization, College of Business, University of Iowa; John E. Hunter, Department of Psychology, Michigan State University. Correspondence concerning this article should be ad- dressed to Frank L. Schmidt, Department of Manage- ment and Organization, College of Business, 108 Pappa- john Business Administration Building, University of Iowa, Iowa City, Iowa 52242-1000. Electronic mail may be sent to [email protected]. studies shows that most studies—especially in lab- oratory research areas—still make no mention either of error of measurement or of reliability. Does this mean that there are many areas in which it is safe to assume perfect measurement in psy- chology? We note later that empirical findings have shown this to be false. Every psychological variable yet studied has been found to be imper- fectly measured, as is true throughout all other areas of science. Furthermore, even in articles in which reliabilities are reported, the majority of studies do not use those reliabilities to correct findings for the distortion produced by error of measurement. Does this mean that if one reports the reliability, then the distortions produced by error of measurement will no longer occur? That is, is it true that a scientist need only mention error of measurement in order to make it go away? Alas, this too is known to be false. If one knows the reliability, then one can compute the extent of distortion and correct for that distortion. How- ever, the distortion in the primary findings is still there even if one reports the reliabilities in the Method section. There has never been either mathematical or substantive rebuttal of the main findings of psycho- metric theory. Random error of measurement dis- torts virtually every statistic computed in modern 199

Upload: others

Post on 26-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

Psychological Methods19%. Vol. I. No. 2. 199-223

Copyrighl 1996 by the American Psychological Association. Inc.10K2-989X/96/$3.(X)

Measurement Error in Psychological Research:Lessons From 26 Research Scenarios

Frank L. SchmidtUniversity of Iowa

John E. HunterMichigan State University

As research in psychology becomes more sophisticated and more orientedtoward the development and testing of theory, it becomes more importantto eliminate biases in data caused by measurement error. Both failure tocorrect for biases induced by measurement error and improper correctionscan lead to erroneous conclusions that retard progress toward cumulativeknowledge. Corrections for attenuation due to measurement error are com-mon in the literature today and are becoming more common, yet errors arefrequently made in this process. Technical psychometric presentations ofabstract measurement theory principles have proved inadequte in improvingthe practices of working researchers. As an alternative, this article uses realis-tic research scenarios (cases) to illustrate and explain appropriate and inappro-priate instances of correction for measurement error in commonly occurringresearch situations.

There are two central questions as to error ofmeasurement: (a) Should we correct empiricalfindings for the biases produced by imperfect mea-surement? (b) If we decide to correct, then howdo we use different estimates of reliability to makecorrections in various situations? This article ad-dresses both questions.

Should empirical findings be corrected for thedistortions produced by imperfect measurement?There are two very different answers to this ques-tion suggested by the current literature. On theone hand, the methodological literature since 1910has been virtually universal in stating that correc-tion is not only desirable but critical to both accu-rate estimation of scientific quantities and to theassessment of scientific theories. We summarizethat informed opinion later in this article. On theother hand, examination of currently published

Frank L. Schmidt, Department of Management andOrganization, College of Business, University of Iowa;John E. Hunter, Department of Psychology, MichiganState University.

Correspondence concerning this article should be ad-dressed to Frank L. Schmidt, Department of Manage-ment and Organization, College of Business, 108 Pappa-john Business Administration Building, University ofIowa, Iowa City, Iowa 52242-1000. Electronic mail maybe sent to [email protected].

studies shows that most studies—especially in lab-oratory research areas—still make no mentioneither of error of measurement or of reliability.Does this mean that there are many areas in whichit is safe to assume perfect measurement in psy-chology? We note later that empirical findingshave shown this to be false. Every psychologicalvariable yet studied has been found to be imper-fectly measured, as is true throughout all otherareas of science. Furthermore, even in articles inwhich reliabilities are reported, the majority ofstudies do not use those reliabilities to correctfindings for the distortion produced by error ofmeasurement. Does this mean that if one reportsthe reliability, then the distortions produced byerror of measurement will no longer occur? Thatis, is it true that a scientist need only mention errorof measurement in order to make it go away? Alas,this too is known to be false. If one knows thereliability, then one can compute the extent ofdistortion and correct for that distortion. How-ever, the distortion in the primary findings is stillthere even if one reports the reliabilities in theMethod section.

There has never been either mathematical orsubstantive rebuttal of the main findings of psycho-metric theory. Random error of measurement dis-torts virtually every statistic computed in modern

199

Page 2: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

200 SCHMIDT AND HUNTER

studies. Accurate scientific estimation requirescorrection for these distortions in all cases.

However, while the basic correction formulasare very simple, application of those formulas isoften not simple. Reliability can be estimated indifferent ways, and different methods capture dif-ferent components. If a large error component ismissed by the method used in a given study, thencorrection using the reliability estimate may beonly partial correction. Furthermore, the distor-tion produced by error of measurement differsfor different statistics and differs across situationsdepending on the specific error processes that en-ter a particular computation. Many researchershave found it very difficult to translate psychomet-ric textbooks into useful procedures for their par-ticular study. The bulk of this article is devoted topresenting these issues in what we believe to bea much more understandable manner than thatused in current measurement texts.

Should We Correct Empirical Findings forDistortions Due to Imperfect

Measurement?

In recent years, changes have taken placethat make it more important for researchers tounderstand the effects of measurement error intheir research and to correct properly for theseeffects. Research has increasingly moved awayfrom the mechanically empirical approach of thepast to an increasing emphasis on the develop-ment and testing of theoretical explanations ofbehavior (e.g., see Klimoski, 1993; Schmitt &Landy, 1993). This change has led to the realiza-tion that in theory-based research, the real inter-est is in the relationships that exist betweenactual traits or constructs rather than betweenspecific measures of traits or constructs. That is,the interest is in correlations at the true-score(or construct) level rather than in the observed-score correlations that are different for differentmeasures, depending on the amount of measure-ment error they contain (J. P. Campbell, 1990;Schmidt, 1993; Schmidt & Hunter, 1992). As aresult, psychological researchers have increas-ingly come to realize the need in theory testingto correct observed relationships for the biasingeffects of measurement error. J. P. Campbell(1990) has explored this change in some detailand has indicated that it will continue into the

future. Because of the emphasis in structuralequation modeling on correcting biases causedby measurement error, the increasing use ofstructural equation modeling has also contributedto this change in thinking. These developmentsindicate that it is becoming increasingly importantto make the appropriate corrections for biasesinduced in research data by measurement error.

At the same time, there has developed abroader emphasis on precision in estimating cor-relations and effect sizes, both corrected anduncorrected and in both applied and theory-based research. A better understanding of themagnitude of the impact of sampling error hasled to the use of larger samples in individualstudies and has thereby led to increased precisionin estimates of correlations. In addition, the useof meta-analysis has provided more precise andgeneralizable estimates of relationships amongdifferent constructs and measures (Schmidt &Hunter, 1992). These advances have led to thedemise of the older belief that population rela-tions among variables are unique to settings,organizations, or subgroups and have thereforeencouraged attempts to develop general theoriesand explanations (Schmidt, 1992). This increasingemphasis on precision has made it more im-portant that corrections for biases due to mea-surement error be as precise and accurate aspossible and not merely approximately correct.

The amount of measurement error variance insome measures used in psychological research islarge, often in the neighborhood of 50% of thetotal variance of the measure. This is especiallylikely to be true of the short scales often used infield research to save time and thereby allow alarger number of measures to be included. Yetmany researchers today still do not correct theirdata at all for biases induced by measurement er-ror. In fact, in many research areas, failure to evenacknowledge biases created by measurement erroris still a more important problem today than is theuse of inappropriate corrections.

Problems in Applying Reliability Theoryand Methods

Corrections for biases due to measurement er-ror are used increasingly frequently in the researchliterature. We know from our experience as re-viewers and from reading the published literature

Page 3: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

MEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201

that inappropriate attenuation corrections con-tinue to be made by researchers. Some of theseerrors lead to large downward or upward biases.Researchers need to know how to make thesecorrections correctly and how to evaluate themproperly when they are made and presented byothers in the research literature. How can this bestbe accomplished? Many extended technical treat-ies on the abstract principles of reliability theoryhave appeared over the years, especially in theeducational domain (e.g., Cronbach, 1947; Cron-bach, Gleser, Nanda, & Rajaratnam, 1972; Feldt &Brennan, 1989; Stanley, 1971; Thorndike, 1951;Traub, 1994). Virtually all of these have discussedthe biasing effects of measurement error on corre-lations, and most have discussed corrections forthese biases. Yet these abstract psychometric dis-sertations appear to have had little impact onworking researchers. These discussions rarely il-lustrate the abstract technical principles with con-crete research applications. Without such exam-ples, it is difficult for researchers to apply theabstract technical principles to concrete researchproblems that appear in a wide variety of formsand with a myriad of obscuring features, details,and idiosyncracies. This article is based on a differ-ent approach: examination of a series of concreteresearch scenarios that we have encountered inour work as researchers, advisors to researchers,or reviewers. This article examines 26 real-world"case studies" and explicates them on the basis ofthe principles of reliability theory found in Cron-bach (1947, 1951) and Cronbach et al. (1972). Al-though we cite and rely on these sources, othersources (e.g., Thorndike, 1951) would yield identi-cal resolutions; only the terminology would besometimes (slightly) different.

In classical measurement theory, the fundamen-tal general formula for the observed correlationbetween any two measures, x and y, is

yields the dissattenuation formula:

(2)

r,v = >V, (r,, r,,)1' (1)

where r,v is the observed correlation, r,,, is thecorrelation between the true scores of the mea-sures x and y, and r,.t and rvv. are the reliabilities ofx and y, respectively. This is called the attenuationformula, because it shows how measurement errorin the x and y measures reduces the observed(computed) correlation (r,,,) below the true scorecorrelation (rxf). Solving this equation for r t v

If the sample size is infinite, both these formulasare perfectly accurate. That is, they are perfectlyaccurate in the population. In the smaller samplesused in real research, there are sampling errors inthe estimates of r t l/„, and rn, and therefore thereis also sampling error in the estimate of r v v . Be-cause of this, a circumflex is sometimes used toindicate that all values are estimates:

t = t lit t V'2' . i-v, 'IT' V'-iv' n') • (3)

The r,,. is the estimated correlation between theconstruct underlying the measure x and the con-struct underlying the measure y} Alternatively, itis an estimate of the (uncorrected) correlation thatwe would observe between the measures x and y ifboth measures could be made free of measurementerrors. This much is familiar to some researchers.What confuses many, however, is the question ofwhat type of reliability estimate should be usedunder different circumstances to provide the ap-propriate estimate of the true-score correlation.Indeed, because there are several types of reliabil-ity coefficients, each with different properties, thisquestion can be confusing and can lead to errone-ous estimates, as is illustrated in the research sce-narios that follow.

Corrections for measurement error are also in-cluded in meta-analysis methods as described byHunter and Schmidt (1990) and Hunter, Schmidt,and Jackson (1982). In our advising of researchersconducting meta-analyses, we have repeatedly ob-served uncertainty and confusion as to the kindsof reliability estimates that are appropriate forinclusion in meta-analyses. The principles thatapply to the selection of the appropriate reliability

1 Item response theory can be used to demonstrate thatthe true scores for many scales based on classical mea-surement theory are monotonically but not perfectlylinearly related to the underlying trait (construct) inquestion (Lord & Novick, 1968). However, the relation-ship is usually close enough to linear that the effect onthe estimated construct-level correlations of taking truescores as colinear with the construct is negligible. How-ever, the trend toward increased precision in psychologi-cal research means that someday even this factor mayhave to be routinely taken into account.

Page 4: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

202 SCHMIDT AND HUNTER

coefficient for use in single studies also apply tometa-analyses. Hence the guidance provided inthe single-study scenarios in this article can be usedto determine the appropriate type of reliabilitycoefficients to use in meta-analyses.

There are two approaches in meta-analysis tocorrecting biases due to measurement error. Inthe first, individual correlations or d values fromeach primary study are first corrected, and thenthe meta-analysis is performed on these correctedcorrelations or d values (Hunter & Schmidt, 1990,chap. 3 & 7). In this approach to meta-analysis,the entire process of making corrections for mea-surement error is identical to that described inthis article.

Often the reliability information necessary tocorrect each correlation or d value individually isnot presented in most of the primary studies. Inthose cases, the "artifact distribution" meta-analy-sis method is used (Hunter & Schmidt, 1990, chaps.4 & 7). In this approach, reliability values are takenfrom individual studies, test manuals, and othersources that do report the appropriate reliabilityestimates. These values are then compiled intoreliability distributions characteristic of that re-search literature, and these distributions are thenused with appropriate calculational procedures tocorrect for the biases induced by measurementerror. In this case it is again important to knowwhat types of reliability estimates are appropriatefor use in these artifact distributions. This articleprovides that information for the most commonlyoccurring research situations.

Failure to Correct and Resulting Problems

Scenario 1

Situation. A researcher is interested in theo-ries relating job satisfaction to perceptions of fair-ness in the reward for performance. She decidesto test one such theory by comparing the levelof satisfaction following performance under twodifferent incentive schemes. In a laboratory exper-iment, she randomly assigns 20 students to one oftwo conditions; 10 to each cell of the design. The10 subjects assigned to Condition A perform withreward determined by Incentive Scheme A. The10 subjects assigned to Condition B perform withreward determined by Incentive Scheme B. Thedependent variable is the subject's report of satis-faction with the reward. Subjects were asked to

record their feeling about the experiment as "satis-fied" or "not satisfied."

The researcher was quite troubled when shefound no significant difference between the meansfor the two groups. When asked whether she as-sessed the reliability of the dependent variable,she claimed that there is no error of measurementin her study. She stated she was careful to useperfectly random assignment to conditions andwas equally careful in correctly recording exactlywhich response option was chosen by the subject.

Problem. There are two errors in this re-searcher's beliefs. First, the random assignment ofsubjects to conditions is irrelevant to this issue. Ifsubjects were nonrandomly assigned to conditions,there might be a potential problem with the con-struct validity of the treatment variable but thereis little likelihood that the nonrandom assignmentwould have any effect on the extent of randommeasurement error in the measure of the depen-dent variable. Thus random versus nonrandom as-signment is irrelevant to the issue of imperfectreliability in the measurement of satisfaction.

The deeper and more pervasive error in theresearcher's thinking is the identification of theconcept "error of measurement" with the processof correctly recording the data. Random error inthe measure of satisfaction is produced if the re-sponse recorded is subject to any random influ-ence. It is true that error in recording the responseswould be one such source of error, a source that iscarefully controlled in most psychological studies.But many studies have shown that error in re-cording responses is usually trivial in comparisonto other sources of random error (see Hunter &Schmidt, 1990, pp. 117-125, for a full discussionof such sources).

The largest source of random error in the re-corded response is randomness in the responseprocess itself. Many studies have shown that mosthuman responses have a large random component.In the case of very well rehearsed responses suchas answering "Are you male or female?," the re-sponse is often highly predictable. For example,in the famous "American Voter" study (A. Camp-bell, Converse, Miller, & Stokes, 1960), the test-retest reliability of the variable sex was .95. Inthese extreme cases, random response error issmall in magnitude, and ignoring it causes littleerror in causal inference.

But the essence of most field research and virtu-

Page 5: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

MEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 203

ally all laboratory research is to study responsesthat are not well rehearsed. Studies are conductedon the basis of subject responses that should varyin order to reflect the causal processes to be stud-ied. Thus in most research contexts—and in labo-ratory studies in particular—any observed re-sponse will have a very large random component.If a subject is observed in an exactly repeatedsituation, the correlation between "replicated" re-sponses is rarely any higher than .25. That is, forunrehearsed single responses, it is unwise to as-sume a test-retest reliability higher than .25.

In field research, researchers are taught to beconcerned about measurement. They are taughtthat randomness can be reduced if multiple re-sponses are observed and averaged to generatethe final measurement. This is the principle of the"test" or "scale" of measurement.

Consider the researcher in this scenario. In usinga single item to measure satisfaction, it is likelythat the reliability of that measure is no higherthan .25. For purposes of this scenario, let us as-sume that value. Had the researcher used a seven-item scale to measure satisfaction, then the Spear-man-Brown formula indicates that reliabilitywould climb to .70, that is, nearly three times morereliable measurement. With a 15-item scale, thereliability would have been .84.

Consider now the effect size for the study andthe implications of differences in quality of mea-surement for statistical power. Lipsey and Wilson(1993) recently reviewed over 300 meta-analyseson a very wide variety of psychological treatments.The average effect size was d = .46 (correspondingto a point-biserial r of .22, if sample sizes are equalin experimental and control groups). Suppose wetake this to be the effect size for perfect measure-ment in the present researcher's study.

With a reliability of .25, the effect sizes in herstudy would be reduced from a potential d = .46to an actual d = .23 (i.e., V25(.46) = .23). Theprobability of finding a significant result wouldthen be only 12% (using a one-tailed test). Thatis, the probability of an error for the significancetest in this study is 88%. So it is no great surprisethat she failed to find a significant difference.

Had she used a seven-item scale to measuresatisfaction, the effect size would have beend = .31. The probability of finding a statisticallysignificant difference in this study would be 17%for a one-tailed test. For a sample size using 10

subjects per cell, this means that the significancetest still has an error rate of 83%. But the oddsusing a seven-item scale improve from 12% to 17%,a 42% increase in the odds of drawing the correctconclusion from the study.

For a 15-item scale, the effect size rises to .40—almost the maximum value achieved using perfectmeasurement—and the probability of observing asignificant difference rises to 23% using a one-tailed test. That is, with a 15-item scale the oddsthat the significance test will produce the wrongresult drop to 77% using a one-tailed test. By com-parison, the odds of drawing a correct conclusionrise from 12% for a 1-item scale, to 17% for a 7-item scale, to 23% for a 15-item scale.

For perfect measurement, the effect size isd = .46 and the probability that the significancetest is right is 26% using a one-tailed test. This isnot a lot higher than the power for the 15-itemscale. Thus there is only limited improvement ingoing from a reliability of .84 to a reliability of 1.00.

This researcher would have been well rewardedto have correctly learned the nature of randomerror in measurement of human responses (ani-mals are even more random). Even in laboratorystudies it is critical to measure the reliability ofthe dependent variable. It is also important to cor-rect the study value for attenuation using the mea-sured reliability, since this will make it possible tocompare the results across studies using measure-ment scales with differing reliabilities.

Scenario 2

Situation. A researcher is testing severalhypotheses from a theory of work adjustment.These hypotheses predict certain relationshipsamong the constructs of role ambiguity, role stress,and overall work adjustment. The researcher isaware that measurement error can distort researchfindings, so she decides to use scales that are some-what longer than average to attain adequate levelsof reliability for her measures. After her data aregathered, she finds that coefficient alpha is .76 forthe measure of role ambiguity, .85 for the measureof role stress, and .81 for the measure of overallwork adjustment. She has read in a text on re-search methods that a reliability of .70 or aboveis adequate for research purposes. She concludesthat there is no need to correct for measurementerror, because the fact that she has adequate levels

Page 6: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

204 SCHMIDT AND HUNTER

of reliability means that measurement error is nota problem in her research data.

Problem. This case is an example of the magicnumber belief. This researcher falsely believes thatif reliability is as high as the magic number .70,then there will be no downward bias in observedcorrelations due to attenuation resulting frommeasurement error. But the bias introduced intoestimates of correlations between constructs doesnot magically cease to exist when reliability hitsthe .70 level. In fact, when reliability is .70, thebias factor is VT/O = .84; that is, the observedcorrelation will be on average 16% below its cor-rect value. But that is the effect of measurementerror in only one of the two variables. If bothvariables have reliability of .70, the bias factor isV.70(.70) = .70. This means that the observedcorrelation would on average be 30% below itscorrect value, a very large bias. In this researcher'sdata, the bias factor for the observed correlationbetween the measures of role ambiguity and over-all work adjustment would be V.76(.81) = .78, adownward bias of 22%.

It is true that larger reliabilities produce smallerdownward biases than do smaller reliabilities. Butsome bias continues to exist unless and until relia-bility reaches 1.00, which never happens. Thusthere is never a magic value for reliability beyondwhich estimates of correlations from research dataare unbiased and can be taken at face value asestimates of correlations among the constructsof interest.

Scenario 3

Situation. A researcher is studying a job inwhich mental arithmetic must be used to makeon-the-spot decisions about warehouse storage.His task is to determine the best way to assureadequate mental arithmetic skills on the job toprevent costly decision errors by employees. Hefirst evaluates a training program designed to im-prove this skill; comparing the experimental andcontrol groups, he obtains a d value of .41. Thatis, he finds that the training program increasesskill in mental arithmetic by 41% of a standarddeviation. He next conducts a predictive criterion-related validity study (on a different group) of ageneral mental ability test, using as the criterionthe same measure of mental arithmetic given atthe end of the training program. He corrects the

observed criterion-related validity coefficient of.32 for measurement error in this criterion. How-ever, he does not correct the d value from thetraining evaluation study, stating that measure-ment error corrections are applicable only to cor-relations.

Problem. Measurement error corrections arejust as applicable to d values as to correlations.The d value is the standardized difference betweenthe control group and experimental group means;it is biased downward by measurement error in thedependent variable (here the measure of mentalarithmetic skills) in exactly the same way as thecorrelation, and the correction should be made inexactly the same manner (Hunter & Schmidt,1990, pp. 241-247).

Scenario 4

Situation. An industrial-organizational psy-chologist conducts a large-scale validity study ofan assessment center used to select first-line super-visors in a large manufacturing firm. Six differentcriterion measures of job behavior are used; two ofthese measures are promotion rate and job tenure.The researcher makes corrections for criterion un-reliability for the other measures, but she arguesthat there is no unreliability in promotion rate andtenure. People are either promoted or not, andthere is no error in the recording of the promo-tions. The records also clearly show whether eachparticipant left the company and, if so, how longtenure was before leaving. She is confident thereare no recording errors in tenure, either. In herresearch report, she therefore states that the un-corrected validity coefficients for the criteria ofpromotion rate and tenure are unbiased estimatesof the true validities.

Problem. For purposes of this scenario, we ac-cept the researcher's claim that there are no re-cording errors in these data, although in realitythis is unlikely to be the case. But even if therewere no recording errors, the promotion and ten-ure measures cannot be assumed to be perfectlyreliable. As a general matter, there are no mea-sures that are completely free of measurementerror. Even in the hard sciences, whenever mea-sures have been examined, measurement error hasbeen found.

Although it may be difficult to assess the reliabil-ity of promotion rate, it is not impossible. Promo-

Page 7: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

MEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 205

tions occur over considerable time periods and arebased on assessments of employee job perfor-mance and capabilities. For most employees, thereare multiple opportunities for promotion. For em-ployees who have stayed with the organization fora reasonable period of time, personnel records canbe used to obtain their promotion rates during twoperiods of time. The correlation between thesetwo promotion rate measures can be correctedusing the Spearman-Brown reliability to be appro-priate for the time period used in the validity study.

It is also possible to determine whether the rela-tive standing of employees on promotion ratechanges over time. Doing that requires obtainingmeasures of promotion rate for three time periodsduring careers. The three correlations among thesethree periods are then computed. If there has beenno change, then these three correlations will beequal (within the limits of sampling error). If sys-tematic change is found to be occurring, then thereliability of the promotion rate measure (rn) iscomputed as

A similar procedure can be used to estimate thereliability of the measure of tenure. However, thatdetermination would be more difficult, because itwould require tenure records on some individualsfrom several organizations, and organizational dif-ferences might bias the reliability estimates. De-termining the reliability of promotion rate andtenure might be difficult or even infeasible in anyone study. However, it should be possible to obtainestimates of this general sort from the researchliterature. It should be, but it currently is not, be-cause to our knowledge the needed studies havenot been conducted, or if they have, they have notbeen published. Thus this is an information gapin the research literature that needs to be filled.Until this gap is filled, the best option when suchreliability estimates cannot be made is to state thatthe observed validity coefficients for promotionrate and tenure are downwardly biased to someunknown extent. It is erroneous to pretend thatthese measures are perfectly reliable.

Scenario 5

Situation. On the basis of his own research andfindings by others, a researcher hypothesizes thata major cause of job dissatisfaction is negative

affectivity, the general tendency to experiencenegative emotions. He further hypothesizes thatthe trait of negative affectivity is highly stable overtime. To calibrate the degree of stability, he admin-isters a measure of negative affectivity to a largegroup and 1 year later administers a parallel formto the same group. Both forms have been carefullydeveloped and are reliable; one has an estimatedalpha coefficient of .86 and the other an alpha of.84. The correlation over the 1-year period is .80.On the basis of this 1-year correlation of .80, theresearcher concludes that the trait (construct) ofnegative affectivity is fairly but not completelystable. He concludes that 20% of the variance ofthe trait is unstable and 80% is stable.

Problem. This estimate of the stability of thetrait of negativity is downward biased. The re-searcher has confused the stability of the measureswith the stability of the trait. The 1-year stabilityof the measures is .80; the 1-year stability of thetrait is higher. Random errors of measurement ateach time period bias the correlation between thetwo time periods downward. These errors of mea-surement are assessed by the alpha coefficients.Correcting for measurement errors at Times 1 and2 provides an estimate of the stability of the trait:

Estimated stability = .80/(.86 X .84)"- = .94.

Thus the trait of negativity is actually considerablymore stable than this researcher concluded.

Actually, this trait may be nearly perfectly sta-ble. The .94 value above is at least to some smalldegree an underestimate. This conclusion stemsfrom the fact that coefficient alpha consigns tran-sient error (Hunter & Schmidt, 1990, pp. 123-125)to true variance and, as a result, somewhat under-estimates measurement error (and overestimatesreliability) at Time 1 and at Time 2. In this study,subjects respond to the measure of negative af-fectivity at Time 1 and Time 2. At each time,that particular occasion is characterized for eachsubject by a certain mood, level of emotion, typeof feeling, and so on. These factors are transient:A day later or a week later, they may be different,and hence the individual's responses may be some-what different. These transient influences on thescores are not part of the construct (negative af-fectivity) that the researcher is trying to measure.They are a part of measurement error variance,not true variance. The ideal research design wouldbe as follows. At Time 1 administer parallel form

Page 8: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

206 SCHMIDT AND HUNTER

measures a week or so apart and estimate reliabil-ity at Time 1 as the correlations between thesetwo measures taken at these two times. This esti-mate of reliability controls for transient error:Transient moods and other factors will not repro-duce themselves on the second occasion. Do thesame thing at Time 2. These reliability estimatesshould be slightly smaller than the alpha coeffi-cients. Then correct the .80 correlation across the1-year period using these reliability estimates. Theresulting stability figure will probably be some-what larger than the .94 above; it might be 1.00,indicating perfect stability of the trait.

Scenario 6

Situation. A researcher is testing specific apti-tude theory against the g factor theory of mentalability. He uses multiple regression to test his hy-pothesis that the specific aptitudes of quantitative,verbal, and spatial ability contribute to the predic-tion of job performance over and above the predic-tion from general mental ability (g). All three ofhis ability measures are reliable; each has reliabil-ity of at least .80. With a large N (N = 3,312) toensure stable results, his regression results appearto confirm his hypothesis: Although the standard-ized regression weights for the specific aptitudesare much smaller than the weight for g, some arenevertheless statistically significant and seeminglylarge enough to be of practical value in some situa-tions. In addition, the multiple R is .02 larger whenthe specific aptitudes are added.

Problem. Unless the measure of g has perfectreliability (never the case and not the case here),the addition of the specific aptitude measures tothe measure of general ability serves to raise thereliability (and hence the validity) of the predictorset as a whole (the weighted sum of the indepen-dent variable scores) as an overall measure of g.The effect of this can be to create the false appear-ance of nonzero beta weights for the specific apti-tudes and the false appearance of an increment tothe multiple correlation by the specific aptitudes.Schmidt, Hunter, and Caplan (1981) discussed thisphenomenon in some detail. After being informedof this possibility, the researcher tested this hy-pothesis by correcting all (the criterion related)validities and intercorrelations for measurementerror and then recomputing the standardized re-gression weights. This time he found that all the

weights for the specific aptitudes were zero andthat the increment to the multiple correlation fromadding the two specific aptitudes as predictors waszero. These findings confirm the alternative hy-pothesis that the trait of general intelligence (g)alone was responsible for the prediction.

Scenario 7

Situation. A researcher has developed a 30-item self-report measure of organizational citizen-ship behavior that she believes has more constructvalidity than the self-report measure that is cur-rently used in most studies in the literature. Shehypothesizes that her measure correlates higherwith supervisory evaluations of citizenship behav-ior than does the older measure. She tests thishypothesis on a group of 48 laboratory technicianswho complete both measures and are rated oncitizenship behaviors by their supervisors. She re-ports that her measure correlates significantly withthe supervisory ratings (r = .27, p > .05), whilethe other measure does not (r = .22, ns). She usescoefficient alpha to estimate the reliability of theorganizational citizenship measures and uses theappropriate measures of interrater reliability toestimate the reliability of the ratings. Correctedfor unreliability in both the measure and the rat-ings, the correlation for her measure increasesfrom .27 to .46. She does not correct the correlationfor the other measure because it is not significant,stating that a correlation that is not significantlydifferent from zero should not be corrected. Sheconcludes that the best estimate of the true-scorecorrelations with supervisory judgments of organi-zational citizenship behaviors is .46 for her mea-sure and zero for the other measure. Thereforeshe concludes that her measure has construct va-lidity and that the other measure does not.

Problem. This comparison is quite biased.Consider the observed correlations of .22 and .27for the two measures. The 95% confidence inter-vals for these two correlations show an overlap of.51 correlation points: 95% confidence interval forthe old measure, -.06 < .22 < .50; 95% confidenceinterval for the new measure, —.01 < .27 < .55.Thus the confidence intervals provide no founda-tion for concluding that these correlations are dif-ferent. (A significance test of the difference be-tween these two correlations would also indicatethat they are not significantly different; but the

Page 9: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

MEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 207

confidence interval provides much more direct in-formation [Berenstein, 1994; Schmidt, 1994].)Thus there is no basis for this researcher's conclu-sion that the correlation of .27 should be correctedfor measurement error while the correlation of .22should not.

This researcher has used a false decision rulethat is often used in data analysis (Schmidt, 1992).This decision rule states that if a difference orcorrelation is not statistically significant, then thebest point estimate of its value is zero. But thebest point estimate of the population value of anobserved correlation or difference is the observedvalue of the correlation or difference, regardlessof whether it is statistically significant or not.Hence in this case the best estimates of the mea-surement error-attenuated population correla-tions are .22 for the old measure and .27 for thenew measure. The best estimates of the true-scorecorrelations for these measures are these valuescorrected for measurement error. For the newmeasure, correction for the bias introduced bymeasurement error increases the estimate from.27 to .46. For the old measure, this correctionincreases the estimate from .22 to .37. Comparisonof these two corrected values suggests that theolder measure may have less construct validitythan does the new measure. However, this is onlya weak suggestion, as can be seen by comparisonof the confidence intervals for the corrected corre-lations.

The confidence intervals for the corrected corre-lations can be computed by correcting the end-points of the confidence intervals for the observedcorrelations for the biasing effects of measurementerror (Hunter & Schmidt, 1990, pp. 121-122).When we do this we get the following true-scorecorrelation confidence intervals: 95% confidenceinterval for old measure, -.10 < .37 < .85; 95%confidence interval for new measure, -.02 < .46< .94. Comparison of these confidence intervalsmakes it clear that each correlation lies well withinthe confidence interval for the other. Thus thedifference between the .46 and the .37 is wellwithin the limits of differences expected from sam-pling error. Thus even if the data in this exampleare analyzed correctly, the original question can-not be answered here with any certainty. A samplesize of 48 does not contain enough informationto provide an adequate test of the researcher'shypothesis. This is true whether or not statistical

significance tests are used. The key point illus-trated in this scenario, however, is that the practiceof correcting only statistically significant correla-tions for measurement error leads to serious bi-ases. It is akin to the practice of reporting onlystatistically significant correlations, leading to anupward bias in reported correlations.

Scenario 8

Situation. A researcher is interested in the her-itability of the personality traits of extraversionand emotional stability (neuroticism, when scorednegatively). Using a well-known personality in-ventory, he measures these traits in samples ofidentical and fraternal twins. On the basis of thescores obtained, his observed heritabilities are .43for extraversion and .45 for emotional stability.He concludes that for the trait of extraversion,43% of the variance is due to genetic effects and57% to environmental effects. For the trait of emo-tional stability (neuroticism), he concludes that45% of the variance is explained by genetic differ-ences between individuals, while 55% is due todiffering environmental influences.

Problem. These conclusions are incorrect. Theheritabilities that are the basis of these conclusionsare observed heritabilities; they have not been cor-rected for the biases caused by measurement errorin the personality scales. Thus, for example, .43 isa downwardly biased estimate of the heritabilityof the trait of extraversion. The .43 figure estimatesthe heritability not of the trait but of the scoreson the measure of extraversion. Furthermore,some of the 57% of variation attributed to environ-mental effects is actually the variance of measure-ment errors; thus this researcher has overesti-mated the size of environmental effects in additionto underestimating heritability of the traits.

What type of reliability should be used to correctthese observed heritability estimates? We need ameasure of reliability that assigns a measurementerror not only random errors of measurement butalso transient errors (see Scenario 5) and specificfactor variance. Thus what is needed is the correla-tion between two parallel forms of these measuresseparated by, say, 2 weeks in time. In the personal-ity domain, such reliabilities average about .65.Suppose the values in this case are .67 for extraver-sion and .64 for emotional stability. Then the unbi-ased estimate of heritability for the trait of extra-

Page 10: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

208 SCHMIDT AND HUNTER

version would be .43/.67 = .69. And the unbiasedestimate of the heritability for the trait of emo-tional stability is .4S/.64 = .70. Thus the heritabilit-ies of the traits are considerably higher than arethe heritabilities of the observed scores on themeasures. The amount of variation that can beattributed to environmental effects is considerablyless, about 30%.

The reader probably noticed that in makingthese corrections for measurement error, we di-vided by the reliabilities themselves instead of bythe square roots of the reliabilities. This is becausethe heritability is an estimate of a squared correla-tion: the squared correlation between genetic dif-ferences and the scores, in the case of observedheritabilities, and the squared correlation betweengenetic differences and the actual trait, in the caseof corrected heritabilities. (This is the reason thatheritabilities are represented by the symbol h- andare given percent-variance accounted-for interpre-tations.) Therefore the square root of the observedheritabilities is the estimated correlation betweenthe genetic differences (genotypes) and the ob-served scores; for example, for extraversion thiscorrelation is V^3 = .66. Likewise, the squareroot of the corrected heritability is the estimatedcorrelation between genetic differences (geno-types) and the trait itself; for example, for extra-version this correlation is V!69 = .83. For the traitof emotional stability, this estimated correlationis V?70 = .84. Thus the correlations underlyingheritabilities are considerably larger than are theheritabilities themselves. A trait with a heritabilityof .70 correlates .84 with genetic differences be-tween individuals.

Scenario 9

Situation. A researcher hypothesizes that be-haviorally anchored rating scales (BARS; Smith &Kendall, 1963) will lead to reduced halo error incomparison with Likert scales. Halo error in jobperformance ratings refers to unrealistically highcorrelations between differently rated dimensionsof job performance. For example, if the populationcorrelation between personal appearance andcomputer programming skills is .10, a correlationof .85 between personal appearance and computerprogramming skills would be evidence of halo er-ror. To test his hypothesis, the researcher developsBARS and Likert scales to measure the same 10

dimensions of job performance (e.g., skills in deal-ing with customer complaints). The BARS scalesare of the usual type, and each Likert subscale hasseven items. He has a group of supervisors doperformance appraisals with both sets of scalesand computes scale intercorrelations. The averagecorrelation between performance dimensions is.72 for the Likert scales but only .53 for the BARSscales. The researcher concludes that BARS scalesdo indeed result in reduced halo.

Problem. The researcher failed to control forthe reliability of the ratings. Each BARS rating isbased on only one item; that is, for each perfor-mance dimension, there is only one judgment andonly one response. Each Likert rating, on the otherhand, is based on seven items; the rater makesseven judgments for each performance dimension.Just as in the case of tests, longer measures areusually more reliable than are shorter measures.Higher reliability means higher correlationsamong dimensions, because there is less down-ward biasing of these correlations due to measure-ment error. Thus the lower intercorrelations be-tween performance dimensions for the BARSscale may be due merely to lower reliability.

The following procedure can be used to test thishypothesis. First, develop a BARS scale that hastwo items per dimension. Within each dimension,the correlation between these two items, after cor-rection using the Spearman-Brown formula, is thereliability of the sum of the two items on thatdimension. These summed two-item scores shouldthen be correlated across the 10 dimensions, andthese 45 correlations should be corrected usingthese reliability estimates. Then the average ofthese corrected correlations is computed for theBARS scale.

The same procedure should then be followedfor the Likert scales. Coefficient alpha should becomputed for each of the 10 seven-item Likertscales; this provides an estimate of reliability com-parable with that computed for the BARS scales.Next, the correlations should be computed be-tween total scores on the 10 Likert scales. Thenthese 45 correlations should be corrected for unre-liability using the alpha coefficients for the Lik-ert scales.

The average of the corrected correlations forthe Likert scales should then be compared withthe corresponding average for the BARS scalesto determine which has more halo. Correcting for

Page 11: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

MEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 209

bias due to unreliability eliminates the possibilitythat differences between the two scale types inaverage dimension intercorrelation are due to dif-ferences in the reliability of ratings. Only if theBARS average is smaller than that for the Likertscales could one conclude that BARS scales areless affected by halo error.

Despite almost 30 years of discussion and debateabout whether the BARS rating format reduceshalo error, the analysis described here has never,to our knowledge, been conducted. Yet it is theonly approach that can provide an unambiguousanswer to the question.

Correcting Using the Wrong ReliabilityCoefficient

Scenario 10

Situation. A researcher conducts a criterion-related validity study of an assessment center usingsupervisory ratings as her criterion measure of jobperformance. Job performance is rated by the first-line supervisor (manager) on 10 dimensions of jobperformance. The sum of standardized ratingacross these 10 dimensions is used as the finalindex of overall job performance. On the basis ofthe correlations among these rated dimensions,she computes coefficient alpha (a = .84), whichshe uses to correct the observed validity coefficientfor attenuation due to unreliability in the crite-rion measure.

Problem. Coefficient alpha as used here is ameasure of intrarater reliability. It is an estimateof what the correlation would be if the same raterrerated the same employees (Cronbach, 1951).What is needed is a measure of interrater reliabil-ity. The problem with intrarater reliability is thatit assigns specific error (unique to the individualrater) to true (construct) variance. Each rater isanalogous to a different form of the rating instru-ment; the specific error for each rater is that rater'sidiosyncratic perceptions of employee job perfor-mance. This specific error is known to be very largein ratings of job performance (King, Hunter, &Schmidt, 1980; Rothstein, 1990). Average intra-rater reliability can be measured using coefficientalpha, as is done here, or by having the same raterrate the same group of employees at two differenttimes and correlating the ratings. With eithermethod, intrarater reliability for the total ratingscore from multiple subscale rating instruments is

usually found to be in the .80 to .90 range. In-terrater reliability, however, is much smaller, aver-aging about .50 for the same type of rating scales(Rothstein, 1990). That is, the average correlationof total ratings between two different knowledge-able raters who rate the same employees is onlyabout .50. The difference between these two fig-ures is quite large, indicating that somewhere be-tween 30% and 40% of the variance of the ratingsof the average rater is specific factor errorvariance.

Use of intrarater reliabilities to correct crite-rion-related validity coefficients for criterion unre-liability produces substantial downward biases inestimates of actual validity. This bias results fromthe fact that intrarater reliability assigns specificfactor error variance to true job performance vari-ance. In Scenario 12, we see that random responseerror variance is very large in employment inter-views. Random response error is somewhatsmaller in total job performance ratings that arethe sum of rating across, say 10 or more, ratingjudgments (10 or more rating subscales). Such jobperformance ratings vary less from day to day, asis shown by intrarater reliabilities in the .80 to.90 range. But specific factor error is a very largecomponent of ratings, perhaps an even larger com-ponent than in the case of personality tests (seeScenario 13). With both ratings and personalityinventories, failure to control for specific factorerror leads to especially serious errors in re-search conclusions.

Scenario 11

Situation. Many researchers have hoped thatthey could measure different dimensions of jobperformance by asking supervisors to rate employ-ees on those dimensions. However, the ratings ondifferent dimensions are typically so highly corre-lated that most researchers have adopted a halohypothesis about ratings. The extreme form of thishypothesis holds that in rating different dimen-sions of job performance, raters are essentiallyrating the dimension of overall performance overand over.

A researcher is studying halo in ratings of em-ployee job performance. He defines 12 dimensionsof job performance, and each employee is ratedon each of those 12 dimensions. Each of 12 dimen-sions is measured by a separate nine-item Likertscale.

Page 12: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

210 SCHMIDT AND HUNTER

For each rater, a 12 x 12 correlation matrixbetween dimensions is formed by correlating theratings made by that rater judging the subordinateswho work under that rater. The researcher foundthese matrices to be essentially identical in formfor different raters, and so he averaged across rat-ers to reduce the impact of the sampling error.

For every rater, the correlations between therated work dimensions are quite large, with allcorrelations falling in the .70 to .80 range. Whenaveraged across raters, the correlation matrix ap-peared to be flat with an average correlation of.75. The researcher then considered the hypothesisthat halo is so strong that judges make no distinc-tion between dimensions at all. All nominal di-mensions are really the same as the rater's over-all evaluation.

If this extreme halo hypothesis were true, thenthe correlations between dimensions would differfrom 1.00 only because of error of measurement.That is, if the extreme hypothesis were true, thenonce the correlations are corrected for the down-ward bias due to measurement error, the correla-tions would differ from 1.00 only by samplingerror.

The researcher knows that in the personnel se-lection literature, the proper reliability for ratingsby a single supervisor is the interrater reliability,that is, the correlation between ratings of workersmade by different supervisors. For well-con-structed multiple-item rating scales scored by add-ing across items, the average correlation betweentwo raters is .47 (King et al., 1980; see also theaggregate study by Rothstein, 1990, which foundan average correlation of .50). The researcher usedthis value as the interrater reliability to correctfor attentuation.

When corrected for attentuation using the aver-age interrater reliability of .47, the average correla-tion between dimensions of job performance rosefrom .75 to 1.60. This was quite shocking to the re-searcher.

Problem. The error in this analysis is a failureto define the concept error of measurement insubstantive terms. The definition of error of mea-surement used in the dimensionality research onratings made by a single rater is different from thedefinition used in personnel selection. The corre-sponding definition of "reliability" will also differ.

In personnel selection research, the goal is tomeasure actual job performance. Research shows

that supervisors are quite idiosyncratic in theirevaluation of workers. This idiosyncracy is thelargest known source of error of measurement inusing ratings to evaluate performance. Thus as thefirst step in defining error of measurement, wedefine the "true score" for performance ratingsto be the worker's average evaluation across apopulation of raters. One kind of error of measure-ment is the difference between the rating madeby one supervisor and the consensus rating thatwould be obtained by averaging ratings made bythe population of supervisors. This is the kind oferror of measurement assessed by the interraterreliability.

In the dimensional research, the focus is on rat-ings made by a single supervisor. Because the re-searcher is studying the perceptual process in sin-gle raters, the idiosyncracy of perception is partof the construct being studied. Because we are nottrying to estimate true performance, the idiosyn-cracy is not an error process in this study. Thusthe interrater reliability is irrelevant to the issueof error of measurement in this research.

What is the relevant definition of error of mea-surement for ratings made by a single rater on asingle dimension? Consider the problems that therater has in responding to the items on the ratinginstrument. The rater must translate their memo-ries, thoughts, and perceptions into a specific re-sponse to a specific question. Research in all areashas shown a large element of randomness in thistranslation process. This is random response errorin the rating made on a specific item.

The researcher implicitly took this aspect of ran-dom response error into account by devising nineitems for each dimension. Adding across the nineitem responses is equivalent to averaging the indi-vidual-item random response errors. However,empirical research has shown that the size of therandom response error to single items is so largethat even averaging across nine responses will noteliminate the random error in the final scale score.The averaged error in the final scale score will stillbe large enough to require correction.

What is the relevant statistical definition of"random error" for this research, and how do wemeasure the relevant scale reliability? For eachdimension, the rater is asked to consider nine itemsthat are all considered to measure the same dimen-sion. If this hypothesis is correct, then the nineresponses to those nine items differ from the di-

Page 13: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

MEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 211

mension score only by random response error. Ifthis is true, then the relevant reliability for thedimension score is the usual Spearman-Brown re-liability (or coefficient alpha or Kuder-Richardson20 [KR-20]) for the dimension computed acrossthe nine items. This scale reliability would be com-puted for each of the raters separately. The scalereliability for each rater could then be averagedacross raters to reduce the impact of samplingerror.

Consider the nine items for a given dimensionand a given rater. Suppose the average correlationbetween the items for that rater is .39. The "inter-nal consistency" reliability for the scale is com-puted using the Spearman-Brown formula with nequal to nine. So computed, the scale reliabilityfor that scale for that rater is .85.

In principle, the average scale reliability acrossraters could be different for each dimension. Thisis usually not true, and for simplicity we assumethat it is not true here. Assume that all dimensionshave the same scale reliability of .85. We can thenuse that scale reliability to correct the correlationbetween dimensions. The average correlation be-tween dimensions for a single rater in the researchproject was found to be .75. If we correct thisusing a reliability of .85 for each dimension, thecorrected correlation is .V5/.85 = .88. This correla-tion is less than the 1.00 predicted by the extremehalo hypothesis, though still far larger than thecorrelation between actual work dimensions (atleast for some dimensions such as physical appear-ance vs. report writing, or either of these vs. peerrapport). This finding of a true-score correlationof .88 between ratings of different dimensions issimilar to the average estimated true-score corre-lation in the 13 studies reviewed by King et al.(1980).

Scenario 12

Situation. A researcher is testing a theory thatpostulates that managerial motivation increaseswith age. The test of this hypothesis depends onthe size of the true-score (construct level) correla-tion between managerial motivation and age. Sheuses the Incomplete Sentences Test (1ST), a pro-jective instrument, to measure managerial motiva-tion. The responses of the managers to the 1STare scored independently by two trained responsecoders; intercoder agreement (correlation) for to-

tal score is found to be .85. The researcher usesthis reliability figure to correct the observed corre-lation (r = .20) between managerial motivationand age for attenuation due to measurement error.She finds that the corrected correlation (r = .22)is much smaller than predicted by the theory andconcludes that this prediction from the theory isdiscontinued.

Problem. The procedure used here to estimatereliability is a commonly used one, but the re-sulting reliability estimate is not the appropriateone for testing the hypothesis under consideration.There are three sources of measurement error thatbias the observed correlations downward from itstrue-score value in this study and that are not takeninto account by this researcher.

First, there is purely random response error.This is similar to transient error (see Scenario 5)except that it varies within occasions as well asacross occasions. The factors that produce randomresponse error vary across moments within occa-sions. Administering the test twice on a single oc-casion would control for purely random responseerror but not for transient error. In this study, thetwo coders score exactly the same responses foreach subject; hence the protocols that they evalu-ate contain exactly the same random response er-rors. Hence random response errors act to inflatethe correlation between the two raters.

Second, another source of measurement errorthat inflates the .85 reliability estimate in this studyis specific error. Specific error results from the factthat only one form of the 1ST is used. Each realor potential parallel form of the 1ST has somespecific factor variance; that is, each form, in addi-tion to measuring the construct of interest, alsomeasures factors specific to that form; these factorsresult from the peculiarities of the content of thatform. Specific factors are not part of the constructand hence are a form of measurement error. Whentwo different parallel forms are correlated, thespecific factors are assigned to measurement errorin the resulting estimate of reliability because theyare different in the two forms and do not correlatewith each other and hence drop out.

Third, there is transient error. In this study, sub-jects respond to the 1ST on only one occasion; foreach person, that particular moment in time ischaracterized by a certain mood, level of emotion,type of feeling, and so on. These factors are tran-sient: A week later, or even a day later, they may

Page 14: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

212 SCHMIDT AND HUNTER

be different, and hence the individual's responsesmay be somewhat different. The critical point isthat these transient influences on the scores arenot part of the construct (managerial motivation)that the researcher is trying to measure; they area part of measurement error variance, not truevariance.

Random response error and transient error canbe controlled by administering the test to the sub-jects on two different occasions; the correlationbetween coders across occasions then provides anestimate of reliability that properly removes bothtransient error and purely random error from theestimate of construct variance (true variance).This more accurate reliability estimate is likely tobe much smaller, as has been found to be the casefor the employment interview (McDaniel, Whet-zel, Schmidt, & Mauer, 1994). When two inter-viewers jointly interview each interviewee (i.e.,both interviewers are present during the same in-terview), the average correlation between inter-viewers is .81. However, when the two interviewersinterview the same applicants on two different oc-casions, the average correlation drops to .52. Thisis a large difference; apparently, about 29% ofthe variance of employment interview evaluationscores is due to random variations in questionsand topics, random variations in interviewee re-sponses, and perhaps transient errors. Only 19%(1.00 - .81) is due to scorer (coder) disagreement.The best estimate of the reliability of the typicalemployment interview is .52. The best way to in-crease this reliability is to have different interview-ers interview candidates on different occasions andthen average the interview scores across the occa-sions and interviewers.

The correlation between scorers across two oc-casions for the 1ST is the reliability for a scorefrom one administration of the test. If the scoreto be used is the average across two occasions, thefamiliar Spearman-Brown formula can be used tocompute this (higher) reliability.

Finally, specific error can be controlled by ad-ministering parallel forms of the 1ST on two occa-sions and computing the correlation betweenforms, across occasions, and across scorers. Thisestimate of reliability controls for all three typesof measurement error.

These considerations are not just technicalities;they are critical to meaningful theory testing. Forexample, the study described here tested the hy-

pothesis that there is a substantial true-score corre-lation between managerial motivation and age.The uncorrected correlation is .20, and use of theinflated .85 reliability estimate leads to a false esti-mate of .22 for the true-score correlation. Theaccurate estimate of the reliability of the 1STscores, described above, would perhaps be around.40. This reliability figure would indicate that thetrue-score correlation is .32. This value might belarge enough to lead the researcher to concludethat the hypothesis was supported, the oppositeof her original conclusion.

Scenario 13

Situation. A researcher examining a theory ofmental ability wants a precise estimate of the true-score correlation between vocabulary (one facetof verbal ability) and three-dimensional spatialability (one facet of perceptual ability). He admin-isters his test of vocabulary on one occasion, and1 month later he administers his test of three-dimensional spatial ability. The researcher be-lieves that because there is a 1-month interval be-tween the administrations, he should make his cor-rections for attenuation using the coefficient ofstability (test-retest using the same test; Cron-bach, 1947) estimated with a 1-month interval be-tween administrations. These reliability figures arepresented in the test manual for each scale. Thegroup on which these reliability estimates werecomputed has the same test standard deviationsas the researcher's group, so he uses them to cor-rect for the bias induced by measurement error.

Problem. This procedure controls for bothtransient error and purely random measurementerrors, but it fails to control for specific error. Theresearcher's reliability estimate is the correlationbetween the same form of the test administeredon two different occasions. Because the sameforms are used both times, the influence of specificfactors on the scores will be the same on bothoccasions. They will correlate with each other andhence inflate the reliability estimate, overestimat-ing the proportion of variance in (both sets of) thescores that is due to construct variance. As a result,this researcher undercorrects, and his estimate ofthe true score correlation between vocabulary andspatial ability is downwardly biased. The correctestimate would be made using parallel forms ofeach test on each occasion. In some research do-

Page 15: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

MEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 213

mains, specific factor variance is quite large. Forexample, in the measurement of personality traits,it may account for 20% or more of score variance.In our research in the area of personality, we haveobserved that the correlation between the sameform of a measure of a personality trait adminis-tered twice is often .20 or more higher than is thecorrelation between parallel forms of the measures(with the same time interval). In such domains,it is particularly important to control for specificfactor error when conducting construct-based, the-ory-oriented research. In the measurement of spe-cific abilities (such as verbal or spatial ability),specific factor error is usually smaller, typicallyaccounting for 3% to 5% of score variance. How-ever, if measures of different specific abilities areused and interpreted as measures of general men-tal ability, specific factor error variance in thesemeasures of specific abilities in relation to the traitof general mental ability is larger, typically 10%to 20%. Hence specific error is important enoughto require careful attention and control.

Scenario 14

Situation. A researcher is interested in the cor-relation between the constructs of mechanical abil-ity and knowledge of electronics. She administerstests of each during a single session. On the basisof these data, she computes the KR-20 reliabilitiesof each test and the correlation between the tests.She then uses the KR-20 reliability estimates tocorrect the observed correlation and estimate thecorrelation between the constructs.

Problem. This procedure results in a close ap-proximation to the correct answer. KR-20 reliabil-ity is the special case of coefficient alpha (Cron-bach, 1951) that exists when item responses aredichotomous (i.e., items are scored correct or in-correct, agree or disagree, etc.). The KR-20 relia-bility formula assigns specific factor variance initems (and hence in the measure) to error; it alsoassigns purely random measurement errors to er-ror (Cronbach, 1951). Transient error is not as-signed to error in KR-20 reliability estimates, buttransient error has repeatedly been found to bevery small or nonexistent in the mental abilitiesdomain. Thus when used to correct for the biascreated by measurement error, the KR-20 reliabil-ity coefficient can be expected to generate accurateestimates of the true-score correlation.

Scenario 15

Situation. An applied researcher is developinga selection system for clerical workers at a largeinsurance company. He is aware of validity gener-alization findings from both civilian and militarysettings, showing that perceptual speed (clericalspeed and accuracy) predicts performance in cleri-cal jobs with substantial validity (Pearlman,Schmidt. & Hunter, 1980). He feels that the Min-nesota Clerical Test is the best commercially pub-lished test of perceptual speed, but the companyhas asked him to develop its own measure. Hereasons that he can have confidence in the scalehe has developed if he can show that its true-scorecorrelation with the Minnesota Clerical Test is 1.00or nearly 1.00. Accordingly, he administers bothhis new perceptual speed test and the MinnesotaClerical Test to approximately 1,800 applicants tothe company for clerical jobs and finds that thetwo tests correlate .81. He next computes the KR-20 reliabilities of each test on this same sample,obtaining a value of .96 for his test and .94 for theMinnesota Clerical Test. He then computes hisestimate of the true-score correlation between thetwo tests as follows:

rv, = .81/V.96(.94) = .85.

He concludes that the true-score correlation of .85departs too much from 1.00 to indicate that the twotests are measuring the same underlying construct.He states that because 15% of the true-score vari-ance of each is not common to the other, the twotests are to some important extent measuring dif-ferent constructs. He is puzzled by this finding,because the two tests appear to have very similaritems and general content. Nevertheless, he de-cides that he must construct a second new test ofperceptual speed that he hopes will show a largertrue-score correlation with the Minnesota Cleri-cal Test.

Problem. The true-score correlation of .85 isan underestimate. Tests of perceptual speed arehighly speeded; in fact, they are often used asexamples of the quintessential speed test. Percep-tual speed tests are made up of very easy items,and it is usually assumed that with unlimited timeevery examinee could answer every question cor-rectly. Hence they have very stringent time limits.Most perceptual speed tests used in clerical selec-tion comprise name- and number-matching items.

Page 16: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

214 SCHMIDT AND HUNTER

The examinee must, for example, indicate whethertwo names are exactly identical or slightly differ-ent. The KR-20 reliability estimate cannot be usedfor measures that are substantially speeded, be-cause speeding biases the reliability estimate up-ward. To estimate the reliability of speed tests,one must administer separately timed forms andcorrelate these. This provides an estimate of paral-lel forms reliability. Alternately, one can adminis-ter separately timed halves of the test and correctthe resulting correlation using the Spearman-Brown formula. Unlike the KR-20 estimates, thesereliability estimates are not upwardly biased.

Informed of these facts, this researcher esti-mated the reliability of each test on a new groupof 1,150 clerical applicants by administering sepa-rately timed halves and correcting their intercorre-lation with the Spearman-Brown formula. The re-sulting reliability estimate was .79 for his test and.87 for the Minnesota Clerical Test. His estimateof the true-score correlation between the two testswas then

f,A = .81/V.79(.87) = .98.

The researcher then concluded that his new testof perceptual speed measured the same underlyingconstruct as the Minnesota Clerical Test, and heproceeded to incorporate his test into his new cleri-cal selection battery for the company.

Scenario 16

Situation. A researcher is studying the rela-tionship between measures of the personality traitconscientiousness and evaluations of "organiza-tional citizenship behaviors" on the job. Coeffi-cient alpha for the self-report measure of conscien-tiousness is .86. He uses supervisory ratings tomeasure the frequency of citizenship behaviors.He has the same two supervisors rate each subjecton 20 different citizenship behaviors; the sumacross both raters of these standardized ratings isthe total citizenship score used in the study. Theratings are made independently, and the averagecorrelation between raters (interrater reliability)for total score is .48. The researcher uses this figureof .48 to correct the observed correlations for theeffects of measurement error in the measure ofcitizenship behavior. He uses the coefficient alphavalue of .86 to correct for unreliability in the mea-sure of conscientiousness. The uncorrected corre-

lation is found to be .35, and the correlation cor-rected for measurement error in the ratings is .54.

Problem. Assuming that transient error isnonextant or minimal, the use of coefficient alphafor the measure of conscientiousness is correct.However, the use of .48 as the reliability of theratings is in error. The value .48 is the reliabilityfor ratings from one rater, while the researcherhas used the sum (or average) of ratings from twoindependent raters. From the Spearman-Brownformula, the interrater reliability for this sum is2(.48)/(l + .48) = .65. Whenjheobserved correla-tion of .35 is divided by V.65(.86), the result is .47rather than .54. Hence use of the improper in-terrater reliability inflated the true-score correla-tion estimate by 15%.

This researcher was testing a theory of conscien-tiousness and therefore was interested in con-struct-level relationships. Thus unbiased estima-tion required correction for measurement error inboth variables. If the researcher's purpose hadbeen to estimate the true validity of the conscien-tiousness measure as a selection device, therewould be no correction for measurement error inthe test, because observed scores from the testwould have to be used in selection. The estimateof operational validity for the observed test scoresis fVi = .35/V^5 = .43.

Scenario 17

Situation. A researcher has developed her ownmeasure of general mental ability for use in herresearch, and she has constructed two parallelforms of this measure. To estimate the reliabilityof these scales, she would like to administer thetwo parallel forms with 8 weeks between adminis-trations but finds it impossible to arrange for thisbecause of practical difficulties. However, she canarrange to have the test administered to a largegroup (of the appropriate composition) on oneoccasion, along with an older, well-establishedmeasure of general ability. On the basis of thisadministration, she finds that the two forms corre-late .87. The KR-20 values for the two parallelforms are .86 and .88. One form correlates .88 withthe established ability test, and the other correlates.86. Although these figures are very encouraging,she is disappointed that she was not able to obtaina direct estimate of the coefficient of equivalenceand stability (Cronbach, 1947) by correlating the

Page 17: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

MEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 215

two forms over a time interval. She neverthelessconcludes in her published article that the equiva-lence and stability reliability for her two forms areprobably at least .85 and may be as high as .87.

Problem. On the basis of measurement theoryalone, this conclusion may not be justified, becausethe estimates of reliability obtained by this re-searcher do not control for transient error or forany real changes in general intelligence that mightoccur over the 8-week period. However, on thebasis of substantive cumulative knowledge fromresearch on abilities, her conclusion is justified. Itis a well-established fact that in the case of abilitiesand aptitudes, transient error (see Scenario 5) issmall or nonexistent. It is also an established factthat no real changes in mental abilities occur oversuch short time periods. This can be inferred fromthe fact that in this domain estimates of the coeffi-cient of equivalence—provided by KR-20 and par-allel forms correlations at a single point in time—are usually very similar to estimates of the stabilityand equivalence reliability. That is, the insertionof a time interval, so long as it is of a reasonablelength (e.g., 6 months or so for adults), does notreduce the reliability estimate below equiva-lence estimates.

This scenario illustrates the important fact thatsubstantive research findings and cumulativeknowledge can modify the interpretation of relia-bility estimates. Specifically, if enough evidenceaccumulates that a particular source of measure-ment error is negligible in a particular domain,then that source can subsequently be ignored forcertain purposes. Research evidence indicates thattransient error is apparently not very important inthe abilities area, especially for general mentalability.

Scenario 18

Situation. A researcher is using a large primarydatabase developed by a national polling organiza-tion to study the attitudes of the American publictoward labor unions. He is particularly interestedin the construct of general favorability (generalevaluative attitude) and its relation to the valuethat people place on egalitarianism. The databasecontains six questions assessing the respondents'general evaluative attitude toward unions, and thereported coefficient alpha for these items is .83.

After examining these six items, the researcherconcludes that they are all similar to each otherand are in fact redundant with each other. Toeliminate this redundancy in his own research, hedecides to use only the best question: the itemthat he judges on the basis of content to have thegreatest construct validity. He then correlates thismeasure of general favorability toward unionswith the measure of egalitarianism and uses thealpha coefficients to correct these correlations forthe downward bias created by measurement errorin the measurement of attitudes. The value of egal-itarianism is measured by a 10-item scale that hasan alpha coefficient of .73, and the researcher usesthis value in making the corrections for measure-ment error in the dependent variable. The re-searcher's hypothesis is that people who place astrong emphasis on the value of egalitarianism willhave more favorable attitudes toward unions.

Problem. The reliability (coefficient alpha)used for the measure of general attitude towardunions is the reliability of the sum or average ofthe six questions. But the observed correlationsare based on measurement of this attitude usingonly the "one best item." Hence the reliabilityestimate is too large, and the corrected correla-tions are downwardly biased estimates of the true-score correlation. Because the reliability coeffi-cient is too large, it undercorrects the observedcorrelation.

What can be done to solve this problem? TheSpearman-Brown formula can be used (in reverse)to compute the reliability of one item from theknown reliability (.83) of six items, and this esti-mate could be used to correct the observed corre-lations. Although this procedure will produce un-biased estimates, it is not the optimal approach.The best approach would be to go back to the dataand recompute the observed correlations using thesum or average of scores from all six questions.Because the observed correlations are then basedon all available data, observed correlations willbe larger and will have smaller standard errors.Because the summed scores that are based on allsix questions have higher reliability, there will bea smaller correction, and hence the corrected cor-relation will have less sampling error. That is, boththe observed and the corrected correlations willhave less sampling error (and therefore a smallerstandard error). In short, although the practice ofdropping items, questions, or raters is seen with

Page 18: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

216 SCHMIDT AND HUNTER

some frequency, it is bad practice to discard data.This researcher dropped five of the six questions,not because he thought they were invalid, but justto make data analysis ostensibly easier by eliminat-ing "redundant" data. These items are not redun-dant; they increase the reliability of the measureused and thus contribute to increasing the stabilityof the final research findings.

Scenario 19

Situation. A researcher seeks to test the hy-pothesis that organizational commitment and jobinvolvement are distinct constructs. He seeks todemonstrate this by showing that the true-scorecorrelation between measures of these two vari-ables is less than 1.00. His argument is that, whilethey may be substantially—even highly—cor-related, they are not perfectly correlated, even atthe true-score level and even in large, representa-tive, and heterogeneous groups of employees. Hearranges to administer measures of both variablesto a large group (total N = 1,983) of employeesin a major service firm. On the basis of responsesfrom the administration at one point in time, hecomputes coefficient alpha for each scale and usesthese reliability values to correct the correlationbetween these measures computed in the totalgroup of nearly 2,000 subjects. The corrected cor-relation is .82; on the basis of this finding, he con-cludes that organizational commitment and jobinvolvement are similar but distinct constructs.

Problem. Coefficient alpha is an alternate-form reliability estimate. It treats each scale itemas if that item were an alternate form of a test forthe construct. This analytic strategy captures theeffects of random response error and specific errorbut does not control the effects of transient error.That is, transient error is treated as part of truescores. If either organizational commitment or jobinvolvement is subject to transient error, then coef-ficient alpha will overestimate the reliability of thatmeasure. The corrected correlation would thus beunderestimated. Thus the corrected correlation of.82 could be lower than the actual correlation be-tween these constructs. If the true-score correla-tion is actually 1.00, then the conclusion this re-searcher reached would be false.

There are variables whose measurement hasbeen found to be virtually unaffected by transienterror; examples include aptitudes, abilities, and

job performance ratings. For such variables, thereliability estimate used by this researcher wouldbe a good way to handle error of measurement.However, it is not clear whether this is true oforganizational commitment or job involvement.

There are those who believe that transient errorcan be ignored. They would say, "But if we wantto know the person's organizational commitmenton a given day or at a given moment, then transientvariation would not be error and coefficient alphawould be the correct reliability estimate." How-ever, examination of the phrase organizationalcommitment reveals that this criticism dependssharply on the substantive nature of the transienterror process.

Consider one hypothesis of transient error thatall investigators would regard as "error variance"if true. Suppose that organizational commitmentfeelings on a given day depend on the person'semotional state at that moment. In particular, sup-pose that commitment depends on the person'slevel of physical energy. Suppose that when peoplefeel energetic, they look forward to work and havehigher commitment feelings and that when theyare sick and feel lethargic, they have lower com-mitment feelings toward work. Would any theoristreally want to define the important construct oforganizational commitment in such a way thatcommitment is lower if the person has a cold thanif the person is well? Probably not. That is, it wouldbe obvious that transient variation is measurementerror and not part of the construct itself.

By contrast, there are constructs for which weare interested in day-by-day fluctuations. For suchconstructs, day-to-day instability is not an errorprocess and would thus not contribute to the atten-tuation of correlations with other constructs, espe-cially if the other construct is also defined on aday-by-day basis. For example, consider the corre-lation between sleep deprivation and negativemood. If a person says, "I'm irritable when I don'tget enough sleep," that person is specifically talk-ing about day-to-day fluctuations in both sleeplevel and mood state. In this case, the day-to-dayfluctuations are not random error but rather thefocus of the correlation.

On the other hand, consider the use of the item"Do you sleep well?" as a measure of anxiety ona personality scale. The use of this item stems fromthe known fact that people suffering from anxietyhave trouble sleeping. If one is measuring the sta-

Page 19: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

MEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 217

ble trait of anxiety, then the time reference to theitem is "Generally speaking, do you sleep well?"If one asks instead "Did you sleep well lastnight?," then the day-to-day fluctuations in sleepwill act as transient error in measuring anxiety.

In most contexts, transient error is usually dueto "trivial" factors that we would want to excludefrom our definition of the construct. Thus we can-not dismiss transient error as irrelevant withoutknowing the substantive nature of the transienterror.

In summary, in this research scenario, if themeasurement of organizational commitment andjob involvement is subject to little or no transienterror, then the procedures used by this researcherto estimate the true-score correlation betweenthese two constructs are correct and the research-er's estimate of .82 is appropriate. If transient erroris a significant factor in these measures, then thetwo reliabilities will be overestimates and the esti-mated true-score correlation will be an underes-timate.

What research could be done to find out whethertransient error is important here? Consider theorganizational commitment scale. If parallel formsof that scale were administered a week or so apart,then the correlation between the two administra-tions would be an estimate of reliability in whichtransient variation is controlled for (i.e., is assignedto measurement error). If this value is the same (ornearly the same) as the coefficient alpha computedearlier by the researcher, then we know that tran-sient error is unimportant in this measurementdomain and can be ignored. In this case, we wouldconclude that the researcher's procedures werecorrect.

On the other hand, if the correlation betweenthe two administrations is considerably smallerthan coefficient alpha (say, .60 vs. .85), then weknow that transient error is substantial here. Wealso know that the alpha coefficients are inflatedby transient error and cannot be used in makingthe corrections for measurement error. Instead,the corrections between the two administrationsof the parallel forms must be used in making thecorrections. If this is the case, then the actual true-score correlation between organizational commit-ment and job involvement is larger than the .82estimated by this researcher.

There are many areas like this one in which theresearch required to find out whether transient

error is important has not been conducted. Re-searchers in these areas typically assume withoutany evidence that it is of no importance. This is badresearch practice. Research of the kind describedhere is needed in these areas to provide the an-swers needed to improve research practices.

Scenario 20

Situation. An industrial-organizational psy-chologist develops a new type of employmentinterview, the Contextual-Behavioral Interview(CBI). Because the process of conducting this in-terview is complex, the first question she seeks toanswer is whether even trained interviewers canconduct these interviews reliably. She is also con-cerned that the CBI might be suspected of beingan orally administered intelligence test. She de-cides to proceed by correlating CBI total scoreswith a well-respected intelligence test and correct-ing this correlation for attenuation due to mea-surement error to estimate the correlation withthe construct (trait) of intelligence. She trains twopsychologists in the use of CBI interviews andthen has them interview 243 job applicants. Bothinterviewers are present at each interview, andeach alternately asks questions for different seg-ments of the interview while the other observes.The interviewers tabulate their final interviewscores independently (without discussion betweenthem). The correlation between the two interview-ers is .81, and on the basis of this finding, sheconcludes that the CBI interview has adequate re-liability.

The correlation between the intelligence testand interview scores is .67 for each interviewer.The KR-20 reliability of the general mental abilitytest is .86. Her estimate of the true-score correla-tion between interview scores and intelligence is

?v. = .67/(.81 x .86)"2 = .80.

On the basis of these findings, she concludes that(a) CBI scores are reliable and (b) the CBI mea-sures more than just mental ability. She states thather results indicate that 36% of the true-score vari-ance of interview scores (1.00 - .802) reflects fac-tors independent of mental ability. Thus the CBIinterview is more than just an orally administeredability test.

Problem. The estimate of reliability used forthe CBI interview is inflated, leading to down-

Page 20: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

218 SCHMIDT AND HUNTER

the two constructs. As is explained in Scenario 12,much of the variance of employment interviewscores consists of random response measurementerror or transient error. On the basis of data from210 samples, McDaniel et al. (1994) reported thatthe average correlation between interviewers sit-ting in on the same interview was .81, the valueobtained here. However, when the interviewersconducted their interviews on different days, theaverage correlation was only .52. Thus randomresponse error (and perhaps transient and specificfactor error) causes considerable instability in in-terviewees' responses from occasion to occasion.These unstable influences on scores cannot be con-sidered to be part of the construct or constructsmeasured by the interview; they must be consid-ered error of measurement. The best available esti-mate of the real reliability of CBI scores is the .52figure from McDaniel et al. (1994). Thus the CBIis probably not nearly as reliable as the researcherconcluded. When .52 is used in the attenuationcorrection formula as the reliability of the CBI,the estimated true-score correlation is

tVi = .67/(.81 X .52)1'2 = 1.00.

This suggests that the CBI really is nothing morethan an orally administered test of general men-tal ability.

Scenario 21

Situation. A researcher is concerned that in-dustrial-organizational psychology has driftedaway from its previous focus on real job behaviorsand has come to rely too heavily on ratings andquestionnaire measures (often self-reports) as asubstitute for actual behavior on the job. She be-lieves that research should be based more heavilyon observation of actual job behavior than is thecase today. She argues that such measures wouldnot only be more valid, they would also be morereliable. She reported the following research tosupport the hypothesis that reliability would behigher. She assigned pairs of trained observers tosmall groups of workers; these observers recordedeight different categories of actual worker behav-iors for a period of 4 hr. Across the behavior cate-gories, the average correlation between the twoobservers was found to be .89. On the basis of thisfinding, she concluded that focusing on actual job

behaviors leads to much higher reliabilities thancan be obtained with supervisory ratings of jobperformance.

Problem. The reported figure of .89 is in-tercoder reliability, often referred to as conspectreliability or scorer reliability. Some of the prob-lems associated with this form of reliability werediscussed in Scenario 12. For purposes of thepresent scenario, the critical problem is that scorerreliability fails to control for random response er-ror. The two observers in this research coded thebehavior of the same workers at the same time.They appear to have coded these behaviors quiteaccurately, but these behaviors themselves mightbe quite unstable from day to day. On the basisof findings in other areas, we can predict that ifthese two observers had coded the behaviors ondifferent days, the correlation between their evalu-ations would have been considerably below thereported .89, perhaps in the neighborhood of theaverage correlation between job performance rat-ings by two supervisors (.50; Rothstein, 1990).

Scenario 22

Situation. A psychologist seeks to introduce"multiskilling" for two jobs performed on the pro-duction floor. His goal is to have each employeecurrently performing Job A learn to perform JobB, and vice versa, so that any worker can fill in forpeople in either job when necessary. At present,employees in each job have observed the otherjob but not performed it. Before the cross-training,he decides to collect baseline data on how muchjob knowledge each employee currently has forboth jobs, so he constructs and administers con-tent-valid job knowledge tests for both jobs to allemployees on the two jobs. Because knowledgeof either job depends on cognitive ability, he hy-pothesizes that those employees whose job knowl-edge is highest for their own job will also tend tohave acquired more knowledge about the otherjob (even though they have only observed theother job and have not done it). To test this hy-pothesis, he correlates scores on the two tests,finding a correlation of .52 for the Job A incum-bents. The correlation for the Job B incumbentsis also .52. He knows that the real relationship isstronger than this because unreliability in his twojob knowledge tests has reduced this correlationsomewhat. To estimate the real relationship, he

Page 21: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

MEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 219

computes the KR-21 reliabilities of each test andcorrects for measurement error:

rv, = .521 (.81 X .79)"2 = .65.

Problem. The .65 is an upwardly biased esti-mate of the true extent of the relationship, becausethe KR-21 reliability formula underestimates thereliability of the job knowledge tests. The KR-21formula, unlike the KR-20 formula, assumes thatall test items are equal in difficulty (i.e., are allanswered correctly by the same percentage of ex-aminees). This is rarely true in any domain, andit is almost never true for job knowledge tests.Job knowledge tests almost always contain a widerange of item difficulties. When items vary in diffi-culty, KR-21 underestimates reliability, leading toovercorrections for measurement error. Toldabout this, the researcher computed the KR-20reliabilities and used these to correct the ob-served correlation:

/•W = .52/(.85X.84)"2 = .61.

Thus the appropriate (unbiased) estimate of thetrue-score correlation is .61. The .65 obtained ear-lier was an overestimate by almost 7%.

Computing the KR-20 reliability estimate re-quires access to item-level data; one needs to knowthe difficulty (or endorsement rate) for each item.This information is not required for KR-21 (onlythe average item difficulty need be known). KR-21 is sometimes used because item-level data arenot available. In such cases, the researcher shouldnote that the KR-21 estimates are underestimatesof reliability.

Measurement Error in MoreComplex Cases

Scenario 23

Situation. A researcher is attempting to vali-date predictors of voluntary absenteeism fromwork, which he measures over a period of 1 year.He finds that the year-to-year correlation of absen-teeism is .70; that is, the reliability of absenteeismover 1 year is .70. However, the researcher decidesnot to correct the observed validity coefficientsfor unreliability in the absenteeism criterion. Heargues that the departure of the .70 from 1.00 doesnot reflect error of measurement but rather is dueto real changes in the behavior. That is, he reasons

that the .70 merely reflects reality. He states thatcorrected validities would be overestimates, be-cause in the real world there is no such thing asperfectly stable (perfectly reliable) absenteeismbehavior.

Problem. This researcher argues that the insta-bility observed in the measure of absenteeism over1 year is due to real (relative) change among em-ployees and not to measurement error. However,he presents no evidence to support this contention,and therefore his statement is merely a hypothesis.There is an empirical test that can be applied todetermine whether there has been real change.This test requires that absenteeism be measuredat three points in time. For example, we couldmeasure absenteeism in each of 3 subsequentyears. If no real change has taken place, then ifthere were no sampling error, we expect thatf]) = ru = r2]. For example, if we find that ru =.70, r12 = .70, and r^ = .70, the data would indicatethat no real change has taken place. This wouldindicate that all of the instability in the measureis due to measurement error. This means that theresearcher's hypothesis is wrong and that his valid-ity coefficient should be corrected using his .70reliability figure (obtained for a 1-year period).

If employees' relative standing on absenteeismis really changing over time, then we expect to findthat r]3 < r\i and rB < r2) . For example, suppose wefind that r,, = .61, while rn = .70 and r23 = .70.This finding suggests that some real change in ab-senteeism may be occurring over time. However,this finding does not imply that all of the instabilityof absenteeism scores is due to real change. Thedeparture of the researcher's .70 figure from 1.00is to some extend due to purely random errors ofmeasurement: failures to record every absence,absences by one employee erroneously recordedfor another employee, involuntary absences mis-takenly recorded as voluntary absences, and so on.These measurement errors are artifacts, and theireffect is to downwardly bias the observed validity.A correction should be made for this bias. Underthese conditions of real change, the reliability forTime Period 2 is

.70(.7Q)~ ~

This reliability coefficient should be used in mak-ing the correction for measurement error. Noticethat this reliability is higher than the .70 that holds

Page 22: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

220 SCHMIDT AND HUNTER

for a situation in which there is no real changeover time.

Since real change has taken place, the true valid-ity could be different for different periods. There-fore, observed validities should be computed sepa-rately for each time period and then correctedusing the reliability computed for that time period.If the estimated true validities are essentially iden-tical for each time period, then we conclude thatthe changes over time in absenteeism do not affectthe validity of the predictor for predicting absen-teeism.

Finally, we note in passing that real change overtime in individuals' relative standing on a trait orother dimension has no necessary effect, positiveor negative, on the correlation between that di-mension and another variable or measure (cf. Ack-erman, 1989; Rogosa & Willett, 1985; Schmidt,Ones, & Hunter, 1992, pp. 645-646). This meansthat real change over time in job performance doesnot necessarily reduce the validity of predictors ofjob performance.

Scenario 24

Situation. A social psychologist testing a the-ory of personal values hypothesizes that each of 10different "peripheral" values are positively relatedto the "core" value of individual freedom. Eachof these 11 values is measured by its own nine-item Likert scale; each of these 11 scales has acoefficient alpha reliability of .75 or higher. Theresearcher computes the correlations between the10 peripheral values and the core value in his re-search sample. However, his statistical analysisshows that only 2 of the 10 correlations are statisti-cally significant. His interpretation of this is that8 of the peripheral values are apparently unrelatedto the core value and that the theory is thereforenot supported. However, after some thought theresearcher concludes that in the context of thetheory under study, it is really the true-score corre-lations that are important, not the observed corre-lations. He then corrects the 10 correlations formeasurement error. Applying the same signifi-cance test, he finds that the corrected correlationsare all significant beyond the .01 level. He con-cludes that these findings of significant relation-ships between peripheral values and the core val-ues provide strong support for his theory.

Problem. The conventional statistical signifi-cance tests applied to the estimated true-score cor-relations are not legitimate, because the correctionfor attenuation increases the standard error of thecorrelation coefficients. As a general rule, correc-tions for measurement error cannot make a corre-lation, or other zero-order statistic, statistically sig-nificant if it is not statistically significant before thecorrection. For example, correcting a correlationcoefficient from .25 to .35 increases it by 40%.However, its standard error also increases by 40%,so its level of statistical significance (p value) re-mains constant. If the appropriate standard errorfor the corrected coefficient is computed and usedin the significance test, that significance test is sta-tistically correct. However, it yields the same out-come as the conventional significance test con-ducted on the observed (uncorrected) correlation.Hence in this study, only the significance tests con-ducted on the uncorrected correlations are legiti-mate. In this analysis, only 2 of the 10 correlationswere statistically significant.

However, this does not mean that the other 8are zero or are best considered to be zero. Theoperational rule of thumb in the social scienceshas long been, "If it is not significant, it is zero."However, this rule of thumb is erroneous and leadsto frequent errors in data interpretation (Schmidt,1992, 1994), especially in situations of low statisti-cal power. The researcher in this case appears tofollow this erroneous rule of thumb. He seems toassume that his nonsignificant correlations indi-cate the absence of a relationship. This researchershould look at the magnitude of the estimatedtrue-score correlations. Although the significancetest he conducted is erroneous, the estimated true-score correlations themselves are unbiased esti-mates of the magnitude of the relationships. Ifthese correlations are small, the theory is indeednot supported, but if they are substantial, thatfinding would support the theory. The absence ofstatistical significance is in that case probably dueto low statistical power. Further research can thentest this hypothesis.

Scenario 25

Situation. A researcher seeks to determinewhether two different measures of satisfactionwith pay are interchangeable. The manuals for

Page 23: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

MEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 221

each of these instruments provide estimates of co-efficient alpha computed on large samples (Ns >1,500) of appropriate composition, and the alphasare similar in magnitude. She administers bothmeasures to a sample of 31 employees and usesthe alphas from the manuals to correct this correla-tion to estimate the construct-level correlation. Toher surprise, the corrected correlation is 1.23. Shedecides that there might be serious problems withattenuation corrections and vows never to makesuch corrections in her research again.

Problem. The problem here is not that thewrong reliability estimates were used to correctfor bias induced by measurement error; assumingno transient error, the use of coefficient alpha iscorrect. Rather, this is a case of failure to under-stand sampling error. When the sample is as smallas N = 31 (and even for much larger /Vs), samplingerror will often cause the observed correlation todepart from its population value by a substantialamount. (Sampling errors are much larger thanmost researchers realize. Examples illustrating thispoint can be found in Hunter & Schmidt. 1990,chaps. 1 & 2; Schmidt, 1992; and Schmidt, Ocasio,Hillery, & Hunter, 1985.) If the observed correla-tion happens to be a low random bounce, the cor-rected correlation will be underestimated. In thiscase, it might have been .90. If this had happened,this researcher would probably not have beenshocked. However, if the observed correlationhappens to be a high random bounce, the estimateof the true-score correlation can be greater than1.00. If the true-score population correlation isactually 1.00 (as is probably the case here), this isto be expected about half the time. That is, purelybecause of ordinary sampling error, about 50% ofthe sample estimates of the true-score correlationshould be above 1.00. If the actual true-score popu-lation correlation is lower, say .80, then the ex-pected percentage of sample estimates above 1.00is less than 50%. In either case, such sample esti-mates pose no problem. Since the correlation can-not exceed 1.00 by definition, such estimates aresimply rounded down to 1.00. As we noted in theintroduction to this article, the correction for at-tenuation, properly applied, is perfectly accuratein the population (i.e., when A' = <*>). When Nis less than infinite, corrected correlations, likeuncorrected correlations, are estimates and con-tain sampling error. Sampling error can cause com-puted estimates to be larger than 1.00.

Scenario 26

Situation. A researcher hypothesizes thatthere is a positive relation between job satisfactionand the construct of "role adjustment." She ob-tains measures of both constructs on a sample of243 appliance salespeople. The observed correla-tion is .25, and the 95% confidence interval (CI)is .13 to .37. Using coefficient alpha reliabilitiesfor each scale, she corrects the observed correla-tion for measurement error:

rv. = .251 (.66 X J3)"2 = .36.

She concludes that her hypothesis is supported.She states that the best estimate of the actual rela-tionship is .36 but that the true value could be aslow as .13 or as high as .37.

Problem. The researcher is correct in statingthat the best point estimate of the true-score corre-lation is .36. However, her statement about theconfidence interval is erroneous. The confidenceinterval used applies to the uncorrected correla-tions; it does not apply to the corrected correlation.The confidence interval for the corrected correla-tion can be estimated by correcting the endpointsof the confidence interval for the observed corre-lation:

Lower CI endpoint = .13/(.66 X .73)'- = .19,Upper CI endpoint = .37/(.66 X .73)'2 = .53.

Thus the correct statement is that the best estimateof the true-score correlation is .36 but that the95% confidence interval is from .19 to .53.

Concluding Remarks

These scenarios involving measurement errorproblems in psychological research reflect real sit-uations that we have encountered in our own re-search and that of others. In addition, they reflectthe situations we have encountered most fre-quently. Because of this, we feel that they are morerealistic, informative, and useful than abstract psy-chometric discussions of measurement theory andreliability principles. Over the years, many badmethodological practices have retarded progressin the development of cumulative knowledge inpsychology and the other social sciences(Hunter & Schmidt, 1990). Failure to attend prop-erly to measurement error is perhaps not the worstof these; that honor probably goes to the practice

Page 24: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

222 SCHMIDT AND HUNTER

of relying on statistical significance tests in inter-preting research data (Hunter & Schmidt, 1990;Schmidt, 1992, 1994). But measurement error isat least second in importance. Research progressin theory development and cumulative knowledgeis impossible without careful attention to problemscreated by measurement error. We believe thatthese real-life, concrete examples provide usefulguides to working researchers grappling with thereal-world complexities of their own research pro-grams.

References

Ackerman, P. L. (1989). Within-task intercorrelationsof skill performance: Implications for predicting indi-vidual differences? Journal of Applied Psychology,74, 360-364.

Berenstein, M. (1994). The case for confidence intervalsin controlled clinical trials. Controlled Clinical Trials,15, 411-428.

Campbell, A., Converse, P. E., Miller, W. E., & Stokes,D. E. (1960). The American voter. New York:Wiley.

Campbell, J. P. (1990). Modeling the performance pre-diction problem in industrial and organizational psy-chology. In M. D. Dunnette & L. M. Hough (Eds.),Handbook of industrial and organizational psychol-ogy (Vol. 1,2nd ed.). Palo Alto, CA: Consulting Psy-chologists Press.

Cronbach, L. J. (1947). Test reliability: Its meaning anddetermination. Psychometrika, 12, 1-16.

Cronbach, L. J. (1951). Coefficient alpha and theinternal structure of tests. Psychometrika, 16, 297-334.

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajarat-nam, N. (1972). The dependability of behavioral mea-surements: Theory of generalizability for scores andprofiles. New York: Wiley.

Feldt, L. S., & Brennan, R. L. (1989). Reliability. InR. L. Linn (Ed.), Educational measurement (3rd ed.,pp. 105-146). New York: Macmillan.

Hunter, J. E., & Schmidt, F. L. (1990). Methods ofmeta-analysis: Correcting error and bias in research findings.Newbury Park, CA: Sage.

Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982).Meta-analysis: Cumulating research findings acrossstudies. Newbury Park, CA: Sage.

King, L. M. Hunter, J. E., & Schmidt, F. L. (1980). Haloin a multidimensional forced-choice performanceevaluation scale. Journal of Applied Psychology, 33,507-516.

Klimoski, R. J. (1993). Predictor constructs and theirmeasurement. In N. Schmitt & W. C. Borman (Eds.),Personnel selection in organizations (pp. 99-134). SanFrancisco: Jossey-Bass.

Lipsey, M. W., & Wilson, D. B. (1993). The efficacy ofpsychological, educational, and behavioral treatment:Confirmation from meta-analysis. American Psychol-ogist, 48, 1181-1209.

Lord, F. M., & Novick, M. R. (1968). Statisticaltheories of mental test scores. Reading, MA: Addison-Wesley.

McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., &Mauer, S. (1994). The validity of employment inter-view: A comprehensive review and meta-analysis.Journal of Applied Psychology, 79, 599-616.

Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980).Validity generalization results for tests used to pre-dict job proficiency and training criteria in clericaloccupations. Journal of Applied Psychology, 65,373-407.

Rogosa, D., & Willett, J. B. (1985). Satisfying a simplexstructure is simpler than it should be. Journal of Edu-cational Statistics, 10, 99-107.

Rothstein, H. R. (1990). Interrater reliability of job per-formance ratings: Growth to asymptote with increas-ing opportunity to observe. Journal of Applied Psy-chology, 75, 322-327.

Schmidt, F. L. (1992). What do data really mean? Re-search findings, meta-analysis, and cumulative knowl-edge in psychology. American Psychologist, 47,1173-1181.

Schmidt, F. L. (1993). Personnel psychology at the cut-ting edge. In N. Schmitt & W. Borman (Eds.), Person-nel selection in organizations (pp. 497-516). San Fran-cisco: Jossey-Bass.

Schmidt, F. L. (1994, August). Quantitative methods andcumulative knowledge in psychology: Implications forthe training of researchers. Division 5 presidential ad-dress presented at the Annual Convention of theAmerican Psychological Association, Los Angeles,CA.

Schmidt, F. L., & Hunter, J. E. (1992). Development ofa causal model of processes determining job perfor-mance. Current Directions in Psychological Science,1, 89-92.

Schmidt, F. L., Hunter, J. E., & Caplan, J. R. (1981).Validity generalization results for two jobs in the pe-troleum industry. Journal of Applied Psychology,66, 261-273.

Schmidt, F. L., Ocasio, B. P., Hillery, J. M., & Hunter,J. E. (1985). Further within-setting empirical tests of

Page 25: Measurement Error in Psychological Research: Lessons From ...users.cla.umn.edu/~nwaller/prelim/schmidtmeaserror.pdfMEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 201 that inappropriate

MEASUREMENT ERROR IN PSYCHOLOGICAL RESEARCH 223

the situational specificity hypothesis in personnel se-lection. Personnel Psychology, 38, 509-524.

Schmidt, F. L., Ones, D. S., & Hunter, J. E. (1992).Personnel selection. Annual Review of Psychology,43,627-670.

Schmitt, N., & Landy, F. (1993). The concept of validity.In N. Schmitt & W. C. Borman (Eds.), Personnelselection in organizations (pp. 275-309). San Fran-cisco: Jossey-Bass.

Smith, P. C., & Kendall, L. M. (1963). Retranslationof expectations: An approach to the construction ofunambiguous anchors for rating scales. Journal of Ap-plied Psychology, 47, 149-155.

Stanley, J. C. (1971). Reliability. In R. L. Thorndike(Ed.), Educational measurement (2nd ed., pp. 356-442). Washington, DC: American Council on Edu-cation.

Thorndike, R. L. (1951). Reliability. In E. F. Lindquist(Ed.), Educational measurement (pp. 560-620).Washington, DC: American Council on Education.

Traub, R. E. (1994). Reliability for the social sciences.Newbury Park, CA: Sage.

Received February 21, 1995Revision received June 19, 1995

Accepted June 28, 1995 •