impact of valid selection procedures on work-force productivity...(the programmer aptitude test;...

18
Journal of Applied Psychology 1979, Vol. 64, No. 6, 609-626 Impact of Valid Selection Procedures on Work-Force Productivity Frank L. Schmidt John E. Hunter U.S. Office of Personnel Management Michigan State University Washington, B.C. and George Washington University Robert C. McKenzie and Tressie W. Muldrow U.S. Office of Personnel Management, Washington, D.C. Decision theoretic equations were used to estimate the impact of a valid test (the Programmer Aptitude Test; PAT) on productivity if it were used to select new computer programmers for one year in (a) the federal government and (b) the national economy. A newly developed technique was used to estimate the standard deviation of the dollar value of employee job performance, which in the past has been the most difficult and expensive item of required informa- tion to estimate. For the federal government and the U.S. economy, separately, results are presented for different selection ratios and for different assumed values for the validity of previously used selection procedures. The impact of the PAT on programmer productivity was substantial for all combinations of assumptions. The results support the conclusion that hundreds of millions of dollars in in- creased productivity could be realized by increasing the validity of selection decisions in this occupation. Likely similarities between computer programmers and other occupations are discussed. It is concluded that the impact of valid selection procedures on work-force productivity is considerably greater than most personnel psychologists have believed. Questions concerning the economic and tests. Hunter and Schmidt (in press) have productivity implications of valid selection contended, on the basis of a review of the procedures increasingly have come to the fore empirical literature on the economic utility in industrial-organizational psychology. Dun- of selection procedures, that personnel psy- nette and Borman's chapter in the Annual chologists have typically failed to appreciate Review of Psychology (in press) includes, for the magnitude of productivity gains that the first time, a separate section on the utility result from use of valid selection procedures, and productivity implications of selection The major purpose of this study is to illustrate methods. This development is due at least in the productivity (economic utility) implica- part to the emphasis placed on the practical tions of a valid selection procedure in the utility of selection procedures in some of the occupation of computer programmer in the litigation in recent years involving selection federal government and in the economy as a whole. However, to set the stage we first review developments related to selection The opinions expressed herein are those of the authors utility and do not necessarily reflect official policy of the U. S. Office of Personnel Management. We would like to thank Lee J. Cronbach and Hubert History and Development of E. Brogden for their comments on an earlier version of Selection Utility Models this article. * Requests for reprints should be sent to Frank L. „,, , . ,, ,. , . j r Schmidt, who is at Personnel Research and De- The evaluation of benefit obtained from velopment Center, U.S. Office of Personnel Manage- selection devices has been a problem of con- ment, 1900 E Street, N.W., Washington, D.C. 20415. tinuing interest in industrial psychology. Most In the Public Domain 609

Upload: others

Post on 17-Nov-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

Journal of Applied Psychology1979, Vol. 64, No. 6, 609-626

Impact of Valid Selection Procedureson Work-Force Productivity

Frank L. Schmidt John E. HunterU.S. Office of Personnel Management Michigan State University

Washington, B.C.and George Washington University

Robert C. McKenzie and Tressie W. MuldrowU.S. Office of Personnel Management, Washington, D.C.

Decision theoretic equations were used to estimate the impact of a valid test(the Programmer Aptitude Test; PAT) on productivity if it were used to selectnew computer programmers for one year in (a) the federal government and(b) the national economy. A newly developed technique was used to estimatethe standard deviation of the dollar value of employee job performance, whichin the past has been the most difficult and expensive item of required informa-tion to estimate. For the federal government and the U.S. economy, separately,results are presented for different selection ratios and for different assumed valuesfor the validity of previously used selection procedures. The impact of the PATon programmer productivity was substantial for all combinations of assumptions.The results support the conclusion that hundreds of millions of dollars in in-creased productivity could be realized by increasing the validity of selectiondecisions in this occupation. Likely similarities between computer programmersand other occupations are discussed. It is concluded that the impact of validselection procedures on work-force productivity is considerably greater thanmost personnel psychologists have believed.

Questions concerning the economic and tests. Hunter and Schmidt (in press) haveproductivity implications of valid selection contended, on the basis of a review of theprocedures increasingly have come to the fore empirical literature on the economic utilityin industrial-organizational psychology. Dun- of selection procedures, that personnel psy-nette and Borman's chapter in the Annual chologists have typically failed to appreciateReview of Psychology (in press) includes, for the magnitude of productivity gains thatthe first time, a separate section on the utility result from use of valid selection procedures,and productivity implications of selection The major purpose of this study is to illustratemethods. This development is due at least in the productivity (economic utility) implica-part to the emphasis placed on the practical tions of a valid selection procedure in theutility of selection procedures in some of the occupation of computer programmer in thelitigation in recent years involving selection federal government and in the economy as a

whole. However, to set the stage we firstreview developments related to selection

The opinions expressed herein are those of the authors utilityand do not necessarily reflect official policy of the U. S.Office of Personnel Management.

We would like to thank Lee J. Cronbach and Hubert History and Development ofE. Brogden for their comments on an earlier version of Selection Utility Modelsthis article. *

Requests for reprints should be sent to Frank L. „,, , . , , ,. , . j r „Schmidt, who is at Personnel Research and De- The evaluation of benefit obtained fromvelopment Center, U.S. Office of Personnel Manage- selection devices has been a problem of con-ment, 1900 E Street, N.W., Washington, D.C. 20415. tinuing interest in industrial psychology. Most

In the Public Domain

609

Page 2: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

610 SCHMIDT, HUNTER, McKENZIE, AND MULDROW

attempts to evaluate benefit have focused onthe validity coefficient, and at least five ap-proaches to the interpretation of the validitycoefficient have been advanced over the years.The oldest of these is the index of forecastingefficiency, (E): E = 1 - (1 - rxv^, whererxv is the validity coefficient. This index com-pares the standard error of job performancescores predicted by means of the test (thestandard error of estimate) to the standarderror that results when there is no valid in-formation about applicants and one predictsthe mean level of performance for everyone(the standard deviation of job performance).The index of forecasting efficiency was heavilyemphasized in early texts (Hull, 1928; Kelley,1923) as the appropriate means for evaluatingthe value of a selection procedure. Thisindex describes a test correlating .50 with jobperformance as predicting with a standarddeviation of estimate errors only 13% smallerthan chance, a very unrealistic and pessimisticinterpretation of the test's economic value.

The index of forecasting efficiency was suc-ceeded by the coefficient of determination,which became popular during the 1930s and1940s. The coefficient of determination issimply the square of the validity coefficient,or rxv

2. This coefficient was referred to as "the

proportion of variance in the job performancemeasure accounted for" by the test. The co-efficient of determination describes a test ofvalidity of .50 as "accounting for" 25% of thevariance of job performance. Although rx£is still occasionally referred to by selectionpsychologists—and has surfaced in litigationon personnel tests—the "amount of varianceaccounted for" has no direct relationship toproductivity gains resulting from use of aselection device.

Both E and rxv2 lead to the conclusion that

only tests with relatively high correlation withjob performance will have significant practicalvalue. Neither of these interpretations recog-nizes that the value of a test varies as a func-tion of the parameters of the situation in whichit is used. They are general interpretations ofthe correlation coefficient and have been shownto be inappropriate for interpreting the validitycoefficient in selection (Brogden, 1946; Cron-bach & Gleser, 1965, p. 31; Curtis & Alf, 1969).

The well-known interpretation developed by

Taylor and Russell (1939) goes beyond thevalidity coefficient itself and takes into accounttwo properties of the selection problem—theselection ratio (the proportion of applicantshired) and the base rate (the percentage ofapplicants who would be "successful" withoutuse of the test). This model yields a much morerealistic interpretation of the value of selec-tion devices. The Taylor-Russell model indi-cates that even a test with a modest validitycan substantially increase the percentage whoare successful among those selected when theselection ratio is low. For example, when thebase rate is 50% and the selection ratio is .10,a test with validity of only .25 will increase thepercentage among the selectees who are suc-cessful from 50% to 67%, a gain of 17 addi-tional successful employees per 100 hired.

Although an improvement, the Taylor-Russell approach to determining selectionutility does have disadvantages. Foremostamong them is the need for a dichotomouscriterion. Current employees and new hiresmust be sorted into an unrealistic 2-pointdistribution of job performance: "successful"and "unsuccessful" (or "satisfactory"and"unsatisfactory"). As a result, information onlevels of performance within each group is lost(Cronbach & Gleser, 1965, pp. 123-124, 138).All those within the "successful" group, forexample, are implicitly assumed to be equal invalue, whether they perform in an outstandingmanner or barely exceed the cutoff. This factmakes it difficult to express utility in units thatare comparable across situations.

A second disadvantage of the Taylor-Russell model results from the fact that thedecision as to where to draw the line tocreate the dichotomy in job performance isarbitrary. Objective information on which tobase this decision is rarely available, and thusdifferent individuals may draw the line at verydifferent points. This state of affairs createsa problem because the apparent usefulness ofthe selection procedure depends on where theline is drawn. For example, suppose both theselection ratio and the validity are .50. If thedichotomy is drawn so that 90% of non-test-selected employees are assigned to the success-ful category, the test raises this figure to 97%—a gain of only seven successful employees per100 hired, or an 8% increase in the success

Page 3: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

IMPACT OF VALID SELECTION PROCEDURES 611

rate. However, if the dichotomy is drawn sothat 50% are considered successful, this sametest raises the percentage successful to 67, again of 17 successful employees per 100 hired,or a 34% increase in the success rate. Finally,if the line is drawn so that only 10% of em-ployees are considered successful, use of thetest raises this figure to 17%. Here the gain is7 additional successful employees per 100hired, as it was when 90% were consideredsuccessful. But here the percentage increasein the success rate is 70% rather than 8%.Thus the Taylor-Russell approach appears togive different answers to the question of howuseful a test is, depending on where thearbitrary dichotomy is drawn.

The next major advance was left to Brogden(1949), who used the principles of linear re-gression to demonstrate how the selection ratio(SR) and the standard deviation of job per-formance in dollars (SDV) affect the economicutility of a selection device. Despite the factthat Brogden's derivations are a landmark inthe development of selection utility models,they are straightforward and simple tounderstand.

Let rxv = the correlation between the test(x) and job performance measured in dollarvalue. The basic linear model is

Y = 0ZX + nv + e,

where Y = job performance measured indollar value ; 0 = the linear regression weighton test scores for predicting job performance ;Zx = test performance in standard score formin the applicant group; nv = mean job per-formance (in dollars) of randomly selectedemployees; and e = error of prediction. Thisequation applies to the job performance of anindividual. The equation that gives the averagejob performance for the selected (j) group (orfor any other subgroup) is

E(Y.) = £0*,) + E(e).

Since E(e) = 0, and 0 and n are constants,this becomes

Y. = 0Z*. + Hy.

This equation can be further simplified bynoting that 0 = rxv(SDv/SDx), where SDV isthe standard deviation of job performancemeasured in dollar value among randomly

selected employees. Since SDX = 1.00, ft =rxySDy. We thus obtain

Y8 =

This equation gives the absolute dollar valueof average job performance in the selectedgroup. What is needed is an equation thatgives the increase in dollar value of averageperformance that results from using the test.That is, we need an equation for marginalutility. Note that if the test were not used, Y8

would be Hy. That is, mean performance in theselected group is the same as mean performancein a group selected randomly from the ap-plicant pool. Thus the increase due to use ofa valid test is rxySDvZXt. The equation we wantis produced by transposing nv to give

The value on the right in the above equation isthe difference between mean productivity inthe group selected using the test and meanproductivity in a group selected without usingthe test, that is, a group selected randomly.The above equation thus gives mean gain inproductivity per selectee resulting from use ofthe test, that is,

AU/selectee = rxySDyZx,, (1)

where U is utility and AU is marginal utility.Equation 1 states that the average pro-

ductivity gain in dollars per person hired isthe product of the validity coefficient, theaverage standard score on the test of thosehired, and the SD of job performance in dollars.The value rxaZXt is the mean standard scoreon the dollar criterion of those selected, Zy.Thus utility per selectee is the mean Z scoreon the criterion of those selected times thestandard deviation of the criterion in dollars.The only assumption that Equation 1 makesis that the relation between the test and jobperformance is linear. If we further assume thatthe test scores are normally distributed, themean test score of those selected is <t>/p, wherep = the selection ratio, and <j> = the ordinatein N(0, 1) at the point of cut corresponding top. Thus Equation 1 can be written

AU/selectee = rx^>/pSDy. (2)

The above equations illustrate the criticalrole of SDy and suggest the possibility of situa-

Page 4: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

612 SCHMIDT, HUNTER, McKENZIE, AND MULDROW

tions in which tests of low validity havehigher utility than tests of high validity. Forexample:

At?/r^ Zx SDy selectee

Mid-level job(e.g., systemsanalyst) .20 1.00 25,000 $5,000

Lower-level job(e.g., janitor) .60 1.00 2,000 $1,200

The total utility of the test depends on thenumber of persons hired. The total utility(total productivity) gain resulting from use ofthe test is simply the mean gain per selecteetimes the number of people selected, Na, Thatis, the total productivity gain is:

AU = NsrxvSDvZx,.

In this example, the average marginal utilitiesare $5,000 and $1,200. If 10 people werehired, the actual utilities would be $50,000and $12,000, respectively. If 1,000 people wereto be hired, then the utilities would be $500,000and $120,000, respectively. Obviously thetotal dollar value of tests is greater for largeemployers than for small employers. However,this fact can be misleading. On a percentagebasis it is average gain in utility that counts,and that is what counts to each individualemployer.

Equations 1 and 2 clearly illustrate the basisfor Brogden's (1946) conclusion that thevalidity coefficient itself is a direct index ofselective efficiency. Brogden (1946) showedthat given only the assumption of linearity,the validity coefficient is the proportion ofmaximum utility attained, where maximumutility is the productivity gain that wouldresult from a perfectly valid test. A test witha validity of .50, for example, can be expectedto produce 50% of the gain that would resultfrom a perfect selection device (validity =1.00) used for the same job and at the sameselection ratio. A glance at Equation 1 orEquation 2 verifies this verbal statement. Sincethe validity coefficient enters the equation as amultiplicative factor, increasing or decreasingthe validity by any factor will increase or de-crease the utility by the same factor. Forexample, if we increase validity by a factor oftwo by raising it from .20 to .40, Equation 2shows that utility doubles. If we decrease

validity by a factor of one half by lowering itfrom 1.00 to .50, utility is cut in half. Equa-tions 1 and 2 also illustrate the fact that thereare limitations on the utility of even a per-fectly valid selection device. If the selectionratio is very high, the term <t>/p (or ZXt) ap-proaches zero and even a perfect test has littlevalue. If the selection ratio is 1.00, the perfecttest has no value at all. Likewise, as SDy

decreases, the utility of even a perfect testdecreases. In a hypothetical world in whichSDy were zero, even a perfect test would haveno value.

Brogden (1946) further showed that thevalidity coefficient could be expressed as thefollowing ratio :

where ZyM = the mean job performance (y)standard score for those selected using thetest (x) ; ZV(y) = the mean job performancestandard score resulting if selection were onthe criterion itself, at the same selection ratio ;Ztf(r) = the mean job performance standardscore resulting if selection decisions were maderandomly (from among the otherwise screenedpool of applicants) ; and rxy = the validitycoefficient. Since Zy(,> = 0 by definition, theabove formula reduces to Zyw/Zyw- Thisformulation has implications for the develop-ment of new methods of estimating selectionprocedure validity^ If reasonably accurateestimates of both ZvM and ZV(V) can be ob-tained, validity can be estimated without con-ducting a traditional validity study.

In Equations 1 and 2, the values for rxv

and SDy should be those that would hold ifapplicants were hired randomly with respectto test scores. That is, they should be valuesapplicable to the applicant population, thegroup in which the selection procedure isactually used. Values of rxy and SDV computedon incumbents will typically be underestimatesbecause of reduced variance among incumbentson both test and job performance measures.Values of rxv computed on incumbents can becorrected for range restriction to produceestimates of the value in the applicant pool(Thorndike, 1949, pp. 169-176). The ap-plicant pool is made up of all who have sur-vived screening on any prior selection hurdles

Page 5: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

IMPACT OF VALID SELECTION PROCEDURES 613

that might be employed, for example, mini-mum educational requirements or physicalexaminations.

The correlation between the test and a well-developed measure of job performance (/)provides a good estimate of rxv, the correla-tion of the test with job performance measuredin dollars (productivity). It is a safe assump-tion that job performance and the value ofthat performance are at least monotonicallyrelated. It is inconceivable that lower per-formance could have greater dollar value thanhigher performance. Ordinarily, the relationbetween y' and y will be not only monotonicbut also linear. If there are departures fromlinearity, the departures will typically beproduced by leniency in job performanceratings, which leads to ceiling effects in themeasuring instrument. The net effect of suchceiling effects is to make the test's correlationwith the measure of job performance smallerthan its correlation with actual performance,that is, smaller than its true value, makingrxy' an underestimate of rxu. An alternativestatement of this effect is that ceiling effectsdue to leniency produce an artificial nonlinearrelation between job performance ratings andthe actual dollar value of performance. Anonlinear relation of this form would lead toan underestimation of selection utility be-cause the performance measure underesti-mates the relative value of very highperformers.

Values of rxy' should also be corrected forattenuation due to errors of measurement inthe criterion. Random error in the observedmeasure of job performance causes the test'scorrelation with that measure to be lower thanits correlation with actual job performance.Since it is the correlation with actual per-formance that determines test utility, it is theattenuation-corrected estimate that is neededin the utility formulas. This estimate is simplyrxy'/(rv'y')*, where rv>v> is the reliability of theperformance measure. (See Schmidt, Hunter,& Urry, 1976, for further discussion of thesepoints.)

The next major advance in this area camein the form of the monumental work byCronbach and Gleser (1965), Psychological

Tests and Personnel Decisions. First publishedin 1957, this work was republished in 1965 in

augmented form. The book consists of de-tailed and sophisticated application of decisiontheory principles not only to the single-stagefixed-job selection decisions that we have thusfar discussed but also the placement andclassification decisions and sequential selec-tion strategies. In these latter areas, many oftheir derivations were indeed new to the fieldof personnel testing. Their formulas for utilityin the traditional selection setting, however,are, as they note (Cronbach and Gleser, 1965,chap. 4), identical to those of Brogden (1949)except that they formally incorporate cost oftesting (information gathering) into the equa-tions. This fact is not obvious.

Brogden, it will be recalled, approached theproblem from the point of view of mean gainin utility per selectee. Cronbach and Gleser(1965, chap. 4) derived their initial equation interms of mean gain per applicant. Theirinitial formula was (ignoring cost of testingfor the moment) :

A U/ applicant = rxySD^>.

All terms are as defined earlier. Multiplyingby the number of applicants, N, yields totalor overall gain in utility. The Brogden formulafor overall utility is

= ^.AU/selectee = (3)

Ns, it will be recalled, is the number selected.If we note that p = N,/N, that is, the ratioof selectees to applicants, we find that Brog-den's equation immediately reduces to theCronbach-Gleser (1965) equation for totalutility :

At/ = NrxvSDy<t>.

Role of the Cost of Testing

The previous section ignored the cost oftesting, which is small in most testing situa-tions. For example, in a typical job situation,the applicant pool consists of people who walkthrough the door and ask for a job (i.e., thereare no recruiting costs) . Hiring is then done onthe basis of an application blank and a testthat are administered by a trained clericalworker at a cost of $10 or so. If the selectionratio is 10%, then the cost of testing per personhired is $10 for each person hired and $90 forthe nine persons rejected in finding the person

Page 6: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

614 SCHMIDT, HUNTER, McKENZIE, AND MULDROW

hired, or $100 altogether. This is negligible inrelation to the usual magnitude of utility gains.Furthermore, this $100 is a one-time cost,whereas utility gains continue to accumulateover as many years as the person hired stayswith the organization. When cost of testingis included, Equation 2 becomes

AU/selectee = rxySD^>/p - C/p, (4)

where C is the cost of testing one applicant.Although cost of testing typically has only

a trivial impact on selection utility, it ispossible to conjure up hypothetical situationsin which cost plays a critical role. For example,suppose an employer were recruiting one in-dividual for a sales position that would lastonly 1 year. Suppose further that the employerdecided to base the selection on the results ofan assessment center that costs $1,000 perassessee and has a true validity of .40. If theyearly value of SDV for this job is $10,000 and10 candidates are assessed, the expected gainin productivity is .4 X $10,000 X 1.758 or$7,032. However, the cost of the assessmentcenter is 10 X 1,000 = $10,000, which is $2,966greater than the expected productivity gain.That is, under these conditions it would costmore to test 10 persons than would be gainedin improved performance. If the employertested only 5 candidates, then the expectedgain in performance would be $5,607, whereasthe cost of testing would be $5,000 for anexpected gain of $607. In this situation,the optimal number to test is 3 persons. Thebest person of 3 would have an expected gainin performance of $4,469 with a cost of testingof $3,000, for an expected utility of $1,469.

Relation Between SR and Utility

In most situations, the number to be hiredis fixed by organizational needs. If the appli-cant pool is also fixed, the question of whichSR would yield maximum utility becomesacademic. The SR is determined by circum-stances and is not under the control of theemployer. However, employers can often exertsome control over the size of the applicantpool by increasing or decreasing recruitingefforts. If this is the case, the question is thenhow many applicants the employer should testto obtain the needed number of new employees

in order to maximize productivity gains fromselection.

This question can be answered using aformula given by Cronbach and Gleser (1965,p. 309).

, (5)

where Zx is the cutting score on the test inZ-score form. This equation must be solvedby iteration. Only one value of the SR (i.e.,p) will satisfy this equation, and p will alwaysbe less than or equal to .50. The value com-puted for the optimal SR indicates the numberthat should be tested in relation to the numberto be selected. For example, if the number tobe selected is 100 and Equation 5 indicatesthat the optimal SR is .05, the employer willmaximize selection utility by recruiting andtesting 2,000 candidates (100/.05 = 2,000).The cost of recruiting additional applicantsbeyond those available without recruitmentefforts must be incorporated into the cost oftesting term, C. C then becomes the averagecost of recruiting and testing one applicant.The lower the cost of testing and recruiting,the larger the number of applicants it isprofitable to test in selecting a given numberof new employees. Since the cost of testing istypically low relative to productivity gainsfrom selection, the number tested shouldtypically be large relative to the numberselected.

In situations in which the applicant pool isconstant, statements about optimal SRstypically do not have practical value, sincethe SR is not under the control of the em-ployer. Given a fixed applicant pool, AU/selectee increases as SR ratio decreases if costof testing is not considered. Brogden (1949)showed that when cost of testing is taken intoaccount and when this cost is unusually high,AU/selectee will be less at very low SRsthan at somewhat higher SRs. If the cost oftesting per applicant is unusually high, thecost of testing per selectee can become greaterat extremely low SRs than AU/selectee, pro-ducing a loss rather than a gain in utility. Inpractice, however, the combination of ex-tremely high testing costs and extremely lowSRs that could lead to negative utilities occursrarely, if ever. When the applicant pool isfixed, the SR that is optimal for AU/selecteeis not necessarily the optimal SR for total gain

Page 7: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

IMPACT OF VALID SELECTION PROCEDURES 615

in utility. Cronbach and Gleser showed thattotal utility is always greatest when the SRfalls at .50. As SR decreases from .50, AU/selectee increases until it reaches its maximum,the location of which depends on the cost of test-ing. But as AU/selectee increases, the numberof selectees Na is decreasing and the productN, AU/selectee, or total utility, is also de-creasing. In a fixed applicant pool, total gainis always greatest when 50% are selected and50% are rejected (Cronbach & Gleser, 1965,pp. 38-40).

Reasons for Failure to Employ SelectionUtility Models

Despite the availability since 1949 of theutility equations discussed above, applieddifferential psychologists have been notablyslow in carrying out decision-theoretic utilityanalyses of selection procedures. In ourjudgment, the scarcity of work in this area isprimarily traceable to three facts. First, manypsychologists believe that the utility equationspresented above are of no value unless the dataexactly fit the linear homoscedastic model andall marginal distributions are normal. Manyreject the model in the belief that their datado not perfectly meet the assumptions.

Second, psychologists once believed thatvalidity was situationally specific, that therewere subtle differences in the performancerequirements of jobs from situation to situa-tion that produced (nontrivial) differences intest validities. If this were true, then theresults of a utility analysis conducted in a givensetting could not be generalized to apparentlyidentical test-job combinations in new settings.Combined with the belief that utility analysesmust include costly cost-accounting applica-tions, it is easy to see why belief in the situa-tional specificity of test validities would leadto reluctance to carry out utility analyses.

Third, it has been extremely difficult inmost cases to obtain all the information calledfor by the equations. The SR and cost oftesting can be determined reasonably ac-curately and at relatively little expense. Theitem of information that has been mostdifficult to obtain is the needed estimate ofSDy (Cronbach & Gleser, 1965, p. 121). It hasgenerally been assumed that SDV can be esti-mated only by the use of costly and compli-

cated cost-accounting methods. These pro-cedures involve first costing out the dollarvalue of the job behaviors of each employee(Brogden & Taylor, 1950a) and then com-puting the standard deviation of these values.In an earlier review (Hunter & Schmidt, inpress), we were able to locate only two studiesin which cost-accounting procedures were usedto estimate SDy. In this study, we will presentan alternative to cost-accounting estimates ofSDy.

We now examine each of these reasons indetail.

Are the Statistical Assumptions Mel?

The linear homoscedastic model includesthe following three assumptions: (a) linearity,(b) equality of variances of conditionaldistributions, and (c) normality of conditionaldistributions. As we have shown above, thebasic selection utility equation (Equation 1)depends only on linearity. Equation 2 doesassume normality of the test score distribution.However, Brogden (1949) and Cronbach andGleser (1965) introduced this assumptionessentially for derivational convenience; itprovides an exact relation between the SRZ»,. One need not use the normality-basedrelation <j>/p = ZXi to compute Zx>. The valueof ZXt can be computed directly. Thus in thefinal analysis, linearity is the only requiredassumption.

To what extent do data in differential psy-chology fit the linear homoscedastic model?To answer this question, we must of necessityexamine sample rather than population data.However, it is only conditions in populationsthat are of interest; simple data are of interestonly as a means of inferring the state of naturein populations. Obviously, the larger the sampleused, the more clearly the situation in thesample will reflect that in the population,given that the sample is random. A number ofresearchers have addressed themselves to thequestion of the fit of the linear homoscedasticmodel to data in differential psychology.

Sevier (1957), using Ns from 105 to 250,tested the assumptions of linearity, normalityof conditional criterion distributions, andequality of conditional variances. The datawere from an education study, with cumula-tive grade point average being the criterion

Page 8: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

616 SCHMIDT, HUNTER, McKENZIE, AND MULDROW

and high school class rank and various testscores being the predictors. Out of 24 tests ofthe linearity assumption, only 1 showed adeparture significant at the .05 level. Out ofeight samples tested for equality of conditionalvariances, only one showed a departure signif-icant at the .05 level. However, 25 of the 60tests for normality of the conditional criteriondistributions were significant at the .05 level.Violation of this assumption throws inter-pretations of conditional standard deviationsbased on normal curve tables into some doubt.However, this statistic typically is not usedin practical prediction situations, such asselection or placement. Sevier's study indi-cates that the assumptions of linearity andequality of conditional variances may begenerally tenable.

Ghiselli and Kahneman (1962) examined 60aptitude variables on one sample of 200cases and reported that fully 40% of thevariables departed significantly from thelinear homoscedastic model. Ninety percent ofthese departures were reported to have heldup on cross-validation. Tupes (1964) re-analyzed the Ghiselli and Kahneman dataand found that only 20% of the relationshipsdeparted from the linear homoscedastic modelat the .05 level. He also found that three ofthe "significant" departures from linearitywere probably due to typographical or clericalerrors in the data. Later, Ghiselli (1964) ac-cepted and agreed with Tupes's reanalysis ofhis data. Tupes's findings must be interpretedin light of the fact that the frequency ofdeparture from the linear homoscedastic modelexpected at the .05 level is in fact muchgreater than 5%. Tupes carried out twostatistical tests on each test-criterion relation:one for linearity and one for equality of condi-tional variances. Thus the expected proportionof data samples in which at least one test issignificant is not .05 but rather a little over .09.If three statistical tests are run at the .05 level—one for linearity, one for normality of condi-tional distributions, and one for homogeneity

of conditional distributions—the expectedproportion of data samples in which at least

one of these tests is significant is approximately.14 when relations in the parent populations

are perfectly linear and homoscedastic.Tiffin and Vincent (1960) found no signif-

icant departures from the bivariate normalmodel in 15 independent samples (ranging insize from 14 to 157 subjects) of test-criteriondata. In each set of data, a chi-square testwas used to compare the percentage of em-ployees in the "successful" job performancecategory in each fifth of the test score distribu-tion to the percentages predicted from thenormal bivariate surface (which incorporatesthe linear homoscedastic model) correspondingto the computed validity coefficient. Surgent(1947) performed a similar analysis on similardata and reported the same findings.

Hawk (1970) reported a major study search-ing for departures from linearity. The datawere drawn from 367 studies conducted be-tween 1950 and 1966 on the General AptitudeTest Battery (GATE) used by the U.S.Department of Labor. A total of 3,303 rela-tions, based on 23,428 individuals, betweenthe nine subtests of the GATE and measures ofjob performance (typically supervisory ratings)were examined. The frequency of departuresfrom linearity significant at the .05 level was.054. Using the .01 level, the frequency was.012. Frequencies closer to the chance levelcan hardly be imagined. If any substantialproportion of the relations in the Hawk studyhad in fact been nonlinear, statistical powerto detect this fact would have been high—even if statistical power were low for each ofthe individual 3,303 relations. For example,suppose statistical power to detect nonlinearityhad been as low as .30 in each of the individualtests. Then if 40% of the relations were in factnonlinear, the expected proportion of signif-icant tests for nonlinearity would have been.30 X .40 + .05, or 17%. If only 20% of therelations were truly nonlinear, the expectedproportion significant would have been .30 X.20+ .05, or 11%. If only 10% of the rela-tions were truly nonlinear, the expected pro-portion significant would have been 9%. Theobtained proportion was 5.4%. Thus theHawk study provides extremely strong evi-dence against the nonlinearity hypothesis. (Forfurther discussion of statistical power in studiesof studies, see Hunter & Schmidt, 1978.)

During his years as technical director of whatis now the Army Research Institute for the

Behavior and Social Sciences, Brogden and hisresearch associate Lubin spent a considerable

Page 9: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

IMPACT OF VALID SELECTION PROCEDURES 617

amount of time and effort attempting toidentify nonlinear test-criterion relationshipsin large samples of military selection data.Although quadratic and other higher-ordernonlinear equations sometimes provided im-pressive fits to the data in the initial sample,not one of the equations cross-validated suc-cessfully in a new sample from the samepopulation. In cross-validation samples, thenonlinear functions were never superior tosimple linear functions (Brogden, Note 1).

These findings, taken in toto, indicate thatthe linear homoscedastic model generally fitsthe data in this area well. The linearity as-sumption, the only truly critical assumption,is particularly well supported.

We turn now to the question of normalityof marginal distributions. In certain forms (seeEquation 2), the Brogden-Cronbach-Gleserutility formulas assume, in addition to line-arity, a normal distribution for predictor (test)scores. The Taylor-Russell tables, based onthe assumption of a normal bivariate surface,assume normality of total test score distribu-tion also. One obviously relevant question iswhether violations of this assumption seriouslydistort utility estimates. Van Naerssen (1963;cf. also Cronbach & Gleser, 1965) found thatthey do not. He derived a set of utility equa-tions parallel to the Brogden-Cronbach-Gleser equations, except that they were basedon the assumption of a rectangular distribu-tion of test scores. He found that when appliedto the same set of empirical data, the two kindsof equation produced highly similar utilityestimates (p. 288). Cronbach and Gleser(1965) point out that this finding "makes itpossible to generalize over the considerablevariety of distributions intermediate betweennormal and rectangular" (p. 160). Results fromthe Schmidt and Hoffman (1973) study sug-gests the same conclusion. In their dataneither the predictor nor the criterion scoresappeared to be normally distributed. Yet theutility estimates produced by the Taylor-Russell tables were only off marginally: 4.09%at SR = .30 and 11.29% at SR = .50.

Thus it appears that an obsessive concernwith statistical assumptions is not justified.This is especially true in light of the fact thatfor most purposes, there is no need for utilityestimates to be accurate down to the last

dollar. Approximations are usually adequatefor the kinds of decisions that these estimatesare used to make (van Naerssen, 1963, p. 282;cf. also Cronbach & Gleser, 1965, p. 139).Alternatives to use of the utility equations willtypically be procedures that produce largererrors, or even worse, no utility analyses atall, Faced with these alternatives, errors in the5%-10% range appear negligible. Further, ifoverestimation of utility is considered moreserious than underestimation, one can alwaysemploy conservative estimates of equationparameters (e.g., rxv, SDy) to virtually guaran-tee against overestimation of utilities.

Are Test Validities Situationally Specific?

The second reason we postulated for thefailure of personnel psychologists to exploitthe Brogden-Cronbach-Gleser utility modelswas belief in the doctrine of situationalspecificity of validity coefficients. This beliefprecludes generalization of validities from onesetting to another, making criterion-relatedvalidity studies, and utility analyses, necessaryin each situation. The empirical basis for theprinciple of situational specificity has been thefact that considerable variability in observedvalidity coefficients is typically apparent fromstudy to study, even when jobs and tests ap-pear to be similar or essentially identical(Ghiselli, 1966). However, there are a priorigrounds for postulating that this variance isdue to statistical, measurement, and otherartifacts unrelated to the underlying relationbetween test and job performance. There areat least seven such sources of artifactualvariance: (a) differences between studies incriterion reliability, (b) differences betweenstudies in test reliability, (c) differences be-tween studies in range restriction, (d) samplingerror (i.e., variance due to N <°°), (e) dif-ferences between studies in amount and kindof criterion contamination and deficiency(Brogden & Taylor, 1950b), (f) computa-tional and typographical errors (Wolins,1962), and (g) slight differences in factorstructure between tests of a given type (e.g.,arithmetic reasoning tests).

In a purely analytical substudy, Schmidt,Hunter, Pearlman, and Shane (1979) showedthat the first four sources alone are capable,under specified and realistic circumstances, of

Page 10: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

618 SCHMIDT, HUNTER, McKENZIE, AND MULDROW

producing as much variation in validities as istypically observed from study to study. Theythen turned to analyses of empirical data.Using 14 distributions of validity coefficientsfrom the published and unpublished literaturefor various tests in the occupations of clericalworker and first-line supervisor, they foundthat artifactual variance sources a throughd accounted for an average of 62% of thevariance in validity coefficients, with a rangefrom 43% to 87%. Thus there was little re-maining variance in which situational moder-ators could operate. In an earlier study(Schmidt & Hunter, 1977), it was found thatsources a, c, and d alone accounted for anaverage of about 50% of the observed variancein distributions of validity coefficients pre-sented by Ghiselli (1966, p. 29). If one couldcorrect for all seven sources of error variance,one would, in all likelihood, consistently findthat the remaining variance was zero or nearzero. That is, it is likely that the small amountsof remaining variance in the studies cited hereare due to the sources of artifactual variancenot corrected for. Thus there is now strongevidence that the observed variation invalidities from study to study for similar test-job combinations is artifactual in nature. Thesefindings cast considerable doubt on the situa-tional specificity hypothesis.

Rejection of the situational specificitydoctrine obviously opens the way to validitygeneralization. However, validity generaliza-tion is possible in many cases even if the situa-tional specificity hypothesis cannot be defini-tively rejected. After correcting the mean andvariance of the validity distribution for sam-pling error, for attenuation due to criterionunreliability, and for range restriction (basedon average values of both), one may find thata large percentage, say 90%, of all values inthe distribution lie above the minimum usefullevel of validity. In such a case, one can con-clude with 90% confidence that true validityis at or above this minimum level in a newsituation involving the same test type and jobwithout carrying out a validation study of anykind. Only a job analysis is necessary—to en-sure that the job at hand is a member of theclass of jobs on which the validity distributionwas derived. In Schmidt and Hunter (1977),two of the four validity distributions fell into

this category, even though only three sourcesof artifactual variance could be corrected for.In the later study (Schmidt et al., 1979) inwhich it was possible to correct for four sourcesof error variance, 12 of the 14 corrected dis-tributions had 90% or more of validities abovelevels that would typically be indicative ofsignificant practical utility (cf. Hunter &Schmidt, in press).

These methods and findings indicate that inthe future, validity generalization will bepossible for a wide variety of test-job com-binations. Such a development will do muchto encourage the application of decision-theoretic utility estimation tools.

Difficulties in Estimating SDy

The third major reason for neglect of thepowerful Brogden-Cronbach-Gleser utilitymodel was the difficulty of estimating SDV. Asnoted above, the generally recommended pro-cedure for estimating SDV uses cost-accountingprocedures. Such procedures are supposed tobe used to estimate the dollar value of per-formance of a number of individuals (cf.Brogden & Taylor, 1950a), and the SD of thesevalues is then computed. Roche's (1961) dis-sertation illustrates well the tremendous timeand effort such an endeavor entails. This study(summarized in Cronbach and Gleser, 1965,pp. 256-266) was carried out on radial drilloperators in a large midwestern plant of aheavy equipment manufacturer. A cost-ac-counting procedure called standard costing wasused to determine the contribution of eachemployee to the profits of the company. Theprocedure was extremely detailed and complex,involving such considerations as cost estimatesfor each piece of material machined, direct andindirect labor costs, overhead, and perishabletool usage. There was also a "burden adjust-ment" for below-standard performance. Butdespite the complexity and apparent objec-tivity, Roche was compelled to admit that"many estimates and arbitrary allocationsentered into the cost accounting" (Cronbach& Gleser, 1965, p. 263). Cronbach, in com-menting on the study after having discussedit with Roche, stated that some of the cost-accounting procedures used were unclearor questionable (Cronbach & Gleser, 1965, pp.266-267) and that the accountants perhaps

Page 11: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

IMPACT OF VALID SELECTION PROCEDURES 619

did not fully understand the utility estimationproblem. Thus even given great effort andexpense, cost-accounting procedures maynevertheless lead to a questionable finalproduct.

Recently, we have developed a procedurefor obtaining rational estimates of ££>„. Thismethod was used in a pilot study by 62 ex-perienced supervisors of budget analysts toestimate SDV for that occupation. Supervisorswere used as judges because they have the bestopportunities to observe actual performanceand output differences between employees ona day-to-day basis. The method is based on thefollowing reasoning: If job performance indollar terms is normally distributed, then thedifference between the value to the organiza-tion of the products and services produced bythe average employee and those produced byan employee at the 85th percentile in per-formance is equal to SDV. Budget analystsupervisors were asked to estimate both thesevalues; the final estimate was the averagedifference across the 62 supervisors. The esti-mation task presented to the supervisors mayappear difficult at first glance, but only 1 outof 62 supervisors objected and stated that hedid not think he 'could make meaningfulestimates. Use of a carefully developed ques-tionnaire to obtain the estimates apparentlyaided significantly; a similar questionnairewas used in the present study and is de-scribed in the Method section. The finalestimate of SDy for the budget analyst occupa-tion was $11,327 per year (SEM = $1,120).This estimate is based on incumbents ratherthan applicants and must therefore be con-sidered to be an underestimate.

As noted earlier, it is generally not criticalthat estimates of utility be accurate down tothe last dollar. Utility estimates are typicallyused to make decisions about selection pro-cedures, and for this purpose only errors largeenough to lead to incorrect decisions are of anyconsequence. Such errors may be very in-frequent. Further, they may be as frequent ormore frequent when cost-accounting pro-cedures are used. As we noted above, Roche(1961) found that even in the case of the simpleand structured job he studied, the cost ac-countants were frequently forced to rely onsubjective estimates and arbitrary allocations.

This is generally true in cost accounting andmay become a more severe problem as onemoves up in the occupational hierarchy. Whatobjective cost-accounting techniques, for ex-ample, can be used to assess the dollar value ofan executive's impact on the morale of his orher subordinates? It is the jobs with thelargest SDy values, that is, the jobs for whichAU/selectee is potentially greatest, that arehandled least well by cost-accounting methods.Rational estimates—to one degree or another—are virtually unavoidable at the higher joblevels.

Our procedure has at least two advantagesin this respect. First, the mental standard tobe used by the supervisor-judges is the esti-mated cost to the organization of having anoutside consulting firm provide the same pro-ducts and/or services. In many occupations,this is a relatively concrete standard. Second,the idiosyncratic tendencies, biases, andrandom errors of individual experts can becontrolled by averaging across a large numberof judges. In our initial study, the final esti-mate of SDy was the average across 62 super-visors. Unless this is an upward or downwardbias in the group as a whole, such an averageshould be fairly accurate. In our example, thestandard error of the mean was $1,120. Thismeans that the interval $9,480-$13,175 shouldcontain 90% of such estimates. (One trulybent on being conservative could employ thelower bound of this interval in his or hercalculations.)

Methods similar to the one described herehave been used successfully by the DecisionAnalysis Group of the Stanford ResearchInstitute (Howard, Note 2) to scale otherwiseunmeasurable but critical variables. Resultingmeasures have been used in the application ofdecision-theoretic principles to high-level policydecision making in such areas as nuclear powerplant construction, corporate risk policies,investment and expansion programs, and hurri-cane seeding (Howard, 1966; Howard, Math-eson, & North, 1972; Matheson, 1969; Raiffa,1968). All indications are that the response tothe work of this group has been positive;these methods have been judged by high-leveldecision makers to contribute valuably toimprovement of socially and economicallyimportant decisions.

Page 12: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

620 SCHMIDT, HUNTER, McKENZIE, AND MULDROW

In most cases, the alternatives to using aprocedure like ours to estimate SDy are un-palatable. The first alternative is to abandonthe idea of a utility analysis. This course ofaction will typically lead to a gross (implicit)underestimate of the economic value of validselection procedures. This follows if oneaccepts our contention (Hunter & Schmidt,in press) that the empirical studies that areavailable indicate much higher dollar valuesthan psychologists have expected. The secondalternative in most situations is use of a lesssystematized, and probably less accurate,procedure for estimating SDy. Both of thesealternatives can be expected to lead to moreerroneous decisions about selection procedures.

The Present Study

The procedure for estimating SDV describedhere assumes that dollar outcomes are nor-mally distributed. One purpose of the presentstudy is the evaluation of that assumption.

The present study has three purposes: (a)to illustrate the magnitude of the productivityimplications of a valid selection procedure, (b)to demonstrate the application of decision-theoretic utility equations, and (c) to test theassumption that the dollar value of employeeproductivity is normally distributed.

The major reason for our choice of the jobof computer programmer was the remarkablyaccurate validity estimates for this job that aprevious study (Schmidt, Rosenberg, &Hunter, Note 3) had provided. Applying theSchmidt-Hunter (Schmidt et al., 1979)validity generalization model to all availablevalidity data for the Programmer AptitudeTest (PAT; Hughes & McNamara, 1959;McNamara & Hughes, 1961), this study foundthat the percentage of variance in validitycoefficients accounted for in the case of jobproficiency criteria for the PAT total score was94%. This finding effectively refutes thesituational specificity hypothesis. The esti-mated true validity was .76. Thus the evidenceis strong that the (multivariate) total PATscore validity is high for predicting perfor-mance of computer programmers and that thisvalidity is essentially constant across situa-tions (e.g., different organizations; Schmidtet al., Note 3). Since it is total score that istypically used in selecting programmers, this

study concerns itself only with total scorevalidity. Because the PAT is no longer avail-able commercially, testing costs had to beestimated. In this study, we assumed a testingcost of $10 per examinee.

Method

Definition of Relevant Job Group

This study focused on selection of computer program-mers at the GS-5 through GS-9 levels. GS-S is the lowestlevel in this occupational series. Beyond GS-9, it isunlikely that an aptitude test like the PAT would beused in selection. Applicants for higher level program-mer positions are expected (and required) to haveconsiderable developed expertise in programming andare selected on the basis of achievement and experience,rather than directly on aptitude. The vast majorityof programmers hired at the GS-9 level are promoted toGS-11 after 1 year. Similarly, all but a minority hiredat the GS-S level advance to GS-7 in 1 year and to GS-9the following year. Therefore, the SDy estimates wereobtained for the GS-9 through GS-11 levels. Statisticalinformation obtained from the Bureau of PersonnelManagement Information Systems of the U.S. Officeof Personnel Management indicated that the numberof programmer incumbents in the federal governmentat the relevant levels (GS-5 through GS-9) was 4,404(as of October 31, 1976, the latest date for whichfigures were available). The total number of computerprogrammers at all grade levels was 18,498. For 1975-1976, 61.3% of all new hires were at the GS-5 throughGS-9 levels. The number of new hires governmentwidein this occupation at these levels was 655 and 565 forcalendar years 1975 and 1976, respectively, for anaverage yearly selection rate of 618. The average tenureof computer programmers hired at GS-5 through GS-9levels was determined to be 9.69 years.

Data from the 1970 U.S. Census (U.S. Bureau ofthe Census, 1970) showed that there were 166,556computer programmers in the U.S. in that year. Be-cause the growth rate has been rapid in this occupationrecently, this figure undoubtedly underestimates thecurrent number of programmers. However, it is themost recent estimate available. In any event, theeffect of underestimation on the utility results is aconservative one. It was not possible to determine thenumber of computer programmers that are hiredyearly in the U.S. economy. For purposes of thisstudy, it was assumed that the turnover rate was 10%in this occupation and that therefore .10 X 166,556, or16,655, were hired to replace those who had quit,retired, or died. Extrapolating from the federal to theprivate sector work force, it was assumed that 61.3%of these new hires were at occupational levels for whichthe PAT would be appropriate. Thus it was assumedthat .613 X 16,655, or 10,210 computer programmerscould be hired each year in the U.S. economy using thePAT. In view of the current rapid expansion of thisoccupation, it is likely that this number is a sub-stantial underestimate.

Page 13: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

IMPACT OF VALID SELECTION PROCEDURES 621

It was not possible to determine prevailing SRs forcomputer programmers in the general economy. Be-cause the total yearly number of applicants for thisjob in the government could not be determined, it wasalso impossible to estimate the government SR. Thisinformation lack is of no real consequence, however,since it is more instructive to examine utilities for avariety of SRs. Utilities were calculated for SRs of .05,.10, .20 ... .80. The gains in utility or productivityas computed from Equation 4 are those that resultwhen a valid procedure is introduced where previouslyno procedure or a totally invalid procedure has beenused. The assumption that the true validity of theprevious procedure is essentially zero may be valid insome cases, but in other situations the PAT would, ifintroduced, replace a procedure with lower but non-zero true validity. Hence, utilities were calculatedassuming previous procedure true validities of .20, .30,.40, and .50, as well as .00.

Estimating SDy

Estimates of SDV were provided by experiencedsupervisors of computer programmers in 10 federalagencies. These supervisors were selected by their ownsupervisors after consultation with the first author.Participation was voluntary. Of 147 questionnairesdistributed, 105 were returned (all in usable form),for a return rate of 71.4%. To test the hypothesis thatdollar outcomes are normally distributed, the super-visors were asked to estimate values for the 15th per-centile ("low-performing programmers"), the 50thpercentile ("average programmers"), and the 85thpercentile ("superior programmers"). The resultingdata thus provide two estimates of SDV. If the distribu-tion is approximately normal, these two estimateswill not differ substantially in value.

The instructions to the supervisors were as follows:

The dollar utility estimates we are asking you tomake are critical in estimating the relative dollarvalue to the government of different selectionmethods. In answering these questions, you willhave to make some very difficult judgments. We realizethey are difficult and that they are judgments orestimates. You will have to ponder for some timebefore giving each estimate, and there is probablyno way you can be absolutely certain your estimateis accurate when you do reach a decision. But keepin mind three things:

(1) The alternative to estimates of this kind isapplication of cost accounting procedures to theevaluation of job performance. Such applicationsare usually prohibitively expensive. And in the end,they produce only imperfect estimates, like thisestimation procedure.

(2) Your estimates will be averaged in with thoseof other supervisors of computer programmers.Thus errors produced by too high and too lowestimates will tend to be averaged out, providingmore accurate final estimates.

(3) The decisions that must be made about selec-

tion methods do not require that all estimates beaccurate down to the last dollar. Substantiallyaccurate estimates will lead to the same decisionsas perfectly accurate estimates.

Based on your experience with agency programmers,we would like for you to estimate the yearly value toyour agency of the products and services produced bythe average GS 9-11 computer programmer. Con-sider the quality and quantity of output typical ofthe average programmer and the value of this output.In placing an overall dollar value on this output, itmay help to consider what the cost would be of hav-ing an outside firm provide these products andservices.

Based on my experience, I estimate the value tomy agency of the average GS 9-11 computerprogrammer at dollars per year.

We would now like for you to consider the "superior"programmer. Let us define a superior performer as aprogrammer who is at the 85th percentile. That is,his or her performance is better than that of 85%of his or her fellow GS 9-11 programmers, and only15% turn in better performances. Consider the qualityand quantity of the output typical of the superiorprogrammer. Then estimate the value of theseproducts and services. In placing an overall dollarvalue on this output, it may again help to considerwhat the cost would be of having an outside firmprovide these products and services.

Based on my experience, I estimate the value tomy agency of a superior GS 9-11 computerprogrammer to be dollars per year.

Finally, we would like you to consider the "lowperforming" computer programmer. Let us define alow performing programmer as one who is at the15th percentile. That is, 85% of all GS 9-11 com-puter programmers turn in performances better thanthe low performing programmer, and only 15% turnin worse performances. Consider the quality andquantity of the output typical of the low performingprogrammer. Then estimate the value of these pro-ducts and services. In placing an overall dollar valueon this output, it may again help to consider whatthe cost would be of having an outside firm providethese products and services.

Based on my experience, I estimate the value tomy agency of the low performing GS 9-11 com-puter programmer at dollars per year.

The wording of this questionnaire was carefullydeveloped and pretested on a small sample of program-mer supervisors and personnel psychologists. None ofthe programmer supervisors who returned question-naires in the study reported any difficulty in under-standing the questionnaire or in making the estimates.

Compulation of Impact on Productivity

Using a modification of Equation 4, utilities thatwould result from 1 year's use oi the PAT for selectionof new hires in the federal government and the economy

Page 14: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

622 SCHMIDT, HUNTER, McKENZIE, AND MULDROW

Table 1

Estimated Productivity Increase From 1 Year' sUse of the Programmer Aptitude Test toSelect Computer Programmers in the FederalGovernment (in Millions of Dollars')

True validityQ l i . *

ratio

.05

.10

.20

.30

.40

.50

.60

.70

.80

.00

97.282.866.054.745.637.630.423.416.5

.20

71.760.148.640.334.627.722.417.212.2

of previous procedure

.30

58.950.140.033.127.622.818.414.110.0

.40

46.139.231.325.921.617.814.411.17.8

.50

33.328.322.618.715.612.910.48.05.6

as a whole were computed for each of the combinationsof SR and previous procedure validity given above.When the previous procedure was assumed to havezero validity, its associated testing cost was also as-sumed to be zero; that is, it was assumed that noprocedure was used and that otherwise prescreenedapplicants were hired randomly. When the previousprocedure was assumed to have a nonzero validity, itsassociated cost was assumed to be the same as that ofthe PAT, that is, $10 per applicant. As mentionedabove, average tenure for government programmerswas found to be 9.69 years; in the absence of otherinformation, this tenure figure was also assumed forthe private sector. AtJ/selectee per year was multipliedby 9.69 to give final Aft/selectee. Cost of testing wascharged only to the first year.

Building all of these factors into Equation 4, weobtain the equation actually used in computing theutilities:

, (6)

where At/ = the gain in productivity in dollars fromusing the new selection procedure for 1 year; t = tenurein years of the average selectee, here 9.69; N, = numberselected in a given year (this figure was 618 for thefederal government and 10,210 for the U.S. economy);r\ = validity of the new procedure, here the PAT(f\ = .76) ; r% = validity of the previous procedure(ra ranges from 0 to .50); C\ = per-applicant cost ofthe new procedure, here $10; and Ci = per-applicantcost of previous procedure, here 0 or $10. The termsSDy, <t>, and p are as defined previously. The figure forSDV was the average of the two estimates obtained inthis study. Note that although this equation gives theproductivity gain that results from substituting for 1year the new (more valid) selection procedure for theprevious procedure, these gains are not all realized thefirst year. They are spread out over the tenure of thenew employees.

Results and Discussion

Estimation of Yearly SDy

The two estimates of SDV were similar. Themean estimated difference in dollar value ofyearly job performance between programmersat the 85th and 50th percentiles in job per-formance was $10,871 (SE = $1,673). Thefigure for the difference between the 50th and15th percentiles was $9,955 (SE = $1,035).The difference of $916 is roughly 8% of eachof the estimates and is not statistically signif-icant. Thus the hypothesis that computerprogrammer productivity in dollars is normallydistributed cannot be rejected. The distribu-tion appears to be at least approximatelynormal. The average of these two estimates,$10,413, was the SDy figure used in the utilitycalculations below. This figure must be con-sidered an underestimate, since it applies toincumbents rather than to the applicant pool.As can be seen from the two standard errors,supervisors showed better agreement on theproductivity difference between "low-per-forming" and "average programmers" than onthe difference between "average" and "supe-rior" programmers.

Impact on Productivity

Table 1 shows the gains in productivity inmillions of dollars that would result from 1year's use of the PAT to select computerprogrammers in the federal government fordifferent combinations of SR and previousprocedure validity. As expected, these gainsincrease as SR decreases and as the validityof the previous procedure decreases. WhenSR is .05 and the previous procedure has novalidity, use of the PAT for 1 year produces aproductivity gain of $97.2 million. At the otherextreme, if SR is .80 and the procedure thePAT replaces has a validity of .50, the gainis only $5.6 million. The figures in all cellsof Table 1 are large—larger than most in-dustrial-organizational psychologists would,in our judgment, have expected. These figures,of course, are for total utility. Gain per selecteefor any cell in Table 1 can be computed bydividing the cell entry by 618, the assumedyearly number of selectees. For example, whenSR = .20 and the previous procedure has a

Page 15: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

IMPACT OF VALID SELECTION PROCEDURES 623

validity of .30, the gain per selectee is $64,725.As indicated earlier, the gains shown in Table1 are produced by 1 year's use of the PAT butare not all realized during the first year; theyare spread out over the tenure of the newemployees. Per-year gains for any cell in Table1 can be obtained by dividing the cell entryby 9.69, the average tenure of computerprogrammers.

Table 2 shows productivity gains for theeconomy as a whole resulting from use of thePAT or substitution of the PAT for less validprocedures. Table 2 figures are based on theassumed yearly selection of 10,210 computerprogrammers nationwide. Again, the figuresare for the total productivity gain, but gainper selectee can be computed by dividing thecell entry by the number selected. Once meangain per selectee is obtained, the reader caneasily compute total gain for any desirednumber of selectees. As expected, thesefigures are considerably larger, exceeding $1billion in several cells. Although we have nodirect evidence on this point, we again judgethat the productivity gains are much higherthan most industrial-organizational psycholo-gists would have suspected.

In addition to the assumptions of linearityand normality discussed earlier, the pro-ductivity gain figures in Tables 1 and 2 arebased on two additional assumptions. The firstis the assumption that selection proceeds fromtop-scoring applicants downward until the SRhas been reached. That is, these analysesassume that selection procedures are usedoptimally. Because of the linearity of the rela-tion between test score and job performance,any other usage of a valid test would result inlower mean productivity levels among selec-tees. For example, if a cutting score were setat a point lower than that corresponding tothe SR and if applicants scoring above thisminimum score were then selected randomly(or selected using other nonvalid proceduresor considerations), productivity gains wouldbe considerably lower than those shown inTables 1 and 2. (They would, however,

typically still be substantial.)The second additional assumption is that all

applicants who are offered jobs accept and arehired. This is often not the case, and the effectof rejection of job offers by applicants is to

Table 2Estimated Productivity Increase From 1 Year'sUse of Programmer Aptitude Test to SelectComputer Programmers in U.S. Economy (inMillions of Dollars)

True validity of

ratio

.OS

.10

.20

.30

.40

.50

.60

.70

.80

.00

1,6051,3671,091

903753622501387273

.20

1,1841,008

804666555459370285201

previous procedure

.30

973828661547455376304234165

.40

761648517428356295238183129

.50

55046837330925721317213293

increase the SR and thus lower the productivitygains from selection. For example, if a SR of.10 would yield the needed number of newemployees given no rejections by applicants,then if half of all job offers are rejected, theSR must be increased to .20 to yield the desirednumber of selectees. If the validity of theprevious procedure were zero, Table 1 showsthat rejection by applicants would reduceproductivity gains from $82.8 to $66.0 million,a reduction of $16.8 million. If the validityof the previous procedure were nonzero, jobrejection by applicants would reduce both itsutility and the utility of the new test. How-ever, the function is multiplicative and hencethe utility of the more valid procedure wouldbe reduced by a greater amount. Therefore,the utility advantage of the more valid pro-cedure over the less valid procedure would bereduced. For example, Table 1 shows that if thevalidity of the previous procedure were .30,the productivity advantage of the more validtest would be $50.1 million if the neededworkers could be hired using a selection ratioof .10. But if half of the applicants rejectedjob offers, we would have to use a SR of .20,and the advantage of the more valid test woulddrop by one fifth to $40 million.

Hogarth and Einhorn (1976) have pointedout that utility losses caused by job offerrejection can often be offset in part by addi-tional recruiting efforts that increase the sizeof the applicant pool and, therefore, restoreuse of smaller SRs. They present equations

Page 16: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

624 SCHMIDT, HUNTER, McKENZIE, AND MULDROW

that allow one to compute the optimal numberof additional applicants to recruit and test andthe optimal SR under various combinations ofcircumstances.

The PAT is no longer available com-mercially. Originally marketed by Psycho-logical Corporation, it was later distributed byIBM as part of package deals to computersystems purchasers. However, this practicewas dropped around 1974, and since then thePAT has not been available to most users(Dyer, Note 4). This fact, however, needcreate no problems in terms of validitygeneralization. The validity estimates fromSchmidt et al. (Note 3) generalize directly toother tests and sub tests with the same factorstructure. The three subscales of the PATare composed of conventional number series,figure analogies, and arithmetic reasoningitems. New tests can easily be constructed thatcorrelate 1.00, corrected for attenuation, withthe PAT subtests.

The productivity gains shown in Tables 1and 2 are based on an estimated true validityof .76 for the PAT total score (Schmidt et al.,Note 3). For many jobs, alternative selectionprocedures with known validities of thismagnitude may not be available. However,the methods used here would be equallyapplicable. For example, if the alternate selec-tion procedure has an estimated true validityof .50, Equation 6 can be used in the same wayto estimate values comparable to those inTables 1 and 2. Obviously, in this case, allproductivity gain would be smaller and therewould be no productivity gain at all from sub-stituting the new procedure for an existingprocedure with validity of .50. But work-forceproductivity will often be optimized by combin-ing the existing procedure and the new pro-cedure to obtain validity higher than eitherprocedure can provide individually. This factis well known in personnel psychology, andwe therefore do not develop it further here.

It should be noted that productivity gainscomparable to those shown in Tables 1 and 2can probably be realized in other occupations,such as clerical work, in which lower SDy

values will be offset by the larger numbers ofselectees. Pearlman, Schmidt, and Hunter (inpress) present extensive data on the general-izability of validity for a number of different

kinds of cognitive measures (constructs) forseveral job families of clerical work.

There is another way to approach the ques-tion of productivity gains resulting from use ofvalid selection procedures. One can ask whatthe productivity gain would have been had theentire incumbent population been selectedusing the more valid procedure. As indicatedearlier, the incumbent population of interestin the federal government numbers 18,498. Asan example, suppose this population had beenselected using a procedure with true validityof .30 using a SR of .20. Then had the PATbeen used instead, the productivity gain wouldhave been approximately $1.2 billion [9.69 X18,498 X (.76 - .30) X 10,413 X .28/.20]. Ex-panding this example to the economy as awhole, the productivity gain that would haveresulted is $10.78 billion.

Obviously, there are many other such ex-amples that can be worked out, and we en-courage readers to ask their own questionsand derive their own answers. However,virtually regardless of the question, theanswer always seems to include the conclusionthat it does make a difference—an important,practical difference—how people are selected.We conclude that the implications of validselection procedures for work-force produc-tivity are much greater than most of us haverealized in the past.

Finally, we note by way of a necessary cau-tion that productivity gains in individual jobsfrom improved selection cannot be extrapolatedin a simple way to productivity gains in thecomposite of all jobs making up the nationaleconomy. To illustrate, if the potential gaineconomywide in the computer programmeroccupation is $10.78 billion and if there areN jobs in the economy, the gain to be expectedfrom use of improved selection procedures inall N jobs will not in general be as great as Ntimes $10.78 billion. Since the total talentpool is not unlimited, gains due to selection inone job are partially offset by losses in otherjobs. The size of the net gain for the economydepends on such factors as the number ofjobs, the correlation between jobs of predictedsuccess composites (5)3), and differences be-tween jobs in SDy. Nevertheless, potential netgains for the economy as a whole are large.The impact of selection procedures on the

Page 17: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

IMPACT OF VALID SELECTION PROCEDURES 625

economy as a whole is explored in detail inHunter and Schmidt (in press).

Reference Notes

1. Brogden, H. E. Personal communication, November1967.

2. Howard, R. A. Decision analysis: Applied decisiontheory. Paper presented at the Fourth Internationa}Conference on Operational Research, Boston, 1966.

3. Schmidt, F. L., Rosenberg, I. G., & Hunter, J. E.Application of the Schmidt-Hunter validity general-ization model to computer programmers. PersonnelResearch and Development Center, U.S. CivilService Commission, Washington, D.C., 1978.

4. Dyer, P. Personal communication, April 20, 1978.

References

Brogden, H. E. On the interpretation of the correla-tion coefficient as a measure of predictive efficiency.Journal of Educational Psychology, 1946, 37, 65-76.

Brogden, H. E. When testing pays off. Personnel Psy-chology, 1949, 2, 171-183.

Brogden, H. E., & Taylor, E. K. The dollar criterion:Applying the cost accounting concept to criterionconstruction. Personnel Psychology, 1950, 3, 133-154.(a)

Brogden, H. E., & Taylor, E. K. A theory and classi-fication of criterion bias. Educational and Psy-chological Measurement, 1950, 10, 159-186. (b)

Cronbach, L. J., & Gleser, G. C. Psychological testsand personnel decisions. Urbana: University ofIllinois Press, 1965.

Curtis, E. W., & Alf, E. F. Validity, predictiveefficiency, and practical significance of selectiontests. Journal of Applied Psychology, 1969, 53,327-337.

Dunnette, M. D., & Borman, W. C. Personnel selectionand classification systems. Annual Review of Psy-chology, 1979, in press.

Ghiselli, E. E. Dr. Ghiselli's comments on Dr. Tupes'note. Personnel Psychology, 1964, 17, 61-63.

Ghiselli, E. E. The validity of occupational aptitude tests.New York: Wiley, 1966.

Ghiselli, E. E., & Kahneman, D. Validity and non-linear heteroscedastic models. Personnel Psychology,1962,15, 1-11.

Hawk, J. Linearity of criterion-GATB aptitude rela-tionships. Measurement and Evaluation in Guidance,1970, 2, 249-251.

Hogarth, R. M., & Einhorn, H. J. Optimal strategiesfor personnel selection when candidates can rejectoffers. Journal of Business, 1976, 49, 478-495.

Howard, R. A. (Ed.). Proceedings of the Fourth Inter-national Conference on Operational Research. NewYork: Wiley, 1966.

Howard, R. A., Matheson, J. E., & North, D. W.The decision to seed hurricanes. Science, 1972, 176,1191-1202.

Hughes, J. L., & McNamara, W. J. Manual for the

revised Programmer Aptitude Test. New York:Psychological Corporation, 1959.

Hull, C. L. Aptitude testing. Yonkers, N.Y.: WorldBook, 1928.

Hunter, J. E., & Schmidt, F. L. Differential and singlegroup validity of employment tests by race: A criticalanalysis of three recent studies. Journal of AppliedPsychology, 1978, 63, 1-11.

Hunter, J. E., & Schmidt, F. L. Fitting people to jobs:The impact of personnel selection on nationalproductivity. In E. A. Fleishman (Ed.), Humanperformance and productivity, in press.

Kelley, T. L. Statistical method. New York: Macmillan,1923.

Matheson, J. E. Decision analysis practice: Examplesand insights. In, Proceedings of the Fifth InternationalConference on Operational Research (OR 69). London:Tavistock, 1969.

McNamara, W. J., & Hughes, J. L. A review of re-search on the selection of computer programmers.Personnel Psychology, 1961, 14, 39-51.

Pearlman, K., Schmidt, F. L., & Hunter, J. E. Test ofa new model of validity generalization: Results forjob proficiency and training criteria in clericaloccupations. Journal of Applied Psychology, in press.

Raiffa, H. Decision analysis. In, Introductory lectures onchoices under uncertainty. Reading, Mass.: Addison-Wesley, 1968.

Roche, U, J. The Cronbach-Gleser utility function infixed treatment employee selection (Doctoral dis-sertation, Southern Illinois University, 1961). Dis-sertation Abstracts International, 1961-62, 22, 4413.(University Microfilms No. 62-1570). (Portionsreproduced in L. J. Cronbach & G. C. Gleser (Eds.),Psychological tests and personnel decisions. Urbana:University of Illinois Press, 1965.)

Schmidt, F. L., & Hoffman, B. Empirical comparisonof three methods of assessing the utility of a selec-tion device. Journal of Industrial and OrganizationalPsychology, 1973, /, 13-22.

Schmidt, F. L., & Hunter, J. E. Development of ageneral solution to the problem of validity general-ization. Journal of Applied Psychology, 1977, 62,529-540.

Schmidt, F. L., Hunter, J. E., Pearlman, K., & Shane,G. S. Further tests of the Schmidt-Hunter validitygeneralization model. Personnel Psychology, 1979, 32,257-281.

Schmidt, F. L., Hunter, J. E., & Urry, V. W. Statisticalpower in criterion-related validity studies. Journalof Applied Psychology, 1976, 61, 473-485.

Sevier, F. A. C. Testing the assumptions underlyingmultiple regression. Journal of Experimental Educa-tion, 1957, 25, 323-330.

Siirgent, L. V. The use of aptitude tests in the selectionof radio tube mounters. Psychological Monographs,1947, 61(2, Whole No. 283).

Taylor, H. C., & Russell, J. T. The relationship ofvalidity coefficients to the practical effectiveness oftests in selection: Discussion and tables. Journal ofApplied Psychology, 1939, 23, 565-578.

Thorndike, R. L. Personnel selection. New York:Wiley, 1949.

Page 18: Impact of valid selection procedures on work-force productivity...(the Programmer Aptitude Test; PAT) o n productivity if it were used to select new computer programmers for one year

626 SCHMIDT, HUNTER, McKENZIE, AND MULDROW

Tiffin, J., & Vincent, N. L. Comparison of empirical van Naerssen, R. F. Selectie van chauffeurs. Groningen,and theoretical expectancies. Personnel Psychology, The Netherlands: Wolters, 1963. (Portions re-1960, 13, 59-64. printed in L. J. Cronbach & G. C. Gleser [Eds.;

Tupes, E. C. A note on "Validity and non-linear J. Wassing, trans.] Psychological tests and personnelheteroscedastic models." Personnel Psychology, 1964, decisions. Urbana: University of Illinois Press, 1965.)17, 59-61. Wolins, L. Responsibility for raw data. American

U.S. Bureau of the Census. Census of population: 1970. Psychologist, 1962, 17, 657-658.Subject reports (Final Rep. PC (2)-7A). Washington,D.C.: Author, 1970. Received December 27, 1978 •

Manuscripts Accepted for Publication

Role Theory, Attitudinal Constructs, and Actual Performance: A Measurement Issue. Eric N.Berkowitz (College of Business Administration and Graduate School of Business Administra-tion, Business Administration Building, 271 19th Avenue South, University of Minnesota,Minneapolis, Minnesota 55455).

Measuring the Relative Importance of Utilitarian and Egalitarian Values: A Study of IndividualDifferences About Fair Distribution. John Rohrbaugh (Graduate School of Public Affairs, TheState University of New York at Albany, 1400 Washington Avenue, Albany, New York 12222),Gary McClelland, and Robert Quinn.

Interview Behaviors of Mentally Retarded Adults as Predictors of Employ ability. Carol K.Sigelman (Texas Tech University, Research and Training Center in Mental Retardation,P. O. Box 4510, Lubbock, Texas 79409), Susan F. Elias, and Pamela Danker-Brown.

Long-Term Auditory Memory: Speaker Identification. Howard Saslove and A. Daniel Yarmey(Department of Psychology, University of Guelph, Guelph, Ontario, Canada NIG 2W1).

Effects of Student Participation in Classroom Decision Making on Attitudes, Peer Interaction,Motivation, and Learning. Fredrick D. Richter and Dean Tjosvold (Department of Economicsand Commerce, Simon Fraser University, Burnaby, British Columbia, Canada V5A 1S6).

Machiavellianism and Leadership. Amos Drory (Department of Industrial Engineering and Man-agement, Ben-Gurion University of the Negev, Beersheba, Israel) and Uri M. Gluskinos.

Effects of Noise on Performance on Embedded Figures Tasks. Andrew P. Smith and Donald E.Broadbent (Department of Experimental Psychology, University of Oxford, South ParksRoad, Oxford 0X1 3UD, England).

Leader-Group Interactions: A Longitudinal Field Investigation. Charles N. Greene (Indiana Uni-versity, School of Business, 10th and Fee Lane, Bloomington, Indiana 47405) and Chester A.Schriesheim.

Redundancy and Dimensionality as Determinants of Data Analytic Strategies in MultivariateAnalysis of Variance. Paul E. Spector (Northside Community Mental Health Center, Inc., 13301North 30th Street, Tampa, Florida 33612).

Testing Mintzberg's Managerial Roles Classification Using an In-Basket Simulation. Zur Shapira(School of Business, Hebrew University, Jerusalem, Israel) and Roger L. M. Dunbar.

Evaluation of Feedback Sources as a Function of Role and Organizational Level. Martin M.Greller (Rohrer, Hibler & Replogle, Inc., Suite 610, 10 Rockefeller Plaza, New York, New York

10020).Comprehending Spatial Information: The Relative Efficiency of Different Methods of Presenting

Information About Bus Routes. D. J. Bartram (Department of Psychology, University ofHull, Hull, North Humberside HU6 7RY, England).

Effects of Rater Training: Creating New Response Sets and Decreasing Accuracy. H. John Ber-nardin (Department of Psychology, Virginia Polytechnic Institute and State University,Blacksburg, Virginia 24061) and Earl C. Pence.

Individual Differences in Electrodermal Lability and the Detection of Information and Deception.William M. Waid (Unit for Experimental Psychiatry, 111 North 49th Street, Philadelphia,Pennsylvania 19139) and Martin T. Orne.

(Continued on page 634)