anne anastasi- psychological testing i

104
ANNE~NASTASI Professor of Psychology, Fordham Universiry Psyclwlvgical Testing MACMILLAN PUBLISHING CO., INC. New York Collier Maonillan Publishers London

Upload: demonforme20

Post on 26-Jun-2015

6.653 views

Category:

Documents


456 download

TRANSCRIPT

Page 1: Anne Anastasi- Psychological Testing I

ANNE~NASTASIProfessor of Psychology, Fordham Universiry

Psyclwlvgical Testing

MACMILLAN PUBLISHING CO., INC.New York

Collier Maonillan PublishersLondon

Page 2: Anne Anastasi- Psychological Testing I

I N A revised edition, one expects both similarities and differences. Thisedition shares with the earlier versions the objectives and basic approachof the book. The primary goal of this text is still to contribute toward theproper evaluation of psychological tests and the correct interpretationand use of test results. This goal calls for several kinds of information:( 1) an understanding of the major principles of test construction, (2)psychological knowledge about the behavior being assessed, (3) sensi-tivity to the social and ethical implications of test use, and (4) broadfamiliarity with the types of available instruments and the sources ofinformation about tests. A minor innovation in the fourth edition is theaddition of a suggested outline for test evaluation (Appendix C).

In successive editions, it has been necessary to exercise more and morerestraint to keep the number of specific tests discussed in the book fromgrowing with the field-it has never been my intention to provide aminiature Mental Measurements Yearbook! l:\evertheless, I am awarethat principles of test co~struction and interpretation can be better un-derstood when applied to~particular tests. Moreover, acquaintance withthe major types of available tests, together with an understanding oftheir special contributions and limitations, is an es!>entialcomponent ofknowledge about contemporary testing. For these reasons, specific testsare again examined and evaluated in Parts 3, 4, and 5. These tests havebeen chosen either because they are outstanding examples with whichthe student of testing should be familiar or because they illustrate somespecial point of test construction or interpretation. In the text itself, theprincipal focus is on types of tests rather than on specific instruments. Atthe same time, Appendix E contains a classified list of over 250 tests,including not only those cited in the text but also others added to providea more representative sample.

As for the differences-they loomed especially large during the prepa-ration of this edition. Much that has happened in human society sincethe mid-1960's has had an impact on psychological testing. Some of thesedevelopments were briefly described in the last two chapters of the thirdedition. Today they have become part of the mairn;tream.;()fpsychological'testing and have been accordingly incorpo~i-ted in the apprqpqate sec-tions throughout the book. Recent changes in psychological Jesting thatare reflected in the present edition can be delpribed on three levels:(1) general orientation toward testing, (2) Stlbm,IJ,tiveand inethod()l~i-cal developments, and (3) "ordinary progress" w1)Q as the publiciitibnof new tests and revision of earlier tests.

All rights reserved. No part of this book may be reproduced ortransmitted in any form or by any means, electronic or me-chanical, including photocopying, recording, or any informa-tion storage and retrieval system, without permission in writingfrom the Publisher.

Earlier editions copyright 1954 and © 1961 by MacmillanPublishing Co., Inc., and copyright © 1968 by Anne Anastasi.

MACMILLAN PUBLISHING Co., INC.

866 Third Avenue, New York, New York 10022

COLLIER MACMILLAN CANADA, LTD.

Librarlj of Congress Cataloging in Publication Data

Anastasi, Anne, (date)Psychological testing.

Bibliography: p.Includes indexes.1. Mental tests. 2. Personality tests. I. Title.

[DNLM: 1. Psychological tests. WM145 A534P]BF431.A573 1976 153·9 75-2206ISBN O-<>2-30298<r3

Page 3: Anne Anastasi- Psychological Testing I

Preface

; An example of changes on the first level is the increasing awareness of~e ethical, social, and legal implications of t~sting. In the present edi-lon, this topic has been expanded and treated 111a separate chapter earlyb the book (Ch. 3) and in Appendixes A and B. A cluster of related

l..evelopments represe~ts a bro~dening of.test u~es..Beside~ the tradi~ion~l'pplications of tests 111 selectwn and diagnosIs, 111creasmg attention ISeing given to administering tests for self-kuowledge and self-develop-

~entl and to training individuals in the use of their own test res?lts. in,lJecisionmaking (Chs. 3 and 4). In the same category are the contmumg~eplacementof global scores with multitrait profiles and the applicationbf classificationstrategies, whereby "everyone can be above average" inbne or more socially valued "ariables (Ch. 7). From another angle,rffortsare being made to modify traditional interpretations of test scores,~n bothcognitive and noncognitive areas, in the light of accumulatingpsychological knowledge. In this edition, Chapter 12 brings together'psychological issues in the interpretation of intelligence test scores,:touchingon such problems as stability and change in intellectual level.overtime; the nature of intelligence; and the testing of intelligence in:earlychildhood, in old age, and in different cultures. Another exampleis providedby the increasing emphasis on situational specificity and

I person-by-situationinteractions in personality testing, stimulated in largepartbythe social-learning theorists (Ch. 17).

T~e second level, -covering substantive and methodological changes,is illustratedby the impact of computers on the development, administra-

"tioll,scoring,and interpretation of tests (see especially Chs. 4, 11, 13, 17,18, W). The use of computers in administering or managing instructionalpro/ramshas also stimulated the development of criterion-referencedt~~~although other conditions have contributed to the upsurge of'i!restin such tests in education. Criterion-referenced tests are discussed'1c •

,. 'pally in Chapters 4,5, and 14. Other types of lllstruments that haveto prominence and have received fuller treatment in the presentn include: tests for identifying specific learning disabilities (Ch.inventories and other devices for use in behavior modification pro-'

(Ch. 20), instruments for assessing early ch~ldhOod education14), Piagetian "ordinal" scales (Chs. 10 and 14), basic education

literacy tests for adults (Cbs. 13 and 14), and techniques for thement of environments (Ch. 20). Problems to be considered in the

, ment of minority groups, including the question of test bias, areined from different angles in Chapters 3, 7, 8, and 12.the third level, it may be noted that over 100 of the tests listed in

edition have been either initially pUblished or revised since theication of the preceding edition (1968). Major examples include thearthy Scales of Children's Abilities, the WISC-R, the 1972 Stanford-

norms (with all the resulting readjustments in interpretations),

Preface vii

Forms Sand T of the DAT (including a computerized Career PlanningProgram), the Strong-Campbell Interest Inventory (merged form of theSVIB), and the latest revisions of the Stanford Achievement Test and theMetropolitan Readiness Tests.

It is a pleasure to acknowledge the assis~nce received from manysources in the preparation of this edition. The completion of the projectwas facilitated by a one-semester Faculty Fellowship awarded by Ford-ham University and by a grant from the Fordham University ResearchCouncil covering principally the services of a research assistant. Theseservices were performed by Stanley Friedland with an unusual combina-tion of expertise, responSibility, and graciousness. I am indebted to themany authors and test publishers who provided reprints, unpublishedmanuscripts, specimen sets of tests, and answers to my innumerable in-quiries by mail and telephone. For assistance extending far beyond theinterests and responsibilities of any single publisher, I am especiallygrateful to Anna Dragositz of Educational Testing Service and BlytheMitchell of Harcourt Brace Jovanovich, Ioc. I want to acknowledge theSignificant contribution of John T. Cowles of the University of Pittsburgh,who assumed complete responSibility for the preparation of the Instruc-tor's Manual to accompany this text.

For informative discussions and critical comments on particular topics,I want to convey my sincere thanks to Willianl H. Angoff of EducationalTesting Service and to several members of the Fordham University Psy-chology Department, including David R. Chabot, Marvin Reznikoff,Reube~ M. Schonebaum, and 'Warren, W. Tryon. Grateful acknowledg-ment IS also made of the thoughtful recommendations submitted bycourse instructors in response to the questionnaire distributed to currentusers of the third edition. Special thanks in this connection am due toMary Carol Cahill for her extensive, constructive, and Wide-rangingsuggestions. I wish to express my appreciation to Victoria Overton ofthe Fordham University library staff for her efficient and courteous as-sistance in bibliographic matters. Finany, I am happy to record thecontributions of my husband, John Porter Foley, Jr., who again partici-pated in the solution of countless problems at all stages in the prepara-tion of the book.

A.A.

Page 4: Anne Anastasi- Psychological Testing I

CONTENTS

PART 1CONTEXT OF PSYCHOLOGICAL TESTING

1. FUNCTIONS AND ORIGINS OFPSYCHOLOGICAL TESTING 3

Current uses of psychological tests QEarly interest in classification and training of the mentally

retarded 5The first experimental psychologists 7Contributions of Francis Galton 8Cattell and the early "mental tests" 9Binet and the nse of intelligence tests 10Group testing 12Aptitude testing 13 ~Standardized achievement tests 16Measurement of personality 18Sources of information about tests 20

2. NATURE AND USE OFPSYCHOLOGICAL TESTS

What is a psychological test? 23Reasons for controlling the use of psychological testsTest administration 32Rapport 34Test anxiet\' 37Examiner ~nd situational variables 39Coaching, practice, and test sophistication 41

3. SOCIAL AND ETHICAL IMPLICATIONSOF TESTING "

User qualifications 45Testing instruments and procedures 47Protection of privacy . 49Confidentiality 52Communicating test results 56Testing and the civil rights of minorities 57

ix

Page 5: Anne Anastasi- Psychological Testing I

4. NORMS AND THE INTERPRETATION OFTEST SCORES

Statistical concepts 68Developmental norms 73Within-group norms 77Relativity of norms 88Computer utilization in tile interpretation of test scores 94Criterion-referenced testing 96

5, RELIAB ILITYThe correlation coefficient 104Types of reliability 110Reliability of speeded tests 122Dependence of reliability coefficients on the sample tested 125Standard error of measurement 127Reliability of criterion-referenced tests 131

Content validity 134Criterion-related validity 140Construct validity 151Overview 158

7. VALIDITY: MEASUREMENT ANDINTERPRET ATION

Validity coefficient and error of estimate 163Test validity and decision theory 167Moderator variabll;;s 177Combining information from different tests 180Use of tests for cl.assification decisions 186Statistical analyses of test bias 191

8. ITEM ANALYSl-SItem difficulty 199Item validity 206Internal consistency 215Item analysis of speeded tests 217Cross validation 219Item-group interaction 222

PART 3TESTS OF GENERAL INTELLECTUALLEVEL

9. INDIVIDUAL TESTSStanford-Binet Intelligence Scale 230Wechsler Adult Intelligence Scale 245Wechsler Intelligence Scale for Children 2.'55Wechsler Preschool and Primary Scale of Intelligence 260

10. TESTS FOR SPECIAL POPULATIONSInfant and preschool testing 266Testing the physically handicapped 281Cross-cultural testing 287

Croup tests versus individual tests 299Multilevel batteries 305Tests for the college level and beyond 318

12. PSYCHOLOGICAL ISSUES ININTELLIGENCE TESTING

Longitudinal studies of intelligence 327.Intelligence in early childhood 332Problems in the testing of adult intelligence 337Problems in cross-cultural testing 343Nature of intelligence 349

PART 4TESTS OF SEPARATE AInLJTIES

13. MEASURING MULTIPLE APTITUDESFactor analysis 362Theories of trait organizationMUltiple aptitude batteriesMeasurement of creativity

369378388

14. EDUCATIONAL TESTINGAchievement tests: their nature and uses 398General achievement batteries 403Standardized tests in separate subjects 410Teacher-made classroom tests 412

Page 6: Anne Anastasi- Psychological Testing I

20. OTHER ASSESSMENT TECHNIQUES"Objective" performance tests 588Situational tests 593SeH-concepts and personal constructs 598Assessment techniques in behavior modification programsObserver reports 606Biographical inventories 614The assessment of environments 616

Diagnostic and criterion-rdt:renced tests 417Specialized prognostic tests 423Assessment in early childhood education 425

~ OCCUPATIONAL TESTING\V Validation of industrial tests 435

Short screening tests .for industrial personnel 439Special aptitude tests 442Testing in the profeSSions 458

Diagnostic use of intelligence tests 465Special tests for detecting cognitive dysfunctionIdentifying specific learning disabilities 478Clinical judgment 482Report writing 487

B. Guidelines on Employee Selection Procedures (EEOC)

Guidelines for Reporting Criterion-Related andContent Validity (OFCC)

PART 5PERSON ALITY TESTS

17. SELF-REPORT INVENTORIESContent validation 494Empirical criterion keying - 496Factor analysis in test developmentPersonality theory in test developmentTest-taking attitudes and response setsSituational specificity 521Evaluation of personality inventories

506510515

18. MEASURES OF INTERESTS, ATTITUDES,AND VALUES ;527

Interest inventories 528Opinion and attitude measurement 543Attitude scales 546Assessment of values and related variables 552

19. PROJECTIVE TECHNIQUESNature of projective techniques 558Inkblot techniques 559Thematic Apperception Test and related instrumentsOther projective techniques 569Evaluation of projective techniques 576

Page 7: Anne Anastasi- Psychological Testing I

PART 1C01ltext of

. Psychological Testing

Page 8: Anne Anastasi- Psychological Testing I

CHAPTER 1

Functions and 01~igiTlSofPsycllological TeStiTlg

A'NYONE reading this book today could undoubtedly illush'ate what. is meant by a psychological test, It would be easy enough to recall. a test the reader himself has taken in school, in college, in the

armed services, in the counseling center, or in the personnel office. Orperhaps the reader has served as a subject in an experiment in whichstandardized tests were employed. This would certainly not have been thecase fifty years ago. Psychological testing is a relatively young branch ofone of the youngest of the sciences.

Basically, the function of psychological tests is to measure ,9.:iffe~~~.n~L_1Jetween individuals or between the reactions of the same individual ondifferent occasions. One of the first problems that stimulated the develop-ment of psychological tests was the identification of the mentally re-tarded. To this day, the detection of int~i1ectual deficiencies remains anImportant application of certain types of psychological tests. Relatedclinical uses of tests include the examination of the emotionally disturbed,the delinquent, and other types of behavioral deviartts. A strong impetusto the early development of tests was likewise provided by problemsarising in education, At present, schools are among the largest test users.The classifica.tiOIlOfchildren with reference to their ability to profitfrom different types of school instruction, the identi£ication of the in-tellectually retarded on the one hand and the gifted on the other, thediagnosis of academic failures, the educational and vocational counselingof high school and college students, and the s~~ction of applicants forprofessional and other special schools are among the many educational

~uses of tests.The selection and classification of industrial personnel represent an-

other major application of psychological testing. From the assembly-line

Page 9: Anne Anastasi- Psychological Testing I

4 COllfcl't of Psychological Testing

operator or filing clerk to top management, there is scarcely a type of jobfor which some kind of psychological test has not proved helpful in suchmatters as hiring, job assignment, transfer, promotion, or termination.To be sure, the effective employment of tests in many of these situations,es eciiill-"Tri('Onnection with high-level jobs, usuall • re uires that thet!.:ts he used as an adjunct to s -i u interviewing, so that test scoresmay be properly int~rpreteaTnt1leli ht of other back ound' rmatiQna out the m IVI un. evertheless, testing constitutes an important part

~ total personnel program. A closely related application of psycho-logical testing is to be found in the selection and classification of militarypersonnel. From simple beginnings in "Vorld 'War I, the scope andvariety of psychological tests employed in military sihlations underwenta phenomenal increase during World War II. Subsequently, researchon test development has been continuing on a large scale in all branchesof the armed services,

The use of tests in counseling has gradually broadened from a nar-rowly defined guidance regarding educational and vocational plans toan involvement with all aspects of the person's life. Emotional well-being and effective interpersonal relations have become increasinglyprominent objectives of counseling. There is growing emphasis, too, onthe use of tests to enhance self-understanding and personal development.Within this framework, test scores are part of the information given tothe individual as aids to his own decision-making processes.

It is clearly evident that psychological tests are currently being em-ployed in the solution of a wide range of practical problems. One shouldnot, however, lose sight of the fact that such tests are als? serving impor-tant functions in basic research Nearly all problems in differential psy-chology, for example, require testing procedures as a means of gatheringdata. As illustrations, reference may be made to studies on the nature andextent of individual differences, the identification of psychological traits,the measurement of group:' differences, ~nd the investigationfijo]ogicaland cUltural factors associated WIth 6ehavioral differences. For all suchareas of research-and for many others-the precise mt>.asurement ofindividual differences made possible by well-constructed tests is anessential prerequisite. Similarly, psycholOgical tests provide standardizedtools for investigating such varied problems as life-span developmentalchanges within the individual, the relative effectiveness of different edu-cational procedures, the outcomes of psychotherapy, the impact ofcommunity programs, and the influence of noise on performance.

From the many different uses of psychological tests, it follows that someknowledge of such tests is needed for an adequate understanding of mostfields of contemporary psychology. It is primarily with this end in viewthat the present book has been prepared. The book is not designed to

make the individual either n skilled examiner and test administrator oran"experf on test construction. It is directed, not to the test specialist, butto the general student of psychology. Some acquaintance with the lead·'ing current tests is necessary in order to understand references to the useof such tests in the psychological literature. And a proper evaluation andinterpretation of test results must ultimately rest on a knowledge of howthe tests were constructe<l, what they can be expected to accomplish, andwhat are their peculiar limitations. Today a familiarity with tests is re-quired, not only b~' those who give or construct tests, but by the generalpsychologist as well.

A brief overview of the historical antecedents and origins of psychologi-cal testing will provide perspective and should aid in the understandingof present-day tests.' The direction in which contemporary psychologicaltesting has been progressing can be clarified when considered in the lightof the precursors of such tests. The special limitations as well as theadvantages that characterize current tests likewise become more intel-ligible when viewed against the background in which they originated.

The roots of testing are lost in antiquity. DuBois (1966) gives a pro-vocative and entertaining account of the system of civil service examina-tions prevailit:\g in the 'Chinese empire for some three thousand years.Among the ancient Greeks, testing was an established adjunct to theeducational process. Tests were used to assess the mastery of physical aswell as intellectual skills. 'the Socratic method of teaching, with itsinterweaving of testin and t~hin has mch i mmon with toda 'srrograme earning. From their beginnings in the middle ages, Europeanumversities relied on formal examinations in awarding degrees andhonors. To identify the major developments that shaped contemporarytesting, however, we need go no farther than the nineteenth century. Itis to these developments that we now turn,

EARLY INTEREST IN CLASSIFICATION ANDTRAINING OF THE MENTALLY RETARDED

The nineteenth century witnessed a strong awakening of interest in thehumane treatment of the mentally retarded and the insane. Prior to thattime, neglect, ridicule, and even torture had been the common lot of theseunfortunates. With the growing concern for the proper care of mental

I A more detlliled account of the early origins of psycllOlogical tests can be foundin Goodenough (1949) and J. Pefers~n (1926~. See also Boring (1950) and Murphyand Kovach (1972) for more general backgrq~md, DuBois (1970) for a brief butcomprehensive history of psychologi~l tClsting, and ,Anastasi (1965) for historicalantecedents of the study of individual differences.

Page 10: Anne Anastasi- Psychological Testing I

6 Context of Psychological Testing

deviates came a realization that some uniform criteria for identifying andclassifying these cases were required. The establishment of many specialinstitutions for the care of the mentally retarded in both Europe andAmerica made the need for setting up admission standards and an ob-jective system of classification especially urgent. First it was necessary todifferentiate between the insane and the mentallv retarded. The formermanifested emotional disorders that might or might not be accompaniedby intellectual deteriomtion from an initially normal level; the latter werecharacterized essentially by i~tellectual defect that had been presentfrom birth or early infancy. What is probably the first explicit statementof this distinction is to be found in a two-volume work published in 1838by the French physician Esquirol (1838), in which over one hundredpages are de\'oted to mental retardation. Esquirol also pointed out thatthere an! many degrees of mental retardation, varying along a continuumfrom normality to low-grade idiOCy. In the effort to develop some systemfor claSSifying the different degrees and varieties of retardation"Esguiroitried several procedures but concluded that the individual's use of lan-guage provides the m05t de endable criterion of his intellectual level. Itis meres mg to note t at current criteria 0 menta retardation are alsolargely lingUistic ant! that present-day intelligence tests are heavilyloaded ~vith Yerbal content. The important part verbal ability plays inour concept of intelligence will be repeatedly demonstrated in subsequentchapters.

Of special significance are the contributions of another French physi-cian, S,egll~. who pioneered in the training of the mentally retarded.Having rejected the prevalent notion of the ineurability of mental re-tardation SeO'uin (1866) eXIJerimented for many vears with what he, v ~ "termed the physiological method of training; and in 1837 he,:es,tal:6hedthe nrst school devoted to the education of mentally reta .." ~hildren.In 1848 he emigrated to America, where his ideas gaine _ ide recog-nition. Man~- of the sense-training and muscle-trainirJg techniques cur-rently in use in institutions for the mentally retarded \vere originated bySeguin. By these methods, severely retarded children are given intensiveexercise in sensory discrimination and in the development of motor con-trol. Some of the procedures developed by Seguin for this purpose were'eventually incorporated into performance or nonverbal tests of intelli-gence. An example is the Seguin Form Board, in which the individualis required to insert variously shaped blocks into the correspondingrecesses as quickly as possible.

More than half a century after the work of Esquirol and Seguin, theFrench psychologist Alfred Binet urged that children who failed torespond to normal schooling be examined before dismissal and, if con-sidered educable, be assigned to special classes (T. H. Wolf, 1973). With

Functions and Origins of Psychological Testing 7

his fellow members of the Society for the Psychological Study of theChild, Binet stimulated the Ministry of Public Instruction to take steps toimprove the condition of retarded children. A specific outcome was the<'stablishment of a ministerial commission for the study of retarded chil-dren, to which Binet was appointed. This appointment was a momentousevent in the history of psychological testing, of which more will be saidJal'er.

The ~arly experimental psycholOgists of the nineteenth century werenot, in general, concerned \vith the measurement of individual'differ-ences. The principal aim of psychologists of that period was the fomm-lation of generalized descriptions of human behavior. It was theuniformities rather than the differences in behavior that were the focusof attention. Individual differences were either ignored or were acceptedas a necessary evil that limited the applicability of the generalizations.Thus, the fact that one individual reacted diHerently from another whenobserved under identical co~ditions was regarded' as a form of -etror.The presence of such error, or individual variability, rendered thegeneralizations approximate rather than exact. This was the attitudetoward individual differences that prevailed in such laborotodes as thatfounded by '''undt at Leipzig in 1879, where many of the early experi-mental psychologists received their training.

In their choice of topics, as in many other phases of their work, thefounoers of experimental psychology reBected the influence of their back-grounds in physiology and physics. The problems studied in their labora-tories were concerned largely with sensitivit~ to ~al, auditory, and~other sensory stimuli and \vith simple reaction time. This emphasis onsen~ory phenome~a was in tU!'l1reflected in the nature of the £rst psycho-logICal tests, as will be apparent in subsequent sections.. St:ilI another way in which nineteenth-century experimental psychologyInfluenced the course of the testing movement may be noted. ,The earlvps~'chological experiments brought out the need for rigorous controlof the conditions under which observations were made. For example, the\\'?rding of directions given to the subject in a reaction-time experimentmIght appreci~bly incre.ase or decrease the speeg 'i\t the subject's re-sponse. Or agam, the bnghtness or color oEthe sUtr~,,~:ding field couldmar~edly alter the appearance of a visu~J s~mulU~:".1\h~portance ofmakmg observations on all subjects un4i~.,s~ndardiz~& conditions was

...!fu1svividly demonstrated: Such standardization of proce,dure eventuallybecame one of the special earmarks of psychological tests.

Page 11: Anne Anastasi- Psychological Testing I

Functions and Ol'igills of Psychological Testing 9

mathematically untrained investigator who might wish to treat test re-sults quantitatively. He thereby extended enormously the application ofstatistical procedures to the analysis of test data. This phase of Galton'swork has been carried forward by many of his students, the most eminentof whom was Karl Pearson.

It "'as the English biologist Sir Francis Galton who ,,:as. primarily r~-sponsible for launching the testing movem~l~t: A umfY~lg. factor ~nCalton's numerous and vaI'ied research activities was hiS }nterest llL'humaJ;rheredit ". In the course of his imestigations on heredity, Caltont~a 'ize t e need for measuring the characteristics of related and un-related persons. Only in this way could he discover, for example, theexact degree of resemblance bet:w'een p~ren~s and offspring, 1;'rothers and .sisters; cousins, or twins. With this end 11l View, Calton was mstrument~l 'in inducing a number of educational institutions to keep systematicanthropometric recOl:ds on their students. ~e al<;oset up an anthropo~ct-ric laboratory at the International EXposI~on of ,18~4wh~re, by .pa) mgthreepence, visitors could be measured 111 ce~yslcal traIts andcould take tests of keenness of vision and hearing, muscular strength,reaction time, and other simple sensorimotor functions. Whe~l the expo-sition closed, the laboratory was transferred to South Kensmgton Mu-seum, London, where it operated for six years. By such methods, the nrstlarge, systematic body of data on individual differences in simple psycho-logical processes was gradually aceu~ulated. . . .

Galton himself devised most of the sun pIe tests admIDlstered at hIS an-thropometric laboratory, many of which are still familiar either in ~heiroriginal or in modified forms. Examples include the Cal~o~ bar for ,,:mual

,discrimination of len h, the Galton whistle for determmlllg the hlghestau i e pitch, and graduated series of weights for measurin? k~ne.sth~ticdiscrimimltion. It was Calton's belief that tests of sensory discrlrmnatlOncould serve as a means of gauging a person's intellect. In this respec,~' hewas partly influenced hy the theories of L?cke. Thus Galton wrote: .Theonly information that reaches us concernmg outward events appeals topass through the avenue of our senses; and the n~ore per~ptive the sen~esare of difference, the larger is the field upon which our Judgment and 10-

telligence can act" (Calton, 1883, ~'. 27). C~lt~n !lad.:~lso noted thatidiots tend to be defective in the ability to discrlmmaJe·:heat, cold, andpain-an observation that furtller strengthene5iYnis ~nviction that sens~rydiscriminative capacity "would on the whole' be highest among the m-tellectualh- ablest" (Galton, 1883, p. 29). .

Galton also pioneered in the application of rating-sca~c ~nd ques~lOn-naire methods as well as in the use of the free associatIon techmquesubsequently ~mployed for a wide ~arietyof purposes. A .fu.rther contri-bution of Galton is to be found in hiS development of statistical methodsfor the analysis of data on individual differences. Galton selected andadapted a n~mber of techniques previously derived ~y m~thematicians.These techniques he put in such form as to permit theIr use by the

An especially prominent position in the development of psychologicaltesting is occupied by the American psychologist James McKeen Cattell.The newly established science of experimental psychology and the stillnewer testing movement merged in Cattelfs work. For his doctorate atLeipzig, he completed a dissertation on individual differences in reaction!ime, despite Wundt's resistance to this t'ype of investigation. While lec-tming at Cambridge in 1888, Cattell's own interest in the measurementof individual differences was reinforced bv contact with Calton. On hisreturn to America, Cattell was active both 'in the- establishment of labora-tories for experimental psychology and in the spread of the testingmovement. l -;;\- ';e~ U-U..~

In an article written by Cattell in ,,890, the term "mental test'. was . _used for the £rst time in the psychological literature. This article de-scribed a series of tests that were beinO' administered anlluallv to collegeo .students in the effort to determine their irteilectuall~yel. The tests, whichhad to be administered individually, included measures of muscularstrength, speed of movement, sensiti~ty to pain, keenness of vision andof hearing, weight discrimination, reaction time, memory, and the like. I

In his choice of tests, Cattell shared Galton's view that Jl measure of/M-.,';';;.(,V1.""V'.-(~i,ntellectual functions could he Qbt<}ined through tests of sensorv cis,- f<.U4-~e.I..t., ;~~c~pination and reaction time. Cattell's pI'eference for such tests was also !1~~tl<-.~

bolst.e~ed by the fact that simple functions could be measured with .p!i<ck{t<:1.<-lA.~JpreCiSIOnand accuracy, whereas the development of objective measures1-<=~.M "..it-r Ifor the more complex functions seemed at that time a well-nigh hopeless r:YL-'task. ' .

Catten's tests were typical of those to be found in a number of testseries developed during the Jast decade of the nineteenth century. Suchtest series were administered to schoolchilqren, college students', and mis-ccllaneous adults. At the Columbian Exposition Jield in Chicago in 189~,Jastraw set up an exhibit at which visitors wete"'iIllitted to take tests ofsensory, motor, and simple perceptual processes and: to compare tlieirskill with the norms (J. Peterson, 1926; Philippe, 1894·~.A few attemptsto evaluate such early tests yielded very discOuraging results: The indi-vidual's Rerform~Dce showed little correspondence from one test to an-other (Sharp, 1~1899; Wissler, 1901), and it exhibited little or no

Page 12: Anne Anastasi- Psychological Testing I

10 Context of PSlJc11010gical Testing

relation to independent estimates of intellectual levC:'1based on teachers'ratings (Bolton, 1891-1892; J. A. Gilbert, 1894) or academic grades(Wissler, 1901).

A number of test series assembled by European psychologists of theperiod tended to cover somewhat more complex functions. Kraepelin(1895), who was interested primarily in the clinical examination of psy-chiatric patients, prepared a long series of tests to measure what he re-garded as basic factors in the characterization of an individual. Thetests, employing chiefly simple arithmetic operations, were designed tomeasure practice effects, memory, and susceptibility to fatigue and to dis-traction. A few years earlier, Oehrn (1889), a pupil of Kraepelin, hademploY€idtests of perception, memory, association, and motor functionsin an investigation on the interrelations of psychological functions. An-other German psychologist, Ebbinghaus (1897), administered tests ofarithmetic computation, memory span, and sentence completion to school-children. The most complex of the three tests, sentence completion, wasthe only one that showed a clear correspondence with the children'sscholastic achievement.

Like Kraepelin, the Italian psychologist Ferrari and his students wereinterested primarily in the use of tests with pathological cases (Guicciardi& Ferrari, 1896). The test series they devised ranged from physiologicalmeasures and motor tests to apprehension span and the interpretation ofpictures. In an article published in France in 1895, Binet and Henri criti-cized most of the available test series as being too largely sensory and asconcentrating unduly on simple, specialized abilities. They argued furtherthat, in the measurement of the more complex functions, great precisionis not necessary, since individual differences are larger in these functions.An extensive and varied list of tests was proposed, covering such func-tions as memory, imagination, attention, comprehension, suggestibility,aesthetic appreciation, and many others. In these tests we can recognizethe trends that were eventually to lead to the development of the famousBinet intelligence scales.

Functions and Origi;ls of Psychological Testing 11

ously cited commission to study procedures for the education of retardedchildren. It was in connection 'with the objectives of this commission thatBinet, in collaboration with Simon, prepared the first Binet-Simon Scale(Binet & Simon, 1905).

This scale, known as the 1905 seale, consisted of 30 problems or testsarranged in ascending order of difficulty. The difficulty level was deter-mined empirically by administering the tests to 50 normal children aged3 to 11 years, and to some mentally retarded children and adults. Thetests were designed to cover a wide variety of functions, with speCialemphasis onJ.udgmt;nt, comprehension, and reasoning. Which Binet re-garded as essential components of intelligence. Although sensory andperceptual tests were included, a much greater proportion of verbalcontent was found in this scale than in most test series of the time. The1905 scale was presented as a preliminary and tentative instrument, andno precise objective method for arriving at a total score was formulated.

In the second, or 1908, scale, the number of tests was increased, someunsatisfactory tests from the earlier scale were eliminated, and all testswere grouped into age levels on the basis of the performance of about300 normal children between.. the ages of 3 and 13 Years. Thus, in the3-year level were placed all tests passed by 80 to 00 percent of normal3-year-olds; in the 4-year-Ievel, all tests similarly passed by normal 4-year-olds; and so on to age 13. The child's score on the entire test could thenbe expressed as a mental level corresponding to the age of normal chil-dren whose performance he equaled. In the various translations andadaptations of the Binet scales, the term "mental age" was commonlysubstituted for "mentalleveI." Since mental age is such a simple conceptto~rasE> the introduction of this term undoubtedly did much to popu-larize intelligence testing.> Binet himself, however, avoided the term"mental age" because of its unverified developmental implications andpreferred the more neutral term "mental level" (T. H. \\Tolf, 1973).

A third revision of the Binet-Simon Scale appeared in 1911, the year ofBinet's untimely death. In this scale, no fundamental changes were intro-duced. Minor revisions and relocations of specific tests were instituted.More tests were added at several year levels, and the scale was extendedto the adult level

Even prior to the 1908 revision, the Binet-Simon tests attracted wide> Goodenough (1949, pp. 50-51) notes that in 1881, 2l y~aTs befor~ the appear-

ance of the 1908 Binet-Simon Scale, S. E. Chaille publi!iheq in the New OrleansMedical a~d Surgical Journal a series of tests for infan~ 11l7anged according to thea!1:eat whIch the tests are commonly passed. Partly because' of the limited circulationof the journal 'nd partly, perhaps, because the scientific ~Om!J1l1nity was not readyfor it, the significance of this age-scale concept passed unnoticed at the time. Binet'sown scale was in~ed by the work oE some oE ~is contemporaries, notably Blinand Damaye, who prepared a set of oral questions from which they derived a singleglobal score Eor eaclrdiild (T. H. Wolf, 1973). .

Binet and his co-workers devoted many years to active and ingeniousresearch on ways of measuring intelligence. Many approaches were tried,including even the measurement of cranial, facial, and hand form, andthe analysis of handwriting. The results, however, led to a growing con-viction that the direct, even though crude, measurement of com lex1 fence a unc ons 0 ere t e greatest promise. T en a specific situ-ation arose that brought Binet's efforts to imme(]iate practical fruition.In 1904, the Minister of Public Instruction appointed ~inet to the previ-

Page 13: Anne Anastasi- Psychological Testing I

12 Context of Psyc11010gical Testing

attention among psychologists throughout the world. Translation~ andadaptations appeared in many lang;uages. In Americ;l, a number of diHer-ent revisions were prepa.red, the most famous of which is the one de-veloped under the direction of L. ~tTerman a.t Stanford University, andknown as the Stanfmd-Binet (Terman, 1916). It was in this test that theintelligence quotient (IQ), or mtio between mental age and chronologi-cal age, was first used. The latest revision of this test is widely employedtoday and will be more full\' considered in Chapter 9. Of special interest,too. is the first Kuhlmann-Binet revision, which extended the scale down-ward to the age level of 3 months (Kuhlmann, 1912). This scale repre-sents one of the earliest efforts to develop preschool and infant tests ofintelligence.

Functions and Origins of Psyc1101ugical Testing 13

fo~ g~n~ral routine te~ting; t~e latter was a nonlanguage scale employedWIth Illiterates and wIth foreign-born recruits who were unable to take atcst in English. Both test~ w~re suitable for administratio~ to large groups.

Shortly af~e~ the temunatlOn of "Vorld War I, the Army tests were re-leased for cmhan use. Not only did the Army Alpha and Army Betathemselves pass through many revisions, the latest of which are even nowin use, b.ut they also sVVed as ~dels for most group intelligence tests.The te~ting .movement underwent a tremendous spurt of growth. Soongroup mtelhgence tests were being devised for all ages and types of~ersons, from preschool children to graduate students. Large-sc~le test-109 progra~ns: previously impossible, were now being launched with~est~ul optimIsm. Because group. tests were designed as mass testinglUsh uments, they not only permItted the simultaneous examination oflarge groups but also simplified the instructions and adminish'ation pro-cedu~es so as to demand a minimum of training on the part of theexammer. Schoolteachers began to give intelligence tests to thcir classes.Coll~ge studen~s were routinely examined prio~ to admission. Extensivestudies of specIal adult groups, such as prisoners, were undertaken. Andsoon the general public became IQ-conscious. "---

T~e application of such group intelligence tests far outran their techni-cal Improvement. That the tests were still crude instruments was oftenf?rgotten in the rush of gathering scores and drawing practical condu-slO~Sfrom the ~esults. 'Vhen. ~he tests failed to meet unwarranted expec-tations" skepticism and hostiht)' toward all testing often resulted. JJ1US.the testi boom of the twenties, based on the indiscriminate use of tests i?ISma~ have ~one as much to retai' as to ad\'ance the progress of psvcho- ---logical test mg. - ~

The Binet tests, as well as all their revisions, are indil;iclual scales inthe sense that the\" can be administered to onlY one person at a time.Man\' of the tests in these scales require .oral re~ponses from the subjector n~cessitate the manipulation of materials. Some call for individualtiming of responses. For these and other reasons, such tests are notadapted to group administration. Another characteristic of the Binet typeof test is that" it requires a highly trained examiner. Such tests are es-sentiallv clinical instruments, suited to the intensive study of individual

J .' •

cases.Group testing, like the first Binet scale, was developed to meet a press-

ing practical need. When the United States entered l)!orld 'Var I in1917, a committee was appointed by the American Psychological Associ-ation to consider ways in which psychology might assist in the conduct ofthe war. This committee, under the direction of !lobert 1.•.1. Yerkes, recog-nized the need for the rapid classification of the million and a ha1f re-cruits with respect to general intellectual level. Such informati~.~~~va:srelevant to many admmistrative decisions, including rejection or dis-charge from military service, assignment to different types of sel'vicei, oradmission to officer-training camps. It was in this setting that the firstgroup intelligence test was developed. In this task, the Ar-m~' psycholo-gists drew on all available test materials, and especially on an unpub-lished group intelligence test prepared by ~rthur S. Otis, which hcturned over to the Army. A major contribution of Otis's test, which hedesigned while a student in one of Terman's graduate courses, was theintroduction of multiple-choice and other "objective" item types.

The tests finally developed by the Army psychologists came to beknown as the ~rm""yAlpha and the Army Beta The former was designed

~lthough intelligence tests were originally designed to sample a widevanety of ~unctions in order to estimate the individual's general intelIec-tua~ level, It soon became apparent that such tests were quite limited intheIr .cove~age. Not all important functions were represented. IJ:!. fact,most mtelhgence tests were primarily measures of verbal ability and. to alesser extent, of the ability to handle numerical and other abstract andsymb~~ic re~ations. Gr~dually psychologists eame to recogni~e that the~erm . Il1telhgence test was a misnomer, since only certain aspects ofmtelligence were measured by such tests.

To be sure, th~ tests cov~red abilities ,t~t are ot p.rime importance inour culture. B~ It was. realized that more'precise designations, in termsof the type of mformation these tests are able to yield, w<;lUlq be prefer-

Page 14: Anne Anastasi- Psychological Testing I

14 Context of Psyclwlo{!.ical Testing

able, For example, a number of tests that would probably have beencaned intelligence tests during the twenties later came to be known asscholastic aptitude tests. This shift ill terminology was made in l'ec:og-nition of the fact that mallY so-called intelligence tests measure thatcombination of abilities demanded by academic work.

E\'l'n prior to Vvorld War I, ps\'ch~logists had begun to recognize theneed for tests of spE'cial aptitudes to suppkment the global intelligencetests. These s ecial a till/de tests ' , , _ 'vocationa counseling and in the selection and classification of industrialand military ersonn~1. Among the most widely used are tests of.!!lechani-ea , c erica, musical, and artistic aptitlldes.-TI~ca~lation of intelligence tests that follm,'ed their wide-sl>\'eadand indiscriminate use durinlJ the twenties also revealed another, 0lIote"iOlthy fact: an individual's erformance on 'test often -showed mar -c variation. This ,yas especially apparent ongl'OUptests, 111whlch the items ar~mmonly segregated into subtests ofrelath'e1\- homogeneous content. For example, a person might score rela-tively high on a verbal subtest and low on a numerical subtest, or viceversa, To some extent, such internal variability is also discernible on atest like the Stanford-Binet, in which, for example, all items involvingwords might prove difficult for a particular individual, whereas itcmsemploying pictures or geometric diagrams may place him at an ad-vantage,

Test users. and especially clinicians, frequently utilized such interc~l11-parisons in order to obtain 1110reinsight into the individual's psychologicalmake-up. Thus, not only tllC'IQ or other global score but also scores onsubtests wonld lJt' examined in the e\'aluation of the indhidual case, Sucha practice is not to be general1~' recommended, ho,~,('ver. ~)eeaus~ in-tellig('J]ce tests were not designed for the purpose of ,dIHerel,~h,~11aphtudeanal;'sis. Often the subtests heing compared contain t0o,14C\\' items toyield a stable or reliable estimate of a specific ability:;jis'a result, theobtained diffl:'rence betwcen subtest scores might be reversed if theindividual were retestE'd on a different day or with another foml of thesame test. If such intraindividual comparisons are to be made, tests areneeded that are specially designed to reveal differences in performancein various functions.

While the practical apl)lication of tests demonstrated the l1~.ed fordifferential aptitude tests, a parallel development in the stu,d)' of trait or-ganization was gradually providing the means for constructing SUC? tests.Statistical studi('s on the nature of intelligence had been explonng theiflterrelatiol1s among scores obtained by many persons on a ,,,ide varietyof different tests, Such investigations were begun by the English ,psy-chologist Charles Spearman (1904, 1927) during the £lrst decade of the

Functions and OrigillS of PSljchological Testing 15

present century. Subsequent methodological developments, based on thework of such American psychologists as T. L. ReIley (1928) and L. L.!hurs~one (1935, 194i), as well as on that of other American and Englishll1veshgators, have come to be known as "factor analvsis."

The contributions that the methods of factor ana'lysis have made totest c'Onstruction will be more fully examined and ill~strated in Chapter1:3. For the present, it will suffice to note that the data gathered by suchprocedures have indicated the presence of a Dumber of rebtiyely ;nde-

J)endent factors. or traits. Some of these traits were represen'ted, invary~ng proportions, in the traditional intelligence tests. Verbal compre-henSIOn and numerical reasoning are examples of this tvpe of trait.Others, such as spatial, perceptual, and mechanical aptitude~, were foundmore often in special aptitude tests than in intelligence tests.

One of the chief practical outcomes of factor analysis was the develop-ment of multiple aptitude batteries. These batteri('s arc desiuned to pro-vide a measure of the individual's standing in each of a number of traits.In place of a total score or IQ, a separate score is obtained for such traitsas "erhal comprehension, numerical aptitude, spatial visualization, arith-m~tic re~soning, and perce~tual speed, Such batteries thus provide aSUItable mstrument for makin<1 the kind of intraindividual anaJ\'Sis I'

1 e~'e ~nOSls, t at c inicians a een tr\'ing for matiy years to.obtam, wlth crude and often errODl:'OUSresults from intelligence tests.These batteries also incorporate into a comprehensivl:' and svstl:'matictesting program much of the inform,ation formerly obtained fro~l specialaptihlde tl:'sts, since the multiple aptitude batteries cover some of thetraits not ordinarily me u e JlI IJ1e 1 ence tests.

, u tip e ap u e atteries represent a relatively late development inthe testing field. Nearl~' all have appeared since 1945. In this connection,the work of thc military psychologists during World War II s.J~d alsobe noted. ~fuch of the test research conducted in the armed services wasbased on factor analysis and was directed toward the construction ofmu.ltiple aptitude batteries. In the Air Force, for example, special bat-tent's were constructed for pilots, bombardiers, radio operators, rangefinders, and scores of other military specialists. A report of the battericsprepared in the Air Force alone occupies at least nine of the nineteenvolumes devoted to the aviation psychology program during 'Vorld WarII (Anny Air Forces, 1947-1948). Research along these line~ is still inprogress under the sponsorship of various branches of the armed services.A.~~mber of multiple aptitude batteries !rl,\yelikewise ~en 4,eveloped forclVllian. use and are being widely appliel:l\,n educati0l1~l and vocationalcounselmg and in personnel' selectioll and' cJassincadqIl. Examples ofsuch butteries will be discussed in Chapter 13, ,"-' "

To avoid confusion, a point of terminology shoul\!l be clarified. The

Page 15: Anne Anastasi- Psychological Testing I

16 COIl!ex! of Psyclwlogict,{ Tcsrillg

term "aptitude test" has been tracHtiollalJ" cmployed to refer to testsmeasuring relativel\" homo ('ncous and dparlv defined sc rn1C'nts of• I I \., t le term "intelliO'ence test" customarih' refers to more hderogenc-Co) e-. .

~ests yielding a single global score sm:h as an IQ. S~)ecial aptitu~ctests typically measure a single aptitude. ~lultiple al~tltl1de battenesmeasure a number of aptitudes but pro\"ide a profile of scores, one foreaeh aptitude.

FI/I1C!iol1.\' mltl Origi/l.~ of Psyc1IO/<l{!.ical Tcsli,l{!. 17

and other hroad educational objectives. The deeade of the 19:305 alsowitnessed the introduction of test-seoring maehines, for which the newohjec:tive tests could be readily adapted.

The establishment of statewide, regional. and nalional testing programs,,,as another noteworthy parallel denlopment. Probably the best known.?f these programs is that of the College Entrance Examination Board

~t;EEB). Established at thc turn of the ce_ll'~' to reduce duplication inthe exa"tnining of entering college freshmen, this program has undergoneprofound changes ill its testing procedures and in the number and nature?f partie-ipa.ting col1eges-c·hangcs that reflect inten'ening developments111both testIng and cducation. In 1947, the testing functions of the CEEBwere llIerged with those of the Carnegie Corporation and the AmericanCouncil on Education to form Educational Testing Service (ETS). Insubscq.t1cnt ~'ears, ETS has assumed responsibility for a growing numberof testlllg programs on behalf of universities, professional schools, gov-ernment agencies, and other institutions. \[ention should also he made ofthe American Collegc Testing Program established in 1959 to scrccnapplicants to colleges not included i~ thc CEEB program, and of severalnational testing programs for the selection of highl\' talented studentsfor scholarship awards. .. Achievem.ent tests are used not only for educational purposes but alsoIII the se]Pchon of applicants for industrial and government jobs. \fentionhas already been made of the systematic use of ci\'i\ sen'jce examinationsin the Chinese empire, dating from 111.5 .B.c. In modern times, selectionof go\'~rnI~lent emplo:-e~s by examination was introduced in Europeancountnes 111the late eIghteenth and eark nineteenth centuries. Thel!llited States Chi! Service Commission in~talled competitive examina-tions as a regular procedure in 1883 (Kanuck, 19.56). Test constructiontechniques developed during and prior to World "'a~ I were introdudedinto tll<:'examination program of the United States Ch-il Service with theappointment of L. J. O'Rourke as director of the newlv established re-search dh'ision in 1922. '. As more and more psychologists trained in psychometrics participatedm the construction of standardized achievement tests, the technical as-pects of achievement tests increasingly came to resemble those of in-telligence and aptitude tests. Procedur~s for cons,trllcting and evaluatingall ~hese tcsts have much in common. The incre~s!ng effOlts to prepareachIevement tests that would measure the attainment of broad educa-tional goals, as contrasted to the recall of factualiminutiae also madethe content of achievement tests resemble more -cioselv th~t of intelli-ge~lce tests. Today the difference between these two 'types of tests isdueHy one of degree of specificity of content and extent to which thetest presupposes a designated course of prior instruCtion.

While psychologists were busy developing intelligence and aptitudetests, traditional school examinations were undergoing a number of tech-nical improvements (Caldwell & Courtis, 192:3; Ebel & Damrin, 1960 ~.An important step in this direction was taken by the Boston pubhcschools in 1845, when written examinations wefe substituted for the oralinterroO'ation of students by visiting examiners. Commenting on this in-nDvati~l, Horacc ~fann cit~d arguments remarkably similar to those usedmuch later to justify the replacement of essay questions hy objectivemultiple-choice items. The written examiuations, \lann noted, put allstudents in a uniform situation, permitted a wider cO\'erage of content,reduced the chance element in question choice, and eliminated tIll' pos-sibilitv of h\'oritism on the examiner's part.

Aft~r the turn of the centurv, the first stand-ardized tests for measuringthe outeomes of school instnl~tion began to appear. Spearheaded h~' thework of E. L. Thorndike. these tests utilized measurement principks de-veloped in the psychological laboratory. Examples include scales forrating the quality of handwriting and written compos.itiol1s, as. well ~stests in spelling, arithmetic computation, and arithmetic reasol1lng. Stl11later came the achie\"ement batteries, initiated by the publication of thefirst edition of the Stanford Achievement Test in 192:3. Its authors werethree earl" It'aders in test development: Truman L. Kelley, GHes ~f.Ruch, ami Lewis M. Terman. Foreshadowing many characteri·stic'S ofmodern t'fsting, this battery provided com~arable measu~'es of perfo~-ance in different school subjects, evaluated 111 terms of a smgle norma livegroup.

At the same time, evidence was accumulating regarding the lack ofagreement among teachers in grading essay tests. By .1930 it was.widelyrecognized that essay tests were not only more hme-cOnsumll1g forexaminers and examinees, but also yielded less reliable results than the"new type" of objective items. As the latter came into increasing use instandardized achievement tests, there was a growing emphaSiS on thedesign of items to test the understanding and application of knowledge

Page 16: Anne Anastasi- Psychological Testing I

J' IIIIC/ /(111,\ {///(/ (higill., of J'sydl(l'(/~i('111 1'<'S!iIlt!. 19

of bc-!Ja>ior01' Wl'I'('<:olll:erncd with mOl'(' dbtindly social r('~pons('s, suchas dOl1lmalll'C-sublllission in interpersonal ('ontacts. A later development\\'as th<: constmction of tests for quantifying the expression of interestsand athtude's, These tests, too, W('H' based l'ssentialh' on <llll'stionnairet('chniqul's, .

.All(~th('rapproach to the measurement of personalit~' is through the ap-pllc,\hon of performatlce or situational tests. In such tests, the subject hasa task to perform whose purpose is often disgUised, :\Iost of these testss~llIulate e\'eryday-life situations quite c1ose1~'.Th(' first extensive applica-tIOn o~ such tl'chniqnes is to be found in the h'sts de\'eloped in the latetwenhcs and earl~' thirties by Hartshorne, ~fa\', and their associates(1928, 1929, 19:30), This series, standardized on s'choolchildren, was con-cerned \:'ith such beha"ior as cheating, lying. stealing, cooperatin'ness,and pcmstenct', Objective, quantitative scores could he obtained on eachof a largc numb('r of sp('cific tests, A more recent illustration, for thea~1I1.tlev;l, is l~ro\'ided by the series of situational tests developcd during" OJld "ar II 111 the Assessment Program of the Office of Strate<TicServ-ices (OSS, 19-48). These tests wem' C:Oll('erned with rclath·ely ~omplexand subtle sodal and emotional beha\'ior and refluired rather ehlboratef~cilities and tr~lin:d personnel for their admillistration, The interpreta-tIOn of th,e subject s responses, moreover, \\'as rdati\'C I~' suhjectivc.

Pro,ectll;e techniqlles represent a third approach to the study of per-sO,nall.tyand olle that has shown phenomenal gro\vth. cspecially amongdlll1CIans. In such tests. the subject is gi\'en a relatin'Jy unstructuredtask that permits "'ide latitudl' in its solution, The assumption underlvincrsuch metllocls is that the indi\'idual will project his characteristic m~d~:of response into stich a task. Lik(' the performancc and situational tests.proje~ti\'l' techniqucs are mor(' or less disguised in lhl:'ir purpose, therebyreducmg the chances that the subject can dt'li1wrateh- create a desiredimpressi?l1, The prc\'iously cited free association test'represe.nts one ofthc earlIest types of projccth'e techniques. Sellten('e-completion testshav.e al.so been tlSed in this manner. Otller tasks commonly employed\nproJech\'e techniques include drawing, arranging toys to create a scene,('xtempor~nt'ous dramatic play. and interpreting pictures or inkblots.

All.a\'aJlable types of personality t('sts present serious difficulties. bothpracti~al and theoretical. Each approach has its own special advaqtagesand. dlsad\:antages. On the whole, personality testing has lagged farbehmd aptitude t('sting in its positive accomplishments. But such lack ofprogress is not to be attributed to insufficient eHOI't. Hesearch on the~~~urement ~f. pers?nality ~as attained i~pr~s~ive Pl~p,p'ortions since. ' ~nd .man) mgemous devIC.'csand techmcal J1nprovemeil~s arc under~VeStigabon. It is rathe,r the spt'cial difficulti~ encountel:fd in the

easurement of personality that account for the slow advances in thisu~ . ,

Another area of psy<:holo~ical testing is concerned with the aH('ctive ornonint('lIectnal aspects of b('ha\'io!'. Tests d('signed for this purpose arecommonly known as personality tests. although some psychologists preferto lISt' the term personalit~, in a hroader sense, to refer to the cntirc indi-vidual. Intellectual as well as nonintellectual traits ,,"ould thus be includedunder this heading, In the terminology of psychologit·al testing, howcver,the designation "personality test" most often refers to measures of suchcharacteristics as emotional adjustment, interpersonal relations, moth·a-tion, interests, and attitudes.

An earl~' precursor of personaJit~' testing may be r('cognizcd in Kra,:-pelin's use of the free association test with abnormal patients. In thIStest the subject is gh'en specially selectcd stimulus words and is requiredto r('spond to each with the first word that comes to mind, Kraepelin( 1892) also employed this technique to study the psychological effectsof fatigue, hunger, and drugs and concluded that all these agents in-crease the relati\'{~ frequenc~' of superficial associations, Sommer (1894),also writing: during the last decade of the nineteenth century, suggestedthat the free association test might be used to differentiate between thevarious forms of mental disorder. The fre(' association technique hassubscqllenth' becn utilized for a vari('ty of testing purpos('s and is stillcurr('nth- en\plcn'ed, \Iention should also be made of the 'York of Galton,Pear~on: and C;lttell in the dpyelopment of standardized questionnaireand ratin~-,~'ale tl'chniqn('s. Although origin~l1y devised for other pur-poses. these proc-edmes \wre e\'entual1~' employed by othNs in construct-ing some of the most common types of current personality tests.

The protntype of tht, personalit\' qnpstionnaire, or self-report inventory,is the Per~(lnal Data Sheet developed by \Voodworth durin~ \"orId \VarI (DuBois. 1970; Symonds. 19:31,eh. 5; Goldlwrg, 19(1). This test wasdesigned as a rough screening device for identifying seriously ~uroticmen \\'110 would be' unfit for military service. The inventor\' conslst~d ofa number of questions dealing with common neurotic sy~pt01'!lS, ,~'hichthe individual answered about himself. A total score was o\5t~ined bycounting the number of symptoms reported, The Personal Data ~heetwas )lot completed carly enough to permit its operational use .J)efore thewar cnded. Immediatel" after the war, however, civilian forms wereprepared, including a special form for use with children. The \Vood-worth Personal Data Sheet, moreover, served as a model for most subse-quent emotional adjustment inventories. In some of these questionnaires,an attempt was made to subdivide emotional adjustment into more spe-cific forms. such as home adjustment, school adjustment, and vocationaladjustment. Other tests concentrated more intensively on a narrower area

Page 17: Anne Anastasi- Psychological Testing I

imtruJl1cnts {,;m hr found in A SourcelJook for .Hell/(/I 11ealtll Measures(Comn'~·. Backer, & Glaser, 197:1). Containing approximately 1,100 ab-stracts. this sourcehook includes tests, questionnaires, rating scales, andother <ledc('s for assessing both aptitude and personality variables inadults and children. Another similar reference is entitled Measures forPsychological Assessment (Chun, Cobb, & Frenrh, ]975). For each of:1,000 measures, this volume' gives the original sOl\J'et' as well as an anno-tat<,d bibliography of the studies in which the measure was subscquentlyused. The entries were located through a search of 26 measurement-related journals for the Years 1960 to 1970.

Information 011 asses~ment devices suitable for children from birth to12 years is summarized in Tests and Measurements in Child Development:A Handbook (Johnson & Bommarito, 1971). Covering only tests not listedin the \nrr, this handbook describes instruments located through anintensi\'(~ journal search spanning a ten-year period. Selection criteriaincluded availability of the test to professionals, adequate instructionsfor administration and scoring, sufficient length, and convenience- of use(i.p., not requiring expensive or elaborate equipment). A still more spe-cialized collection CO\'crs measures of social and emotional developmentapplicable to children between the ages of ,3 and 6 years (Walker, 1973).

Finanv, it should be noted that the most direct source of informationregardiI;!!: specific curr~ltksts is pro\'ided h~' the catalo~t1cs of tcst pub-lIshers and b~' tht· mannal that accompani0s ('ach test. A comprehensivelist of test publishers, \\'ith addresses, can be found in the lates't Mell/alM el/S/lTcmcnfs rearl)()ok~ For reach' reference, the namt's and nddrt'ssesof some of the largt'r .-\merican p'uhlishers and distributors of psycho-logical tests are gi\'en in AppendiX D. Cltalog\1('s of current tests can beobtained from each of these publishers on requcst. :\lanuals and speci-men sets of tests can be purchased hy qualified users.

The test manual should provide the ('ssential infurmation required foradministering, scoring. and evaluating a particular test. In it should befound full and detailed instructions, scoring key, norms, and data on re-Iiahilit~, and validity. :\fo!'E'over, the manual should report the numberand nature of subjects on whom lIonns, reliahilit~·. and validity wereest~b~ished, the methods employed in computing indices of reliability andvalIdity, and the specific criteria against which validity was checked. In~he e\'ent that the necessary information is too lengthy to fit convenientlymto the manual, references to the printed sour<.:esin which such infor-mation can be readily located should be given. The manual should, inother. words, enable the test user to evaluate the ·test before choosing itfor IllS specific purpose. It might be added that ma~y test manuals stillfa!1 short of this goal. But some of the larger ancl more professionallyonented test publishers are giving increasillg attention to the preparation

Psychological testing is in a state of rapid chan~e. There are shiftingoriel;tations, a constant stream of new tests, revisc>dforms of old tests, andadditional data that mav refine or alter the interpretation of scores onexisting tests. The accelerating rate of <:hange, together with ~he vastnumber uf available tests, makes it impracticable to sun'ey speCific testsin any single text. \lore intensive coverage of testing instruments andproblems in special areas can be found in books dealing with the us~ oftests in such fields as counseling. clinical practice, personnel selection,and education. References to such publications are given in the appropri-ate chapters of this book. In order to keep abreast of current develop-ments, however, anyone working with tests needs to be familiar with

IlUoredirect sources of contemporary information about tests.One of the most important sources is the series of Mental !Ifeasurements

)'eaTbooks (MMY) edited hy Buros (19i2). Th('sc yearbooks cover nearlyall commercially available psychological, educational, and vocational testspublished in English. The coverage is especially .complete .for paper-~nd-pencil tests. Eaeh yearbook includes tests publIshed dunng a speCifiedperiod, thus supplementing rather than supplanting the earlier yearbooks.The Ser,enth Mental Measurements rear7JOok, for example, is concernedprincipally with tests appearing bet\\'een 1964 and 1~70. Tests. of con-tinuing interest, however, may be reviewed r~peat('dly m StH.·cesSlyey~ar-hooks, as nt'w data accumulate from pertment research. The earhestpublications in this series were merely bi~)liographies of tests: B~ginningin ]9,38,however, the ),earbook assumed Its ('UlTt'I\t form, wlll(:h llldudescritical reviews of most of the tests by one or more test experts, as wellas a complete list of published references pertailling to each lest. .Routineinformation regarding poblisher, -price, forms, and age of subjects forwhom the test is suitable is also regularly giv('n.

A comprehensive bibliography covering all types of published testsavailable in English-speaking countries is provided by Te:~ts in Print(Buras, 1974). Two related sources are Reading Tests and Reviett;~(Bums, 1968) and Personality Tests and Reviews (Buras, 11970). Bothinclude a numbeF'~9f tests not found in any volume of the MMY, as wellas master indexes'that facilitate the location of tests in the :\1\1Y. Reviewsof specific tests are also published in several Ilsychological and educa-tional journals, such as the Journal of Educational Measurement and theJOllrnal of Counseling Psyc1101ogy.

Since I9iO several sourcebooks have appeared which provide informa-tion about u~published or little known instruments, largely supplement-ing the material listed in the MMY. A comprehensive survey of such

Page 18: Anne Anastasi- Psychological Testing I

22 Context of Psyc11010gical Testing

of manuals that meet adequate scientific standards. An enlightened PU?-lie of test users provides the firmest assurance that such standal'ds wIllbe maintained and improved in the future.. .

A succinct but comprehensive guide for the evaluatwn of psy~hologlcaltests is to be found in Standards for Educational arul Psyc11010glCal Tests(1974), published by the American Psychological As~ocia~ion. Thesestandards represent a summary of recommended practices 111 test con-struction based on the current state of knowledge in the field. They areconcerned with the information about validity, reliability, norms, andother test characteristics that ought to be reported in the manual. In theirlatest revision, the Standards also provide a guide for the proper use oftests and for the correct interpretation and application of test results.Relevant portions of the StQnda~ds "ill.be cited in the following chapters,in connection with the appropnate tOpICS.

CHAPTER 2J\rat1ure arld Use ofPsyclz.ological Tests

T.HE HISTORICAL introduction in Chapter 1 has already suggestedsome of the many uses of psychological tests, as well as the widediversity of available tests. Although the general public may still

associate psychological tests most dosely with "IQ tests" and with testsdesigned to detect emotional disorders, these tests represent only a smallproportion of the available types of instruments. The major categories ofpsychological tests will be discussed and illustrated in Parts 3, 4, and 5,'\'hich cover tests of general intellectual level, traditionally called intelli-gence tests; tests of separate abilities, including multiple aptitude bat-teries, tests of special aptitudes, and achievement tests; and personalitytests, concerned with measures of emotional and motivational traits, in-terpersonal behavior, interests, attitudes, and other noncognitive char-acteristics.

In the face of such diversity in nature and purpose, ,~hat are tIlecommon differentiating characteristics of ps~'Chological tests? Ho," dopsychological tests differ from other methods of gathering informationabout individuals? The answer is to be found in certain fundamentalfeatures of both the construction and use of tests. It is with these featm!esthat the present chapter is concerned.

BEHAVIOR SAMPLE..-A, psychological test is essentially an objective.~d standardized measure orit's'ample of behavior. Psychological testsare like tests in any other science, insofar as 0R~flh~tions are made on asmall hut carefully chosen ,sample .~ .an ip~jyjil~)rs behaviQr.. In thisrespect, the psychologist proceeds in much·.the 'Jame way as the chemistwho tests a patient's blood or a community.}swater supply by analyzing,-et'more samples of it. If the psychologistwish¢'~ to test the extent,iff a c1lild's vocabulary, a clerk's ability to perform arithmetic computa-

Page 19: Anne Anastasi- Psychological Testing I

tions, or a pilot's eye-hand coordination, he ('xamim's their performancewith a representatin' set of wonls, :11'ithmclie prol>lems, or motor tests."'hetlwr or not the test adeqnately co\'(.'rs the behavior under con-sideration obviously depends on the number and nature of it nls in thesamp e. or examp e, an ant 1I1letJctest consisting of only five problems,~le including only multiplication items, would be a poor measure ofthe indiyidual's computational skill. A yoealmlary test composed entirelyof baseball terms would hardly proYide a dependable estimate of achild's total range of vocalmlar~'.

The diagnostic or 'redictiJ;c t;a7uc of a lsycholC!gical test depend~_ol!the debH,',~O which it sen'es as an indicator of a relatively broad and!!guinea;t area·Ofb~;:. Measurement of the hehaYior sample directl~'cO\'ered by the test is J:arely, if ever, the goal of psychological testing.The child's knowledge of a particular list of 50 words is not, in itself, of,great interest. Nor is the job applicant's performance on a specific setof 20 arithmetic problems of much importune-e_ If, however, it can bedemonstrated that there is a dose correspondence between the child'sknO\dedge of the word list and his total l1laster~-of vocabulary, or be-tween the applicant's score on the arithmetic problems and his computa-tional performance on the joh. then the tests are ser\'ing their purpose,

It should be noted ir.. this connectiolJ that the test items need notresemble closely the beha.vior the test is.to }[('dicr."It is only necessarytna " .- on ence be demoHstrated bet"'ecn the tm); Thedegrec of similarity between the test sample and the predicted behaviorma\' vary widely. At one extreme. the test mav coincide completelY witha part o'f the b;'h~or to he preclictt'cl. An e.\:Imple might be a foreignvocabulary test in whi!=·htilt:' students are examilled on 20 of the 50 nt'\\-words th~y have studied; another example is provided by the ro,ld testtaken prior to obtaining a driver's liccme. A lesser degree of similarity isillustrated by many vocational aptitude tests administered prior to johtraining, in which there is only a mod<'rate rese ance between thetasks peIformed on the joh and those incorporat ,in the test. At theother extreme one finds projecth'e personality test!>'" eh as the Rorschachinkblot test, in which an attempt is made to predict from the subject'sas~ociations to inkblots how he will rcad to other people, to ~motionallytoned stimuli, and to other complex, everyday-life situations, Despitetheir superficial differences, all these tests consist of samples of the indi-~s behavioL., And each mUst prove Its worth by" an empiricallydemonstrated correspondence between the subject's pcrformance on thetest and in other situations.

Whether the term "diagnosis" or the term "prediction" is employed inthis connection also represents a minor distinction. Prediction eommonlyconnotes a temporal estimate, the individual's future performance on ajob, for example, heing foreeast from his present test performance. In a

hroader sense, ho\\"('\'er, e\-en the diagnosis of present condition, suell asmental retardation ur emutional disorder, implies a prediction of whatthe incIi\'idual will cIO in situations other than the present test. It islogically Simpler to consider all tests as behavior samples from whichpredictions regarding other JX.havior can be made. Different typps oftests can then be characterized as variants of this basic pattern.

Anotlwr point that should be considered at the outset pertains to thecone-ept of Clll}(/cify. It is entirely possible, for example, to dc\'isc a testfur predicting how well an individual can learn Fre11Ch before he haseven begun the study of French. Such a test would invoh-e a sample ofthe types of behavior required to learn the new language, but would initself presuppose no knowledge of French. It could then be said thatthis test measures the indh'idual's "capacity" or "potentialitt for learn-ing French, Such tenus should, hO"'ever, be used with caution in refer-ence to ps~'dlOlogical tests. Onl\' in the senSe that a present behaviorsample can be used as an indicator of other, future behayior can wes~ak.()f a test measuring "capacity." Ko psychological test can do morethan measurelJel1"UDor. 'Vh~ethci:S\1ch behavior can serve as an effectiveinc!('x of other IX'hador can be determined only by empirical try-out.

STA:-;DARDIZATIO:-;, It ,,-:"iIlhe recalled that in the initial definition a ps~--chological test \\'as described as a standardized measure. Standardizationimplies !miformifll of ~)rQcedllre in 'hdnl11Hsfenng and SCoring the 'test Ifthe scores obtained by different iudiyiduals are to be comparable, testin~conditions must obYiously be the same for all. Such a requirement is onlya speCial application of the need for controlled conditions in all scientificohse-ryations. In a test situation, the single independent \'ariable isusuall~' the indh-idual being tested.

In order to secure uniformity of testing conditions, the test constructorprovides detailed directions for administering each newly developed h:'st.The formulation of such directions is a major part of the standardizationof a new test_ Such standardization extends to the exact materials em'plo~d, time limits, oral instructions to subjects, prc>Jiminary demonstra-

: ~ns, ways of handling queries from subjects. and evel,\, other ~the testing situation. :Many other, more subtle factors may influence thesubject's performance on certain tests. Thus, in giving instructions or,presenting problems orally, consideration must be given to the rate ofspeaking, tone of voice, inflection, pauses, and faCj~1 e}pression. In atest involving the detection of absurdities, tot eX;lnit>le, the correct an-~wer may be given away by smiling or paY~jlg wh~n the crucial wordJ~.read .. Stand~rdized testing p.rocedure, ~r:,~i[th~\. ex.aminer's point of\1:w, Will be dJscussed further m a later sect~g~ of-<tl;lJSchapter dealing\\'Jth problems of test administration. ."

Page 20: Anne Anastasi- Psychological Testing I

26 COlltext Of Psychological Testing

Another important step in the standardization of a test is the establish-ment of norms, Psychological tests have no predetermined standards ofpli5singor fa'inng; an individual's score is evaluated by comparing it withthe scores obtained by others. As its name implies, a norm is the normalor average performance. Thus, if normal B-year-old children complete12 out of 50 problems correctly on a particular arithmetic reasoning test,then the 8-year-old norm on this test corresponds to a score of 12, Thelatter is known as the raw score on the test, It may be expressed asnumber of correct items, time required to complete a task, number oferrors, or some other objective measure appropriate to the content of thetest. Such a raw score is meaninglcss until evaluated in terms of a suitableset of norms, .

In the process of standardizing a test, it is administered to a large,representative sample of the type of subjects for whom it is designed.This group, known as the standardization sample, serves to establish thenorms. Such norms indicate not only the average performance but alsothe relative frequency of varying degrees of deviation above and belowthe awrage. It is thus possible to evaluate different degrees of superiorityand inferiority. The specific ways in which such norm" may be expressedwill be considered in Chapter 4. All permit the designation of the indi-"idual's position with reference to the normative or standardizationsample.

It might also be noted that norms are established for personality tests. in esse!1tially the same way as for aptitude tests. The norm on a person-

ality test is not necessarily the most desirable or "ideal" performance,any more than a perfect or errorless score is the norm on an aptitudetest. On both types of tests, the norm corresponds to the performance oftypical or average individuals. On dominance-submission tests, for ex-ample, the nonn falls at an intermediate point representing the degreeof dominance or submission manifested by the average individual.Similarly. in an emotional adjustment inventory, the norm does notordinarih· correspond to a complete absen<.'C of unfavoral;>le or mal-adaptive' }'esponses, since a few such responses occur in the majority of"normal" individuals in the standardization sample. It is thus apparentthat psychological tests, of whatever type, are bascq'· on lmpiricallyestablished norms.

Nature alld Use of Psychological Tests 27

the discussion of standardization. Thus, the administration, scoring, andinterpretation of scores are objective insofar as they are independent ofthe subjective judgment of the individual examiner. Anv one individualshould theoretically obtain the identical score on a test r~gardless of whohappens to be his examiner. This is not entirely so, of comse, since per-fect standal'dization and objectivity have not been attained in practice.But at least such objectivity is the goal of test consb'uction and has beenachieved to a reasonably high degree in most tests.

There are other major ways in which psychological tests can be prop-erly described as objective. The determination of the difficulty level of anitem or of a whole test is based on objective, empirical procedures. 'VhenBinet and Simon prepared their original, 1905 scale for the measurementof intelligence, they arranged the 30 items of the scale in order of in-creasing difficulty. Such difficulty, it will be recalled, was determined bytrying out the items on 50 normal and a few mentally retarded children.The items correctly solved by the largest number of' children were, ipsofacto, taken to be the easiest; those passed by relativdy few children wereregarded as more difficult items. By this procedure, an empirical orderof difficulty was established. This early ,:xarnple typifies the objectivemeasurement of difficulty level, which is now common practice in psycho.logical test construction.

:l'ot only the arrangement but also the selection of items for inclusionin a test can be determined by the proportion of subjects in the trialsamples who pass each item. Thus, if there is a bunching of items at theeasy or difficult end of the scale, some items can be discarded. Similarly,if items are sparse in celiain portions of the difficulty range, new itemscan be added to fill the gaps. More technical aspects of item analYsiswill be considered in Chapter 8. .

. RELIABILITY. How good is this test? Does it really work? Thes£l ques-t~ons could-and occasionally do-result in long hours of futile discus-sIOn. Subjective opinions, hunches, and personal biases may lead, on theone hand, to extravagant claims regarding what a particular test canacco~plish and, on the other hand, to stubborn rejection. The only wayq~estlOns sU~h ~s these can be conclusively answered is by,empiricaltrial. The olJ]ectlve evaluation of psychological tests involves primarilvt?e d~tennination of the reliability and the validity of the test in specifiedSltuatlons.

As used in psychometrics, the term reliability always means consis-tenc~', Test reliability is the consistency of scores obtain_ed;~ the samepersons when retested with the identical test or with an eqRhYalent formof the test. If a child receives an IQ of 110 on Monday and an IQ of 80

OBJECTIVE MEASUREMENT OF DIFFICULTY. Reference to the definitionof a psychological test with which this discussion opened will show thatsuch a test was characterized as an objective as well as a standardizedmeasure. In ,••.hat specific way~.are such tests objective? Some aspects ofthe objectivity of psychologieh'l tests have already been touched on in

Page 21: Anne Anastasi- Psychological Testing I

when retested on Friday, it is obvious that little or 110 confidence can beput in either score. Similarly, if in olle set of 50 words an individualidentifies 40 correctl~·, whereas in another, supposedly equivalent set hegets a score of only 20 right, then neither score can be taken as a de-pendable index of his verbal comprehension. To be sure, in both illustra-tions it is possible that only one of the two sC'ores is in error, but tluscould be demonstrated only by further retests. From the given data, wecan conclude only that both scores cannot be right. \Vhether one orneither is an adequate estimate of the individual's ability in vocabularycannot be established without additional information.

Before a psychological test is released for general use, a thorough,objective check of its reliability should be carried out. The different typesof test reliability, as well as methods of measuring each, will be con-sidered in Chapter 5. Reliability can be checked with reference to

Itemporal fluctuations, the particular selection of items or behavior sampleconstituting the test, the role of different examiners or scorers, and otheraspects of the testing situation. It is essential to specify the type of re-liability and the method employed to determine it, because the same testmay vary in these different aspects. The number and nature of indi-viduals on whom reliability was checked should likewise be reported.With such information, the test user can predict whether the test will beabout equally reliable for the group with 'which he expects to use it, orwhether it is likelv to be more reliable or less reliable.

VALIDITY, Undoubtedly the most important question to be asked aboutany psychological test"concerns its validity, i.e., the degree to which thetest actually measures what it purports to measure. Validity provides adirect check on how well the test fulfills its function. The determinationof validity usually requires independent, external criteria of-whatever thetest is nesigned to measure. For example, if a medical aptitude test ist9be used in selecting promising applicants for medical school,. ultimatlesuccess in medical scholYlwould be a criterion. In the process of ·y~lidat-ing such a test, it would be administered to a large group of students atthe time of their admission to medical school. Some measure of per-formance in medical school would eventually be obtained for each stu-dent on the basis of grades, ratings by instructors, success or failure incompleting training, and the like. Such a composite measure constitutesthe criterion with which each student's initial test score is to be correlated.A high correlation, or validity coefficie,,!t, would signify th~t those indi-viduals who scored high on the- test. had been relatively successful inmedical school, whereas those scoring low on the test had done poorly inmedical school. A low correlation would indicate little correspondencel,,,t"'ppn tp~t ~('orp.rind criterirJn measure and hence poor validity for the

test. The validity coefficifnt enables us to determine how closel\' thecriterion perfor~ance could have been predicted from the test scor~s.

In a similar manner, tests designed for other purposes can be validatedagainst appropriate criteria. A vocational aptitude test, for example, canbe validated against on-the-job success of a trial group of new employees.A pilot aptitude battery can 1;>evalidated against achie\'ement in flig:lttraining. Tests designed for broader f\nd more varied uses are validatedagainst a number of criteria and their validity can be established only bythe gradual accumulation of data from many different kinds of investiga-tions.

The reader may have noticed an apparent paradox in the concept oftest validity. If it is necessary to follow up the subjects or in other waysto obtain independent measures of what the test is trying to predict, whynot dispense v.ith the test? The answer to this riddle is to be found in thedistinction between the validation l,TfOUp on the one hand anci the groupson which the test will eventually be employed for operational purposeson the other. Before the test is ready for use, its validity must be estab-lished on a representative sample of suhjects. The scores of these personsare not themselves employed for operational purposes but serve only inthe process of testing the test. If the test proves valid b~' this method, itcan then be used on other samples in the absence of criterion measures.

It might still be argued that we would need only to wait for the crite-rion measure to mature, to become available, on any group in order toobtain the information that the test is trying to predict. But such a pro-cedure would be so wasteful of time and energy as to be prohibitive inmost instances. Thus, we could detennine which applicants will succeedon a job or which students will satisfactorily complete college by admit-ting all who apply and waiting for subsequent developments! It is thevery wastefulness of this procedure-and its deleterious emotional im-pact on individuals-that tests are designed to minimize. By means oftests, the person's present level of prerequisite skills, knowledge, andother relevant characteristics can be assessed with a deferminable marginof error. The more valid and reliable thef~, the smaller will be this

,margin of error. .The special problems encountered in determining the validity of dif-

ferent types of tests, as well as the specific criteria and statistical pro-cedures employed, willlJ~ fhscussed in Chapters 6 and 7. One furtherpoint, however, should be coq$fdered at this time. Validitv tells us morethan the degree to which the te~t is f~lfilling its funcpari.ft actually tellsus what the test is measuring. By studying the validation data, we canobjectively determine what the test is measuring. It would thus be moreaccurate to define validity as the extent to which we Jrnow what the testmeasures. The interpretation of test scores would undoubtedly be clearerand less ambiguous if tests were regularly named in terms of the criterion

Page 22: Anne Anastasi- Psychological Testing I

Context of Psychological Tes/ing

'~:~hl:oughwhich they had been validated. A tendency in this directionpe'recognized in such test labels as "~cholastic aptitude test" andsonnel classification test" in place of the vague title "intelligence

'SONS FOR CONTROLLING THE USE OF,CHOLOCICAL TESTS

'y I:have a Stanford-Binet blank? ~fy nephew has to take it next week for;i~sionto,School X and I'd like to give him ~ol1lepractice so he can pass."

o improve the reading program in our school, we need a culture-free IQ,t .that measures each child's inllate potential."

st night I answered the questions in an intelligence test published in a~gazineand I got an IQ of SO-I think psychological tests are silly."

.. 'y roommate is studying psych. She gave me a personality test and I came1neurotic. I've been too upset to go to class ever since."

, 'ast ~'enryou gave a new personality test to our employees for research pur-.;poses.We would now like to have the scores for their personnel folders."

The above ·remarks are not imaginary. Each is based on a re~fincident,nd the list could easily be extended by any psychologist. SuQ't remarks'lustrate potential misllses or misinterpretations of psychological tests inuch wavs, as to rrnder the tests worthless or to hurt the indi:,V;idual.Likeny sd~ntillc instrument or precision tool, psychological t~~s"roJ!~.LP.!:_9perly used to be effective. In the hands of either the unscrupulous or

"we -meamng ut uninformed user, such tests can cause serious~~~ ~.There are two principal reasons for controlling the use of psychologicalests: (a) to revent general familiarity with test content, which would

.' invalidate the test an ( to ensure tat e test is used ~ a qualified :>

, '~\' if an individual were to merr'lbrize the correct' re-O' sponses on a test o'f' color blindness, such a test w~ld no longer be a

'measure of color vision for him. Under these condItions, the test wouldbe completely invalidated. Test content clearly has to be restricted in

, order to forestall deliberate efforts to fake scores.In other cnses, however, the effect of familiarity may be less obvious,

or the test may be invalidated in good faith by misinformed persons. A\ ,schoolteacher, for example, may give her class special praettee in prob-.1ems closely resembling those on an intelligence test, "so that the pupils

will be well prepared to take the test." Such an attitude is simply a carry-"over from the usual procedure of preparing for a school examination.

When applied to an intelligence test, however, it is likely that suchspecific training 01' coaching will raise the scores on the test without ap-preciably affecting the broader area of beha"ior the test tries to sample.Under such conditions. the validity of the test as a predictive instl'l1mentis reduced.

The need for a qualified examiner is evident in each of the three majoraspects of the testing situation-selection of the test, administration andscoring, and i~terpretation of scores. Tests cannot be chos'en like lawnmowers, from a mail-order catalogue. They cannot be evaluated by name,author, or other easy marks of identification. To be sure, it requires nopsychological training to consider such factors as cost, bulkiness and easeof transporting test materials, testing time required, and ease and rapidityof scoring. Information on these practica] points can '\lsually be obtainedfrom a test catalogue and should be taken into account in planning a test-ing program. For the test to serve its function, however, an e"nlnation ofits technical merits' in terms of such characteristics as validity reliabilitydifficulty level, and norms is essential. Only in such a way' ~an the tes~user determine the appropriateness of an)' test for his particular purposeand its suitability for the type of persons with whom he plans to use it.

The introductory discussion of test standardization earlier in this chap-ter has ah'eady suggested the importance of a trained examiner. An ade-quate realization of the need to follow instructions precisely, as well as athorough familiarity with the standard instructions, i~ required if the testscores obtained by different examiners are to be comparable or if anyoneindividual's score is to he evaluated in terms of the published norms.Careful conh-ol of testing conditions is also essential. Similarly, incorrector inaccurate scoring may render the test score worthless. In the absenceof proper checking procedures, scoring errors are far more likeh- to occurthan is generally realized. . ,\

The proper interpretation of test scores requires a thorough under-standing of the test, the individual, and the testing <'Onditiolls. What isbeing measured can be objectively determined only by reference to thespecific procedures in terms of which the particular test was validated.Other information, pertaining to reliability, nature of the group on whichnorms were established, and the like, is likewise relevant. Some back-ground data reg,arding the individual being tested are essential in inter-preting any test score. The same score may be obtained by different per-sons for very different reasons. The conclusions to be drawn from suchscores would therefo.re be quite dissimilar. Finally, some considerationmust also be given to special factors that may have influenced a particularscore, such as unusual testing conditions, temporary emotional or physicalstate of thl> subject, and extent of the subject's previous experience withtests.

Page 23: Anne Anastasi- Psychological Testing I

The basic rationale of testing im·olves generalization from the behaviorsample observed in the testing situation to beha"ior manifested in other,nontest situations, A test SCOl'e should help us to predict how the clientwill feel and act outside the clinic, how the student will achieve in col-lege courses, and how the applicant will perform on the job. Any influ-ences that are specific to the test situation constitute error variance andreduce test validity. It is therefore important to identify any test-relatedinfluences that may limit or impair the generalizability of test results.

A whole volume could easil\' be devoted to a discussion of desirableprocedures of test administration, But such a survey falls outside thescope of the present book. Moreover, it is more pra~ticable to acquire

~.such techniques within specific settings, because no one person wouldnormally be concerned with all forms of testing, from the examinationof infants to the clinical testing of psychotic patients or the administra-tion of a mass testing program for military personnel. The present discus-sion will therefore deal principally with the common rationale of testadministration rather than with specific questions of implementation. Fordetailed suggestions regarding testing procedure, see Palmer (1970),Sattler (1974), and Terman and Merrill (1960) for individual testing,and Clemans (1971) for group testing.

ADVASCE PREPARATIOS OF E."I:AMINERS. The most important requirementfor good testing proc;.edure is advanc-e preparation. In testing there canhe no emergencies. Special efforts must therefore be made to foresee andforestall emergencies. Only in this way can unifom1ity of procedure be

..a{ls.\wed.'Advance preparation for the testing session takes many forms. Memo-

rizingthe exact verbal instructions is essential in most individual testing.Even ill a group test in which the instructions are reauto the subrects,some· previous familiarity with the statements to be read prevents mis-reading and hesitation and permits a more natural. informal ;manner dur-ing test admillish'ation. The preparation of test materials is an9ther im-portant preliminary step. In individual testing and especially in the ad-ministration of performance tests, such preparation invqlves the actuallayout of the necessary materials to facilitate subsequent use with aminimum of search or fumbling. Materials should generally be placed ona table near the testing ta.~le so that they are within easy reach of theexaminer but do not distriCt Vte subject. When apparatus is employed,frequent periodic checking and calibration may be necessary. In grouptesting, all test blanks, answer sheets, special pencils,· or other materials

Nature alld (he of PsycllOlogiclIl Tc'sls 33

needed should be carefully counted, checked, and arranged in advanceof the testing day.

Thorough familiarity with the specific testing procedure is another im-portant prerequisite in both individual and group testing. For individualtesting, supervised training in the administration of the particular test isusually essential. Depending upon the nature of the test and the type ofsubjects to be examined, such training may requi.re from a few demonstra-tion and practice sessions to over a year of instruction. For group testing,and espeCially in large-scale projects, such preparation may includeadvance briefing of examiners and proctors, so that each is hilly in-fonned about the functions he is to perform, In general, the examinerreads the instructions, takes care of timing, and is in charge of the groupin anyone testing room. The proctors hand out and collect test materials,make certain that subjects are following instructions, answer individualquestions of subjects within the limitations specified in the manual, andprevent cheating.

· J

TESTING COXDlTlOXS. Standardized procedure applies not only to verbalinstructions, timing, materials, and other aspects of the tests themselvesbut also to the testing environment. Some attention should be iven tothe selection of a . ~ flijJ.. This room should be

hould wvide , venti-~ .~cial~

should a so e ta -en to prevcnt mtcrrup ons unng the test. Posting asign on the door to indicate that testing is in progress is effective, pro-vided all personnel have learned that such a sign means no admittanceunder any circumstances. In the testing of large groups, locking the doorsor posting an assistant outside each door may be neeessarv to-prevent theentrance of late-comers. --. It is important to realize the extent to which testing conditions maylI1fluence scores. Even apparentl~' ·minor aspects of the testing situationmay appreciably alter performance. Such a factor as the use of deSKSorof chairs with desk arms, for example, proved to be significant in a grouptesting project with high school students, the groups using desks tendingto obtain higher scores (Kelley, 1~43:Traxler & Hilkert, 1942). There isalso evidence to show that the Slli9ir~loyed may affecttest scores (Bell, Hoff, & Hoyt,-19t3~1~li'~1lfr-~~ab1ishment of in-dependent test-scoring and data-processing agencies that;, provide their0\1.'11machine-scorable answer sheets, examiners sometimes administergroup tests with answer sheets other than those lIsed in the standardiza-tion sample. In the absence of empirical verification, the equivalence ofthese answer sheet# cannot be assumed. The Differential Aptitude Tests,for example, may be administered with any of five different answer

Page 24: Anne Anastasi- Psychological Testing I

Context of Psychological Testing

eets. On the Clerical Speed and Accuracy Test of this battery, separates are provided for three of the five answer sheets, because they were

nd to yield substantially different scores than those obtained with thereI' sheets used by the standardization sample.testing children below the fifth grade, the use of (Illy separate answert may significantly lower test scores (Meh'opolitan Achievement Testial Report, 19i5). At these grade levels, having the child mark the

\'ers in the test booklet itself is generally preferable.any other, more subtle testing conditions have been shown to affectormance on ability as well as personality tests. Whether the ex-

inel' is a stranger or someone familiar to the subjects may make a'nificant difference in test scores (Sacks, 1952; Tsudzuki, Hata, & Kuze,57). In another study, the general manner and behavior of the exam-

, as illustrated by smiling, nodding, and making such comments asood" or "fine," were shown to have a decided effect on test results"ickes, 1956). In a projective test requiring the subject to write stories'fit given pictures, the presence of the examiner in the room tended tohibit the inclusion of strongly emotional content in the stories (Bern-ein, 1956). III the administration of a typing test, job applicants typed'a significantly faster rate when tested alone than when tested in groups

liHwo or more (Kirchner, 1966).Examples.could readily be multiplied. The implications are threefold.

.first, follow standardized procedures to the minutest detail. It is the re-onsibility of the test author and publisher to descdbe such procedures

ully and clearly in the test manual. Second, record any unusual testingonditions, however minor. Third, take testing conditions into account;hcn interpreting test results. In the intensive assessment of a personrough individual testing, an experienced examiner may occasionally de-rt from the standardized test procedure in OJ:der to eJi~it additional in-rmation for special reasons. \Vhen he docs so, he ~ no longer in-rpret the subject's responses in terms of the test norms, Under thesercumstances, the test stimuli are used only for qualitative exploration;

. ld the responses should be treated in the same way as any other infor-"malbehavioral observations or interview data.

In psychometrics, the term "rapport" refers to the examiner's effOl'tso arouse the subject's interest in the test, elicit his cooperation, andnsure that he follows the standard test instructions. In ability tests, thenstructions call for careful concentration on the given tasks and for put-'ng forth one's best efforts to perform well; in personality inventories,ey call for frank and honest responses to questions about one's usual

Natml.' anel USe' Of Psychological Tests 35

behavior; in certain projective tests, they call for full reporting of associa-tions evoked by the stimuli, without any censoring or editing of content.Still other kinds of tests may require other approaches. But in all in-stances, the examiner endeavors to motivate the subject to follow themstructlOns as fullv and conscientiously as he can.

The training of examiners covers techniques for the establishmcnt ofrapport as well as those more directly related to test administration. Inestablishing rapport, as in other testing procedures, uniformity of condi-tions is essential for comparability of results. If a child is given a covetedprize whenever he solves a test problem correctly, his performance can-not be directly compared with the norms or with that of other childrenwho are motivated only with the standard verbal encoura"ement 01', 0praise. Any deviation from standard motivating conditions for a particulartest should be noted and t,aken into account in interpreting performance.

Although rapport can be more fully established in individual testing,steps can also be taken in group testing to motivate the subjects and re-lieve their anxiety. Specific techniques for establishing rapport vary withthe nature of the test and with the age and other characterbtics of thesubjects. In testing preschool children, special factors to be consideredinclude shyness with strangers, distractibility, and negativism. A friendly,cheerful, and relaxed manner on the part of the examiner helps to reas-sure the child. The shy, timid child needs more preliminary time to be-come familiar with his surroundings. For this reason it is better for theexaminer not to be too demonstrative at the outset. but rather to waituntil the child is ready to make the first contact. Test periods should bebr~ef, and the ~asks should be varied and intrinsically interesting to thechll.d.. The testIng should be presented to the child as a game and hiscunoslty aroused before each new task is introduced. A certain flexibilitvof procedure is necessary at this age level because of possible refusal~,loss of interest, and other manifestations of negativism.

Children in the first two or three grades of elementary school presentmany of the same testing problems as the preschool child. The game ap-proach is still the most effective way of arousing their interest in the test.The older schoolchild can usually be motivated through an appeal to hiscompetitive spirit and his desire to do well on tests. 'Vhen testing chil-dren from educationally disadvantaged backgrounds or from differentcultures, however, the examiner cannot assume they will be motiyated toexcel on academic taSKSto the same extent as children in the starfdardiza-ti~n sa~~le ..This pro~le~ and others pertaining to the testing of persons\\ lth diSSImilar expenential backgrounds will be c'Onsidered further inChapters 3, 7, and 12.. Special. motivational problems may be encountered in testing emo-

tionally disturbed persons, prisoners, or juvenile delinquents. Especiallywhen examined in an institutional setting, suca persons are likely ·to ..

Page 25: Anne Anastasi- Psychological Testing I

manifest a number of unfavorable attitudes, such as suspicion, insecurity,fl'ar, or cynical indifh'renee. Abnormal conditions in their past experiencesare also likely to influence their test perforrnanee adversely. As a resultof early failures and frustrations in school, for example, they may havedeveloped feelings of hostility and inferiority toward academic tasks,\rhich the tests resemble. The experienced examiner makes special effortsto establish rappolt under these conditions. In any event, he must besensitive t~ these special difficulties and take them into account in inter-preting and explaining test performance.

In testing any school-age child or adult, one should bear in mind thate\'e1')'test presents an implied threat to the individual's prestige. Somereassurance should therefore be given at the outset. It is helpful to ex-plain, for example, that no one is expected to finish or to get all the itcmscorrect. The individual might otherwise experience a mounting sense offailure as 11e advances to the more difficult items or finds that he is un-able to finish anv subtest within the time allowed.

It is also desil:able to eliminate the element of surprise from the testsituation as far as possible, because the unexpected and unknown arelikely to produce al1xiet~'. :Many group tests provide a prdiminaryex-planatory statement that is read to the group by the examiner. An evenbetter procedure is to announce the tests a few days in advance and togive each subject a printed booklet that explains the purpose and natureof the tests, offers general suggestions on how to take tests, and containsa few sample items. Such explanatory booklets are regularly available toparticipants in large-scale testing programs such as those conducted bythe College Entrance Examination Board (1974a, 1974b). The UnitedStates Employment Service has likewise de\'eloped a booklet on how totake tests, as well as a more extensive pretesting orientation~.techniquefor use with culturally disadvantaged applicants unfamili~f. ,v'ith tests.

\1ore general orientation booklets aie also .available, si'tc11 as l\feetingthe Test (Anderson, Katz, & Shimberg, 1965), A tape recOl'ding and twobooklets are combined in Test Orientatioll Procedure (TOP), designedspecifically for job applicants with little prior testing experience CBen-nett & Doppelt, 1967), The first booklet, used together with the tape,provides general information on how to take tests; the second containspractice tests. In the absence of a tape recorder, the examiner may readthe instructions from a printed script.

Adult testing presents--some additional problems. Unlike the school-child, the adult is not so likely to work hard at a task merely because it isassigned to him. It therefore becomes more important to "sell" the pur-pose of the tests to the adult, although high school and college studentsalso respond to such an appeal Cooperation of the examinee can usually

;be secured by convincing him that it is in his own interests to obtain a\,

valid score, Le., a score correctly indicating wh~lt he can do rather thanoverestimating or underestimating his abilities. ~Iost persons will under-stand that an incorrect decision, which might result from invalid testscores, would mean subsequent failure, loss of time, and frustration forthem. This approach can serve not only to motivate the individual totry his best on ability tests but also to reduce faking and encourage frankreporting on personality inventories, because the examinee realizes thathe himself would otherwise be the loser. It is certainly not in the bestinterests of the individual to be admitted to a course of study for whichhe is not qualified or assigned to a job he cannot perform or that hewould find uncongenial.

:\lany of the practices designed to enhance rapport sen'e also to reducetest anxiety. Procedures tending to dispel surprise and strangeness fromthe testing situation and to reassure and encourage the subject shottldcertainly help to lower anxiety. J'he examiner's own manner and a well-organized, smccthly running testing operation will contribute toward thesame goal. Individual differences in test anxiety have been studied withhoth schoolchildren and college students (Ga~dry& Spielberger, 1974;-Spielberger, 19i2). Much of this research was initiated bv Sarason andhis associates at Yale (Sarason, Davidson, Lighthall, "'aite, & Ruebush,1960). The first step was to construct a questionnaire to assess the indi-vidual's test-taking attitudes. The children's form, for example, containsitems such as the following:

Do you worry a lot before taking a test?

\\'hen the teacher sa~'s she is going to find out how much you h,we learned,does your healt begin to beat faster?

While 'you are taking a test, do you usually think you are not doing wen.

Of primary interest is the finding that both school achievement and intel-ligence test scores yielded significant negative correlations with test anx-iety. Similar correlations have been found among college st1tdcn!s (1. G.Samson, 1961). Longitudinal studies likewise revealed an inverse relationbetween changes in anxiety level and changes in inteJligence or achieve-ment test perfonnance (Hill & Sarason, 1966; Sarason, Hill, & Zim-bardo, 1964). .

~uch findings, of course, do not indicate the direction of caUsal relation-slllps. It is possible that children develop test anxiety because they per-

Page 26: Anne Anastasi- Psychological Testing I

Context of Psydl(Jlogical Testiug

formpoorly on tests and haw thus experienced failure and frustration inprevious test situations. In support of this interpretation is the findingthat \\ithin subgroups of high scorers on intelligence tests, the negative"rrelation between anxiet~' level and test performance disappearsDenny, 1966; Feldhusen & Klausmeier, 1962). On the other hand, there5 evidence suggesting that at least some of the relationship results fromhe deleteLious effects of anxiety on test performance. In one study(:Waite,Sarason, Lighthall, & Davidson, 1958), high-anxious and low-, 'iotlschildren equated in intelligence test scores were given repeatedials in a learning task Although initially equal in the learning test, thew-allxiousgroup improved significantly more than the high-anxious.Severalinvestigators have compared test performance under conditions

esigned to evoke "anxious" and "relaxed" states. Mandler and Sarason f;;(.1952), for example, found that ego-involving instructions, such as tellingsubjects that everyone is expected to finish in the time allotted, had abeneficialeffect on the performance of low-anxious subjects, but a dele-teriouseffect on that ofbigh-anxious subjects. Other studies have likewisefoundan interaction between testing conditions and such individual char-~cteristicsas anxiety level and achievement motivation (Lawrence, 1962;Palll& Eriksen, 1964). It thus appears likely that the r~latjQn betweenanxiety,and test performance is nonlinear, a slight amount Qf anxiety,\lein bencficia~ while a lar e amount is detrimental. Individuals who are',cllstomariy ow-anxious benefit from test con i,tions t lat arouse some

et:>, ",hi e t lose who are customarilv hi<rh-anxiol1s )erform betterIi ' firmore re axe can itions.it is undoubtedl\' true that a ~hronicalh- high amidv len'l will c:I;erJ a

deb'imental effect 'on school learning and' int~lIectual dewlopllleltf,_",~~ch"aneffect, howe\'er, should be distinguished horn the tesr:tiinit1!,r- ~'ectswith which this discussion is concerned. To what extent do~s test auxier.·

,make the individual's test performance unrepresentative of his cust~mar~';'performance level in nontest situations? Because of the competitive pre~-sureexperienced by college-bound high school seniors in ,,\merica today,it has been argued that performance on c'OlIege ~dmissif>il tests may beunduly affected by test anxiety. In a thorough ana::4ontrol1ed investi.gationof this question, French (1962) compar~d Jhf'p,erformancc of highschool students on a test given as part of the fe-gular administration ofthe SAT with performance on a parallel form of the test administered at

,a different time under "relaxed" conditions, The instructions on the latter, occasion specified that the test was given for 'research purposes only and

scores would not be sent to any college. The results showed that per-formance was no poorer during the standard administration than duringthe relaxed administration. Moreover, the concurrent validitv of the testscoresagainst high school course grades did not differ signifi~antly underthe two conditions.

Comprehensive surveys of the effects of examiner and situationalvariables on test seores'lmve been prepared by S. B. Sarason (1954),Masling (l~60), ~foliarty (1961, 1966), Sattler and Theye (1967),Palmer (19,0), and Sattler (1970, 1974). Although some effects havebeen demonstrated with objective group tests, most of the data have beenobtained with either projective techniques or individual intelligence tests.These extraneous factors are more likely to operate with unstructured andambiguous stimuli, as well as "ith difficult and nO"el tasks, than withclearly defined and well-learned functions. In general, children are moresusceptible to examiner and situational influences than are adults; in theexamination of preschool children, the role of the examiner is especiallycruCiaL.Emotionally disturbed and insecure persons of an\' age are alsomClre likely to be affected by such conditions than are well-adjustedpersons,

There is considerable evidence that test results may vary systematicallyas a function of the examiner (E. Cohen, 1965; ~'Iasling, 1960). These dif-ferences may he related to personal characteristics of the examiner, suchas his, age, sex, race, professional or socioeconomic status, training andexpenence, personality charaderistics, and appearance. Se\'eral studies ofthes~ examiner variables, however, have yielded misleading or illcon-cluSl\'e results because the experimental designs failed to control or iso-late the influence of differcnt examiner or subject characteristics. Hencethp l:'ffeds of two or more variables ma\, be confounded.

The examiner's behavior before and during test auministration has alsoheen s~lown to affect test results, For example, controlled investigationsha\'e YIelded significant differences in intelligence test performance as ares~lt of a "warm" versus a "cold" interpersonal relation between ex-amllJer and examinees, or a rigid and aloof versus a natural manner onthe part of the examiner (Exner, 1966; Masling, 1959). Moreover, theremay be Significant interactions between examiner and examinee' charac-t " , he~lstJCs,III t e sen~e that the same examiner characteristic or testing man-nel may have a dIfferent effect on different examinees as a function ofthe examinee's Own personality characteristics. Similar interactions mayoccur '~ith task variables, such as the nature of th,e test, the purpose ofthe testing, and the instructions given to the subjects. Dyer (1973) addseven more variables to this list, calling attention to the possible inHirenceof th t t· , d . ," .. c es gIVers an the test takers' diverse perceptions of the funetigllsand goals of testing.' '

St'll '• '. I. an,other way in which an examin8r may inadvertently affect the

~x~~m~e s responses is through ~is own 'cexpectations, This is simply aP clal mstance of the self-fulfilhng prophecy (Rosenthal, 1966; Rosen-

Page 27: Anne Anastasi- Psychological Testing I

40 Context of Psycholog.ical Testing

thaI & Rosnow, 1969). -An experiment conducted with the Rorschach willillustrate this effect (Masling, 1965). The examiners were 14 graduatestudent volunteers, 7 of whom were told, among other things, that ex-perienced examinel's elicit more human than animal responses from thesubjects, while the other 7 were told that experienced examiners elicitmore animal than human responses. Under these conditions, the twogroups of examiners obtained significantly diHerent ratios of animal tohuman responses from theh subjects. These differences occurred despitethe fact that neither examiners nor subjects reported awareness of anyinfluence attempt. ~foreover, tape recordings of all testing sessions re-vealed no evidence of verbal influence on the part of any examiner. Theexaminers' expectations apparently operated through subtle postural andfacial cues to which the subjects responded.

Apa~ from the examiner, other aspects of the testing situation maySignificantly affect test performance. Military recmits, for example, areoften examined shortly after induction, during a period of intense read-justment to an unfamilim' and stressful situation. In one investigationdesigned to test the effect of acclimatization to such a situation on testperformance, 2,724 recruits were given the Navy Classification Batteryduring their ninth day at the ~a\'al Training Center (Gordon & Alf,1960). When their scores were c'Ompared with those obtained by 2,180recruits tested at the conventional time, during their third day, the 9-daygroup scored Significantly higher on all subtests of the battery.

The examinees' activities immediately preceding the test may also af-fect their performance, especially when such activities produce emotionaldisturbance, fatigue, or other- handicapping conditions. In an investiga-tionwith third- and fourth-grade schoolchildren, there was some evidenceto suggest that IQ on the Draw-a-Man Test was influenced Qrthe chil-dren's preceding classroom activity (McCarthy, 1944). On one occasion,the class had been engaged in writing a composition on "The" BestThing That Ever Happened to Me"; on the second occasion, they hadagain been writing, but this time on "The Wo~sLThing That Ever'Hap-pened to Me." The IQ's on the second test, fOllowing what may havebeen an emotionally depressing experience, averaged 4 or 5 points lo\verthan on the first test. These findings were corroborated in a later investi-gation specifically designed to determine the effect of immediately pre-eeding experience on the Draw-a-Man Test (Reichenberg-Hackett, 1953).In this study, children who had had a gratifying experience involving thesuccessful solution of an interesting puzzle, followed by a reward of toysand candy, snowed more improvement in their test scores than those whohad undergone neutral or less gratifying experiences. Similar results wereobtained by W. E. Davis (1969a, 1969b) with college students. Per-fonnance on an arithmetic reasoning test was significantly poorer whenpreceded by a failure experience on a verbal comprehension test than it

Natufa aile! Use of Psychological Tests 41

was in a control group given no preceding test and in one that had takena standard verbal comprehension test under ordinary conditions.

Several studies have been concerned with the effects of feedback re-garding test scores on the individual's subsequent test performance. In aparticularly well-designed investigation with seventh-grade students,Bridgeman (1974) found that "success" feedback was followed by sig-nificantly higher performance on a similar test than was "failure" feed-hack in subjects who had actually performed equally well to begin with.This type of motivational feedback may operate largely through the goalsthe subjects set for themselves in subsequent performance and may thusrepresent another example of the self-fulfilling prophecy. Such generalmotivational feedback, however, s1)ould not be confused with correctivefeedback, 'whereby the individual is informed about the specific items hemissed and given remedial instruction; under these conditions, feedbackis much more likely to improve the performance of initially low-scoringpersons.

The examples cited in this section illustrate the wide diversity of test-related factors that may affect test scores. In the majority of well-admin-istered testing programs, the influence of these factors is negligible forpractical purposes. Nevertheless~ the skilled examiner is constantly onguard to detect the possible operation of such factors and to mipimizetheir influence. When circumstances do not permit the control of theseconditions, the conclusions drawn from test performance should bequalified.

In evaluating the eHect of coaching or practice on test scores, a funda-mental question is whether the improvement is limited to the specificitems included in the test or whether it extends to the broader area of~ehavior that the test i~gned to p;edict. The answer to this ques~represel1ts the difference between coacmng and education. Obviouslyany educational experience the indiVidual undergoes, either formal or in-formal, in or out of school, should be reflected in his performance on testssampling the relevant aspects of behavior. Such broad influene.es will inno way invalidate the test, since the test score presents an aar:a,tate piC-ture of the individual's standing in the abilities under conside~n. Thedifference is, of course, one of degree. Influences cannot..:..be~dassified aseither. narrow or broad, but obviously vary widely in scop~~f;om those~ffecting only a single a~lllinis~tj~n of a.,single test, throu~hJib.~se. affect-~ng'p~rformance on all Items ;()fi,ca /:crtUln,type, to those mtfUencmg themdl vidual's performance in the large .Irtai9rity of his activities. From thestandpOint of effective testing, however, a workable distinction can be

Page 28: Anne Anastasi- Psychological Testing I

COlltext of P~yc1lOlogic(/l Testing

e. Thus, it can be stated that a test score is inmlidated only when a':'cular experience raises it withont appreciably affecting the criterion~Lviorthat: the test is deSigned to predict.

:";{CHIKC.'the effects of coaching on test scores have been widely in-gated. Many of these studies were conducted by British psycholo-,with special reference to the effects of practice and coaching on thebrinerly used in assigning ll-year-old children to different types of'Ilrv;,schools (Yates et aI., 195:3-1954). As might be expected, theot ~~ovement depends on the ability and earlier educational;

'ences of'the examinees, the nature of the tests, and the amount and'of coaching provided. Individuals with deficient educational back-

unds are more likely to benefit from special coaching than are those'ihave had superior educational opportunities and are already pre-, to do well on the tests. It is obvious, too, that the closer the re-

,blance between test content and coaching material, the greater willthe improvement in test scores. On the other hand, the more closelytruction is restricted to specific test content, the less likely is improve-:nt to extend to criterion performance."n America, the College Entrance Examination Board has been con-hed about the spread of ill-advised commercial coaching courses forlege applicants. To clarify the issues, the College Board conducted

veral well-controlled experiments to determine the effects of coaching'its Scholastic Aptitude Test and surveyed the results of similar studiesother, independent investigators (Angoff, 19711>;Conege Entrance

'amination Board, 1968). These studies covered a variety of coachingethods and included students in both public and private high schools;e investigation was conducted with black students in 15 urban and

'"ralhigh schools in Tennessee. The conclusion from all"these studies is':at intensive drill on items similar to those on the SAT is unlikelY to'oduce appreciably greater gains than occur wrJ/i students are rete~ted'th the SAT after a year of regular high schot;il instruction.On the basis of such research, the Trustees of the College Board issued

.formal statement about coaching, in which the fonowing points wereade, among others (College Entrance Examination Board, 1968,p.8-9):

e results of the coaching studies which ha,'e thus far been completed in-te that average increases of less than 10 points on a 600 point scale can

,expected. It is not reasonable to believe that admissions decisions can beected by such small changes in scores. This is especially true since the testsmerely supplementary to the school record and other evidence taken into

. unt b'): admissions officers. . . , As the College Board uses the term, ap-itude is not something flxed and impervious to influence by the way the child

\in'S and is taught. Rather, this particular Scholastic Aptitude Test is a meas-ure of abilities that seem to grow slowly and stubb(lrnl~'. profoundly influcllcedby conditions at home and at school over thc years, but not responding tohasty attempts to relive a young lifetime.

It should also be noted that in its test construction procedures, the Col.lege Board im'estigates the susceptibility of new item types to coaching(:\ngoH, 1971b; Pike & Evans, 1972). Item types on which perfo.rma1lcecan be appreciably raised by short-term drill or instruction of a narrowlylimited nature are not included in the operational forms of the tests..

PRACTICE.The effects of sheer repetition, or practice, on test per-formance are similar to the effects of coaching, but usuaIl~' less pro-nounced. It should be noted that practice, as well as coaching, may alterthe nature of the test, since the subjects may emplo~' different work meth-ods in solving the same problems. Moreover, certain types of items maybe much easier when encountered a second time. An example is 'providedby problems requiring insightful solutions which, once attained, can beapplied directly in solving the same or similar problems in a retest. Scoreson such tests, whether derived from a repetition of the identical test orfrom a parallel form, should therefore be carefully scrutinized.

A number of studies have been concerned ~,'ith the effects of theidentical repetition of intelligence tests over periods ranging from a fewdays to se,'eral years (see Quereshi, ]968). Both adults and children,and both normal and mentally retarded persons have been employed. Thestudies have covered individual as well as group tests. All agree in show-ing significant mean gains on retests. Nor is improvement necessarilylimited to the initial repetitions. \Vhether gains persist or level off in suc-cessive administrations seems to depend on the difficulty of the test andthe abilit~· level of the subjects. The implications of sucll findings are il- \lustrated by the results obtained in annual retests of .3,500 schoolchildrenwith a Yariety of intelligence tests (Dearborn & Rothnev, 1941). Whenthe same test was readministered in successive years, th~ median IQ ofthe group rose from 102 to 113, but it dropped to 104 when another testw~s substituted. Becaus~ of the retest gains, the meaning of an IQ ob-tamed on an initial and later trial proved to be quite different. For exam-ple, .a~ ~Q of 100 fell approximately at the average o£'lhe distribution onthe Im~lal trial, -but in the lowest quarter On a retest~S\ldl iQ's, thoughnumencally identical and derived from the same te~ 1l;!ightthus signifynormal ability in the one instance and inferior ability#},(,the other.

G~ins in score are also found on retesting with pili:dIel -forms <1j thesame tes~, although such gains tend in general to be .srh.a4Ier.Significantm~a,n gams have been reported when altema"f~ forins ofa 'test were ad-rnullstered in immediate succession or after intervals ranging from orie

Page 29: Anne Anastasi- Psychological Testing I

Context of Psychological Tesring

b three years (Angoff, 1971b; Droege, 1966; Peel, 1951, 1952).. r results have been obtained with normal and intellectually gifted)children, high school and college students, and employee samples.

a"onthe distribution of gains to be expected on a retest with a parallelshould be provided in test manuals and allowance for such gains

. ~dbe made when interpreting test scores.

)17 SOPHJSTICATIO~. The general problem o(test sophistication should'"be considered in this connection. The individual who has had ex-'vl! prior experience in taking psychological tests enjoys a certain ad-

Jage in test performance over one who is taking his first test (Heim &, IIace,194~1950; Millman, Bishop, & Ebel, 1965; Rodger, 1936). PartIthis advantage stems from having overcome an initial feeling ofangeness, as well as from haVing developed more self-confidence and

"etter test"taking attitudes. Part is the result of a certain amount of over-lap in the type of content and functions covered by many tests. SpeCific

,"familiaritywith common item types and practice in the use of objective"answer sheets may also improve performance slightly. It is particularlyimportant to take test sophistication into account when comparing thescores obtained by children from different types of schools, where theextent of test-taking experience may have varied Widely. Short orienta-tion and practice sessions, as described em'lier in this chapter, can bequite effective in equalizing test sophistication (Wahlstrom & Boersman,1968).

CHAPTER 3

Social a1ld Etltical11JljJZicatioTls of Testi1lg

IxORDER to prevent the misuse of psychological tests, it has becomenecessary to erect a number of safeguards around both the teststhemselves and the test scores. The distribution and use of psycho-

logical tests constitutes a major area in Ethical Standards of Psychologists,the code of professional ethics officially adopted by the American Psycho-logical Association and reproduced in Appendix A. Principles 13, 14, and15 are specifically directed to testing, being concerned with Test Security,Test Interpretation, and Test Publication. Other principles that, 'althoughbroader in scope, are highly relevant to testing include 6 (ConfideIi-tiality), 7 (Client Welfare), and 9 (Impersonal Services). Some of thematters discussed in the Ethical Standards are closely related to pointscovered in the Standards for Educational and Psychological Tests (1974),cited in Chapter 1. For a fuller ,and richer understanding of the principlesset forth in the Ethical Standards, the reader should consult two com-panion publications, the Casebook on Ethical Standards of PsycllOlogists(1967) and Ethical Principles in tIle Conduct of Researc11 with HumanParticipants (1973). Both report specific incidents to illustrate each prin-Ciple. Special attention is given to marginal situations in which there maybe a conflict of values, as between the advancement of science for humanbetterment and the protection of the rights and welfare of individuals.

The requirement that tests be used only by appropriately qualifiedexaminers is one step toward protecting !he indiy!~ual againE: the im-~oper use of tests. Qf course, the necessary qualiB,c~tions vary with thetype of test. Thus, a relatively long pe.ri!'d of int~nsive training ands~pervised experience is required for the proper use of individual intel-ligence tests and most personality tests, whereas a mini~um of specializedpsychological training is needed in the case of educational achievement

45

Page 30: Anne Anastasi- Psychological Testing I

46 COllfext of Psycl1010gicaf Testing

or vocational proficiency tests. It should also be noted that students whotake tests in class for instructional purposes are not usually equipped toadminister the tests to others or to interpret the scores properly.

The well-trained examiner chooses tests that are a )ro riate for 0

the particular purpose for whie 1 e is teshn an t ex-amme. e IS a so cognizant of the available research literature on theclioseiitest and able to evaluate its technical merits with reC1ard to suchocharacter,istics as norms, reliability, and validity. In administering thetest, he is sensitive to the many conditions that~ such as those 1 ustrate 10 apter 2. He draws conclusions ormakes recommendations only after considering the test score (or scores)in the light of other pertinent information about the individual. Above all,lie shpuld be sufficiently knowledgeable about the science of human be-havior to guard against unwarranted inferences in his interpretations oftest scores. When tests are administered' by psychological technicians orassistants, or by persons in other professions, it is essential that an ade-quately qualified psychologist be available, at least as a consultant, toprovide the needed perspective for a proper interpretation of test per-formance.

Misconceptions about the nature and purpose of tests and misinter-pretations of test results underlie Illany of the popular criticisms of psy-chological tests. In part, these difficulties arise from inadequate com-munication between· psychometricians and their various publics-educators, parents, legislators, job' applicants, and so forth. Probably th~most common examples center on unfounded inferences kdfrtIQs. Not alTIU1sconcephons· about tests, howcyer, can bc attrib_R!;~ to inadequatecommunication between psychologists and laymeD.)~'c.:hological testingitself has tended to become dissociated from~;.the· mainstream of be-havioral science (Anastasi, 1967). The growing.Fdrnplexity of the scienceof psychology has inevitably becn accompani~,dby increasingspecializa-tion among psychologists. In this process, psychometricians have concen-trated more and more on the technical refinements of test constructionand have tended to lose conta:tt wit'rr developments in other relevantspecialties, such as learning, child development, individual diffe;ences,and behavior genetics. Thus, the technical aspects of test constructionhave tended to outstrip the psychological sophistication with which testresults are interpreted. Test scores can be properly interpreted only inthe light of all available knowledge regarding the behavior that the testsare designed to measure.

Who is a qualified psychologist? Obviously, with the diversification ofthe field and the consequent specialization of training, no psychologist isequally qualified in all areas. In recognition of this fact, the EthicalStandards specify: "The psychologist recognizes the boundaries of hiscompetence and the limitations of his techniques and does not offer

Social alief Etllicalll1lplications of Testing 47

selyices or use techniques that fail to meet profeSSional standards estab-lished in particular fields" (Appendix A, Principle 2c). A useful distinc-tion is that between a psychologist working in an institutional setting,such as a school system, university, clinic, or government agency, and oneengaged in independent practice. B~cause the inde endent ractitioneris less subject to judC1ment and eva ua on l' wle eable collen est lan lS 1e lIlS Itntional s choloC1ist he needs to meet hi her standards? -pro esslOna qualifications. The same would be true of a psychologistresponSIble for the supervision of other i·nstitntional psychologists or onewho serves as an expert consultant to institutional personnel.

A Significant step, both in upgrading professional standards and inhelping the public to identify qualified psychologists, was the enactmentof state licensing and certification laws for psychologists. Nearly all statesnow have such laws. Although the terms '1icensing" and "certification"are often used interchangeably, in psychology certification typically refersto legal protection of the title "psychologist," whereas licensing controlsthe practice of psychology. Licensing laws thus need to include a defini-tion of the practice of psychology. In either type of law, the requirementsare generally a PhO in psychology, a specified amount of snpervisedexperience, and satisfactory performance on a qualifying examination.Violations of the APA ethics code constitute grounds for revoking aceltiRcate or license. Although most states began with the simpler certifi-cation laws, there has been continuing movement toward licensing.

At a more advanced level, speCialty certification within psychology isprovided by the American Board of Professional Psychology (ABPP).ReeJuiring a high level of training and experience within deSignatedspecialties, ABPP grants diplomas in such areas as clinical, counseling,industrial and organizational, and school psychology. The BiographicalDirector~' of the APA contains a list of current diplomates in each spe-cialty, which can also be obtained directly from ABPP. The principalf~nction of ABPP is to provide information regarding qualified psycholo-gIsts. As a privately constituted board within the profession, ABPP does~)()thave the enforcement authority available to the agencies administer-mg toe state licensing and certification laws.

.The. p~rchase of tests is generally restricted to persoJl~ ,who meet cer-tam z:nlmmalqualifications. The catalogues of major testp~1>lishers specifyreqUlr~ments that must be met by purchasers, Usually ~pdividuals with amast~r s degree in psychology or its equivalent qu~l.i~~' -SO'rtle publishersclaSSIfytheir tests into levels with reference to user qt;al~fi~~ions, rangingfrom educational achievement and vocational proficiency tests, through

Page 31: Anne Anastasi- Psychological Testing I

'Context of Psychological Testing

, , 'entories to such clinical instru-ltelligence tests and mterest In\ t 'ersonalit tests, Distinc-s individual intelligence tests al

1ldmOhsPers alld a~thorized insti-

' db' d' 'idua [lUre ase also ma e etween In ,1\ t . Graduate students who mayh of appropnate tes s, , hPure asers " f research must have t e. , f I ignment or or ,

. articular test or ~ c ass a~s h "ehology instructor, who as-" order countersigned by t elf ps~ ,'b'l' f' th oller use of the test. ,

sponsl 1 Ity 01 e'pr, f h a dual objective: secuntyto restrict the distn~uboll o· ~ests ;~: Ethical Standards state:' . 1 d prevenhon of mIsuse, 1 , tatena san ,'th professional mteres s' , I' 't d to persons \\1to such deVices IS ImI e , , 1 13)' "Test scores like test' d h' "( Pnnclp e, ,

~ll safeguar t elr use who arc ualifled to interpret andals, are rele::sed ~nl~ to perso~:sshould beqnoted that although testm properly (Prmciple 14)" I t these obJ'cctives, the con-k ' , efforts to Imp emen 'b'l'utors ma 'e SllleCIe , '1 limited, The major responsl 1 ItyYare able to exert IS neeessan y h ' d' 'dual uscr or institutionf 'd in t e 111 IViproper use 0 tests resl es h t MA degree in psychology~ed,It is evident, ~or exampleA~;p a~i 'lorna-do not necessarilyen a PhD, state hc~nse, a~ld P articular test or that his' hat the indi\'idualls quah~ed ~o u;e tia;: of the results obtained

is relevant to the proper mtel pre a 0

at test. . s the Il1arketing of psvcho-. l' 'bilihr concern. ,er professIOna lcsponsl '} h Id - t be released pre-I d blishers Tests s Oll notests by aut lOrs an pu , I' be made regardin

crthe' ' 1 N" h Id anv c aUllS b

V for <renera use, • 01 S ou '. b' t" c"l'dence 'I\'hen ao f fficient 0 Jec lye, .

f a test in the absence 0 su nI\, this condition shouldd If search purposes 0 .''sttibute ear y or reo , , f the test n;'stricted accordingly,y specified and the d~,S:'lb~tIOl~~e data to permit an evaluationmanual should pro\ 1 e a, eq, re ardin administraUon,

est itself as well as full il1fo~~n~ttonfactal e~OSitiOl1 of whatnd norn1S,The manual S IOU ,e a d vice:.desi ed t~;~t1t the'b t tlle test rather than a sellmg c ;'" gn h da ou , , )onsibility of the test ,aut or an

'favorable lIght, It IS the rfeSl h to prevent obsolescence, i' dorms 0 ten enougr to reVise tests an n d t d 'II of course var)', t be ames out a e WI, ,idity with wlueh a tes c "

vith the nature of the tehst, ld t be published in a newsp.aper,' t of tests s ou no If

~~ ma °or UI:l'Sbook either for descri tive wrposes or forI SC

b-

e, or , " If 1 t' on would not on y e'00, Under these COndltI~:\;eW~~~~j~~ \vorthless, but it might

, such drastic errors as I' d' 'dual Moreover any pub-I . II ' , , s to t le In 1VI, ,~ychoogl~a y mJ~nou will tend to invalidate the future use of,n to speCIfic test It~~S 'ght also be added that presentation of)Vithother 'persOJ~s, m~ to create an erroneous and distortedprials in thIS fashIOn ten ,s 01 ~""h nllhlicitv may foster

Social alld Ethical Implicatiolls of Tes/ing 49

either naIve credulity or indiscriminate resistance on the part of the pub-lic toward aU psychological testing,

Another unprofessional practice is testing by mail, An individual's per-formance on eithel' aptitude or personalit~· tests cannot be properly as-sessed by mailing test forms to him nnd lla\'ing him return them by mailfor scoring and interpretation, Not only does this procedure provide nocontrol of testing conditions but usually it nlso involves tIle interpretationof test scores in the absence of other pertinent information about the in-dividual. Under these conditions, test results may be Worse than useless,

A question arising particularly in connection with personality tests isthat of invasion of privacy, Insofar as some tests of emotional, motiva-tional, or attitudinal traits are necessarily disguised, the subject may re-veal characteristics in the COurse of such a test without realiZing that heis so dOing, Although there are few available tests whose appr~1tssubtle enough to fall into this category, the possibility of developing s'i1~1.rindirect testing procedures i~~ a grave responsibility on the pi.choIogist who uses them. F~~se61 ijf'te§..ting cliee:tii\'ene~,~. De,..necessary to keep the examinee"in'1gnQ.f~~ the speCific ~.):hhis l'esponses on any Oue test are to be int~fpreted, Xe\'er~~ •.a.1Jt'r_son should not be subjected to any testing program under false pretenses,Of primary importance in this connection is the obligation to have adear understanding with the examinee regarding the use that will bemade of llis test results, The- Jellowing statement contained in EthicalStandards of Psychologists (Principle 7d) is especially germane to thisproblem:

The psychologist who asks that an individual reveal personal information inthe COurseof interviewing, testing, or evaluation, or who allows such infonna-tion to be divulged to him, does s9 only after making certain that the r:e-sponsible person is fully aware oflhe purposes of the intervjew, testing, orevaluation and of the ways in which the information may be used,

Although concerns about the invasion of privacy have .been expressedmost commonly about perspnalit)' tests, they logi<:ally apply to any typeof test. Certainly any itlteJligence, aptitude, or achievement test may re-veal limitations in skills and knowledge that an individual would rather1Totdisclose. Moreover, any observation of an individual's behavi@r-'tt'~in an interview, casual conversation, or, other personal '~llcoul1ter-m:lM'yield information about him that he wouldpr~fer to c.qnCe.E.l1 and that I¢may reveal unWittingly. The fact that psycI11;)Jogicaltests have often been.

Page 32: Anne Anastasi- Psychological Testing I

Il/('xl (If Psychological Testing

lit in discussions of the invasion of privacy probably reflects. misconceptions about tests. If all tests were recognized as.of behavior samples, with 110 mysterious powers to penetratehavior,popular fears and suspicion would be lessened.

'Id also bc noted that all behavior research, whether employinghet-observational procedures, presents the possibility of invasion'. Yet,as scientists, psychologists are committed to the goal ofg,.knowledge about human behavior. Principle 1a in Ethicals ofPsychologists (Appendix A) clearly spells out the psycholo-Viction"that socieh' v.·ill be best served when he investigatesjudgment indicate~ investigation is needed." Several other prin-theother hand, are concerned with the protection of privacy

'the{velfare of research subjects (see, e.g., 7d, 8a, 16). Conflictsmay thus arise, which must be resolved in individual cases.

amplesof such confl.ict resolutions can be found in the previouslyicalPrinciples in the Conduct of Research tcit11 Human Par-

s (1973).problemis obviously not simple; and it has been the subject of"e delibemtion by psychologists and other professionals. In a re-titledPrivacy and Be7IGvioral Research (1967), prepared for thef Science and Technology, the right to privacy is defined as "thethe individual to decide for himself how much he will share with

histhoughts, his feelings, and the facts of his personal life" (p. 2).fllrthercharacterized as "a right that is essential to insure dignityreedomof sf>lf.determination"-(p. 2). To safeguard personal pri-

jno universal rules can be formulated; only general guidelines £illlrovided.In the application of these guidelines to specific cases, th~~~substitute for the ethical awareness and professional respons~i{9

Ie individual psychologist. Solutions must be worked out in ter~ p£:particularcircumstances. -:'nerelevant factor is the purpose for which the testing is conducted-'ther for individual counseling, institutional decisions regarding~~lec-andclassification, or research. In clinical or counseling sit1,j.tions, the

_ t is usually willing to reveal himself in order to obtain h~]p with his,oblems.The clinician or examiner does not invade privacy'where he iseelyadmitted. Even under these conditions, however, the client should

tie warned that in the course of the testing or interviewing he may reveal:informationabout himself without realizing that he is so doing; or heIrony disclose feelings of which he himself is unawar- When tes ng IS con uded for institutional purposes, the lfiaffiineeIsbouldbe fully informed as to the use that will be made of his test scores.

, It is also desirable, however, to explain to the examinee that correct as-sessment will benefit him, since it is not to his advantage to be placed

in a position where he will fail or which he will find uncongenial. Theresults of tests administered in a clinical or counseling situation, of course,should not be made available for instihltional purposes, unless the ex-aminee gives his consent.

When tests are given for research purposes, anonymity should be pre-served as fully as possible and the procedures for ensuring such anonym-ity should be explained in advance to the subjects. Anonymity does not,however, solve the problem of protecting privacy in all research contexts.Some subjects may resent the disclosure of facts they consider personal,even when complete confidentiality of responses is assmed. In most cases,however, cooperation of subjects may be elicited if they are convincedthat the information is needed for the research in question and if they _have sufficient confidence in the integrity and competence of the in-vestigator. All research OIl human behavior, whether or not it utilizestests, may present conflicts of values. Freedom of inquiry, which is es-sential to the progress of science, must be balanced against the protectionof the individual. The investigator must be alert to the values involvedand must carefully weigh alternative solutions (see Ethical Principles,1973; Privacy and Be1lGvioral Researc11, 1967; Ruebhausen & Brim, 1966).

Whatever the purposes of testin tlle rotection f rivatwo Key concepts: re evanc consent. The information thatt e m iVl ua is asked to reveal must be relevant to the stated purposesof the testing. An important implication of this principle is that an prac-ticable effOlts should be made to ascertain the validity of tests for theparticular diagnostic or predictive purpose for which they are used. Aninstrument that is demonstrably valid for a given purpose is one thatprovides relevant information. It also behooves the examiner to makesure that test scores are correctly interpreted. An individual is less likelyto feel that his privacy is being ~aded by a test assessing his readinessfor a particular educational progrlfm than by a test allegedly measuringhis "innate intelligence."

The concept.,£.f informed consellt also requires clarification; and its ap-plication in individual cases mav call for the exercise of considerablejudgment (Ethical Principles, 1973;,Ruebhausen & Brim, 1966). The ex-aminee should certainly be infoJ'l!le~.about the purpose of testing, thekinds of data sought, and the use tha1;:wifi be made of his scores. It is notimplied, however, tliat he be shown the test items in advance or toldhow specific responses will be scored. Nor should the test items be shownto a parent, in the case of a minor. Suc~ infonnation would usually in-validate the test. Not only would the giving of this information seriouslyimpair the usefuhless of an ability test, boutit would alsotcm.d Jo distortresponses on many personality tests. For ~xaQJple, if an indi®~,~l is toldin advance that a self-report inventory-will be scored v.ith adorpinance

Page 33: Anne Anastasi- Psychological Testing I

Social and Ethical Implications of Testing 53

tent, the hazards of misunderstanding test scores, and the need of variouspersons to know the results.

There has been a growing awareness of the right of the individualhimself to have access to the findings in his test re ort. He should alsolave e opportum to comment on e contents of the report and ifnecessary to clarify or correct factual information. Counselors are nowtrying more and more to involve the client as an active participant in hisO\\'n assessment. For these purposes, test results should be presented ina form that is readily understandable, free from technical jargon orlabels, and oriented toward the immediate objective of the testing.Proper safeguards must be observed against misuse and misinterpretationof test findings (see Ethical Standards. Principle 14).-In the case 'of minors, one must also consider the parents' right ofaccess to the child's test record. This presents a possible conflict with thechild's own right to privacy, especially in the case of older children. In asearching analysis of the problem, Ruebhausen and Brim (1966, pp. 431-4,32) wrote: uShould not a child, even before the age of full legal re-sponsibility, be accorded the dignity of a private personality? Considera-tions of healthy personal growth, buttressed with reasons of ethics,seem to command that this be done." The previously mentioned Guide-lines (Russell Sage Foundation, 1970, p. 27) recommend that uwhen astudent reaches the age of eighteen and no longer is attending highschool, or is married (whether age eighteen or not)," he should have theright to deny parental access to his records, However, this recommenda-tion is followed by the caution that school authorities check local statelaws for possible legal difficulties in implementing such a policy.

Apart from these- possjble exceptjons, the question is not whether tocommUDlcute test results to arents of a minor but how to do so. Parents 1

norma y have a legal right to information- a out eir child; and it isusually desirable for them to have such information. In some cases, more-over, a child's academic or emotional difficulties may arise in part fromparent-child relations. Under these conditions, the counselor's contactWIth die parents IS of prime importance, both to fill in background dataand to elicit parental coope.ration.

Discussions of ~he ~n6dentiality of test records have usuall~ dealtwith accessibility to a thIrd person, other than the in~hjdilal tese~d (orparent of a minor) and the examiner (Ethical Stando,r.ds, Principle 6;Russell Sage Foundation, 1970). The underlying principle is that suchrecords should not be released without the knowl~~~. an..d. conseiitOf •the individual.' .,

'Vhen tests are administered in an institutional context, as in a schoolsystem, court, or employment setting, the indi~dual should be .infonne~at the time of testing regarding the purpose 6f~!he test, how th~ results- _._~--_._ _.-._-- ...•.. ~,:~'-- ..

of Psychological T('sting. fl d by stereotyped (and often

,p'J)se~are likely to bbeIntthu~n:ait or by a false or distorted

as'he may have a ou IS ,

. . 'th regard to pa-ng of children, special qU,es~ons anse "':1 e the Russell

~~:(i~;~~~U~~iS~~:~r:i:;:~;n~;:;d~di::I7:rc tfite COeelle~~i~~:, . ' f P '1 Recor s. 11 re ereo,.and DissenunatlOl1 0 tip' . d' "d al consent,nt, the Gujdelines differentiate b;tween l~t:~~o:al consent,'hild, his'tiparents, or both, an . ~r~~e resentatives, such'arents: legally elected ~r .appoll1t~. . p the Guidelines

board. 'While avoiding rigId preSC;lpti~n:h type of instru-, and achie",~mcnt tests as examp es °b em' t, at the

" , I nt should e su Cleo,,!lich representation a conse .. . "", t' cite . 1 1 I, e, personality ~ss~ssm~~i~:lilles is the inclusion of samplehelpfu eature. o. ~ e~~tten consent. There is also a selected.forms,for obtammo ' 1 t of school record keeping,, on the ethical and ~ega alsPdec~ that protect the indi-a d 'penmenta eSlgns

rocedur~~ a~o pe:rucipate and that adequately safeguard hist ,to eCme. . f 1data resent a challengeHevielding scientifically meanmg u 'tP d the establish-, '. . W'th oper rappor anc,hologist's ipgenUlty.. 1 pr h b of refusals to, 1 t however t e num ertitudes of mutua' respec, . 'bl ' tity The technical dif-may be reduced to a neghgl e quan ' h'; b avoided.

,bi;sed sampling and voluntee~ error, may : USe;t; tllat thisrom both national and stateWide .SUlvley~'t:gg,·, nd in the, h . f g educatlona ,ou comes a

be achieved, bot III t':s III rch (Holi:zn~~n, 1971; Womer,'~itiv~area of pers~~allty ;~:~ath(' number of respondents 'whoere Is-also some eVI ence ,',' "on of privacy or. . t enresents an mvaSI .a personahty llwen ory r 1" .'. 'S" nt'ly reduced when

h' ff nsive 15 slgm ca''der some of t e.ltems 0 e " : ex lanation of h.Q.YLitemL

:preceded by a Simple.and ~orthrJ ~:d..(Fink & Butcher, 1972).ted and I 0\ ores WI I be mterpre_ , .1:'~,' h

,- lid' 't hould be' adde~~~"t sue anstandpoint of test va Ity, 1 Sf'" the personality

'on did not affect the mean profile 0 scores on - , '

IDENTIALITY, . . which it is related, the problem oft,~e~rotectlOn ~f p~lVacYiftf:ceted. The fundamental question is:tiahty of test ata ISmu {ts? Several considerations influence theall hav~ access. to t~t resAmu~ng them are the security of test con-in particular situations.

Page 34: Anne Anastasi- Psychological Testing I

Social alld Ethical Implications of TCStillg 55

the c:lpacity to record faithfully, to maintain permanently, to retrieve promptly,and to communicate both Widely and instantly.

'ntext of psychological Testing'd nd their availabilih' to institutional personnel who h~v~ a

ISC ,ad f th UncJ:e,r'these conditions, nO further penms~lOne nee or em. . hi 'h' t1 institutiOn,d tl ti results are made avalla e Wit III Ie .-e at Ie me 1 r uested by outsiders,

'nt situation exists when test resu ts are eq t It from" 'm lover or a college requests tes resu s

"~.R:::~~t~~:s: i~st~nc~s, ind~v~d~l~~e~o;:~:~t~o:;:~~:~e~;dt~:equired, T~e same r~qUlre%~,nres:~ch urposes. The previouslyand coullsehng contexts, or d' 1971 P 42) contain a sampleuidelines (Russell Sage Folun :tlOn"n de~ri~lg the transmission, ofiformfor the use of schoo sys ems I ,

ta. , f . d'n institutions. Oner pr,oblem pertains to the l'ete~tlO~l? recor s Ibe vcr' valuable,

,hand, longi1tudinal rec~r~s a~:o l:~~I~::~~:t~~~ing ani'counselingy for researc I purposes u . advanta es resuppose properson. As is so often the cas~, th;se t1 othe; haKd the availability. interpretation •.of test resu ts, n m::uses as inl~rrect inferencesrleords opens t~e way f~~ s~ch for otber than the original'solete data atld.~unauthollze 1 acbcessd for example to cite an IQ

I Id b anifest v a sur , , ,.gpurpose, ~wou em: d b a child in the third gradereading achle\'t>ment sco~e, obtalOe II Ye Too much may have hap-n evaluating him for admISSion to co eg 'k h ·1' and ""'lated, I" 'ears to ma e suc eaI' ,.,..d to llim in t Ie mtervemng ) d etained fo'l"many, f I S' '1 Iv when recor s are rscores meaning u. Iml ar .' b ed for purposes that the individ-

rs,t11ereis dan!!:er that tbey ma): edu~nd would not have approved.(or his parents) never suspecte 'd t' d either for le-

I, when recor s are re amea I1revent suc 1 mIsuses, , f h 'd'" d al or for ac-, " I 'the interest 0 t e m 1'111 ulate longltudma use m them should be subject to unusual¥i

table research purPloseCs,a,cdcej:setso(Russell Sal1e foundation, 1970:W', t troIs In t Ie w e In I:> d tngen can . d 1 . 'fi d into three categories-with regar· 0t2), sch~ol recol' s. are c aSSli~in factor in this classification is the'I" retenti~n, ~. major det~~~ilih~ of the data; anot\l.er is rdevance to i

ree of objectIVity and ven a 'J 1 I Id be ..s-e for any type of. 1 b' ti f the schoo. t wou ,,"", .

e educationa 0 Jec ves ,0 '1 . l' 't policies regardit.g the destruc-.stitution to fonnulate SHmar exp lCl d'. . . 'b'1't f personal recor s.

:t!on, retention, and acceSSI I I Y a 't nd accessibility of test results'" ' bl f . tenance secun y, a, The pro ems 0 mam, . 'fi d bv the develop-.~and uf all other ~ersonal da~:n~avlen b~~; ;:~~e eta the Guidelines. inent of computenzed, aata . 5-6) Ruebhausen wrote;

(RussellSage Foundation, 1970, pp, , ,. d a new dimension into the issues of pnvacy.

Modernscience has mtl'Oduce h tr t allies of privacy were the in-t' 1e among t e s ongcs ,Therewas a Ime W 1 n . d the healing compaSSion, h f II'b'n" f hiS memorv anefficiencyof man, tea 1 1 I • ,0 f' t' 'd' the warmth of human reeol-, d b th the passmg 0 tme an 'lhatat'compame,. 0 ,'." .., ""II" ,_,,"'.\fnrlrrn sciellcehas !!ivenus

The unprecedented advances in storing, processing, and retrieving datamade possible by computers can be of inestimable service both in re-search and in the more immediate handling of social problems, The po-tential dangel"s of invasion of privacy and violation of con~dentialityneed to be faced squarely, constructively, and imaginatively, Rather thanfearing the centralization and efficiency of complex computer systems, weshould explore the possibility that these very characteristics may permitmore effecth'e procedures for protecting the security of individualrecords.

An example of what can be accomplished with adequate facilities isprovided by the Link system de\'eloped by the American Council ofEducation (Astin & Boruch, 1970), In a longitudinal research programon the effects of different types of college environments, questionnaireswere administered annually to several hundred thousand college fresh-men, To permit the collection of follow-up data on the same personswhile preventing the identiflcation of individual responses by anyone atany future time, a three-file system of computer tapes was devised, Thefirst tape, containing each student's responses marked with an arbitraryidentincation number, is readily accessible for research purposes. Thesecond tape, containing only the students' names and addresses with thesame identification numbers, was originally housed in a locked vault andused only to print labels for follow-up mailings. After the preparation ofthese tapes, the original questionnaires were destroyed.

This two-file system repl'esents the traditional security system. It stilldid not provide complete protection, since some staff members wouldhave access to both files. ~'Ioreover, such files a-re subject to judicial andlegislative subpoena. For these reasons, a third me was prepared. Knownas the Link file, it contained only the original identification numbers anda neW set of random numbers which were substituted for the originalidentification ~umbers in the name and address file. The Link file wasdcposited at a computer facilit), in a foreign country, with the agreementthat the file would never bC;le)eased to anyone, inclu~jpg the AmericanCouncil on Education. Follow-u.p data t~p!s are sent tq the f{)reign fa-cility, which substitutes one set of code numbers f~the other. With the.decoding files and the research data files under: the control of differentorganizations, no one can identify {he responses of illdividuals ~ thedata files. Such elaborate precautions roi'the protection of conlidentialityobviously would not be feasible except in a!aJge-scale computerized databank. The procedure could be simplified sQmewhat if the lin\ing faCility'·were located in a domestic agency given,:adequate protection againstsubpoena.

Page 35: Anne Anastasi- Psychological Testing I

i$tshave given much thought to the comm~nication of test"formthat will be meaningful and useful. It IS clear that theshould not be transmitted routinely, but should be accom-nterpretive explanations by a professionally trained person.imicating scores to parents. for example, a recommendedto arrange a group meeting at which a counselor or school'\explains the purpose and nature of the tests, the sort ofth'tt"t mav reasonably be drawn from the results, and the

of the d~ta. Written'reports about their own children mayributed to the parents, and arrangements made for personal';vithany parents wishing to discuss the ~epol'ts further .. ~e-how they afe transmitted, however, an Important condItIonresu1tsshould be prcsented in terms of descriptive perform-rather than isolated numerical scores. This is especiall}' tnu::..

nee test· which are more likely to be misinter reted than are

't tes,ll1icatingresults to teachers, school administrators, emplo'yers,approprig.te persons, similar safeguard~ shoul~ b~ proVided.Is of performance and qualitati\·e descnptot~ns 111 Sllnple termspreferred over specific numerical. scores, cxc,:pt when com-g with adequately trained professlOnals .. Ev~n well-educatedye been known to confuse percentiles WIth Q~~centa~e scor~s,with lQ's, norms with standards, and int~Fts~ ratlOgs With

'ores.But a.more serious misinte )fetation )ertams to the con-rawn from test SCOl'es,even w en their te.c:nnical meaning ismderstood. A familiar example is the popuhyassumption that!cates a fixed characteristic of the individual wl)ich pTede-is lifetime level of intellectual achievemen~. , ,-litcommunication it is desirable to take .i.W:oaccount the char- .of the person who is to receive the i~fomlation. This. applies I

o at person's general educatIOn 1~:imowledge about psy-nd testing. but also to his anticipated eIllotional response to theon. In the case of a parent or teacher, for. example, personalI' involvement with the child may interfere with a calm and'cceptance of factual information. . .ut by no means least is the problem of commumcatlOg test re-';e individual himself, whether child or adult. The same gene.ral.'s against misinterpretation apply here as in ~mmuni~tm~ird party. The person's emotional reaction to the mforrnatlOn lS

ly important. of course, when he is learnin? about hfs 0'1\'11 assets.,... :..~.. 'H1".~ ,,,, ;nr1;vir'll1:1lis !!iven hiS own test results, not

Social and "Etl1icalIIll1"ications of Testing 57

onl~. s~ould the data be interpreted by a properly qualified person, butfaclli~Ies shoul.d also be available for counseling anyone who may becomecmOti01~any dIsturbed by such information. For example, a college stu-dent mIght become seriously' discouraged when he leams of his poorperformance on a scholastic aptitude test. A gifted schoolchild might de-velop habits of laziness and shiftlessness, or he might become uncoop-erahve a~ld unm.anageable, if he discovers that he is much brighter thanany of Ius asso.clates. A severe personality disorder may be precipitatedwhen a ~aladlust('d individual is given his score on a personality test.Such de~nmental effects may, of course, occur regardless of the correct-ness or lllcorrectness of the score itself. Even when a test has been ac-curately administer:d and scored and properly interpreted, a knowledgeof such a score WIthout the opportunity to discuss it further ~nay beharmful to the individual.

Counseling psychologists h~e been especially concerned with the de-v~lo ment of effective wavs of transmittin test inform' to-their-_c IC11t5 see, e.g., Goldman, 1971. Ch. 14-16). Although the details of

..-tfu~ pr.ocess ~re be}'o~d the, scope of ?~present discussion, two majorgll1del~nes are of particular mterest. FI~ test-reporting is to be \'iewed~as an mtegral part of the counselin rocess and incor orated into theo a counse or-c lent relationshi . Se d, insofar as ossible, test results

shou e reported as answers to specific !:lucstions raised bv the CQun-~. An Important consideration in counseling relates to the' counselee's

~cceptance o~ the information presented to him. The counseling situationIS such thaf If the individual rejects any information, for whatever rea-sons, then that information is likely to be totally wasted.

I

1 II III

T~ SETfINC.~he decades since 1950 have witnessed an increasingpublIc concern With the rights of minorities,' a concern that is reflected inthe enactment of civil rights legislation at both federal and state levels.In conn~t~on with mechanisms for improving educational and vocationalopportumhes ~f such groups, psychological testing has been a majorfocus of att:nbon. Th~ psychological literature of the 1960s and early197?s co~tams many dI~cussions of the topic, whose impact ran.ges fromclanflcabo~ ~o obfuscation. Among the more clarifying contributions areseveral po.slbon papers by professional associ,tit>ns (see, e.g., AmericanPsychological Association, 1969; Cleary, Humphreys, Kendrick, & \Ves-

Ie I tlthou~h ~omen repre)'lnt a statistical majorltyjn the nati~~al population.ga I.y,~c~upalJonallY'in in otlu~rways.they have s~ed Jllany of the problems

of mmoTlhes. Hence w the term "minority" is use(i "fu tnis section it will beunderstood to includj) men. '

Page 36: Anne Anastasi- Psychological Testing I

. 'onlcxt of Ps!}clIOlogica1Testing

'5' Deutsch Fishman, Kogan, North, & Whiteman, 1964; Tl::'1Jl~use of t~sts 1972). A brief but cogent paper b~ F~aughIsohelps to cle~r away some preval~nt S~ll;C~So~;:; ~~I~~~ural'of the concern centers on the lowenng 0 es sc . d . t r-ns that ma)' have affected the devel~p;ne~t lofc~;:~~e:;;ti: eoEotivation, attitudes, and other psyc ~ O~IC: for the problem

ou members. Some of the propose so u Ions . almi~nlrstandings about tIle nature anddfllnfction of ps~chdol'Vll?j~~ls

. . I b kgroun s 0 groups or 10iflerencesin the expenentia ac hI' 1 test~itably manifested in test performanlce. Ev:rytPsbychaoVl~o~C~tsin-

. 1 I f as Cll ture alIec s e ,res a beh:wlOl' samp e. nso ar If 1 ut aU culturalwill and should be detected by tests. .we ~ e. 0 as a measure

I1tialsHom a test, we may th.erebdYlower Its ;ah~?t case the test

behavior domain it was deslgnc to assess. n'fail to provide the kind of information needed to correct the very

'ionsthat impaired performance. . 1 citron theause the testing of minorities repr~sent\a sP:~~~l ~~; :heoretical

.er problem of cross-cultural te.stmg, t e U full) in Cha ter 12.naleand testh~g procedures ar: ~1~~~:~e~i:'?7s giv:n in Ch~pter 7,chnicalanalysIs of the concep 0 h t h ter our interest is

, . h l'd't In t e presen c ap ,llnnectlOl1Wit test va I I y. ., ., f inDrity groUpwily in the basic issues and SOCialImplications 0 m

·ng.

Social and Etllicallmplications of Testing 59

iarity with such objects. On the other hand, if the development of arith-metic ability itself is more strongly fostered in one culture than in an-other, scores on an arithmetic test should not eliminate or conceal sucha difference.

Another, more subtle way in which specific test content may spuriouslyaffect performance is through the examinee's emotional and attitudinalresponses. Stories or pictures portraying typical suburban middle-classfamily scenes, for example, may alienate a child reared in a low-incomeinner-city home. Exclusive representation of the physical features of asingle racial type in test illustrations may have a similar effect on mem-bers of an ethnic minority. In the same vein, women's organizatiDlls haveobjected to the perpetuation of sex stereotypes in test content, as in theportrayal of male doctors or executives and female nurses or secretaries.Certain words, too, may have acquired connotations that are offensive tominority groups. As one test publisher aptly expressed it, "Until fairlyrecently, most standardized tests were constructed by white middle-classpeople, who sometimes clumsily violate the feelings of the test-takerwithout even knDwing it. In a way, one cDuld say that we have been notso mueh culture biased as we-have been 'culture blind'" (Fitzgibbon,1972, pp. 2--3).

The major test publishers now make special efforts to weed out in-appropriate test cDntent. Their Dwn test construction staffs have becDmesensitized to pDtentially offensive, culturally restricted, or stereotypedmaterial. Members of different ethnic groups participate either as regularstaff members or as consultants. And the reviewing of test content withreference to possible minority implications is a regular step in the processof test construction. An example Df the application Df these proceduresin item construction and revision is provided by the 1970 edition of theMetropolitan Achievement Tests (Fitzgibbon, 1972; HarcDurt Brace Jo-vanovich,1972).

TEST.RELATED FACTORS. In testing culturally di"h·elt·seffPerst°bno~hi:e~~~~d, . . b cultural factors t a a ecrtant to differentiate etween . t . t d to the test It is

I' d th hDse in uence is res nc e - .·terionbe laVlor an ose w d ~ Ex~mples of suchatter, tSst-related actors that. re l\~e va 1 .; ion to erEorm

to~sinclude previous experience m ~akmg tests, mo~;t. variable; ,~_~veJlon tests, rappDrt with the exammer, an~ an

1y

0 tetr

~_-i<c;€fit~rionth fcular test but me evan O~ __ . --_.-

fectingperformance on .e pa~ I h ld be m'aae toreduce the opera-d. s'deration SpeCial en arts s ou .' . '.~ I' when testing persons wltn diSSimilar

lion of these test-related factors - . - 'd adequate test-,.ii:ctilfuralbacKg~.n:dS: A d~b1e proc~~urea:\~u~~~;~d\Y the booklets

akingorientation and prehmmary prnc iCe,. 'th parallel form is. d" d' Chapter 2 Retestinl1 WI a _ ~--~d tape recor mgs cite III '. Ph h had little or no

IsoreeDmmended with low-seorin examm s w a ave -

~prl;;e~st~t~e:~.~:e~c:;also~n~~e:~:s:e;: :~~~e;e~7c ~:::o::~, ~::unrelated to cntenon per£orm~n tu' of obl'ects unfamiliar in a particular

-ample the use of names or piC res . d h diex l' T ld obviously represent a test-restncte an cap.cultura mlleu wou h' k' d not depend upon fami!-

. Ability to carry out. quantitative t m mg oes

INTERPRETATION AND USE OF TEST SCORES. By far the most importantcoflsiderations in the testing of culturally diverse groups-as in all testing-,;..,pertain to the interpretation of test scores. The most frequent misgiv-ings regarding the use Df tests with minority group m~w:bers ste~ frommisinterpretations of scores. If a minority examinee Qn~l:li~sa low scoreon an aptitude test or a deviant score on a personality):est, it is essentialtQ.investigate why he did so. FDr example, an infel~i'St:ore on an arith-metic test could result from low test-taking motivation, poor readingability, or inadequate knowledge of arithmetic, among other reasons.Some thought should also be given to the type of nQCWsto be employedin evaluating individual scores. Depending on the purpose of the testing,the appropriate norms may be general nDrms~.!2gl;oUP.Jlotms based Qn- . .

Page 37: Anne Anastasi- Psychological Testing I

Many bright, non-conformingpupils, with backgrounds different from those oftheir teachers, make favorable showings on achievement tests, in contrast totheir low classroom marks. These are very often chffaren whose cultural handi-caps are most evident in their overt social and interpersonal behavior. Withoutthe intervention of standardized tests, many such children would he stigma-tized by the adverse subjective ratings of teachers who tend to reward can·formist behavior of middle-class character.

Social alld Et!lical171lplicatiolls of Testing 61

an IQ would thus serve to perpetuate their handicap. It is largely be-cause implications of permanent status have become attached tq.Jhe IQthat in 1964 the use of group intelligenGe-testS-..M:asdiscontinued in thel\ew York City public schools (H. B. Cilbeli, 1966; Loretan, 1966). Thatit proved necessary to discard the tests in order to eliminate the miscon-ceptions, about the fixity of the IQ is a revealing commentary on thetenacity of the misconceptions. It should also be noted that the use ofindividual intelligence tests like the Stanford-Binet, which are admin-istered and interpreted by trained examiners and school psychologists,was not eliminated. It was the mass testing and routine use of IQs byrelatively unsophisticated persons that was considered hazardous.

According to a popular misconception, the IQ is an index of innateintellectual potential and represents a fixed property of the organism. Aswill be seen in Chapter 12, this view is neither theoretically defensiblenor supported by empirical data. \Vhen properly intcrrireted, intelligencetest scores should not foster a l'igid categorizing ~f persons. On the con-hary, intelligence tests-and any other test-may be regarded as a mapon which the individual's present position can be located. When com-bined with information about his experiential background, test scoresshould facilitate effective planning for the optimal development of theindividual.

OBJECTIVITY OF TESTS. "'hen social stereot:'pes and prejudice may dis-tort interpersonal evaluations, tests provide a safeguard against fa-voritism and arbitrary or capricious decisions. Commenting on the use oftests in schools, Gardner (1961, pp. 4&-49) wrote: "The tests couldn't seewhether the youngster was in rags or in tweeds, and they couldn't heanthe accents of the slum. The tests revealed intellectual gifts at every levelof the population."

In the same vein, the Guidelines for Testitlg Minority Group Children(Deutsch et at, 1964, p. 139) contain the follOWingobservation:

\Vith regard to personnel selection, the contr!!>ution:,of t~sts was aptlycharacterized in the following words by John ,¥:, Macy, Jr.,'Chairman ofthe United States Civil Service Commission (7.f~~!f,rgand Public Policy,1965, p. 883) :""'.

Page 38: Anne Anastasi- Psychological Testing I

be of states enacted legislation and estlt •••AL REGULATIONS. Anum. r ., (FEPC) to implement i,t..:,.

d F . E 10 ment Practices CommiSSions. -1'im\'!\."e aIr mp y f h legal mechanisms at the federal l~;l~~'nor to the development 0 suc lIotts have been made to pat-1iI0ngthe states that did so 7t~r, sfme;e\ The most pertinent federal

tern th~ re?ulatio~s after the e u~tE:olo '~ent Opportunity Act (TitlelegislatIOnISprovld.ed by the ~q 1964 ~ ?ts subsequent amendments).>'n of the Civil Rl?hts Act o. a~ ;nfottement is vested in the

, e. sponsibility for ImplementationC

an ., (EEOC) When charges, 0 rtunity ommlSSlon .qual Employment ppo. h plal'nt and if it finds the charges

, " h EEOC' shgates t e com ,-arefiled, t e lllve t th 'tuation through conferences and. '6 d'" first to correc e Sltobe lush e , u1.es d f '1 EEOC may proceed tor If these proce ures al,

voluntary com~ lance. d d . t orders and finally bring action inhold hearings, ISsue cease an eSlS ,

. 1 al developmentssince midcentury, including'A brief summary of ~he major e~d rt decisions, can be found in Fincher

legislativeactions, executive orders, an cou(1973).

Social and Etlticallmplications of Testing 63

the federal courts. In states having an approved FEPC, the Commissionwill defer to the local agency and will give its Bndings and conclusions"substantial weight."

The Office of Federal Contract Compliance (OFCC) has the authorityto monitor the use of tests for employment purposes by government con-tractors. Colleges and universities are among the institutions concernedwith OFCC regulations, because of their many research and traininggrants from such federal sources as the Department of Health, Educa-tion, and Welfare. Both EEOC and OFCC have drawn up guidelines re-garding employee testing and other selection procedures, which are vir-tulillly identical in substance. A copy of the EEOC Guidelines on Em-ployee Selection Procedures is reproduced in Appendix B, together witha 1974 amendment of the OFCC guidelines clarifying acceptable pro-cedures for reporting test validity,3

Some major provisions in the EEOC Guidelines should be noted, TheEqual Employment Opportunity Act prohibits discrimination by em-ployers, trade unions, or employment agencies on the basis of race, color,religion, sex, or national origin, It is recognized that properly conductedtesting programs not only are acceptable under this Act but can alsocontribute to the "implementation of nondiscriminatory personnel poli-cies." Moreover, the same regulations specified for tests are also appliedto all other formal and informal selection procedures, such as educationalor work-history requirements, interviews, and application forms (Sec-tions 2 and 13),

\Vhen the use of a test (or other selection procedure) results in asignificantly higher rejection rate for minority candidates than for non-minority candidates, its utility must be justified by evidence of validityfor the job in question. In defining acceptable procedures for establish-ing validity, the Guidelines make explicit reference to the Standards forEducational and Psychological Tests (1974) prepared by the AmericanPsycholOgical Association. A major portion of the Guidelines covers mini-mum requirements for acceptable validation (Sections 5 to 9). Thereader may find it profitable to review these requirements after readingthe more detailed technical discussion of validity in Chapters 6 and 7 ofthis book. It will be seen that the requirements are generally in line withgood psychometric practice.

In the final section, dealing with affirmative action, the Guidelinespoint out that even when selection procedures have been satisfactorily

ntcxt of Psychological Testing" ., f pIc that are related to job per-sityto measure charactenS!lCS 0 peo h' h' the basis for entrv

, is at the very root of the merit system, ~v~u:s over the veal'S, th~~areerservices of ~hel~ederalt ~o\t'he:::l~pmen't and application of.. .. h s had a vita mteres m d bl'eTVIcea d bt that the widesprea pu Iegicaltesting methods. I ha\'~ ~o ou d res has in large part been

" in the objectivity of ~ur 111fnn

hgp;~ce ~: the' practicality, and the

"by the public's perception 0 t e alrne., .-.'ofthe appraisal methods they must submit to.". • 101 ee Selection Procedures, prepared by the:GUldeltnes on Emp y. ., (1970) as an aid in the". I t 0 portumty CommiSSIOnmp oymen P b' 'th the following state-'entation of the Civil Rights Act, rgm WI

purpose:I h belief that properly validated and

elin,esin this part ar~ based o~ ~e: can significantly contribute to thefzedemployee selection proce u I I'CI'es as required bv Title

d' ' . t personne po I ' , ,entationof no~ Iscnmma or; . llv developed tests, when used in

(is also recogmzed that pro esslon~ ;~sessment and complemented by'ction with other tools of perso~n~fi tl,'d in the development and" f ' b d' may sign! can 'Ii al dprograms0]0 eSlgn, - d . deed aid in the utilization antenanceof an efficient work force an , In ,

servationof human resources generally,, b 'sused in testing culturally disadvantaged

ar)' tests can e ml ' 'h th 'nsumm .'. ,I When properly used, owever, e). ns-as 111 testmg aD.yon~ ese, ting irrelevant and unfair discrim-, e an important fun~tlOn 111 pre~te~ive index of the extent of cultural

. 'ti' The\' also prOVIde a quanti ~ ..lOaon, - . d'al programsnandicapas a necessar~' first step In reme 1< •

3 In 1973, in the interest of simplIficationand improvedcoordination,the prepara-tion of a set of uniform guidelines was undertaken by the Equal EmploymentOp-portunity Coordinating Council, consisting of representatives of E ,the U.S,Department of Justice, the u.s. Civil Service Commission,the U.S'c,rtlJlent ofLabor, and the U.S. Commissionon Civil Rights. No'uniform versioD,o<... et. 1u!syet been adopted. " '•.

Page 39: Anne Anastasi- Psychological Testing I

Context of Psychological Testing'ted, if disproportionate rejection rates result for minorities, stepse.takento reduce this discrepancy as much as possible. Affirmative'~impliesthat an organization does more than merely avoiding dis-'. ry practicCli,.Psychologically, affirmative action programs maydedas eHorts to compensate for the residual effects of past social~s.Such effects may include deficiencies in aptitudes, job skills,~,motivation, and other job-related behavior. They may also be'~iniH~erson'sreluctance to apply for a job not traditionally open

" ndidates, or in his inexperience in job-seeking procedures.~mative actions in meeting these problems include re-

media most likely to reach minorities;, explicitly en-minority candidates to apply and following other recruitingesignedto counteract past stereotypes; and, when practicable,special training programs fOI the acquisition of prerequisite

knowledge.

PART 2Primipus ofPsychological listing

Page 40: Anne Anastasi- Psychological Testing I

CHAPTER 4NornlS a'nd the11lterjJretation of Test Scores

INTHE absence of additional interpretive data, a raw score on anypsychological test is meaningless. To say that an individual hascorrectly solved 15 problems on an arithmetic reasoning test, or

identified 34 words in a vocabulary test, or successfully assembled amechanical object in 57 seconds conveys little or no information abouthis standing in any of these functions. Nor do the familiar percentagescores provide a satisfactory solution to the problem of interpreting testscores. A score of 65 percent correct on one vocabulary test, for' example,might be equivalent to 30 percent corred on another, and to 80 percentcorrect on a third. The difficulty level of the items making up each testwill, of course, determine the meaning of the score. Like aU raw scores,percentage scores can he interpreted only in terms of a dearly definedand uniform frame of reference.

Scores on psychological tests are mOst commonly interpreted by ref-erence to norms which represent the test performance of the stand-ardization sample. The norms are thus empirically established by de-termining what a representative group of persons actually do on the test.Any individual's raw score is then referred to the distribution of scoresobtained by the standardization sample, to discover where he falls in thatdistribution. Does his score coincide with the average performance of thestandardization group? Is he slightly below average? Or does he fall nearthe upper end of the distribution?

In order to determine more precisely the individual's exact positionwith reference to the standardization sample, the raw score is convertedinto some relative measure. These derived scores are designed to serve adual purpose. First, they indicate the individual's t~lativ.e standing inthe normative sample and thus permit an evaluation of his'performancein reference to other persons. Second, they provide comparable measuresthat permit a direct comparison of the individual's performance on dif-ferent tests. For example, if an individual has a raw score of 40 on avocabulary test and a:raw score of 22 on an arithmetic reasoning test, we

67

Page 41: Anne Anastasi- Psychological Testing I

il1lcsof Psychological Tcstillg'nownothing about his relative performance on the two tests.invocabulary or in arithmetic, or equally good in both? Since'.9ndifferent tests are usually expressed in different units, a,a)'isollof such scores is impossible, The difficulty level of theest would also affect such a comparison between raw scores.,~s,on the other hand, can be expressed in the same units"to the same or to closely similar normative samples for. The individual's relath'e performance in many different

,thusbe compared.ariousways in which raw scores may be converted to fulfillp.vesstate'd above. Fundamentally, however, derived scores)0 one of two major ways: (1) developmental level at-relative position within a specified group. These types of~rwith some of their common variants, will be considered::tionsof this chapter. But first it ,vill be necessary to ex-'elementary statistical concepts that underlie the develop-'zation of norms. The following section is included simply.meaningof certain common statistical measures. Simplified.examples are given onl~; for this purpose and not to pro-'~ statistical methods. For computational details and spe-s to be ~llowed in the practical application of tl1ese tech-er is refeHed to any recent textbook on psychological or

atistics.

TABLE 1Frequency Distribution of Scores of 1 000 C II Studon a Code-Learning Test ' 0 ege ents

(Data from Anastasi, 1934, p. 34)

-Class Interval Frequency

52-5548-51

1

44-471

40-4320

36-S973

32-35156

28-31328

24-27244

20-23136

16-1928

12-158

8-1132

•. ~-:-na-=

1,000fa

ject of statistical method is to organize and summarize)~ in order to facilitate their understanding. A list of 1,000be an overwhelming sight. In that form, it conveys little-step in bringing order into such a chaos of Iaw data is toes into a frequency distribution, as illustrated in Table l.'on is prepared by grouping the scores into convenientd tallying each score in the appropriate interval. When.n entered, the tallies are counted to find the frequency,es, in each class im"erval. The sums of these frequencies

'e total number of cases in the group, Table 1 shows the,~ollegestudents in a code-learning test in which one setds, or nonsense syllables, was to be substituted for an-, ~cores, giving number of correct syllables substitutedInute trial, ranged from 8 to 52. They have been grouped'1sof 4 points, from 52-55 at the top of the distribution

Ie frequency column reveals that two persons scored

~~~ws:e~n~and 11, three b~tween 12 and 15, eight ,between 16 and 19,

The information provided b fpresented graphicallv in the f y af r~~ue~lcy. distribution can also bethe data of Table 1 'l'n gra h,orm

f° ao lstnbubon curve. Figure 1 shows

p lC orm. n the b r h'are the scores grouped int I' ase me, or onzontal axis,frequencies, or number of o. c ass/1~ervals: .~n the vertical axis are thegraph has been plotted I' teases a mbo

gwlthm each class interval. Then wo ways th fo be' .

In the histogram, the hei ht of the :x.l rms 109 m common use.terval corresponds to the g b umn erected over each class in-can think of each individ n~mt erd~f persons scoring in that interval. Wecolumn In the fre ua 1s an mg on another's shoulders to form theis indi~ated by aq;~i:~Y~o Y~~'th~ number of persons in each intervalacross from the appro n~atacef m t e center of the class interval and, , ,p erequency The s c' .Jomed by straight I' . u ceSSlVepomts are then

meso 'Except for minor irregularities th di 'b . .

resembles the bell-shaped normdl e stn ution por~ayed in Figure 1~erfect normal curve is reproduce;~:~i A mathem.atically dete~jned,lmportant mathematical TO erti ' , . gu:e 3, This type of curve hasof statistical a~alyses FoP thP es and prOVIdesthe basis for many kinds

. represent purpo htures will be noted E ti n h se, owever, only a few fea-. ssen a y t e curve . d' th "number of ca 1 " m lcatesat'J4~ largestses custer In the center of the range and thattlie nu;ri15er

Page 42: Anne Anastasi- Psychological Testing I

Norms and tile Interpretation of Test Scores 71

~he most ~bvious and faniiliar way of reporting variability is in terms ofe range etween the highest and lowest score The ran e h .cxtrem I d d . g, owever IS

. . e y cru c an unstable, for it is determined by onl two scores' Asmgle unu~ually high 01' low score would thus markedly Iffect its size' A:ore precIse method of measuring variability is based on the d'ff .

etwee~ eac~ individual's score and the me;n of the ou 1 erencew~t:~ P01~t it will be helpful to look at the exam~Ie r~Table 2 in10 c t ~ va~ous measures under consideration have been computed onstr:~~~' alu~ a s~an group was chosen in order to simplify the demon-

• ,< tough 111 actual practice we would rarely perform these coputations on so fe' ' T hI J m-ard statistical sym~o~~~~t s~o~: ~ervetS adlsfotOfintroduce certain stand-e no e or uture reference Originalraw scores are conventionally designated by a capital X d . n .used to refer to deviations of each score from the ' an a sma x ISletter I means "sum of" It 'n b group mean. The Greek

g. th d f . Wi e seen that the first column in Table 2lves e ata or the 'f

40, th d" computation 0 mean and median. The mean is, erne lan IS 405 fall' 'd b" mg ml way etween 40 and 41-five cases

Principles of Pbycl1010gical Testing

ps off gradually in both directions as the extremes are approached..curve is bilaterally symmetrical, with a single peak in the center.st distributions of human traj,ts, from height and weight to aptitudespersonality characteristics, approximate the normal curve. In gen-

I,the larger the group, the more closely will the distribution resembletheoretical normal curve.

340320300280260240

~ 220i3 200

'0180•• 160i 140:l 120

10080604020

- Frequency polygon--- Histogram

TABLE 2 ~Illustration of Central Tendency and Variabilit)·

•• ""JI fi!.=z:r--

--I

12- 16- 20-15 19 23

Diff. Squared(:1:2 )24- 28- 32- 36- 40- 44- 48- 52-

27 31 35 39 43 47 51 55scores

50% of {:~ ~~1cases ~~ ~!J +20

Medi,n ~ 40.5 ~~:, ~ {E =H -20

___ 3_2 =~J~X = 400 ~ Ixl = 40

~X 400M=N=1O=40

AD = }; ixj _ 40_N - 10}~ 4

V. ~x' 244·

anance = 0" = - = - - 24 40N 10 - .

SD or u = ~~2 = v'24.40 = 4.9

Flc.1. Distribution Curves: Frequenc\: polygon and Histogram.

(Data from Table 1.)

A group of scores can also be described in terms of some measure.:ofcentral tendency. Such a measure provides a single, most typical or repJi~-sentative score to characterize the performance of the entire grouf:- 'Themost familiar of these measures is the average, more technically knownas the mean (M). As is well known, this is found by adding all scoresand dividing the sum by the number of cases (N). Another measure ofcentral tendency is the mode, or most frequent score. In a frequencydistribution, the mode is the midpoint of the class ihterval with thehighest frequency. Thus, in Table 1, the mode falls midway between 32and 35, being 33.5. It will be noted that this score corresponds to thehighest point on the distribution curve in Figure 1. A third measure ofcentral tendency is the median, or middlemost score when all scoreshave been arranged in order of size. The median is the point that bisectsthe distribution, half the cases falling above it and half below.

Further description of a set of test scores is given by measures of varia-, "', ..• 1. r ~ ••• ~"'t "f ;..,rl;"i"'l1~ 1 flifkrences around the central tendency.

6449

911o4

163664

:£x' = 244

Page 43: Anne Anastasi- Psychological Testing I

.,;Principles of Psychological Test ing

'eIcent) are above the median and five below. There is little point ina mode in such a small group, since the cases do not show c1ear-tering on anyone score. Technically, however, 41 would repre-mode, because t",o persons obtained this score, while all other

ccur only once.and column sho\\'s how far each score deviates above or belowof 40. The sum of these deviations will always equal zero, be-.EOsitiveand negative deviations around the m~an nec~ssarily.or cancel each other out ( +20 - 20 = 0). If we Ignore slgns, ofe Ci,\1laverage the absolute deviations, thus obtaining a measureth'eaverage deviation (AD). The symbol Ix\ ill the AD formulathat absolute values were summed, without regard to sign. Al-f ~mnedescriptive value, the AD is not suitable for use in fur-

thema'tical analyses because of the arbitrary discarding of signs.

99.72'1

95.44'1

68.26'1tI1IIIIIIIIIIIII

-30' -leT Mean +leT +20'

FIC. 3. Percentage Distribution of Cases in a NOlmal Curve.

diffe~ent tests in terms of norms, as will be shown in the section onstan~ard scores. The interpretation of the SD is especi~lly clear-cut whenapphed to a normal or approximately normal distribution curve. In sucha distribution, there is an exact relationship between the SD and theproportion of cases, as shown in Figure 3. On the baseline of this normalcurvc have been marked distances representing one, two, and threestandard deviations above and below the mean. For instance, in the ex-ample given in Table 2, the mean would correspond to a score of40, +1u to 44.9 (40 + 4.9), +20' to 49.8 (40 + 2 X 4,9), and so on. Thepercentage of cases that fall between the mean and +lu in a normalcurve is 34.13. Because the curve is symmetrical, 34.13 percent of thecases are likewise found between the mean and -1u, so that between+ 1u and - 1(1 on both sides of the mean there are 68.26 percent of thecases. Nearly all the cases (99.72 percent) fall within ±3u from themean. These relationships are particularly relevant in the interpreta.tionof standard scores and percentilcs, to be discussed in later sections.

One way in which meaning can be attached to test scores is to indicatehow far along the normal developmental path the individual has pro-gressed. T~us a~ 8-year-old who performs as well as the average 10-year-old on an mtelhgence test may be described, as having a mental age of10; a mentally retarded adult who performs at the saifre level would like-wise be assigned ~n MA of 10. ~n a different context. 11i~.urth-grade childmay be cba.ractenzed as reacbmg the sixth-grade nonn An a reading testand the t~l~d-grade n~rm in an. ar~thmetic test. Other d~velopm~talsystems uti!tze more hIghly quahtative deSCriptions of be.JU~yi9I.in ~

r

-- Lorge SD---Small SD

Scores

Frequenc\'Distributions ...\'ith the Same Mean b~t Different Variahility.

. h more serviceable measure of variability is the standard devw-:mbolized by either SD or u), in which the negative signs are'ely eliminated by squaring each deviation. This p~ has

owed in the last column of Table 2. The sum of thiS column: ("iX2

)by the number of cases N is known as the variance, or mean

eviatiol1, andc~ymbo1ized by u'. The variance has proved ~x-'useful in sorting out the contributions of different factors to m-differences in test performance. For the present purposes, how-chief concern is with the SD, which is the square root of theas shown in Table 2. This measure is commonly employed in

.'g the variability of different groups. In F.igur.e 2,. for e~a~~le,distributions having the same mean but dlflenng In vanabllity.ribution with wider individual differences yields a larger SD"one with narrower individual differences.

Sf) also provides the basis for expressing an individual's scores on

Page 44: Anne Anastasi- Psychological Testing I

'Prillcil,lesof PSljchological Testingunctionsranging from sensorimotor activities to concept formation.-I'erexp~essed, scores based on developmental norms tend. to beoinetricallvcrude and do not lend themselves well to precise sta-

treah~e~t. Nevertheless, they have considerable appeal for de-\ve purposes, especially in the intensive clinical study of individualsorcertain research purposeS.

Norms and the Interpretation of Test Scores 75

readily ~isualized if ••w~ think ~~ the in.dividual's height as being ex-pressed 10 tem1S of heIght age. The dIfference in inches between aheight age of 3 and 4.years would be greater tha~ that betw~en a heightage of 10 and 11. OWll1gto the progressive shrinkage of the MA unit, oneyear of acceleration or retardation at, let us sav, age 5 represents a largerdeviation from the norm than does one vear 'of acceleration or retarda-tion at age 10, .

'l'l;TAL ACE. In Chapter 1 it was noted that the tenn "mental ~ge"s;ddelv popularized through the various translations and adaptatiOnsthe Billet-Simon scales, although Binet himself had employed there nelitral term "mental levcl." In age scales such as the Binet and'revisionsjitemsare grouped into year le,·els. For example, those itemsssedbv the majority of 7-vear-olds in the standardization sample are

~jacedi~ the 7-year level, tilose passed by the m~j~rity of 8-year-olds~e assignedto the 8-year level, and so fOlth. A child s score on the test

',,~11then correspond to the highest year level that he can succe5sful~y'omplete.In actual practice, the indh'idual's performance shows a certal~'~mountof scatter. In other words, the subject fails some tests below h1smentalage level and passes some above it. For this reason, it is c~stom-ar}'to compute the basal age, i.e., the highest age at and below w~lCh alltestsare passed. Partial credits, in months, are then ~d?ed to thiS basal

,'agefor all tests passed at hi~e:;p~r ~evels The chIld s mental age o~the test ISthe sum of the ba~:gp ;lvitbe:dditjonaJ months of credit

earned at higher age level§.:. - . '~tal age norms have also been employed wl~h ~ests that are l:ot dl-divedinto year levels. In such a case, the subJect s raw scor~ 1S firstdetermined. Such a score may be the total number of correct Items onthe whole test; or it may be based on time, on number of~p"(lrs, or onsomecombination of sU~'hmeasures. The mean raw scores.t;,t)Q~ninedbythe children in each year group within the standardiza~tQn' sample con-stitute the age norms for such a test. The mean raw seore of the 8-~ea~-old children, for example, would represent the 8-year nonn. If an ll1d~-ividual's raw score is equal to the mean 8-year-old raw SCOre,then hiSmental age on the test is 8 years. All raw scores on such a test can betransformed in a similar manner by reference to the age nonns.

It should be noted that the mental age unit does not remain constantwith age, but tends to shrin~ with advancing years. For example, a childwho is one year retarded at age 4 will be approximately three. years. re-tarded at age·12. One year of mental growth from ages 3.to 4 ISeqUIVa-lent to three years of growth from ages 9 to 12. Since mtellectual de-velopment progresses more rapidly at the earlier ages and graduallydecreases as the individual approaches his mature limit, the mental ageunit shrinks correspondingly with age. This relationship may be more

GRADE EQUIVALENTS.Scores on educational achievement tests are ofteninterpreted in terms of grade equivalents. This practice is understandablebecaus.e,the t<:stsare employed within an academic setting. To describea pupil s ~chlevement as equivalent to seventh-grade performance inspelhng, eIghth-grade in reading, and fifth-grade in arithmetic has thesame popular appeal as the use of mental age in the traditional intelli-gence tests.

~rade ~orms are found by computing the mean raw score obtained bychIldren In each grade. Thus, if the average number of problems solvedc~ITectly on .an arithmetic tes~ by the fourth graders in the standardiza-hon sample 1S23, th~n a raw score of 23 corresponds to a grade equiva-lent of 4. IntermedIate grade equivalents, representing fractions of agr~de, a~e usually found by interpolation, although they can also be ob-tamed directly by testing children at different times within the schoolyear. Because the school year covers ten months, successive months canbe expressed as decimals. For example, 4.0 refers to average perfonnanceat the beginning of the fourth grade (September testing), 4.5 refers toaverage performance at the middle of the grade (Febmary testing), andso forth.

Despite their popularity, grade norms have several shortcomings. First,the content of instruction varies somewhat from grade to grade. Hence,grade norms are appropriate only for common subjects taught through-o~t the grade le~els covered by the test. They are not generally ap-phcable at the hIgh school level, where many subjects may be studiedfor only one or two. years. Even Vlith subjects taugkt in each grade,however, the emphas1s placed on different subjects may vary from grade~o grade, and ~rogress may therefore be more rapid in oJ1esubject thanIII ~other dUrIng a particular grade. In other words, grade-units areobv~ously unequal and these inequalities occur irregqllirly in differentsubjects. ,; .

Grade norms are also subject to misinterpretation uni~s ,the test userkeeps fi~ly in mind the manner in which they were ·deri~ed. For ex-am~le, .If a fourth-grade child obtains a grade eq.~ivalent of 6~9in arith-metic, I.t does ~ot mean that he has mastered thfi aritn,w.etic processestaught In the SIxth grade. He undoubtedly obtained'hjs sc6r~ largely by

Page 45: Anne Anastasi- Psychological Testing I

.Principles of Psyc11010gicaJ Testing ." . I Id not• >~t . ce 'I·nfouI,th grade arithmetic. It certam y COUlOrpenorman - • . d 'h fc

I. med that he has the prerequi~ites for seventh-gra e ant me I ~

adc norms tend to be incorrectly regarded as performan~l;df. A sixth-grade teacher, for example: may assume tha.t all h~!:e~class should fall a! or close to tl~e sixth-grade ,n?rm In ac radeests This misconception is certamly not surpnsmg when g h

iare ~sed Yet individual differences within any onc grade ar~ suc·.,:therange' of achJevement test scores will inevitably exten over

pal grades,

1 t developmental norms derivesDINAL SCALES. Another approac 1 0 1 b t' f behavior, hI' E Ipirica 0 serva Ion 0research in chIld psyc 0 og~, . n . 1 d t the description of be-'pment in infants and young chlldl;n e. 0 1 omotion sensorytypical of successive ages in ,SUC? uncti~ns as OCt forma~ion. An

.' inati0t, .lingui~~c dc~~~~~;~~:t:~n~f a~ese~lo:~e£ his associates at

(eAxampe1913sl~:~e~et ~l. 1940; Gesell & Amatruda, 1947; H~lver-

mes" , h d I h 0 th apprOXImate1933) The Gesell Developmental Sc e U es s 0''0 e h f flopm~ntallevel in months that the child has attained in eadc 0 °aul

rd ptive lan<1uage an person -

.areasof behavior, namely, motor, a ~ J h'ld' 'behaVior with1 Tliese levels are "found by companng tIe CIS h

• 0 0 0 k a ran ing from 4 weeks to 38 mont s.typlCalof eight ey at>es, g . d tl uential patterning ofsell and his co-workers emphaSize Ie .seq. f'f'-

Th 't d xtenslVe eVidence 0 um or1111, behavior development. ey CIe e. . f behaviorof developmental sequences and an orderly pdrogressllolllb~ect piaced

h hOld' fons towar a sma 0 ]Iges.For example, tee I s reac I , . visualont of him exhibit a characteristic chronologIcal sequen:e I~ d inion and in hand and finger movements. Use of th~ entire an

'de attempts at palmar prehension OCC~Il'S~t a~ ear~er ~g~i~h;: t~::he thumb in opposition to the palm; thIS t)~e 0 pre en~, t pincer-owedb use of the thumb and index finger In a more e c~en .. Y f the ob'ect Such sequential patterning was hkewlse ob-

cg~~;wOalking,st!ir ~limbing, and ~ost ~f th; s~~~~:~l~:o~':~:~~;kt of the first few years, The scales eve ope ~ 'c6nstant. do I' the sense that developmental stages follow In a .e~:~~~hl~tage presupposing m~stery of prerequisite behaVIOr char-

a~teJ'isticof earlier stages.', •• . I I" differs from that in statistics, in which an

'.Thisusageof the term ordma sca ~ k l' f individuals wjthout" .' I that permlt~ a ran -oruenn~ 0o .al scale IS simp y one . . between them' in the statistical sense; o~1. dgeabout amount of dilI~r~nce i les Ordinal sillIes of child development

arecontra.stedto equal-umt mterva ~:m~~ scale or simplex, in which success-uallydeSignedon

1theI ~o~~;so:u:c:ss at an lower levels (Guttman, 1944). An

:pprformanceat one eve mlp I

Norms arid the ITltcrprc:tafioTl of Test Scores 77

Since the 19605, there has been a sharp upsurge of interest in the de-velopmental theories of the Swiss child psychologist, Jean Piaget (seeFlavell, 1963; Ginsburg & Opper, 1969; Green, Ford, & Flamer, 1971).Piaget's research has focused on the development of cognitive processesfrom infancy to the midteens. He is concerned with specific conceptsrather than broad abilities. An example of such a concept, or schema, isobject permanence, whereby the child is aware of the identity and con-tinuing existence of objects when they are seen from different anglesor are out of sight. Another widely studied concept is conservation, orthe recognition that an attribute remains constant over changes in per-ceptual appearance, as when the same quantity of liquid is poured intodifferently shaped containers, or when rods of the same length are placedin different spatial arrangements.

Piagetian tasks have been used widely in research by developmentalpsychologists and some have been organized into standardized scales,to be discussed in Chapters 10 and 14 (Goldschmid & Bentler, 1968b;Loretan, 1966; Pinard & Laurendeau, 1964; Uzgiris & Hunt, 1975). In ac-cordance with Pia get's approach, these instruments are ordinal scales, inwhich the attainment of one stage is contingent upon completion of theearlier stages in the development of the concept. The tasks are designedto reveal the dominant aspects of each developmental stage; only laterare empitical data gathered regarding the ages at which each stage istypically reached, In this respect, the procedure differs from that fol-lowed in constructing age scales, in which items are selected in the firstplace on the basis of their differentiating between successive ages.

In summary, ordinal scales are designed to identify the stage reachedby the child in the development of specific behavior functions. Althoughsc.'Oresmay he reported in terms of approximate age levels, such scoresare secondary to a qualitative description of the child's characteristic be-havior. The ordinality of such scales refers to the uniform progression ofdevelopment through successive stages. Insofar as these scales typicallyprovide information about what the child is actually able to do (e.g.,climbs stairs without assistance; recognizes identity in quantity of liquidwhen poured into differently shaped containers), they share importantfeatures with the criterion-referenced tests to be discussed in a latersection of this chapter.

Nearly all standardized tests now provide some foryn of within~groupnorms. With such norms, the individual's performa,~~,. is evaluated in

;t.~·

extension of Guttman's analysis to Include nonlinear hi~archies i,~ilescribc:dby Bartand Airasian (1974), with special reference to Piagetillrr··~al.~".~ .

Page 46: Anne Anastasi- Psychological Testing I

Principles of Psychological Testingmsof the performance of the most nearly comparable standardizationup, as when comparing a child's raW score with that of ~hi~dren of

e same chronological age or in the same school grade. Wlthm-groupreshave a uniform and clearl\' defined quantitative meaning and canappropriately employed in m~st types of statistical analysis.

PERCEKnLES. Percentile scores are expressed in terms of the percentagepersons in the standardization sample who fall be~ow a given rawreoFor exampk, if 28 percent of the persons obtam fewer than 15

bblemscorrect on an arithmetic reasoning test, then a raw score of<j<\rrespdndsto the 28th percentile (P~~). A percentile indicates ~he

.J{iiduafs relative position in the standardization sample. ~ercent~les. :)\150 be regarded as ranks in a group of 100, except th~t m rankmg

ustomary to start countin<1 at the top, the best person m the group'ing a rank of one. 'With ~ercentiles, on the other hand, we begining at the bottom, so that the lower the percentile, the poorer the

'dual's standing. .'e 50th percentile (P;;(I) corresponds to the medlan, already dls-d as a measure of central tendency. Percentiles above 50 represente-average performance; those below 50 signify inferior p~rforman:e.

'.25th and 75th percentile are known as the first and thlrd quartilehits (Ql and Q3), because they cut off the lowest and highest quartersthe distribution. Like the median, they provide convenient landmarks

Qrdescribing a distribution of scores and comparing it with other dis-

ributions. .Percentiles should not be confused with the familiar percehtage scores.he latter are raw scores, expressed in terms of the percentage of correct

/items;percentiles are derived scores, expressed in terms of perce~ltage of}<persons.A raw score lower than any obtained in the stand~rdizahon sam-

.:ple would have a percentile rank of zero (Po); one hl~her than any.. scorein the standardization sample would have a percentile rank of 100,

. (PH"')' These percentiles, however, do not imply a zero raw score and a

perfect raw score.Percentile scores have several advantages. They are easy to compute

and can be readily understood, even by relatively untrained persons.Moreover, percentiles are universally applicable. They can be usedequally well with adults and children and a~e sUit~ble for any type oftest, whether it measures aptitude or personahty vanables. .

The chief drawback of percentile scores arises from the marked 10-

equality of their units, especially. at ~he extremes of the distribut~on. Ifthe distribution of raw scores approx1mates the normal curve, as lS trueof most test scores, then raw score differences near the median or centerof the distrihution are exag~erated in the percentile transformation,

__________________________ •••••••••••• ·1

Norms and tile Interpretation of Test Scores 79

whereas raw score differences near the ends of the distribution aregreatly shrunk. This distortion of distances between scores can be seenin Figure 4. In a normal curve, it will be recalled, cases cluster closely atthe center and s~atter more widely as the extremes are approached. Con-sequently, any glYen percentage of cases near the center covers a shorterdistance on the baseline than the same percentage near the ends of thedistribution. In Figure 4, this discrepancy in the gaps between percentileranks (PH) can readily be seen if we compare the dj$tance between aPR of 40 and a PH of 50 with that between a PR oero and a PR of 20.Even more stdking is the discrepancy between these distances and thatbetween a PH of 10 and PR of 1. (In a mathematically derived normalcurve, zero percentile is not reached until infinity and hence cannot beshown on the graph. )

Q1 Mdn Q3

20130405106070180

i J i I I : iI 1 I I I I I: \ : I \ I II I I I I II I I 1J I I II II I1 I

99II1

~ IIII1IIII\IIIII

+20- +30-98 99.9

-30- -10- M +10-~m ~ ~ ~

FIC. 4. Percentile Ranks in a NOlmal Distribution.

The same relationship can be seen from the opposite direction if weexamine the percentile ranks corresponding to equal u-distances from themean ~f a. normal curve. These percentile ranks are given under thegraph m Flgure 4. Thus, the percentile difference i;letween the mean and+ lIT .is 34 (84 - 50). That between + I.,. and +~is only 14 (98 - 84).. It IS apparent that percentiles show each indiyf<Jual's relative positionIn the normative sample but not the amount of <h~ence between scores.If plotted on arithmetic probability paper, however, percentile scorescan also provide a correct visual pictUre of th~ differences betweensc..or~s. A~ithmetic probability paper is a cr~ss-se<:;rl?npaper i~ which thevertical h.nes. are. spaced .in t?e same W~y asltM'percentile p~~nts in anormal dlstnbubon (as. m FIgure 4), whereas the horizonta.1i~.nes areuniformly spaced, or vice versa (as in Figure 5). Such normqJpe;centile

Page 47: Anne Anastasi- Psychological Testing I

.";pfillciIJles of Psychological TestingNorms and the Interprdation of Test Scores 81

of differences between standard scores derived by such a linear trans-formation corresponds exactly to that between the ;aw scores. All-proper-ties of the original distribution of raw scores are duplicated in thedistribution of these standard scores. For this reason, any computationsthat can be carried out with the original raw scores can also be carriedout with linear standard scores, withollt any distortion of results.

Linearly derived standard scores are often desilTnatedsimpl\' as "stand-b .

ard scores" or "z scores." To compute a :; score, we find the differencebetween the individual's raw score and the mean of the normative groupand then divide this difference by the SD of the normative group.Table 3 shows the computation of z scores for two individuals, one ofwhom falls 1SD above the group mean, the other .40 SD below themean. Any raw score that is exactly equal to the mean is equivalent to az smre of zero. It is apparent that such a procedure will yield derivedscores that have a negative sign for all subjects falling below the mean..Moreover, because the total range of most groups extends no fartherthan about 3 SD's above and below the mean, such standard scores willhave to be reported to at least one decimal place in order to providesufficient differentiation among individuals.

John Mary Ellen Edgar Jane Dick Bill Debby

~h-ANormal"PercentileChart. Percentiles are spaced so as to ~orrespond~~Idistancesin a normal distribution. Compare the sc~re. distance ~e-" hn and Mary with that between EIIen and Edgar; w!.thm both pal:s,

entile difference is 5 points. Jane and Dick differ by 10 percentileas do Bill and Debby.

TABLE 3Computation of Standard Scores

X-MSD

JOHN'S SCOREX\=65

65 - 60Zl=

5= +1.00

BILL'S SCOREX:=58

58 - 60"'canbe used to plot the scores of different persons. on the samer thescoresof the same person on different tests. In elther case, theillinterscoredifference will be correctly represented~ Many aptitudeachievementbatteries now utilize this technique in their score pro-'whichshow the individual's performance in each test. An example~eIndividualReport Form of the Differential Aptitude Tests, repro-d in Figure 13 (Ch. 5).

. "AXDARD SCORES. Current tests are making increasing use of standard.. scoreswhichare the most satisfactory type of derived score ftom most~oints'of view. Standard scores express the individual's distance from

t ", meanin terms of the standard deviation of the distribution.Standardscores mav be obtained by either linear or nonlinear trans-

ationsof the origi~al raw scores.Whe~ found by a l.in.eartransforma-; theyretain the exact numerical r~labons of the ongmal raw scores,

. usethey are computed by subtracting a constant from each raw scorethendividing the result by another con~tant The relative magnitude

Both the abovE'conditions, viz., the occurrence of negative values andof decimals, tend to produce awkward numbers that are confusing anddifficult to use for both computational and reporting purposes. For thisreason, some further linear transformation is u~u,:lly applied, simply toput the scores into a more convenient form. ,For. ~x~lnple, the scores onthe Scholastic Aptitude Test (SAT) of the College Entrance Examina-tion Board are standard scores adjusted to a mean ot;~:,and an SD of100. Thus a standard score of -Ion this test would b: .ressed as 400(500 - 100= 4(0). Similarly, a standard score of +l.S ou1ltcorrespondto 650 (500 + 1.5 X 100 = 650). To con"er~ an origi~$ll!tandard score tothe new scale, it is Simplynecessary to multiply the standard score by the

Page 48: Anne Anastasi- Psychological Testing I

Principles of P~Y;'IO'ogical Testing

'ed SD (100) and add it to or subtract it from the desired mean). Any other convenient values can be arbitrarily chosen for the

,mean and SD. Scores 011 the separate subtests of the Wechsler In-ence Scales, for instance, are converted to a distribution with a1 of 10 and an SD of 3. All such measures are examples of linearlysformed standard scores.

'twill be recalled that one of the reasons for transforming raw scoreso any derived scale is to render scores on different tests comparable.e linearlv derived standard scores discussed in the preceding section" be cO~lparable only when found from distributions that have ap-ximatelythe same form. Under such conditions, a score corresponding

~.I SD above the mean, for example, signines that the individual occu-ies the same position in relation to both groups. His score exceeds ap-roximately t1J.e.same percentage of persons in both distributions, andis percentage can be determined if the form of the distribution is

'known.If, howeyer;"one distribution is mal'kedly skewed and the other"normal,a z score of +1.00 might exceed only 50 percent of the cases in,negroup but would exceed 84 percent in the other.In order to achieve comparability of scores from dissimilarly shaped

,distl-ibutions,nonlinear transformations may be employed to fit the scoresto any specified type of distribution curve. The mental age and percentilescores described in earlier sections represent nonlinear transformations,but they are subject to other limitations already discussed. Althoughunder certain circumstances another type of distribution may be moreappropriate, the normal curve is usually employed for this purpose. Oneof the chief reasons for this chotee is that most raw score distributionsapproximate the normal CUJ;V-e more closely than they do any other typeof curve. Moreover, physical me1tsures such as height and weight, whichuse equal-unit scales derived. thl:"'t'fugh physical operations, generaU,y yieldnormal ~istributions., Anoth'1f"frnportan: advantage .of the ~or.~al :~rveis that It has many useful mathematical properties, whlchl""faclhtatefurther computations.

NQrmalized standard scores are standard scores expressed in terms of adistribution that has been transformed to fit a normal curve. Such scoreScan be computed by reference to tables giving the percentage of casesfalling at different SD distances from the mean of a normal curve. Firsf,the percentage of persons in the standardization sample falling at orabove each raw score is found. This percentage is then located in thenormal curve frequency table, and the con-esponding normalized stand-

2 Partly for this reason and partly as a result of other theoretical considerations. ithas frequently been argued that, by normaliZingraw scores. an e(lual-unit scale couldbe developcd for psycholo~ical measurement similar to the equal-twit sL-dlesof physi-cal measurement. This, however, is a debatable point that involves certain question-able assumptions.

ard score is obtained. Normalized standard scores are expressed in thesame form as linearly derived standard scores, viz., with a mean of zeroand an SD of 1. Thus, a normalized score of zero indicates that the indi-vidual falls at the mean of a normal curve, excelling 50 percent of thegroup. A score of -I means thafhe surpasses approximately 16 percentof the group; and a s(:ore of + I, that he surpasses 84 percent. These per-centages correspond to a distance of 1 SD below and 1 SD above themean of a normal curve, respectively, as can be seen by reference to thebottom line of Figure 4.

Like linearly derived standard scores, normalized standard scores canbe put into any convenient form. If the normalized standard score ismultiplied by 10 and added to or subtracted from 50, it is converted intoa T score, a type of score first proposed by McCall (1922). On this scale,a score of 50 corresponds to the mean, a score of 60 to 1 SD above themean, and so forth. Another well-known transformation is representedby the stanine scale, developed by the United States Air Force duringWorld War II. This scale provides a single-digit system of scores with amean of 5 and an SD of approximately 2.3 The name stanine (a contrac-tion of "standard nine") is based on the fact that the scores run from1 to 9. The restriction of ~cores to single-digit numbers has certaincomputational advantages, for each score requires only a Single columnon computer punched cards.

TABLE 4Normal Curve Percentages for Use in Stanine Conversion

PercentageStanine

Raw scores can readily be co~verted to stanines by arranging the origi-nal scores:in order of size and ~,~fn assigning stanines in accordance withthe normal curve percentages"re,produced in Table 4. For example, iftlJ.e group consists of exactly I()() persons, the 4 lowest-scoring persons re-ceive a stanine score of 1, the next 7 a score of 2, the next 12 a score of 3,and so on. When the group contains more or fewer than l00~cases, thenumber corresponding to each deSignated percentage is first computed,and these numbers of cases are then given the appropriate stanines.

"'c

-" 3 Kaiser (1958) proposed a modification of the staninl!'scale thaq~volves slight(;han~es in the percentages and yields an SD of exactly 2, thus being e~Werto handlequantitatively. Other variants are the C scale (Guilford & ltruchter, :1,.913" Ch. 19),consisting of 11 units and also yielding an SD of 2, and tl.!~~lO-Uilitstefl scale, with5 units above and 5 below the mean (Canfield, 1951}.'\: ".

Co

Page 49: Anne Anastasi- Psychological Testing I

Norms and the Interpretation of Test Scores 85

for comparability of ratio IQ's throughout their age range. Chiefly forthis reason, the ratio IQ has been largely replaced by the so-called devi-ation IQ, which is actually another variant of the familiar standard score.The deviation IQ is a standard score with a mean of 100 and an SDthat approximates the SD of the Stanford-Binet IQ distribution. Al-though the SD of the Stanford-Binet ratio IQ (last used in the 1937edition) was not exactly constant at all ages, it fluctuated around amedian value slightly greater than 16. Hence, if an SD close to 16 ischosen in reporting standard scores on a newly developed test, the result-ing scores can be interpreted in the same way as Stanford-Binet ratioIQ's. Since Stanford-Binet IQ's have been in use for many years, testersand clinicians have become accustomed to interpreting and classifyingtest performance in terms of such IQ levels. They have learned what toexpect from individuals with IQ's of 40, 70, 90, 130, and so forth. Thereare therefore certain practical advantages in the use of a derived scalethat corresponds to the familiar distribution of Stanford-Binet IQ's.Such a correspondence of score units can be achieved by the selection ofnumerical values for the mean and SD that agree closely with those inthe Stanford-Binet distribution.

It should be added that the use of the term "IQ" to designate suchstandard scores may seem to be somewhat misleading. Such IQ's are notderived by the same methods employed in finding traditional ratio IQ's.They are not ratios of mental ages and chronological ages. The justifi-cation lies in the general familiarity of the term "IQ," and in the factthat such scores can be interpreted as IQ's provided that their SDis approximately equal to that of previously known IQ's. Among the firsttests to express scores in terms of deviation IQ's were the \Vechsler In-telligence Scales. In these tests, the mean is 100 and the SD 15. DeviationIQ's are also used in a number of current group tests of intelligenceand in the latest revision of the Stanford-Binet itself.

\Vith the increasing use of deviation IQ's, it is important to rememberthat deviation IQ's from different tests are comparable only when theyemploy the same or closely similar values for the SD. This value should,always be reported in the manual and carefully noted by the test user. Ifa test maker chooses a different value for the SD in making up his devia-tion IQ scale, the meaning of any given IQ on his test will be quite differ-ent from its meaning on other tests. These discrepancies are illustrated inTable 5, which shows the percentage of cases}i1normal distriblltions withSD's from 12 to 18 who would obtain IQ's at different l~els.These SDvalues have actually been employed in the IQ scales ofp*lJli~hed tests.Table 5 shows, for example, that an IQ of 70 cuts off the lo\v(j:..st3.1 per-cent when the SD is 16 (as in the Stanford-Binet); but it _",;;y cut off.as few as 0.7 percent (SD = 12) or as many as 5.1 percen .' = 18) .An IQ of 70 has been used traditionally as a cutoff point fpl' . ying

Prillciplcs of Psycl1010gical Testing

us,out of 200 cases, 8 would be assigned a stanine of 1 (4 percent of= 8). With 150 cases, 6 would receive a stanine of 1 (4 percent of== 6). For any group containing from 10 to 100 cases, Bartlett and

,erton (1966) have prepared a table whereby ranks can be directlyrted to stanines. Because of their practical as well as theoretical

rimtages,stanines are being used increasingly, especially with aptitudeachievement tests.Ithough nOlmalized standard scores are the most satisfactory type of.refor the majority of purposes, there are nevertheless certain tech-al objections to normalizing all distributions routinely. Such a trans-:)ation should be carried out only when the sample is large and rep-Iltativeand when there is reason to believe that the deviation fromin~litvresults from defects in the test rather than from characteristicshe sample or from other factors affecting the behavior under con-ration/it should also be noted that whpn-the original distribution ofscoresapproximates normality, the linearly derived standard scoresthe normalized standard scores will be very similar. Although the:ods of deriving these two types of scores are quite different, the

tiltingscores will be nearly identical under such conditions. ObViously,.!proeessof normaliZing a distribution that is already virtually normalr produce little or no change. Whenever feasible, it is generally more'rable to obtain a normal distribution of raw scores by proper adjust-,t of the llifficulty' level of test items rather than ~by subsequentlyalizing a markedly nonnormal distribution. With an approximatelyal distributiou of raw scores, the linearl\' derived standard scores

,servethe same purposes as normalized st;ndard scores.

.~ DEVIAT10JlO IQ. In an effort to convert ~1A scores into a ~6rmJ of the individual's relative status, the ratio IQ (IntelligenceJient) was introduced in early intelligence tests. Such aIJ.,IQ wasply the ratio of mental age to chronological age, multiplied by 100 to'pate decimals (IQ = 100 X MAjCA). Obviously, if a child's ~IAIs his CA, his IQ will be exactly 100. An IQ of 100 thus represents'\i.\ or average performance. IQ's below 100 indicate retardation,above 100, acceleration.

" apparent logical simplicity of the traditional ratio IQ, however,proved deceptive. A major technical difficulty is that, unless thef the IQ distribution remains approximately constant with age,will not be comparable at different age levels. An IQ of 115 at ager example, may indicate the same degree of superiority as an IQ

at age 12, since both may fall at a distance of 1 SD from th~. of their respective age distributions. In actual practice, it prm'e,&'. ifficult to constmc:t tests that met the psychometric requiremeritS'

Page 50: Anne Anastasi- Psychological Testing I

5tage of Cases at Each IQ Interval in Normal Distributions with Meanand Different Standard Deviations

esyTest Department, Ha~court Brace Jovanovich, Inc.)In5:cov'0.8E'"z

0.13%0.13%

-40- -10- Mean +1<1 +2<1 +3<1 +4<1Test score

z score I ! I I I I I I-4 -3 -2 -I +1 +2 +3 +4

Tscore L I I I I ! I I.. I10 20 30 40 50 GO 70 80 90

CEEB score I I I I I200 300 - 400 500 600 700 800

Deviation IQ(SD =15) ! I I I I I55 10 85 100 115 130 145

Stanine 4% I 7% ,12%,17% ,20%! 11% 112% 17% I 4%2 3 4 5 6 7 8

. : 1Q1iltervalSD= 12 SD = 14 SD = 16 SO = 18s',b .

130 \Rh above 0.7 1.6 3.1 5.1120-129 4.3 6.3 7.5 8.5

··:110-119 15.2 16.0 15.8 15.4100-109 29.S} 59.6 26.1}52.2 ;;::}47.2 21.°l42090- 9~ 29.8 26.1 21.0) .80- 89 15.2 16.0 15.8 15.470- 79 4.3 6.3 7.5 8.5

.Below70 0.7 1.6 3.1 5.1

Total 100.0 100.0 100.0 I 100,0= -,'1II9tA~;r.r ...).~~"""""~

mental retardation. The same discrepancies, of course, apply to IQ's of130 and above, which might be used in selecting children for specialprograms for the intellectually gifted. The IQ range between 90 and lIO,generally described as normal, IJlay include as few as 42 percent or asmany as 59,6 percent of the popula-tion, depending on the ~est chosen. Tobe sure, test publishers are making efforts to adopt the umform SD of 16in new tests and in new editions of earlier tests. There are still enoughvariations among cuaently available tests, however, to make the checkingof the SD imperative.

Percentile I I I I I I I I I5 10 20 30 40 50 60 10 80 90 95 !l9

FIC. 6. Relationships among OiHerent Types of Test Scores in a NormalDistribution.

INTERRELATIONSHIPSOF WITHIN-GROUPSCORES,At this stage in our dis;cussian of derived scores, the reader may have become aware of arapprochement among the various types. of scores. Percentiles ~avegradually been taking on at least a graphIC rese~b~a~ce t? norma}ijzedstandard scores. Linear standard scores arc mdlstingmshable fromnormalized standard scores if the original distribution of raw scoresclosely approximates the normal curve. Finally, standard s(:ores have. be-come IQ's and vice versa. In connection with the last point, a ree,xamm~-tion of the meaning of a ratio IQ on such a test as the Stanford-.Bmet WIllshow that these IQ's can themselves be interpreted as standard scores. Ifwe know that the distribution of Stanford-Binet ratio IQ's had a mean of11") ronrl ~n qT) of :mnroximatelv 16. we can conclude that an IQ of 1I6

falls at a distance of 1 SD above the mean and represents a standardscore of +1.00. Similarly, an IQ of 132 corresponds to a standard scoreof +2.00, an IQ of 76 to a standard score of -1.50, and so forth. More-over, a Stanford-Binet ratio IQ of lI6 corresponds to.~Percertile rankof approximately 84, because in a normal curve 84 plirc~1it of-the casesfall helo. +1.00 SD (Figure 4). . ,.

In Figure 6 are summarized the relaHbnships that exist in a normaldistribution among the types of scores so far discussed in .this chapter.These include z scores, College Entrance Examination Bqp,rcd (CEEB)scores, Wechsler deviation IQ's (SD = 15), T SCOres,stanines, and per-centil~s. Ratio IQ's on any test will coincide with th~g_iven deviation iQscale-If they are normally distributed and have an S1). of 15. Any other

Page 51: Anne Anastasi- Psychological Testing I

Principles of Psychological Testing

ally distributed IQ could be added to the chart, provided we know'SD. If the SD is 20, for instance, then an IQ of 120 corresponds to

'1 SD, an IQ of 80 to -1 SD, and so on.In conclusion, the exact form in which scores are reported is dictatedgelyby convenience, familiarity, and ease of developing nonns. Stand-

scores in any form (including the deviation IQ) have generallyplaced other types of scores because of c.-ertain advantages they offer'th regard to test construction and statistical treatment of data .. ~ostpesof within-group derived scores, however, are fundamentally s1m1lar

_. carefully derived and properly interpreted. When certain statisticalconditionsare met, each of these scores can be readily translated into

...anyof the others.

Norms and the Interpretation of Test Scores 89

tests may differ in content despite their similar labels. So-called intelli-gence tests rrovide many illustrations of this confusion. Although com-mon]y descnbed by the same blanket term, one of these tests may includeonly v~rba] content, another may tap predominantly spatial aptitudes,and still another may cover verbal, numerical, and spatia] content inabout equal proportions. Second, the scale units may not be comparable.As explained earlier in this chapter, if IQ's onone test have an SD of 12and IQ's on another have an SD of 18, then an individual who receivedan IQ of 112 on the first test is most likely to receive an IQ of 118 on thesecon~. !hird, the composition of the s~dardi;;;ation sa'!!Ples used inestablIshmg nonns for different tests may vary. ObViously, the same indi-~idu~l will appear to have performed better when compared with anmfenor group than when compared with a superior group.

Lack of comparability of either test content or scale units can usuallybe detected by reference to the test itself or to the test manual. Differ-ences in the respective normative samples, howeyer, are more likely tobe overlooked. Such differences probably account for many otherwise un-explained discrepancies in test results.

ISTERTEST COMPARISONS, An IQ, or allY other score, should always beaccompanied by the name of the test on which it was obtained. Test~corescannot be properly interpreted in the abstract; they must be re-

e ferred to particular tests. If the school records show that Bill Jones re-. ceived an IQ of 94 and Tom Brown an IQ of 110, such IQ's cannot be

accepted at face value without further information. The positions ofthese two students might have been reversed by exchanging the par-ticular tests that eq,ch was given in his respective school.

Similarly, an individual's relative standing in di~erent functions maybe grossly misrepresented through lack of comparability of test norms.Let us s~ppose that a student has been given a verbal comprehensiontest and a spatial aptitude test to determine his relative standing in thetwo fields. If the verbal abilitv test was standardized on a random sampleof high school students, while the spatial tes~ was standardized on aselected group of boys attending elective shop courses, the examinermight erroneously conclude that the individual is much more able alongverbal than along spatial lines, when the reverse may actually be the case.

Still another example involves longitudinal comparisl?,ns of a singleindividual's test performance over time. If a schoolchild's cumulativerecord shows IQ's of 118, 115, and 101 at the fourth, fifth, and sixthgrades, the first question to ask before interpreting these changes is,"What tests did he take on these three occasions?" The apparent declinemay reflect no more than the differences among the tests. In that case,he would have obtained these scores even if the three tests had beenadministered within a week of each other.

There are three principal reasons to account for systematic variationsamong the scores obtained by the same individual on different tests. First,

THE NORMATIVE SAMPLE.• Any norm, however expressed, is restrictedto the particular normative population from which it was derived, Thetest user should never lose sight of the way in which norms are estab-lished. Psychological test norms are in no sense absolute, univer;!U,orpenn~ne~t. They JIle~ely represent the test performance of the subi.~15consti~tmg the~i\r..~ardization sample. In choosing such a sample·, af1

eff?rt IS usual.lr~de t(t'Qbtain a representative cross sectiol\Hlf.the popu-latIon for which th~.it~st is designed. .

In st~tistjca] terminology, a distinction is made between sample andpopulatIOn. Th: former refers to the group of individuals actually teste (i.Th~ latter des1gn~tes the larger, but similarly constituted, group froinwhich the sample 1Sdrawn. For example, if we wish to establish nonns oftest performance for the population of 10-year-old, urban, public schoo]boys, ~ve migh~ test a carefully chosen sample of 500 10-year-oJd boysattendmg PUb~IC schools in several American cities. The sample wouldbe checked w1th reference to geographical distribution, socioeconomiclevel, ethnic (,'omposition, and other relevant characteristics to ensure thatit was truly representative of the defined population.

In the development and application of test norms, considerable atten-tion should be. given to the standardization sample. It is,,apparent that thesample on wh1ch the norms are based should be large enough to providestable values., Another, similarly chosen sample of th•.•same populationshould not yIeld nonns that diverge appreciably frorp tfl.ose obtained.

Page 52: Anne Anastasi- Psychological Testing I

"Prillciplesof Psychological Testing

, with a large sampling error would obviollsly be of little yalue in~erpretationof test scores.uallyimportant is the requirement that the sample be representative',populationunder consideration. Subtle selective factors that might

. the sample unrepresentative should be carefully investigated. Aber of such selective factors are illustrated in institutional samples.ausesuch samples are usually large and readily available for testingoses,they offer an alluring field for the accumulation of normative. The special limitations of these samples, however, should be care-

yanalyzed. Testing subjects in school, for example, will yield an in-'singlysuperior selection of cases in the sllccessive grades, owing to

e progressive dropping out of the less able pupils. Nor does suchiffiinationi?,ffectdifferent subgroups equally. For example, the rate of

ctiveelimination from school is greater for boys than for girls, and/~greater in lower than in higher socioeconomic levels.S~I~ctivefactors likewise operate in other institutional samples, such.prisoners,patients in mental hospitals, or institutionalized mental re-dates.Because of many special factors that determine institutionaliza-

'n itseH,such groups are not representative of the entire population ofriminals,psychotics, or mental retardates. For example, mental retard-teswith physical handicaps are more likely to be institutionalized thanre the physically fit. Similarly, the relative proportion of severely re-ardedpersons will be much greater in institutiunal samples than in the

total population.Closely related to the question of representativeness of sample is the

needfor defining the specific population to which the norms apply. Obvi-ous]y,one way of ensuring that a sample is representative is to restrictthe population to fit the ~ecifications of the available sample. For ex-

. ample, if the population i$ defined to include only 14-year-old school-chDdrenrather than all 14-year-old children, then a school sample wouldbe representative. Ideally, of course, the desired population should bedefinedin advance in terms of the objectives of the test. Then a suitablesample should be assembled. Practical obstacles in obtaining subjects,however, may make this goal unattainable. In such a case, it is far betterto redefine the population more narrowly than to report norms on an idealpopulation which is not adequately represented by the standardizationsample. In actual practice, very fe''''' tests are standardized on such broadpopulations as is pORularly assumed. No test provides norms for thehuman species! And it is doubtful whether any tests give truly adequatenorms for such broadly defined populations as "adult American men,""lO-year-old American children," and the like. Consequently, the samplesobtained by different test constructors often tend to be unrepresentativeof their alleged populations and biased in different ways. Hence, therr<llJtin~norms are not comparable.

NATION~L ANCHOR NORMS. One solution for the lack of comparabilityof n~rms IS to use an anchor test to work out eqUivalency tables for scores?n dl~erent tests. Such tables are designed to show what score in Test AIS e~Ulvalent to ~ach score in TestB. This can be done by the equiper-cent,ze m.ethod, m which scores are considered equivalent when therhave equal percentiles in a given group. For example, if the 80th pel:'centile in the same group corresponds to an IQ of lI5 on Test A and toan IQ of 120 on Test B, then Test.A-IQ 115 is considered to be equivalentto Test-B-IQ 120. This approach has been followed to a limited extentby so~e test publishers, who have prepared equivalency tables for a fewof theIr Own tests (see, e.g., Lennon, 1966a).

More ambitious proposals have been made from time to time for cali.brat~n~ each new test against a single anchor test, which has itself beenadmllllstered to a highly representative, national normative sample (Len-~on, 1966b). No single anchor test, of course. could be used in establish-mg norms for all tests, regardless of content. "'hat is required is a batterYof anchor tests, all administered to the same national sample. Each ne,~'~est could then be checked aKainst the most nearlY similar anchor test111 the battery. .

The data gathered in Project TALENT (Flanagan et a!', 1964) so farcome closest to providing such an anchor batten' for a high school popu-la~ion. Using a r~ndo~ sample of about 5 per~nt of the high schools intIllS country, th~ lllVeStIga.torsadministered a two-day battery of speciallycons~ructed aphtude, achIevement, interest, and temperament tests to ap-pr~:llnately 400,000 students in grades 9 through 12. Even with the avail-~bihty of anchor data such as these, however, it must be recognized tItatl~dependen~ly dev.eloped tests ·can ~ever be regarded as completely inter-changeable. At best, the use of natIOnal anchor norms would appreciablyreduce the lack of comparability among tests, but it would not elimi.nate it.

Th~ Pro!ec~ TALENT battery has been employed to calibrate severaltest battenes III use by the Navy and Air Force (Dailey, Shaycoft, & Orr,1962: ~haycoft, Neyman, & Dailey, 1962). The general procedure is toadmllllster both the Project TALENT battery and the tests to be cali-bra~ed to the same sample. Through correlational analysis, a ,composite ofProject TALENT tests is identified that is most n~ya,dycomparable toeach test to be norme?. By means of the equipercentile method, tablesare then prepared g1Vlllg the corresponding scores On the ProjectT~LENT composite and on the particular test. For several other bat-tenes, data have been gathered to identify the Project TA.Lf:NT com-

4 F~r an excellent analysis of some of the technical difficulties involved in effortsto achIeve score comparability with different tests, see Angolf (i~~. 1966, 1971a).

"~-

Page 53: Anne Anastasi- Psychological Testing I

SPECIFIC NORMS. Another approach to the nonequivalence of existingnorms-and probably a more realistic one for most tests-is to standard-ize tests on more narrowly defined populations, so chosen as to suit thespecificpurposes of each test. In such ca.ses. the limits of the normative

; population should be clearly reported wIth the norms. :hus, the n?rms" might be said to apply to "employed clerical worke~',s 111 large busll1~sS'. organizations" or to "first-year enginee~ing students. For many test~ng

<. purposes. highly specific norms are deSirable. Eve~ w~e~ representatIve. norms are available for a broadly defined populatIon. It IS often helpful.tohave separately reported subgroup norms. This is true whenever recog-;

.•nizable subgroups yield appreciably different scores on a particular ~est.The subgroups may be formed with respect to ag~, grade, type.of curnc~-

. lum, sex, geographical region, urban or rural envIronment, soclOeCOnO~T1lc'level and manv other factors. The use to be made of the test determmes

the ~pe of differentiation that is most relevant. as well as whethergeneral or specific norms are more appropriate., Mention should also be made of local norms, often developed by thetest users themselves within a particular setting. The groups employed inr11'ridnrt s11ehnorms are even more narrow I)· defined than the subgroups

• FIXED REFERENCE GROUP. Although most derived scores are computedm such a way as to provide an immediate normative interpretation of testperfom~ance, there. ~re some notable exceptions. One type of non-normative scale utIlIzes a fixed reference group in order to ensurecompar~bility and continuity of scores, without providing normativeevaluation of performance. \Vith such a scale, normative interpretationrequires reference to independently collected norms from a suitablepopulation. Local' or other specific norms are often used for this purpose.

One of the clearest examples of scaling in terms of a fixed referencegroup is provided by the score scale of the College Board ScholasticAptitude Test (Angoff, 1962, 1971b). Between 1926 (when this test wasfirst a~ministered) and 1941, SAT scores were expressed on a normativescale, 111 t~r.ms o~ the mean and SD of the candidates taking the test ateach adm~mstration. As the number and variety of College Board membercolleges l~lcreased and the composition of the candidate populationchanged, It was concluded that scale continuity should be maintained.Otherwise, an individual's score would depend on the characteristics otthe group tes~ed .dUring a particular year. An even more urgent reasonfor scale continu~ty ~temmed from the observation that students takingthe. SA~ at certam .hmes of the year performed mOre poorly than those~akll1g It at other bmes, Qwing to the differential operation of selectivef~ctors. After 1941, therefore, all SAT scores were expressed in terms ofthe ~ean and SD of the approximately 11,000 candidates who took thetest m 1941. These candidates constitute the fixed reference group em-ployed in scaling all subsequent forms of the test. Thus, a score of 500 onany form of the SAT corresponds to the mean of the 1941 sample' a scoreof 600 falls 1 SD above that mean, and so forth. ' ,

To permit translation of raw scores on any {prm of the SAT into these~x~d-refere~ce-group scores, a short anc~or test (9r set of common items)IS lI:c1uded 111 each fonn. Each new form is thereby linked to one or two~arher forms. which in turn are linked with other forms by"g chin ofItems extend!ng back to the 1941 form. These nonnormative SAT scorescan then be mterpreted by comparison with any appropriate distribution

,~ Principles of Psychological Testing

..positecorresponding to each test in the battery (Cool~y, 1965; Cooley &Miller,1965). These batteries include the General AptItude Test Battery'oftheUnited States Employment Service, the Differential Aptitude Tests,.andthe Flanagan Aptitude Classification Tesfs .

.Ofparticular interest is The Anchor Test Study conducted by the Edu-cationalTesting Service under the auspices of the U:S. Office of E~u-qation(Jaeger, 197.3). This study represents a systematIc effort to proVIdecomparable and tI'uly representative national norms for the seve~ most

'dely used reading achievement tests for. elementa~ schoolchIldren.hrough an unusually \vell-controlled ~xpenmental desl.gn, o.ver 300,000

fourth-,fifth-, and sixth-grade schoolchIldren were exammed 111 50 states.The anchor test consisted of the reading comprehension and vocabulary

btests of the Metropolitan Achievement Test, for which new normscreestablished in one phase of the-project. In the equating phase of the

"d)', each child took the reading comprehension an~ voca?ula~ sub-ests from two of the seven batteries, each battery bemg paned In turnwithevery other battery. Some groups took parallel forms of t~~ t\.•.•o sub-:testsfrom the same battery. In still other groups, all the pamngs were'duplicated in reverse sequence, in order to control for order. of ad-ministration. From statistical analyses of all these data, score eqUivalency"tablesfor the seven tests were prepared by the equipercentile method. Amanual for interpreting scores is provided for use by school systems and

. other interested persons (Loret, Seder, Bianchini, & Vale. 1974).

Norms alld the Intcrpretation of Tcst Scores 93

considered a?ove. Thus, an employer may accumulate norms on appli-cants for a gIVen type of job within his company. A college admissionsoffice may develop norms on its own student population. Or a singleelementa~y school may evaluate the performance of individual pupils interms of Its own sco:e distribution. These local norms are more appropri-ate than broad nahonal norms for many testing purposes, such as theprediction of subsequent job performance or college achievement, thecomparison of a child's relative achievement in different subjects, orthe measurement of an individual's progress o\-er time.

Page 54: Anne Anastasi- Psychological Testing I

"94 Princil)les of Psychological Testing

of scores; such as that of a particular college, a type of college, a r~gi?n,etc. These specific norms are. more useful in making colle.ge adml~slondecisions than would be annu~l norms based on ~he entire. candidateo ulation. Any changes in the candidate populatlOn o.ver time, more-

~v~r,can be detected only with a fixed-score scale. It will be noted thatthe principal difference beh":een the fixed-reference-group scales u~derconsideration and the previously discussed. scales ~ased on natlOn~1anchor norms is that the latter require the chOIce of a. smgle group that IS

broadl representative and appropriate for normative purposes. Apartfrom the practical difficulties in obtaining such a group and the need toupdate the norms, it is likely that for many testing purposes such broadnorms are not required. .

Scales built from a fixed reference group are analogous m one respectto scales employed in physical measurement. In this connection, Angoff(1962}pp. 32--33) writes:There is hardly a person here who knows the precise origina~ definition of ~heI gth of the foot used in the measurement of height or distance, or which~:git was whose foot was originally agreed upon as the standard; on t~eother hand, there is no one here who does not know how to. evalm~te lengt sand distances in terms of this unit. Our ignora~ce of the precise on.gmal me~n-. g or derivation of the foot does not lessen Its usefulness to us In a~y "ay.~~susefulness derives from the fact that it remains the same ~ver time andallowsus to familiarize ourselves with it. Needless to say, .preclsely th~ sameconsiderations applv to other units of measurement-the mch, the mile, th:de ree of Fahrel1h~it, and so on. In the field ofpsych?l.ogical measureme.nt It. g. 'lar]y reasonable to say that the original defimtlOn of the scale IS orIS Slml . . h . t ce of ashould be' of no consequence. ~Vhat is of consequence IS t e ~am enan .. t nt scale--which in the case of a multiple-form testmg program, ISconsa·, d 1 . . f s pIeachieved bv rigorous form-to-form equati~g-an . t 1e provlSl~n 0 up.-

t,. or'nlative data to aid in interpretation and III the formation of specificmen alY n , . d't' .. ntdecisions, data which would be revised from time to time as con I lOllSwalla .

Norms and the Intcrpretat,ion of Test Scores 95

computer capabilities should serve "to free one's thinking from the con-straints of the past."

Various testing innovations resulting from computer utilization will bediscussed under appropriate topics throughout the book. In the presentconnection, we shan examine some applications of computers in theinterpretation of test scores. At the simplest level, most current tests, andespecially those designed for group administration, are now adapted forcomputer scoring (Baker, 1971). Several test publishers, as well as inde-pendent test-scoring organizations, are equipped to provide such scoringservices to test users. Although separate answer sheets are commonlyused for this purpose, optical scanning equipment available at somescoring centers permits the reading of responses directly from test book-lets. Many innovative possibilities, such as diagnostic scoring and pathanalysis (recording a student's progress at various stages of learning)have barely been explored.

At a somewhat more complex level, certain tests now provide facilitiesfor computer interpretation of test scores. In such cases, the computerprogram associates prepared verbal statements with particular patternsof test responses. This approach has been pursued with both personalityand aptitude tests. For example, with the ~1innesota Multiphasic Per-sonality Inventory (MMPI), to be discussed in Chapter 17, test usersmay obtain computer printouts of diagnostic and interpretive stl;\tementsabout the subject's personality tendencies and emotional condition,together with the numerical scores. Similarly, the Differential AptitudeTests (see Ch. 13) proVide a Career Planning Report, which includesa profile of scores on the separate subtests as well as an interpretivecomputer printout. The latter contains verbal statements that combinethe test data with information on interests and goals given by thestudent on a Career Planning Questionnaire. These statements aretypical of what a counselor would say to the student in going over histest results in an individual conference (Super, 1973)... Individualized interpretation of test scores at a still more complex levelis illustrated by interactive computer systems, in which the individual isin direct contact with the computer by means of response stations andin effect engages in a dialogue with the computer (J. A. Harris, 1973;Holtzman, 1970; M. R. Katz, 1974; Super, 1970). This technique has beeninvestigated with regard to educational and vocational planning and de-cision making. In such a situation, test scores are usually incorporated inthe computer data base, together with other inforn:tation ,,tovided by thestudent or client. Essentially, the computer com~thes all the availableinformation about the individual with storedt-t' ",bout educationalprograms and occupations; and it utilizes all re,lev;tnt' facts and relationsin answering the individual's questions and aiding him in reaching de-,cisions. Examples of such interactive computer systems, ii!' various stages

COMPUTER UTILIZATION IN THE INTERPRETATIONOF TEST SCORES

Computers have already made a Sig~i~cant.impact ,upon eve? phaseof testing, from test construction to admlmstrahon, sconng, reportmg, andinterpretation. The obvious uses of computers-and those develop~dearliest-represent simply an unprecedented increase in the spe~d WIthwhich traditional data analyses and scoring processes can be earned out.

F 'mportant however are the adoption of new procedures andar more 1 " .' h' h ldthe exploration of new approaches to psychological testmg w lC wo~

have been impossible without the fle:dbility, speed, and d~ta-processl~g('~n:lhiliti('s of computPTS. As Baker (1971, p. 227) SUCCinctlyputs It,

Page 55: Anne Anastasi- Psychological Testing I

. PrillcijJlesof Psychological Testing. 1 d IBM's Education and Career Ex-

erationaldevelopment, mc~T;' s S 'stem for Interactive Guidance!:ionSystem (ECES). a~d fi ld i . I show good acceptance of

ation (SIGI). Prehmmary e na s. nts (Harris 1973).systemsby high school stud~nts and1 theltroPfart~edata utilized in

It an mtegra part results a so repres~n I) I der to present instructionaltiter-assisted instructwn (CAd .~ n or t le\'el of attainment, the, . t ch stu ent s curren1appropnate 0 ea d I ate the student's responses to'ermust r.epeated~' s~or.ea~hi~::~onse history, the student may'Pg matenal. On t e aSlSo. I . to further practice at the presentedtomore ad.vanced m:te~:r~~~ he receives instruction in more,r to a reme~l~l branc . w . nostic anal sis of errors may leadtaryprereqUIsItematenal. .Dlag correcr the specific learning,instructionalprogram desIgned toltiesidentified in individual cases. f 'ble variant of computer

d' t' ally more eaSlss costly an opera Ion d ';nstruction (CMI-see1 . . computer-manage ,

ion in earmng IS , 1 I mer does not interact directlyleton,1974). In suc~ syst~~~~t::me:ter is to assist the teacher in

,~~~~u~~:'nT~e i~~~vi~ualize~ il~struct~~n~f~:~~;U~~~'~eu::;~~'tionpackages or more ~onventlOn:l t~: rather formidable mass of'utionof the computer ISto proces f f each student in a

'1 d' g the per ormance 0 ,ceumulateddal y regar m. 1 d' dl'fferent activity and to'I I Y be InvOve In a ';,omW lere eac I, ma ..' xt instructional step for eachthese data in prescnbmg the ne -, 'ded by thel' t' of computers are PIOVI

,J,. Examplesof thi,Sapp lCan~~iduallY Prescribed Instruction-seeJsityof Pittsburgh s IPI (11968) d' Pro)'ect PLAN (Planning for.. & GI 1969' Glaser an . Iaser, , " 1 d b the Amencan n-i~gin Accordance with Needs) deve ope SYh Brudner &

I 1971' Flanagan anner, ,s for Researc~e~t ;~~~ninclud~s a progr~m of self-knowled?e,!lr,1975). Pro) d t' al planning 'as well as instructionaualdevelopment, an occupa Ion ,"entaryand high school subjects.

'<TERION-REFERENCED TESTING

, , h testing that has aroused a surge of',URE AN~ USES.~n appro~c t~ enerally desi<Ynatedas "criterion-J,particularly 1~ education'd1sbygGlaser (1963) this term is still

d . " Fnst propose '.)lee testmg. I 'and its definition varies among different wnters.; mewhat1010asl~~nativeterms are in common use, such as content-,ver,severa '

.~ "f ,del)' used CAI system for tE':lching reading to first-,r a descnptlOn 0 a \\ 1 • '( n-1 \

1" 1 ' \ ch'Ll---,. 'C'(' F, C, :\t1:,,~,n!1 1,,;, 0'

~l;:.n( t.H!·(_~T!~({' :: .'~.'-

Norms and the Interpretation of Test Scores 97

domain-, and objective-referenced. These terms are sometimes employedas synonymsfor criterion-referenced and sometimes with slightly differ~ntconnotations. "Criterion-referenced," however, seems to have gainedascendancy, although it is not the most appropriate term.

Typically, criterion-referenced testing uses as its interpretive frameof reference a specifiedcontent domain rather than a specified populationof persons. In this respect, it has been contrasted with the usual norm-referenced testing, in which an individual's score is interpreted by com-paring it with the scoresobtained by others on the same test. In criterion-referenced testing, for example, an examinee's test performance may bereported in terms of the specific kinds of arithmetic operations he hasmastered, the estimated size of his vocabulary, the difficulty level of read-ing matter he can comprehend (from comic books to literary classics),or the chances of his achieving a designated performance level on anexternal criterion (educational or vocational).

Thus far, criterion-referenced testing has found its major applicationsin several recent innovations in education. Prominent among these arecomputer-assisted, computer-managed, and other individualized, self-paced instructional systems. In all ,these systems, testing is closely inte-grated with instruction, being introduced before, during, and aftercompletion of each instructional unit to check on prerequisite skills,diagnose possible leaming difficulties, and prescribe subsequent instruc-tional procedures. The previously cited Project PLAN and IPI areexamples of such programs.

From another angle, criterion-referenced tests are useful in broad sur-veys of educational accomplishment, such as the National Assessment ofEducational Progress (\Vomer, 1970), and in meeting demands for edu-cational accountability (Gronlund, 1974). From still another angle,testing for the attainment of minimum requirements, as in qualifying fora driver's license or a pilof s license, illustrates criterion-referencedtesting. Finally, familiarity with the concepts of criterion-referencedtesting can contribute to the improvement of the traditional, informaltests prepared by teachers for classroom use. Gronlund (1973) providesa helpful guide for this purpose, as well as ~ simple and well-balancedintroduction to criterion-referenced testing. A brief but excellent 'discus-sion of the chief limitations of criterion-referenced tests is given byEbel (1972b).

CONTENTMEANING. The major distinguishing feature of criterion-referenced testing (however defined and whether designated by thisterm or by one of its synonyms) is its interpretation of test performancein terms of content meaning. The focus is clearly on u;hat the person cando and what he kno'.\'s,not on how he compares with others. A funda-

I:,1\1" '

IIE\Ii

lill~:,Ir\:11',I [

,1111: :

Page 56: Anne Anastasi- Psychological Testing I

!

1 "

MASTERY TESTING. A second major feature almost always found incriterion-referenced testing is the procedure of testing for mastery. Es-sentiany, this procedure yields an all-or-none score, indicating that the

Norms and tIle Interpretation of Test Scores 99

indiVidual. has ~r has not attained the preestablished level of mastery .When basic skIlls are tested, nearly complete mastery is generally ex-pected (e.g., 80--85% correct items). A three-way distinction may alsobe employed, including mastery, nonmastery, and an intermediate doubt-ful, or "review" interval. '

In connection with individualized instru('tion, some educators haveargued that, given enough time and suitable instructional methods nearly~veryone can achieve complete mastery of the chosen instructio~al ob-J:etives. Individ~al differences would thus be manifested in learninghme rather than In final achievement as in traditional educational testing(Bloom, 1968; J. B. C~rroll, 1963, 1970; Cooley & Glaser, 1969; Gagne,1965). It follows t.hat In mastery testing, individual differences in per-fo~m~nce are of httle or no interest. Hence as generally constructedcnter~on-refer~nced tests minimize indh'idual differences. For example:they lnclude items passed or failed by all or nearly all examinees al-though such. ite~ns are usually excluded from no~n-referenced t~sts.'Mas:er~ t.estin? IS r~gularly. employed in the previously cited programsfo~ l~dlvlduahzed mstructIon. It is also characteristic of publishedcr~tenon-referenced tes~ for basic skills, suitable for elementary school.Exam~le~ of such tes~ mclude the Prescriptive Reading Inventory andPres~np~lve Mathem~tlCsJnventory (California Test Bureau), The SkillsM:omtor~ng System in Reading and in Study Skills (Harcourt Brace!o\'anovlch) '.and ~iagnosis: An Instructi onal Aid Series in Reading andIn Mathematics (ScLCnceResearch Associates).

Beyond basic skills, mastery testing is inapplicable or insufficient. Inmore. ad~'~nced and less structured subjects, achievement is open-ended.The ll1dlvJ~ual m~~ progress almost without limit in such functions asunderstandmg, cnbcal thinking, appreciation, and originality. Moreover,content ~vel:a~e m~y p~oc~ed in many different directions, dependingupon .the mdl~I~~al s abllibes, interests, and goals, as well as local in-structional factllties. Under these conditions complete ma t .r . ' S ery IS un-rea lStiCan.d unnecessary. Hence norm-referenced evaluation is generallyenlployed In such cases to assess degree of attainment. Some publishedtcsts are so constructed as to permit both norm-referenced and criterion-refe~enced applications. An example is the 1973 Edition of the StanfordAchIevement Test. While providing appropriate norms at each level thisbatt~ry ~eets three important requirements of criterion-referenced ;ests:speclflc~tlO~ of ~etailed instructional objectives, adequate coverage ofeach obJective WIth appropriate items, and wide range of item difficulty,

It should be noted that criterion-referenced testing is neither as ne~'

rinciplrs of Psychological T ('stingequirement in constructing this type of test is a. clearly defined. f knowledge or skills to be assesscd by the test. If scores. on suche to have communicable meaning, the content domam to be~lust be widely recognized as important. The selected domain

subdivided into small units defined in performance terms.llciHQIlUI context these units correspond to behaviorally defined6nal~.bjectives, 'such as "multiplies three-digit by two-digit•.or "identifies the misspelled word in which the final e is re-,hen addl~g -ing." In the programs prepared for in?ividualized

ion; these objectives run to several hundred for a smgle school.~Afterthe instructional objectives have been fonnulated, items ared to sample each objective. This procedure is admittedly difficult

, e -consuming. \Vithout such careful specification and control of..t, however, the results of criterion-referenced testing could de-

rite into an idiosyncratic and uninterpretable jumble.,en strictly applied, criterion-referenced testing is best adapted for

ng basic skills (as in reading and arithmetic) at elem~ntary le~e1s.heseareas, insh'uctional objectives can also be arranged m an ordmalarchy, the acquisition of more elementary skills being prerequisite

:the acquisition of higher-level skills.6 It is impr~eticab~e a?d probablyndesirable, however, to formulate highly speCIfic obJectIves for ad-

vancedlevels of howl edge in less highly structured subjects. At these',ievels,both thc content and sequence of learning are likely to be much

'moreflexible.On the other hand, in its emphasis on content meaning in the interpre-

tation of test scores, criterion-referenced testing may exert a salutaryeffecton testing in general. The interpretation of intelligence test scores,

_,for example, would benefit from this approach. To describe a child's" intelligence test performance in terms of the specific intelJech~al skills

and knowledge it represents might help to counteract the confuSIOns a~dmisconceptions that have become attached .to the IQ. VVhen stated I~these general terms, however, the critenon-referenced approa~h ISequivalent to interpreting test sCOTesin t~e light of the demonstra~edvalidity of the particular test, rather than m terms of vague underlymgentities. Such an interpretation can certainly be combined with n?rm-

referenced scores.

6ldeaUy, such tests follow the simplex model of a Guttman scale (see Popham &1T1Isck,]9(9), as do the PiaF:etian ordinal scales discussed earlier in this chapter.

. : As a resl~lt.of this reduction in variability, the usual methods for findin tdtlJio~,hty and \'al,d'.ty are,inapplkahle to most criterion-referenced tests. Further irSCIlIl.sum of these pomts Willbe found in Chapters 5, S, and 8.

Page 57: Anne Anastasi- Psychological Testing I

rinciples of Psychological Testing-/

Norms and the Interpretation at Test Scores'II 101

one I ustrated in Table 6 Tl d .171 high school boys en 'II dl~ ata for thIs table were obtained from

d' ro e m courses in Am' h'Ictor was the Verbal R' encan Istor)', The pre-easomng test of the D'ff t' I .administered earl . th I eren la Aphtude Testsy m e course. The crite . 'd I

The correlation between test d ~lOn."as en -of-course grades.scores an crltenon was ,66.

TABLE 6Expectancy Table Showing Relation betwe .and Course Grades in America H' t f en DAT lerbal Reasoning Testn IS ory or 171 Boys in Crade 11(Adapted from Fifth Edition Manual for . .T, p. ll~. Reproduced by permission th~. DIfferential Aptitude Tests, Forms SandCorporation, New York, N.Y. All right~~~;:~~~~,~ 1973, 1974 by The Psychological

~'-=-==-r----=--r:--.:.:----

clearly divorced from norm-referenced testing as some of itsts imply. Evaluating an individual's test performance in absolutech as by letter grades or percentage of correct items, is certainly, er than normative interpretations, More precise attempts totest performance in terms of content meaning also antedate thelion of the term :'criterion-referenced testing" (Ebel, 1962;il,l962-see also Anastasi, 1968, pp. 69-70), Other examples may_in early product scales for assessing the quality of handwriting,_tions, or drawings by matching the individual's work samplef a set of standard specimens. Ebel (1972b) observes, further-that the concept of mastery in education-in the sense of all-or-earning of specific units-achie\"ed considerable popularity in theand 19305and was later abandoned.om1ativeframework is implicit in all testing, regardless of how

, are expressed, (Angoff, 1974). The very choice of content orto be measured is influenced by the examiner's knowledge of whate expected from human organisms at a particular developmental orctional stage. Such a choice presupposes information about whatpersons have done in similar situations, Moreover, by imposing

rm cutoff scores on an ability continuum, mastery testing does not'by eliminate individual differences, To describe an individual's levelding comprehension as "the ability to understand the content of

•~ett;York Times" still'leaves room for a wide range of indi\'idualerencesin degree of understanding.f

Test ~umber Percentage Receiving Each Criterion CradeScore of Cases Below 70 70-79 80-89 90 & above

40 & above 4630-39 36

15 22 63

20-296 39 39 17

43Below 20

12 63 21 546 30 52

--=17

The first column of Tahle 6 shows h .' .class intervals' the numb f t d t e test SCOles, dlVlded into four" ' er 0 s u ents whose f 11' .IS gIven in the second column The r " scores. a. mto each mtervaltable indicate the pe t' f emall1l1lg entnes m each row of the

rcen age 0 cases 'th' hwho received each grade at th d f h WI III eac . test-score intervalwi~h scores of 40 or above 0:~e ;e:b e course. ~hus, of the 46 studentscelved grades of 70-79 22 al Reasomng test, 15 percent re-

d' percent grades of 80-89 d 63

gra es of 90 or above At th th ' an percentbelow 20 on the test '30 e 0 er e~treme, of the 46 students scoring

b' percent receIved gr d b I 7etween 70 and 79 a d 17 - a es e ow 0, 52 percent

limitations of the a~ai~ble dPtercent between 80 and 89. Within theestimates of the probabilit ~ha~tthese. p~rcentages. represent the bestcriterion grade. For exam ? 'f an mdlVldual WIll receive a given34 (i.e" in the 30--39 inte~,:i/ ':e n~w t~udent receives a test score ofof his obtaining a grade of 90 ~ _"ou . conclude that the probabilityof his obtaining a grade betweer ~~ove lS817. out of 100; the probability

In many practical situation n. ~n 9 ISS9'~ of 100, and so on.cess" and "failure" in a 'ob ' s, cntena can be dicliotomized. into "suc-these conditions, an e~ e~;::;,se cof study, or othe.r undertak~ng. Underprobability of success oPr fa"I y hart can be prepared, showing the

F. I ure corresponding t 'h 'Igure 7 is an -example f h 0 eac . score mterval.

selection battery developeod ~\h a~.ex:ectanc~ chart. Based on a piloty e Ir orcc, thIS expectancy en,lirt shows.

PECTANCY TABLES.Test scores may also be interpreted in terms ofeeted criterion performance, as in a training program or on a job,s usage of the term "criterion" follows standard psychometric prac-, as when a test is said to be validated against a particular criterion

Ch, 2), Strictly speaking, the term "criterion-referenced testing"uld refer to this type of performance interpretation, while the otherproachesdiscussed in this section can be more precisely described astent,referenced. This terminology, in fact, is used in the APA test

ndards (1974).n expectancy table gives the probability of different criterion out-

roesfor persons who obtain each test score. For example, if a studenttains a score of 530 on the CEEB Scholastic Aptitude Test, what aree chances that hislreshman grade-point average in a specific collegeill fall in the A, B, C, D, or F category? This type of information cane obtained by examining tbe bivariate distribution of predictor scoresSAT) plotted against criterion status (freshman grade-point average),'f the number of cases in each cell of sueh a bivariate distribution isChangedto a percentage, the result is an expectancy table, such as the

Page 58: Anne Anastasi- Psychological Testing I

, R l' bet "een Performance7 Expectancv Chart ShowlI1g e atlon \ , . .G. • , d E1' ' fan from Primary Flight Trall1JUg.IectionBattery an IIDlllaI

',(From Flanagan, 1947, p, 58.)

~ . ,'thin each stanine on the battery whothe pertentage of men scormg "I .. 'It b seen that 77 percent. l' W ht trammg can e,failedto camp :t: pnmary. 19 f 1were eiiminated in the course of train-.ofthe men recelVlDg a stamne 0 . 9 f. 'led to complete the" 1 I 1 4 t of those at stamne aJ,lng. W Ii c on y percen es the ercentage of failurestraining satisfactorily. Between these ex.trcm ., . Po the basis of this

. 1 the succeSSl'\'e stamnes. n ', decreases consIstent y over ". d f Ie that approximately, expectancy chart, it ~uld be predlcte , °tr e:amPco~e of 4 win fail and

f 'I t d t who obtain a s anme s40 percent 0 pI 0 ea e s '1 1 t 'marv flight train-;tpproximately 60 percent wil1:.atisf~ctor:':b~~~i~ye o~~~cces~ and failureing. Similar statements .reia: dm1 t eh~ receive each stanine. Thus, ancould be ma.de about. m lVI ua s :v60'40 or 3:2 chance of completingindividual wIth a stamne o.f 4 has . . . a criterion-referenced interpre-primary flight training. Besldebsprovldmthg t both expectancy tables and

. f t t es it can e seen a d'tatlol1 0 es scor., 1 'd f the validitv of a test in pre Ict-expectancy charts give a genera 1 ea 0 ,

ing a given criterion.

No. ofMen

9 21,474

8 19,444

7 32,129

6 39,398

5 34,975

•• '23,699

3 11,209

2 2,139

904

CHAPTER 5Reliability

RLIABILlTY refers to the consistency of scores obtained by thesame persons when reexamined with the same test on differentoccasions, or with different sets of equivalent items, or under

othel: variable examining conditions. This concept of reliability underliesthe computation of the error of measurement of a single score, wherebywe can predict the range of fluctuation likely to occur in a single indi-vidual's score as a result of irrelevant, chance factors.

The concept of test reliability has been used to cover several aspects ofscore consistency. In its broadest sense, test reliability indicates the extentto which individual differences in test scores are attributable to "true"differences in the characteristics under consideration and the extent towhich they are attributable to chance errors. To put it in more technicalterms, measures of test reliability make it possible to estimate what pro-portion of the total variance of test scores is error variance. The crux ofthe matter, however, lies in the definition of error variance. Factors thatmight be considered error variance for one purpose would be classifiedunder true variance for another. For example, if we are interested inmeasuring fluctuations of mood, then the day-by. day changes in scoreson a test of cheerfulness-depression would be relevant to the purpose ofthe test and would hence be part of the true variance of the scores. If, onthe other hand, the test is designed to measure more permanent person-ality characteristics, the same daily fluctuations would fall under theheading of error variance.

Essentially, any condition that is irrelevant to the purpose of the testrepresents error variance. Thus, when the examiner tries to maintainuniform testing conditions by controlling the testing environment, in-structions, time limits, rapport, and other similar factors, he is reducingerror variance and making the test scores more reliable. Despi~e optimumtesting conditions, however, no test is a perfectly reliablei~strument.Hence, every test should be accompanied by a statemellt of its reliability.Such a measure of reliability characterizes the test when administeredunder standard conditions and given to subjects simil!lT to those con-stituting the normative sample. The characteristicsof thiss~mple shouldtherefore be specified, together with the type of reliability that was meas-ured.

Page 59: Anne Anastasi- Psychological Testing I

iud/Iles of Psychological Testing

could,of course, be as many varieties of test reliability as there'lions affecting test scores, since any such conditions might be_for a certain purpose and would thus be classified as error vari-e types of reliability computed in actual practice, however, arefew. In this chapter, the principal techniques for measuring the

. of test scores will be examined, together with the sources ofiance identified by each, Since all types of reliability are con-

-with the degree of consistency or agreement between two inde-By derived sets of scores, they can all be expressed in terms of a

lion coefficient, Accordingly, the next section will consider some;basic characteristics of conelation caefficients, in order to clarifyuse and interpretation, More technical discussion of correlation, asas more detailed specifications of computing procedures, can be

d:in any elementary textbook of educational or psychological statis-such as Guilford and Fruchter (1973).

9 I : ~- i m9 ii .Jifflll

,,I ~.j/ffI , II

~4H1Hff

iiNt I.Jiff.j/ff'4/It.j/ff1 !.j/ff!

JItt.j/ff I ---.Jifflll I i

: !:.j/ff JHt ! :!

I ;

mr I ! !,i

II II , ,

0. 0. 0.

NOJ:g 60-69

"5~ 50-59o~ 40-49

oX

T N MO.O. ()o..

2 I, i"'P'?o 0 0 0 0N M """ "0 -0

Score on Variable JBivariate D' t'b' fISn utlOn or a Hypothetical Correlation of +1.00.

fEAl\,~G OF CORRELATION.Essentiallv, a correlation coefficient (T) ex-~ssesthe d'egree of correspondence, '01' relationship, between two sets;scores,Thus, if the top-scoring individual in variable 1also obtains the

scorein variable 2, the second-best individual in variable 1 is second..~stin variable 2, and so on down to the poorest individual in the group,ncn there would be a perfect correlation between variables 1 and 2.ucha correlation would ha\'e a value of +1.00,A hypothetical illustration of a perfect positive correlation is shown in

igure 8. This figure presents a scatter diag~\lm, or hivariate distributiOflt,ch tally mark in this diagram indicate~~~e score of one individual inth vllriable 1 (horizontal axis) and vain.\:B1e:2 (vertical axis). It will be

noted that all of the 100 cases in thee grolJ.l) are distributed along ~~diagonal running from the lower left- to,'theupper right-hand corner of

.,'the diagram. Such a distribution indicates a perfect positive correlation(+ 1.00), since it shows that each individual occupies the same relative

, ,position in both variables. The closer the bivariate distribution of scaresapproaches this diagonal, the higher will be the positive correlation.

Figure 9 illustrates a perfect negative correlation ( -1.00 ). In this case,there is a complete reversal of scores from one variable to the other. Thebest individual in variable 1is the poorest in variable 2 and vice versa,this reversal being consistently maintained throughout the distribution. Itwill be noted that, in this scatter diagram, all individuals fall on thediagonal extending from the upper left- to the lower right-hand comer.This diagonal runs in the reverse direction from that in Figure 8.

"- ....,,1,,.;,,~;."l;,,~t,,~('()mnlete "bsence of rdationship, such as

might occur by chance, If each individ l'out of a hat to determine his 'f' tl.a s n~me were pulled at randomwere repeated for variabl~ C) pOSI IOn m vanable 1, and if the processUnder these conditions l't -, alzderbo~r near~zero correlation would result.

, WOu e ImpOSSIblet d' ,relative standing in variable 2 from k 0 pre, Ict an 1l1dividual's1.The top-~oring subJ'ect I'n "bl a1 n~whledge of IllS score in variablE!, valla I" mlg t scar I' I IIn variable 2. Some individual 'h b h e ug I, ow, or average~oth variables, or below ave;a~e~l~gb~th~ ~hance ~core above average inIn one variable and below in the oth .' 'Uers might ~all above averageaverage in one and at th ' .er, sh others 11lIght be above the

, e avel acre 111 the second d fwould be no regularit}, in the relate: h' f ' an so orth. There

TI lOns Ip rom one i d' "d IIe coefficients found in a t I' n 1\I ua to another.extremes, having some value 'h~ ~1 p~actIce generally fall between theselations between measures of ~1,t't an zero but lower than 1.00. Corre-frequentlY low When a a I,lIes are nearly ;rlways positive, aIthoug'h

" negative conel t' . b'such variables, it usually results f th a IOn IS a tamed between twopressed, For example 1'£ t' rom e way in which the scores are ex-

, Ime scores are correla't d 'thnegative correlation wl'11 prob bl I Th ,e;. WI amount scores, a, a y resu t. u '~f -h b' ,an anthmetic computation t t' d s, '1. cae su lect s score:()n

, es IS recor ed as the xi' b fqmred to complete all itenls h"l h' ',pm er a secondsre·

, W I e IS Score on an 'th .test represents the number of bl ,~, an mehc reasoning1 ' pro ems correctly soh d .ahon can be expected In su I h . ~" a negative corre-

. CIa case, t e poorest (i.e.", slowest) individ-

Page 60: Anne Anastasi- Psychological Testing I

. R l' bet \'een PerformanceCh t Showmg e atIon \ .,IG,7. Expectancy aT p.' . Flight Training.ejectionBattery and Elimination from I1maly

.{FromFlanagan, 1947, p. 58.)

: . ,thin each stanine on the battery who,thepercentage of men scormg \\ I . . . It b seen that 77 percent, J' fI' ht trammg can eailed to comp :t: pnmary. Ig f 1were eiiminated in the course of train-of the men receIVing a stamne 0 . 9 f 'led to complete the

1 '1 I 4 t of those at stamne aling, W 11 C on y percen es the ercentage of failuresh'aining satisfactorily, Between these ex.trcm ", Po the basis of this

. 1 the succeSSl'\'e stanmes. ndecreases consistent y over '. d f I that approximatelyexpectancy chart, it ~uld be predlCt,e , °tr e~amPcoe~eof 4 will fail and

f 'J t d t who obtam a s amne s40 percent 0 pI 0 ca e s 1 . flight train-itpproximately 60 percent wil1;atis~~~tor~'~b~~~i~ye~fl:~:::~and failureing, Similar statements re~a~ m~ hp i each stanine. Thus, ancould be made about. indlvldua s w 6~.~~c:rv;:2 chance of completing

. individual. with a. s~amne o.f 4 has ~idin' a criterion-referenced interpre-primary fhght trammg. Besldebspro thg t both expectancy tables and

. f t t scores it can e seen a d'tatlon 0 es .' I 'd f the validitv of a test in pre lct-expectancy charts glVe a genera 1 ea 0 J

ing a given criterion.

No. ofMen

9 21,474

S 19,444

7 32,129

6 39,398

5 34,975

4 '23,699

3 11,209

2 2,139

904

CHAPTER 5Reliability

RLIABILITY refers to the consistency of scores obtained by thesame persons when reexamined with the same test on differentoccasions, or with diHerent sets of equivalent items, or under

othel: variable examining conditions. This concept of reliability underliesthe computation of the error of measurement of a single score, wherebywe can predict the range of fluctuation likely to occur in a single indi-vidual's score as a result of irrelevant, chance factors.

The concept of test reliability has been used to cover several aspects ofscore consistency. In its broadest sense, test reliability indicates the extentto which individual diHerences in test scores are attributable to "true"differences in the characteristics under consideration and the extent towhich they are attributable to chance errors. To put it in more technicalterms, measures of test reliability make it possible to estimate what pro-portion of the total variance of test scores is error variance. The crux ofthe matter, however, lies in the definition of error variance, Factors thatmight be considered error variance for one purpose would be classifiedunder true variance for another. For example, if we are interested inmeasuring fluctuations of mood, then the day-by-day changes in scoreson a test of cheerfulness-depression would be relevant to the purpose ofthe test and would hence be part of the true variance of the scores. If, onthe other hand, the test is designed to measure more permanent person-ality characteristics, the same daily fluctuations would fall under theheading of error variance.

Essentially, any condition that is irrelevant to the purpose of the testrepresents error variance. Thus, when the examiner tries to maintainuniform testing conditions by controlling the testing environment, in-structions, time limits, rapport, and other similar factors, he is reducingerror variance and making the test scores more reliable. Despite optimumtesting conditions, however, no test is a perfectly reliable instrument.Hence, every test should be accompanied by a statement of its reliability.Such a measure of reliability characterizes the test when administeredunder standard conditions and given to subjects similllr to those con-stituting the normative sample. The characteristics of thiss~mple shouldtherefore be specified, together with the type of reliabIlity that was meas-ured.

Page 61: Anne Anastasi- Psychological Testing I

iflciplesof Psychological Testing

"~ould, of course, be as many varieties of test reliability as there,jtionsaffecting test scores, since any such conditions might bet for a certain purpose and would thus be classified as error vari-e types of reliability computed in actual practice, however, arefew. In this chapter, the principal techniques for measuring the

'f}'of test scores will be examined, together with the sources ofilliance identified by each. Since all types of reliability are con-,with the degree of consistency or agreement between two inde-'flyderived sets of scores, they can all be expressed in tcrms of a'on coefficient. Accordingly, the next section will consider some

basic characteristics of correlation cBefficients, in order to clarifyuse and interpretation. ?\fore technical discussion of correlation, as·as more detailed specifications of computing procedures, can be,in any elementary textbook of educational or psychological statis-

; such as Guilford and Fruchter (1973),

9 I : ,- ; ",

!i ! ./Iff III

,

!.JHt-./Iff

.., II j

!mr ./Iffii#ff I ;

./Iff./lff'

T./Iff./lffl./Iff!

./Iff./lff ,j

'--

./Iff 11/ I: !:./Iff./lff i i

I,;

lilt I ! !, ,,II I ,

I I,

0- 0- 0-

N

••:g 60-69'g> 50-59co~ 40-49v'"

I N (""')0. ().. ()o.

gb b ';t'fl'?N t") ~ Si ~

SCore On Variable IFIG. 8, Bivariate Distr'b t' f

I U IOn or a Hypothetical Correlation of +1.00.

might OCcur by chance If each ind' 'd I'out of a hat to determ'ine hi .1:1 1I.as n~me \"ere pulled at random

, s pOsitIOn In vanahle 1 a d 'f thwere repeated for variable" ' n I e processUnder these conditions it -, alzderbo~r near~zero correlation would result.

, \Vou e ImpOSSible to d' t drelative standing in variable 2 from k pre. IC an in ividual's~. The top-sl!Oring Subject in variable a 1~~w~edge of l~,s SCore in variableIn variable 2. Some individ I 'h b g t Score 11lgh,low, or averageboth vadables or below av:;'l s n~,gbt hY chhance score above average in. ' age In ot . ot ers mightf II b111 one variable and below in the oth .' '11 .a a Ove averageaverage in one and at th " .er, sh others mIght be above the

ld b e a\el:lge III the second and f hwou .e no regularity in the relationshi from '.. ,. so art, There

The coefficients fOund in t I ~ one mdl\ Idual to another.extremes, having some value ~~ ~'l .p~achce generally fall between theselations between measures of a~1.tt an zero but lower than 1,00. Corre-frequentlv low When a I,lies are nearly a-lways positive, althoug'h

,. negative con-el t' . b'such variables, it usually results from th a IOn.IS 0 .tamed between twopressed. For example if time e way III which the scores are ex-

, ' SCores are correlated withnegat.lYc correlation will probabl ' result. Th ';~:'~' , am.ou~t scores, aan anthmetic computation te t .) d d us, 1f~!ch sublect s score'On

. d S IS recor e as the dumb f dqUire to complete all items wh'l h' '~er a secon sre·t t ' I e IS Score on an arith t'es represents the number of hI '''.' me IC reasoningI t' pro ems correctly sol\!cd 'a Ion can be expected. In SUell 'h . :,<:,' a negatIve cone-

,a case, t e poorest (I.e., slowest) individ-

EA!\'ING OF CORRELATION. Essentially, a correlation coefficient (T) ex-ses the d'egree of correspondence, or relotions1Jip, between two setscores.Thus, if the top-scoring individual in variable 1also obtains the

op score in variable 2, the second-best individual in v-ariable 1is second~stin variable 2, and so on down to the poorest individual in the group,

'brll there would be a perfect correlation between variables 1 and 2.ucha correlation would ha\'e a value of + 1.00.A hypothetical illustration of a perfect positive correlation is shown inigure 8. This figure presents a scatter diag~ll.m, or bivariate distrihutiOl/,.ach tally mark in this diagram illdicated~e score of one individual in

'oth variable 1 (horizontal axis) and vUllable: 2 (vertical axis). It will benoted that all of the 100 cases in thee groBl) are distributed along u.~

"diagonal running from the lower left- t~,'the"upper right-hand corner of:the diagram. Such a distribution indicates a perfect positive correlation, (+1.00), since it shows that each individual occupies the same relativeposition in both variables. The closer the bivariate distribution of scaresapproaches this diagonal, the higher will be the positive correlation,

Figure 9 illustrates a perfect negative correlation ( -1.00), In this case,there is a complete reversal of scores from one variable to the other, Thebest individual in variable 1is the poorest in variable 2 and vice versa,this reversal being consistently maintained throughout the distribution. Itwill he noted that, in this scatter diagram, all individuals fall on thediagonal extending from the upper left- to the lower right-hand comer,This diagonal runs in the reverse direction from that in Figure 8.

,,- 1..•;,,~ ;."l;r·~tr'~ ('omnlete flbsellce of rdationship, such as

Page 62: Anne Anastasi- Psychological Testing I

0- 0- 0- 0-1 '? '? r;-~ ~ ~ R

Score on Variable 1

Reliability 107

tive. When some prod t ..W

'll b 1 uc s are posItive and some negative the correlation1 e c ose to zero. 'In actual practice it's tstandard score befo' ~ d~o n~cessary to convert each raw scorc to a

can be mad . re n mg t e cross-products, since this conversion, he once for all after the cross-products have been added There

are many s ortcuts fo . .method demonst. ar .computmg.the Pearson correlation coefficient. Themeanin of the ~te m. Table 7 l~ not the quickest, but it illustrates thethat l rf. rr~latIon coeffiCient more clearly than other methodsPears~~ I:~;~:~~t:~~::i\~hor::uts. Table 7 shows the computation of ato each child's nam ~1e IC and reading scores of 10 children. Nextreading test (Y) T~ are. hiS s~ores in the arithmetic test (X) and thethe res ective c~l e sums an . means of the 10 scores are given undereach aJthm ti umn;- The thU'? column shows the deviation (x) ofthe deviatio~ (yS~o~;ero~1 thed~nthmetic mean; and the fourth column,deviations are squareda~n ~~: ;::g /~ore fr~m the reading mean. Thesesquares are used in . x wo co umns, and the sums of theand reading scores ~~~K:t:::!t~h~ ~and~~d /~viations of the arithmeticdividing each x and y by'ts . 0 eSdc~le m Chapter 4. Rather than

1 correspon mg u to find standard scores, we

/II

\I./ill I \

./iIt./ill \ I11IIJIltJIlt 1/1 i

- Jlltl/tf.IIII./iII \

./ill

11II11II1 Ii

./ill ,I i

./iII./iII.\II

1

\I.11/I11I

1/1i

'"~ 60-69.9Ii> 50-59co~ 040-049ou

Vl

Ic.9. Bivariate Distribution for a Hypothetical Correlation of -1.00.TABLE 7Computation of Pearson Product-Moment Correlation Coefficient

Arith- Read-metic ing

Pupil X Y x y x:z y' xI}

Bill 41 17 +1Carol

-4 1 16 - 4I 38 28 -2 +7

Geoffrey 48 224 49 -14

+8 +1 64Ann 32

1 8

Bob16 -8 -5 64 25

: 3440

18 -6 -3 36Jane

9 1836 15 -4 -6 16

Ellen 41 2436 24

+1 +3 1Ruth 43

9 320 +3 -1 9

Dick 47 23 +71 - ~

Mary 40+2 49 4· 14

27 0 +6 0 36S 400 210

0

M0 0 2~4 186 86

40 21IN -- . ~186 --fT. = 10= v'24.40= 4.94 fT, = 10= v'18.60= 4.31

r,,=~= 86 86NUru. (10)(4.94)(4)R} = 212;91=.40

I ? "':.:'Ii';~l. .'''_~~i~i ' -

.[

. 'ualwillhave the numerically highest score on the first test, while the bestindividualwill have the highest score on the second.

Correlation coefficients may be computed in variom ways, dependingon the nature of the data. The. most common is the Pearson Product-Moment Correlation Coefficient. This correlation coefficient takes intoa.ceountnot only the person's position in the group, but also the amountofhis deviation above or below the group mean. It will be recalled that

. wheneach individual's standing is expressed in}erms of standard scores,personsfalling above the average receive positive standard scores, whilethosebelow the average receive negative scores. Thus, an individual whois superior in both variables to be corre1al:ed,:would have two positivestandard scores; one inferior in both woul~ have two negative standardscores.If, now, we multiply each individ\i&r" tandard score in variable Iby his standard score in variable 2, all.at . products will be positive,provided that each individual falls on theA.ame side of the mean on bothvariables. The Pearson correlation coefficje,))t is Simply the mean of theseproducts. It will have a high positive val\ie:'W~~n corresponding standardscores are of equal sign and of approximately equal amount in the twovariables. When subjects above the average in one variable are below theaverage in the other, the corresponding cross-products will be negative.If the sum of the cross-products is negative, the correlation will be nega-

Page 63: Anne Anastasi- Psychological Testing I

'08 Prillcip1t's of PS!Jchological T('8ting ,t the end as shown in the correlation

form this division only once ad' ' the last column (xI)) have1 7Th oss-pro uets m' dula in Tab e, e cr , d' g deviations in thc x an y

1· l' the cOITespon lll' d ten found by mu tIp ymg '( r) the sum of these cross-pro uc slumns,To compute the _~orrelatlOn(N ) , and by the product of the twodivided bv the number. of casesndard de~'iatiol1s (11':<Uy),

correlation of ,40 found in Table 7 ind~-STATISTICALSIGNIFICAJ'CE,The , 1 t' hl'p between the arithmetic

d f ositwe re a Ions 11tes a moderate egree 0 p d for those children doing wereading scores. There is some1ten hencyadl'ng test and vice versa, al-

f wel on t e re , h harithmeticalso to ~er orm If we are concerned only Wit t eugh the relation IS not close, cept this correlation as an

10 'h'ldren we can acrmance of the,se c 1 , fIt' existing between the two. f th degree 0 re a lOnuate descriptlOn 0 e '1 ch however we are usu-I holog1ca resear , ' d 1" les in this group. n psyc d h t'cular sam1J/e of indivi ua s1" beyon t e par 1

terested in genera lZln~ h'ch the represent, For example, weto the larger populatIOn ": 1 etic ;nd reading ability are corre-

t want to know whether anthm f h e age as those we tested,. h lchildren 0 t e sam .

amongAmencan sc 00 . d vould constitute a very m-iously,the 10 cases actually ~xamAlneth'r comparable sample of the: 'f 1 opulatlOl1, no e .uate sample 0 sUf 1 a p much higher correlatIOn.

sizemight yield a much lowfer ort~ tl'ng the probable fluctuation. . 1 dures or es Ima

ere are stabshca proce , Ie in the size of correlations, means,'expectedfrom sample to samp , The <!uestion usually, , d' ther group n1easures, . 'rd deviations, an an) 0 .' 1 whether the correlation IS

1, however IS SImp v , h,about carre atlOns, , h 'd l'f the correlation 111t e1 . In ot er war 5,

antly greater t lan zelo. . as hi h as that obtained in our'lion is zel'O, could a cOTTel~hon glne? When we say that a

d f Plmg error a 0 'have resulte rom sam t (01) level" we mean the, ,... 'fi t at the 1 percen. , 1

bon IS slgm can t of 100 that the population corre n-are no greater than one ou h t van'ables are truly corre-, 1 de that t e wo ,

zero.Hence, we conc U h risk of error we are willing to ta~eignificancelevels refer to ~ et If a correlation is said to be Slg-ing conclusions from our ~;"t f error is 5 out of 100. Mostat,the .05 ~eve1, th~ pr;;~e; ~~e ~01 or the ,05 levels, althoughoglcalresearch applies 10 ed for s ecial reasbns~

,ificancelevels may be. emp b( 7 f 'Is fo reach significance evenrrelation of .40 found 111Ta e,. al d 't,h';ill11\,10 cases it is

, ht h e been antiCIpate ,WI ..•~;r,Y flevel. As mlg av l' h' conc1usively~\Yith this size 0o establish a general re at,lOn.s

6Ip t a";"t'the "05-1eve1 is .63. Any

, 1 t' Igm can ' ,he smallest corre a Ion s, '" " wered the question of,n below that value simply leaves unans

Reliability 109

whether the two variables are correlated in the population from whichthe sample was drawn.

The minimum correlations significant at the .01 and ,05 levels forgroups of different sizes can be found by consulting tables of the signifi-cance of correlations in any statistics textbook. For interpretive purposesin this book, however, only an understanding of the general concept isrequired. Parenthetically, it might be added that significance levels canbe interpreted in a similar way when applied to other statistical measures.For example, to say that the difference between two means is significantat the .01 level indicates that we can conclude, with only one chance outof 100 of being wrong, that a difference in the obtained direction wouldbe found if we tested the whole population from which our samples weredrawn. For instance, if in the sample tested the bo),s had obtained asignificantly higher mean than the girls on a mechanical comprehensiontest, we could conclude that the boys would also excel in the total popu-lation,

THE RELIABILITYCOEFFICIENT.Correlation coefficients have man)' usesin the analysis of psy.chological data, The measurement of test reliabilityrepresents one application of such coefficients. An example of a reliabilitycoefficient, computed by the Pearson Product-Moment method, is to befound in Figure 10. In this case, the scores of 104 persons on two equiva-lent forms of a Word Fluency test' were correlated. In one form, the sub-jects were given five minutes to write as many words as:'they could thatbegan with a given letter. The second form was identical, except that adifferent letter was employed. The two letters were chosen by the testauthors as being approximately equal in difficulty for this purpose.

The correlation between the number of words written in the two forms,\of this test was found to be ,72. This correlation is high and significant atthe ,01 level. With 104 cases, any correlation of .25 or higher is significantat this revel. Nevertheless, the obtained correlation is somewhat lowerthan is desirable for reliability coefficients, which usually fall in the .80'sor .90's, An ~nation of the scatter diagram in Figure 10 shows atypical bivariate distribution of scores corresponding to a high positivecorrelation. It will be noted that the tallies cluster c~ose to the diagonalextending from the lower left- to the upper right-haridcorner; the trendis definitely in this direction, although there is a certain amount of scatterof individual entries. In the follOWing section, the uSe of the correlationcoefficient in computing different measures of test reliability will be con-sidered. '

lOne of the subtests of the SRATests of Primary Mental Abilities' for Ages 11 to17. The data were obtained, in an investigation by Anastasi and Drake (1954).

Page 64: Anne Anastasi- Psychological Testing I

ReliabilifY 111

less susceptible the scores are to the random daily changes in the condi-tion of the subject or of the testing environment.

When retest reliability is reported in a test manual, the interval overwhich it was measured should always be specified. Since retest correla-tions decrease progressively as this interval lengthens, there is not onebut an infinite number of retest reliability coefficients for any test. It isalso desirable to give some indication of relevant intervening experiencesof the subjects on whom reliability was measured, such as educational orjob experiences, counseling, psychotherapy, and so forth.

Apart from the desirability of reporting length of interval, what con-siderations should guide the choice of interval? Illustrations could readilybe cited of tests showing high reliability over periods of a few days orweeks, but whose scores reveal an almost complete lack of correspond-ence when the interval is extended to as long as ten or fifteen years.Many preschool intelligence tests, for example, yield moderat~ly stablemeasures within the preschool period, but are virtually useless as pre-dictors of late childhood or adult IQ's. In actual practice, however, asimple distinction can usually be made. Short-range, random fluctuationsthat occur during intervals ranging from a few hours to a few months aregenerally included under the error variance of the test score. :rhus, inchecking this type of test reliability, an effort is made to keep the intervalshort. In testing young children, the period should be even shorter thanfor older persons, since at early ages progressive developmental changesare discernible over a period of a month or even less. For any type ofperson, the interval between retests should rarely exceed six months.

Any additional changes in the relative test performance of individualsthat occur over longer periods o£ time are apt to be cumulative and pro-gressive rather than entirely random. Moreover, they are likely to charac-terize a broader area of behavior than that covered by the test perform-ance itself. Thus, one's general level of scholastic aptitude, mechanicalcomprehension, or artistic judgment may have altered appreciably overa ten-year,period, owing to unusual intervening experiences. The indi-vidual's status may have either risen or dropped appreciably in relationto others of his own age, because of circumstances peculiar to his ownhome, school, or community environment, or for other reasons such asillness or emotional disturbance.

The .extent to which such factors can affect an individual's psycho-logical development provides an important problem for investigation.This question, however, should not be confused with that of the reliabil-ity of a particular test. When we measure the reliability of the Stanford-Bin~t, for example, we do not ordinarily correlate retest _~~res over apenod of ten years, or even one year, but over a few ,,~et:1ks.'-T.p be sure,long-range retests have been conducted wit~ such tests-; bpt the resultsare ~enerally discussed in terms of the predictability of adult intelligence

Flc.l0. A Reliability Coefficient of .72.

·<:.(Dalafrom Anastasi & Drake, 1954.)l

;1;;:TYPES OF RELIABILITYr, ost obvious method for finding the re-, TEST-RETEST RELIABILITY. The m. h'd ntical test on a second occa-liabilityof te.st ~c~res is by. rcpeCatll1)gi:;h~S:ase is simply the correlation

. " sian.The I'ehablhty coeffiCIent Tn on the two administra-bt' d by the same persons

~betweenthe scores 0 ame d to the random fluctua-. Th . variance correspon s" lionsof the test. e enor . t the other These variations. f test seSSIOn0 •

tionsof performance rom one n d t ting conditions such as extrememay result in part from uncontr? e eds ther distractions or a broken

. th dden nOlses an 0 '. hchanges m wea er, su h they arise from changes in t epencil point. To so~e ext:nt, lfowev~~~strated by illness, fatigue, emo-condition of the subject h1l11Se.' as 1 f pleasant or unpleasant nature,

. ecent experIences 0 ationalstram, worry, r . ., h the extent to which scores on a testand the like. Retest reliabIlIty sows. th higher the reliability, thecanhr I!eneralized over different occaSlDns; e

I.1i, ; \\-1I : ."I

\ i -HH"

1 : 1111\ " I.

1 iI \ 1111 ',.jilt I \o/Ht'lII; ,

$ ~1 IIt) 0-0. "

()."f0'0"t ~~("") M ~ I I

~ b J, ~ ~ ~ ~ ~N (") M IT'

Score on FormJ: Word Fveney e.

Page 65: Anne Anastasi- Psychological Testing I

Prillciples of PsycllOlogical Testing

omchildhood performance, rather than in terms of the reliability of articular test. The concept of reliability is generally restricted to short-ge, random changes that characterize the test performance itself

.r;ilherthan the entire behavior domain that is being tested,It should be noted that different behavior functions may themselves.ry in the extcnt of daily fluctuation they exhibit. For example, steadi-ess of delicate finger movements is undoubtedly more susceptible to, ht changes in the person's condition than is verbal comprehension, If

wish to obtain an over-all estimate of the individual's habitual fingerdiness, we would probably require repeated tests on several days,reas a single test session would suffice for verbal comprehension,

~gainwe must fall back on an analysis of the purposes of the test and9iJ a thorough understanding of the behavior the test is designed to pre-Biet,:'l' Although.apparently simple and straightforward, the test-retest tech-

, '~iquepresents difficulties when applied to most psychological tests.lPracticewill probably produce varying amounts of improvement in the~testscoresof different individuals. Moreover, if the interval between re-estsis fairly short, the examinees may recall many of their former re-ooses.In other words, the same pattern of right and wrong responses

_likelyto recur through sheer memory. Thus, the scores on the two ad-1Jlinistrationsof the test are not independently obtained and the correIa-

between them will be spuriously high, The natt\re of the test itselfay also change with repetition, This is especially true of problems in-lyingreasoning or ingenuity. Once the subject has grasped the princi-involvedin the problem, ur once he has worked out a solution, he canroduce the correct Iesponse in the future without going through theerveningsteps. Only tests that are not appreciably affected by.'if!.'Jeti-n lend themselves to the retest technique, A number of sensory dis-

(~riminationand motor tests would fall into this category, For the large,majorityof psychological tests, however, the retest technique is inap-

ropriate.

. ALTERNATE-FORM RELIABILITY. One way of avoiding the difficulties en-untered1n test-retest reliability is through the use of alternate formsthe test. The same persons can thus be tested with one form on thestoccasjonand with another, comparable form on the second. The cor-lationbetween the scores obtained on the two forms represents the'ability coefficient of the test. It will be noted that such a reliabilityefficientis a measure of both temporal stability and consistency of

nse to different item samples (or test forms). This coefficient thusbinestwo ty,pes of reliability. Since both types are important for most

Reliability 113testing purposes 110.... Imeasure for e\'al~at' 'ever, a temate-form reliability provides a useful

mg many tests.The concept of item sam Iin '

alternate-form reliability bu~ al~ ~;hcontellt salllpl~llg: ?lIderlies not onlyshortlv. It is the f . er types of reltabIhty to be discussed

- re ore appropnate to ex . 'thas probably h d th' amlOe 1 more close lv, Everyonea e expenence of taking . ..-he felt he had a "I k b k" a course exammatlOn in \vhichvery topics he happue~:d t~e~aveb;~:~:e many of the items covered theeasion, he may have had th . ed mo~t carefully, On another oc-large number of l't e opposite expenence, finding an unusually

ems on areas he had f 'I d .situation illustrates error va . I al e to reVICW,This familiarwhat extent do Scores on th.n~nc: ;esu ting from content sampling, Toticular selection of items? I:sa ~'ff epen? on ~actors speci~c to the par-ently, were to pre!)are another te It ~rent IO

dvestlgator,workmg independ-

t' h s In accor ance with the 'fiIOns, ow much would an indi .d l' . same speci ca-Let us suppose that a 40't VI ua bSslcore differ on the two tests?

-I em voca u ary t t h ba measure of general verbal c ,e.s - as een constructed as~ist of 40 different words is ass~:b~:~e;~:~~~ :ow suppose that a secondItems are constructed with I ame purpose, and that thecultv as the first test The d,effqua can~ to cover the same range of diffi-. d: , ,I erences 111 the sco e bt' d bm lVIduals on these two tests 'II t r s 0 ame y the same

,IUS rate the type of 'conSIderation. Owing to fortuitous f . error vanance underferent individuals the relat' , d'ffi aftors In the past experience of dif-what from pcrso~ to pe !VeT]·1 cu ty of the two lists will vary Some-

rson. IUS the Ii t I' t . hnumber of words unfamiliar to individ ;s IS mIg t contain a larp;el-The second list on the oth h d .ua A than does the second list.1 ' er an mIght co t' d'arge number of words unfamiIi t I • d' 'd n

lam a Isproportionately

ar 0 111 IVI ua B If the t . d"d Iare apprOXimately equal in thei II . WO 111 IVI ua ~"true scores") B' will neverth I r overa word knowledge (i.e., in thei~excel B on th~ second The eIe~ excel A on the first list, while A willtherefore be reversed o'n th tre a ].ve standing of these two persons will

. e wo Ists o' t hselection of items, ' wmg 0 c anee differences in theLike lest-retest rcliabilit, alt . £ ' ..

accompanied by a stateme~' f t~rntc- ~rm rdl~bIhty should always beministrations as well as ado . t~ engft of the mterval between test ad-If h·' escnp Ion 0 relevant' t .

t e two forms are administered' . In ervenmg experiences.correlation shows reliabilit Ifn Immediate succession, the resulting

. y across orms only not .error vanance in this cas 8' ' across occasIOns. Thee represents uctuat'o' fone set of items to another b t H ,I ns In per ormance fromIn the d I ' u not uctuations over time

eve Opment of alternate forms h Id· .cised to ensure that the are trul ' care s ou ..?f ('Ourse be exer-of a test should be jnd~endc t{ parallel. F~ndamentaJ)y, parallel formssame specifications. The tests :h~ ~nstruct~ tests desi~ed to meet the

U ('Ontam the same number of 't.. . 1 elDS,

Page 66: Anne Anastasi- Psychological Testing I

Reliabilify 111

less susceptible the scores are to the random daily changes in the condi-tion of the subject or of the testing environment.

When retest reliability is reported in a test manual, the interval overwhich it was measured should always be specified. Since retest correla-tions decrease progressively as this interval lengthens, there is not onebut an .infinite number of retest reliability coefficients for any test. It isalso desirable to give some indication of relevant intervening experiencesof the subjects on whom reliability was measured, such as educational orjob experiences, counseling, psychotherapy, and so forth.

Apart from the desirability of reporting length of interval, what con-siderations should guide the choice of interval? Illustrations could readilybe cited of tests showing high reliability over periods of a few days orweeks, but whose scores reveal an almost complete lack of correspond-ence when the interval is extended to as long as ten or fifteen years.Many preschool intelligence tests, for example, yield moderarely stablemeasures within the preschool period, but are virtually useless as pre-dictors of late childhood or adult IQ's. In actual practice, however, asimple distinction can usually be made. Short-range, random fluctuationsthat occur during intervals ranging from a few hours to a few months aregenerally included under the error variance of the test score. :rhus, inchecking this type of test reliability, an effort is made to keep the intervalshort. In testing young children, the period should be even shorter thanfor older persons, since at early ages progressive developmental changesare discernible over a period of a month or even less. For any type ofperson, the interval between retests should rarely exceed six months.

Any additional changes in the relative test performance of individualsthat occur over longer periods of time are apt to be cumulative and pro-gressive rather than entirely random. Moreover, they are likely to charac-terize a broader area of behavior than that covered by the test perform-ance itself. Thus, one's general level of scholastic aptitude, mechanicalcomprehension, or artistic judgment may have altered appreciably overa ten-year, period, owing to unusual intervening experiences. The indi-vidual's status may have either risen or dropped appreciably in relationto others of his own age, because of circumstances peculiar to his ownhome, school, or community environment, or for other reasons such asillness or emotional disturbance.

The .extent to which such factors can affect an individual's psycho-logical development provides an important problem for investigation.This question, however, should not he confused with 'that of the reliabil-ity of a particular test, When we measure the reliability of the Stanford-Binet, for example, we do not ordinarily correlate retest :~~res over aperiod of ten years, or even one year, but over a few weeks,'~'£p he SUfe~long-range retests have been conducted wit~ such tests:; bpt the .fcsultsare generally discussed in terms of the predictability of adult intelligence

I.1i

I : \\-i\ : ."I .I

\\\

\" ;4!It 1/: 1/1

\ j

\ 4!It \ " I III/

1/11 '.flit I \.fIIt1H1!

0- ~ 0- ~Ii') 0() 0() "-I I I 1

Ii') 00 Ii') 0 Ii') 0 0() "-0 Ii') ~ ~ Ii') Il'l 0()

sc:e onMFormJ: Word fluencY Test

,. '!G. 10. A Reliability Coefficient of .72.

Data from An8~tasi & Drake, 1954.)

':TYPES OF RELIABILITY

, ost obvious method for finding the re-TEST-RETESTRELIABILITY. The m. h 'dentical test on a second occa-

.. liabilityof test scores is by. rcpeatlll)g.t :h~ ase is simply the correlation.: 'sion.The l'eliability coefficlenf (Tn III IS C, n the two administra-

b . d b the same persons 0\[1betwe~i'Ithe scores 0 tame Y. d to the random fluctua-. Th . vanance correspoll S'; tions of the test. e enor . t the other These variations. . f e test seSSIOn 0 •" tionsof performance rom on II d t t'ng conditions such as extreme, I' rt f ncontro e es 1 ' kmay resu t 111 pa rom u . d ther distractions or a bro enI • h dden nOlses an 0 " hchanges 111 we at er, su th y arise from changes m t e

. . T extent however, e .pend pomt. 0 so~e .' f 'Uustrated by illness, fatigue, emo-condition of the subject hmlsel : as 1 f pleasant or unpleasant nature,

I· recent expenences 0 a ttiona stram, worry, . ., h the extent to which scores on a tesand the like. Retest rehablhty sows. the higher the reliability, thecan he I':eneralized over different occaSIOns;

Page 67: Anne Anastasi- Psychological Testing I

Prillciples of Psychological Testing

om childhood performance, rather than in terms of the reliability of articular test. The concept of reliability is generally restricted to short-nge, random changes that characterize the test performance itselftherthan the entire behavior domain that is being tested.It should be noted that different behavior functions may themselves

, in the extent or daily fluctuation they exhibit. For example, steadi-of delicate finger movements is undoubtedly more susceptible to

ht changes in the person's condition than is verbal comprehension, Ifwish to obtain an over-all estimate of the individual's habitual fingerdiness, we would probably require repeated test~ on several days,

'hereas a single test session would suffice for verbal comprehension,gainwe must fall back on an analysis of the purposes of the test andi1 a thorough understanding of the behavior the test is designed to pre-

t.

Althoughapparently simple and straightforward, the test-retest tech-ique presents difficulties when applied to most psychological tests,.racticewill probably produce varying amounts of improvement in the.testscores of different individuals. Moreover, if the interval between re-

s is fairly short, the examinees may recall many of their former I'e-. Dnses.In other words, the same pattern of right and wrong responses.4 likelyto r~cur through sheer memory. Thus, the scores on the two ad-

inistrationsof the test are not independently obtained and the correIa-n between them will be spuriously high, The natnre of the test itself

:ayalso change with repetition, This is especially true of problems in-DIvingreasoning or ingenuity. Once the subject has grasped the pdnci-"Ieinvolvedin the problem, or once he has worked out a solution, he canproduce the correct response in the future without going through theitervellingsteps, Only tests that are not appreciably affected by"lfi,eti-

tiDnI~nd themselves to the retest technique. A number of sensory dis-,criminationand motor tests would fall into this category. For the large

ajorityof psychological tests, however, the retest technique is inap'opriate,

ALTERNATE-FORM RELIABILITY. One way of avoiding the difficulties en-imteredin test-retest reliability is through the use of alternate formsthe test. The same persons can thus be tested with one form on thestDccasjonand with another, comparable form on the second. The cor-ation between the scores obtained on the two forms represents the'ability coefficient of the test. It will be noted that such a reliability

cient is a measure of both temporal stability and consistency ofnse to different item samples (or test forms). This coefficient thus

binestwo types of reliability. Since both types are important for most

Reliability 113

~:~:~t~;~~':~~~~~:g'enYlear, altternate-form reliability provides a usefulny ests.

The concept of item sam tin 'altemate-fOlm reliability bu~ al~ ~;hcontellt sampl:llg: ~nderlies not onlyshort Iv. It is the f . er types of reltabllIty to be disclIssedhas p;obably h drethoreappr.opnate to examine it more closely, Everyone

a e expenence of tak' g ,he felt lIe had a "I k b k» 'In a course examination in which

uc v rea because f h .very topics he happen~d to have studi many 0 t e Items covered thecasion, he may have had th ' ed mo~t carefully, On another oc-large number of I't e opposIte expenence, finding an unusually

ems on areas he had f 'I d .situation illustrates error' I al e to reVICW. This familiarwhat extent do Scores on ~~n~nc: ;esu ting from content sampling. Toticu]ar selection of items? Ifls eds'Hepen? on factors specific to the par-

I . a I erent mvestigator k' . dent y, were to preIJare another t t' d ' wor mg In epend-t' h es m accor ance with th 'fiIOns, ow much would an indi .dr, e same specI ca-

Let us suppose that a 40-'t VI ua bS slcore differ ort the hm tests?I em voca u ary test h ba measure of general verbal c h' . - as een constructed as

~ist of 40 different words is ass~:1~:d e:::~~~ ~ow suppose that a secondItems are constructed with I ame purpose, and that theculty as the first test The d.effqua can; to cover the same range of dim-. d: , '. I erences 111 the sco e bt' d bIII JVldua]s on these two tests ']1 t r s a ame y the same

. I us rate the type of 'consIderation. Owing to fortuito f ' error vanance underferent individuals the relat' d~~ ators In the past experience of dif-what from pcrso~ to pe Ive

TI·I cu ty of the two lists wiII vary Some-

rSOll. lUS the fi t I' t . hnumber of words unfamiliar to individ rs IS mlg t contain a largerThe second list on the oth h d ,ua] A than does the second list.1 ' er an might conta'n d'arge number of words unfamilia t' . d"d I I a Isproportionatelyare apprOXimately egual in the; r 0 111nlVIua B. If the two individual~"true scores"), B -will neverthele:sov:ra word knowledge (i.e., in theirexcel B on the second Th ], e cel A on the first list, while A willtherefore be reversed o'n the tre atll.ve standing of these two persons will

. eWolstso' t hselection of items. ' wmg 0 c ance differences in theLike lest-retest rcliabilit· alt ' f '. ,

accompanied by a stateme~' f t~m:te- ~nn rell~blhty should always beministrations as well as ado , t~ engft of the Interval between test ad-If h·' escnp Ion 0 relevant . t '

t e two forms are administered' . 111 ervenmg experiences.correlation shows reliabilit 'fn Immediate succession, the resulting

. y across orms only not .error vanance in this cas fl' ' across occasIOns. Thee represents uctuat'o' fone set of items to another b t R . I ns In per ormance fromIn the d I ' u not uctuations over time

eve Opment of alternate forms h Id" .cised to ensure that the are tm] , care s ou . .of (,'ourse be exer-of a test should be ind~endc t{ parallel. Fundamentally, parallel formssame specifications. The tests :h~ ~nstruct~d tests desi~ed to meet the

U contam the same number of items. ,

Page 68: Anne Anastasi- Psychological Testing I

Principles of Psychological Testing

:,d the 'items should be expressed in the same form and should cover themetype of content. The range and level of difficulty of the items shouldo be equal. Instructions, time limits, illustrative examples, format, and

I other aspects of the test must likewise be checked for comparability.It should be added that the availability of parallel test forms is desir-Ie for other reasons besides the determination of test reliability. Alter-te forms are useful in' follow-up studies or in investigations of theectsof some intervening experimental factor on test performance. The

useof several alternate forms also provides a means of reducing the pos-sibilityof coaching or cheating.

Although much more widely applicable than test-retest reliability, al-"temate-form reliability also has certain limitations. In the first place, ifthebehavior functions under consideration are subject to a large practiceelfeet, the!'use of alternate forms will reduce but not eliminate such an

'effect. To be sure, if all examinees were to show the same improvementwith repetition, the correlation between their scores would remain un-

,"affected,since adding a constant amount to each score does not alter the<:orrelationcoefficient. It is much more likely, however, that individualswill differ in amount of improvement, owing to extent of previous prac-tice with similar material, motivation in taking the test, and other factors.Under these conditions, the practice effect represents another source ofvariance that will tend to reduce the correlation between the two testforms, If the practice effect is small, reduction will be negligible.

Another related question concerns the degree to which the nature ofthe test will change with repetition. In certain types of ingenuity prob-lems, for example, any item involving the same principle can be readilysolved by most subjects once they have worked out the solution to thefirst. In such a case, changing the specific content of the items in thesecond form would not suffice to eliminate this carry-over from the firstform. Finallv, it should be added that alternate forms are unavailable formany tests, because of the practical difficulties of constructing compara-ble forms. For all these reasons, other techniques for estimating test re-liability are often required.

Reliability lIS

To find split-half reliabilit tl Ii. .order to obtain th y, Ie 1st problem IS how to split the test illdivided in man ~ most nearly comparable halves. Any test can besecond half w~urd dl~e~ent wars. In most tests, the Rrst' half and thedifficulty level of 'tno e comparable, owing to differences in nature and

I ems, as well as to the cu I t' If fUl), })ractice fatig b d mu a Ive e ects 0 warming, ue, ore am and am' tI fsively from the beginning to th~ end ~f th at Ie; ;ctors varying progres-quate for most purposes is to fi d th e es.. procedure that is ade-of the test. If the items we .n. e scores on the odd and even itemsof difficulty such a dl' . ~e on?llndallyan.anged in an approximate order

, VIsIon Yle s verv ne I· . IOne precaution to b b d . .' ar)' eqUlva ent half-scores.e a serve 111 making such dd I'to groups of items d l' . h' an a -even sp It pertainsea mg WIt a smale problem h

ferring to a particular mechanical di~ . ' sue. as questions re-reading test. In this case a whole r glam. or to a gIven passage in atact to one or the other h~lf \Vere ~ o~p of ~tems should be assigned in-in different halves of the t~st th .e I:e~ls In such a group to be placedspuriousl inflated' . '. e Slml anty of the half-scores would bemight aIf~ct items 'i~l~c;t~n~a~~,:~.leerror in understanding of the problem

Once the two half-scores have b b' dbe correlated by the usual m th een a tame for each person, they maycorrelation actuallv gives th e °l.d'b~lt.shoufld be noted, however, that this'f hoe re la I It" a onlv a half test F ' II t e entire test consists of 100 ite - h ' . - . . or examp e,tween two sets of scores each a .ms,. t e correlatIon IS computed be-test-retest and alternate-fotm r:I;;~~~~,ls bas~d on only 50 items, In bothbased on the full nu b f ' . -' on t e other hand, each score is

. m er 0 Items In the testOther thmgs being equ I th I .

It is reasonable t . a I' e ~nger a test, the more reliable it will be?o expect t Iat, WIth a lar If'

arrive at a more adequate and . ger samp e a behaVIOr, we can. ' consIstent measure The ff t th I hemng or shortening a test will hav . , .' e ec at engt -

means of the Spearman-Bra f e allI Its ~oefficlent can be estimated bywn ormu a, gIVen below:

nr'lI'II =: ~--,,----_

, l+(n-l)r'uin which '1t is the estimated ffi'n is the number of times th ~o~. c~ent, ~11 the obtained coefficient, andnumber of test items is incr:a eSd ~ eng~ ened or shortened. Thus, if thefrom 60 to 30, n is %. Th sse rom 2.'Jto 100, n is 4; if it is decreaseddetermining reliability bv ~heP:ari~~ntrown formula is Widely used inporting reliability in this 'fo p a f m.ethod, m~ny test manuals re-

, formula always involves do~~in"'~~: tpphed to spht-haIf reliability, theclitions, it can be simplified as f~Iows:ength of-the test. Under these con-

SPLIT-HALF RELIABILITY, From a sin'gle,:administration of one form of atest it is possible to arrive at a measure 'of, reliability by various split-halfprocedures. In such a way, two scores are obtained for e~c1i person bydividing the test into comparable halves. It is apparent that split-halfreliability provides a measure of consistency with regard to content sam-pling. Temporal stability of the scores does Ilot enter into such reliability,because only one test session is involved: This type of reliahility co-efficient is sometimes called a coefficient of internal consistency, sinceonly a single administration of a single form is required.

2 Lenulhening a test h . I ." . ' owever, wll Increase 0 I "t, " .tent samplmg not its sl b'I't .,' n y. I S conSIstency m tenns of con-, a II} over hme (see Cureton, 1965). '

Page 69: Anne Anastasi- Psychological Testing I

Principles of Psychological Testing

2r'1ITn = 1+ r'lI

. s it-half reliability was developed byAnalternate method for findmg p. f th differences between. 0 Ily the vanance a e IIon (1939). It reqUires I If t ( , ) and the variance of tota

' the two ha -tes Sad f Ich person s scores on b 't t d in the following ormu a,res (a'r); these two values aTe su stJ u e. ,.hich yieids the reliability of the whole test duectl) .

u'e!111 = 1- -,-

u:;

,r , hi of this formula to the definition of. It is interesting to note the relations p 's scores on the two half-' . , A d'ff ce between a person . 'd d'errorvanance. ny I eren 'f these differences, dlvl e' h r The vanance 0 ,

. 'tests represents c ance eTTO. , 'es the roportion of error variance 111by the variance of total scores, gl\ 'b P t d from 1 00 it gives the- h' 'ariance IS SU trac e , ,he scores. When t IS error \ h' h . I to the reliability coefficient.proportion of "true" variance, w IC IS equa

, . A fourth method for finding reliability,KUDER·RICHARDSON RELIABILIT1:.. f . I form is based on the

. 1 d" t 'ahon 0 a slllg e ,also utiliZing a slIlg e a mmlslII , , the test This interitem con-

f onses to a Items m .consistencv 0 resp f ariance' (1) content sam-'. a d by two sources a error v , h

,;:sistenclj is ~n uence . d s lit-half reliability); and (2) etero-\1 piing (as III altemat~-form an. p m led. The more-homogeneous the0' geneitv of the behavlOr domalll sa.P

t' For example if one test in-

' • h' h tl . lteritem conSISenc\. , bdomain, the Ig er Ie 11 h'1 lo'ther cOllllJrises addition, su _I . I' l' 'tcms w leal b hI" eludes only mu tip Ica IOn I ..'.. the former test will pro a yI· I' t' and dIVISIOnItems,

' traction, mu tip Ica lOn, h th latter In the latter, more. 't onsistenc\' t an e, 0 h .

' show more mten em c "f better in subtraction t an III' t t e subJ'ect ma\' per orm 1' heterogeneous es, on. "ons' another subject may score re a-~, any of the other arithmetIc operatl 0 , ly in addition, subtrac-

h d' " 'tems but more poor btively well on t e IVI510nI , A ore extreme example would etion and multiplication; and so on. m b I items in contrast to one

' b t . ti I IT of 40 voca u ary, .represented y a tcs consls I/::). I I t'ons 10 arithmetic reasomng,b 1 10 spaha re a I 0, 'containing 10 voca u ar~, ~ the latter test, there might be little orand 10 perceptual speed Item~'dI. 'd r performance on the differentno relationship between an III IVI ua s

.' types of items. ill be less ambiguous when derived., It is apparent that test scores w h t'. the highly heteroge-

t ts Suppose t a IIIfrom relatively homogeneo~ esS' 'th and Jones both obtain a score ofneous, 40-item test cited ave, rfml s of the two on this test were20,Can we conclude that the Phe ormance tly completed 10 vocabulary

? N t II Smith may aye correc ..equal. ot a a . 's and none of the arithmetic reasomngitems, 10 perceptual speed ~tem 't t Jones may have received a scoreand spatial relations items, neon ras ,

Reliability U7

·of 20 by the successful completion of 5 pcrccptual speed, 5 spatial rela-tions, 10 arithmetic reasoning, and no vocabulary items,

Many other combinations could obViously producc the same total scoreof 20. This Score would have a very different meaning when obtainedthrough such dissimilar combinations of items. In the relatively homoge-neous vocabulary test, On the other hand, a Score of 20 would probablymean that the Subject llad succeeded with approximately the first 20words, if the items were arranged in ascending order of difficulty, Hemight have failed two or three easier words and correctly responded totwo or three more difficult itcms beyond the 20th, but such individualvariations are slight in comparison with those found in a more heteroge-neous test .

A highly relevant question in this connection is whether the criterionthat the test is trying to predict is itself relatively homogeneous or heter-ogeneous. Although homogeneous tests are to be preferred because theirScores permit fairly unambiguous interpretation, a single homogeneoustest is obViously not an adequate predictor of a highly heterogeneous cri-terion. lvforeover, in the prediction of a heterogeneous criterion, theheterogeneity of test items would not necessarily represent error variance.Traditional intelligence tests provide a good example of heterogeneoustests designed'to predict heterogeneous criteria. In such a case, however,it may be desirable to construct several relatively homogeneous tests,each measuring a different phase of the heterogeneous criterion, Thus,unambiguous interpretation of test scores could be combined with ade-quate criterion coverage.

The most common procedure for finding interitem consistency is thatdeveloped by Kuder and Richardson (1937). As in the split-half methods,interitem consistency is found from a single administration of a singletest. Rather than requiring two half-scores, however, such a technique isbased on an examination of performance on each item. Of the variousformulas derived in the original article, the most Widely applicable, com-monly known as "Kuder-Richal'dson formula 20," is the following:

3

In this formula, rll is the reliability coefficient of the whole test, n is thenumber of items in the test, and IJ't the standard deviation of total SCOl'eson the test. The only new term in this formula, 'S.pq, is found by tabu-lating the proportion of persons who pass (p) and the proportion who donot pass (q) each item. The product of p and q is computed for eachitem, and these products are then added for all items, to give ~pq. Sincein the ptocess of ~est construction p is often routinely recorded in order

3 A Simple dcrivatiolJ of this formula can be found in Ebel (1965, ppo 32!hS27).

Page 70: Anne Anastasi- Psychological Testing I
Page 71: Anne Anastasi- Psychological Testing I

_ (~)U't - ~U';TlI - n - 1 u't

A clear description of the computational layout for finding coefficientalpha can be found in Ebel (1965, pp. 326-330).

Reliability 119

one case, error variance covers temporal fluctuations; in another, it refersto differences between sets of parallel itcms; and in still another, it in-cludes any interitem inconsistency. On the other hand, the factors ex-cluded from measures of error variance are broadly of two types: (a)those factors whose variance should remain in the scores, since they arepart of the true differences under consideration; and (h) those irrelevantfactors that can be experimentally controlled. For example, it is notcustomary to report the error of measurement resulting when a test isadministered under distracting conditions or with a longer or shortertime limit than that specified in the manual. Timing errors and seriousdistractions can be empirically eliminated from the testing situation.Hence, it is not necessary to report special reliability coefficients corre-sponding to "distraction variance" or "timing variance."

Similarly, most tcsts provide such highly standardized procedures foradministration and scoring that error variance attributable to these fac-tors is negligible. This is particularly true of group tests deSigned formass testing and computer scoring. 'With such insb'uments, we need onlyto make certain that the prescribed procedures are carefully followedand adequately checked. 'Vith~clinical instruments employed in intensiveindividual examinations, on the other hand, the!'e is evidence of con-siderable "examiner variance:' Through special experimental designs, itis possible to separate this variance from that attributable to temporalfluctuations in the subject's condition or to the use of alternate test forms.

~ne source of error variance that can be checked quite simply is scorervanance. Certain types of tests-notably tests of creativity and projectivetests of personality-leave a good deal to the judgment of the scorer.\Vith such tests, there is as much need for a measure of scorer reliabilityas there is for the more usual reliability coefficients. Scorer reliability canbe found by having a sample of test papers independently scored by twoexaminers. The two scores thus obtained hv each examinee are then cor-related in the usual way, and the resulti~g correlation coefficient is ameasu,re of scorer reliability. This type of reliability. is commonly com-puted when subjectively scored instruments are e.mployed in research."»est manuals should also report it when appropriate. '

u8 Pri'lcipks of Psychological Testing

i6'find the difficulty level of each item, this method of determining rc-i~bilityinvolves little additional cO,mputation. l' bT,fIt canbe shown mathematically that the Kuder-Ri~hardson r~ la Ilty

, cient is actually the mean of aU split-half coeffiCients .resultll1~ froment splittings of a test (Cronbach, 1951).4 The ordmary spht-halfdent, on the other hand, is based on a planned split design~d toequivalent sets of items. Hence, unless the test items are hIghly

mogeneous, the Kuder-Richardson coefficient will be .lo\~er than t~elit-halfreliability. An extreme example will serve to hl.ghlight t?e dlff

erence.Suppose we construct a 50-item test out of 25 diHerent kmd~ aemssuch that items 1 and 2 are vocabulary items, items 3 and 4 anth-eticreasoning, items 5 and 6 spatial orientation, a~d so on. The odd.andvenscores on this test could theoretically agree qmte clos:ly, thus. YIeld-

'ng a high split-half reliability coefficient. The homogeneity of. thiS test,Id be very low Since there would be little consistency of

owever,wou • " lderformance among the entire set of 50 items. In thIS example, we wou.

'~xpectthe Kuder-Richardson reliability to be much lower th\lD th~ splIt-halfreliability. It can be seen that the diHerence between Kuder-~Ichard-,sonand split-half reliability coefficients may serve as a rough ll1dex oftheheterogeneity of a test. .

The Kuder-Richardson formula is applicable to tests whose Items arescoredas right or wrong, or according to some other all-or-none syste~.Sometests however may have multiple-scored items. On a personahtyinventory,for exampie, the respondent may receive a di,~erent n,~~ericalscore on an item, depending on whether he checks . usually, some-. " " I" "ne\1el'" For such tests a generahzed formula hastimes, rare y, or· ' . k

been derived known as coefficient alpha (Cronbach, 1951; NOVIC &Lewis, 1967).' In this formula, the value ~pq is replaced by ~u'i, ~he sumof the variances of item scores. The procedure is to find the vana~ce ofall individuals' scores for each item and then to ~dd these v~na~ces

i, across all items. The complete formula for coeffiCIent alpha IS glVen

below:

SCORER RELIABILITY. It should now be apparent that the difIer:nt typesof reliability vary in the factors they subsume under error vananee. In

4 This is strictly true only when the split-half coefficientsare found by the Rulonformula,not when they are found by correlation of halves and Spearman-Brownformula(Novick& LewiS, 1967).

OVERVIEW. The diHerent types of reliability coemsiel),ts discussed inthis section are summarized in Tables 8 and 9. In Tablit18'the operationsfollowed in obtaining each type of reliability are classffled,-,with regardto number of test forms and number of testing sessions required. Table 9shows the sources of variance treated as error vitri~nce b},;,~achprocedure.

Any reliability coefficient may be interpreted directly"in terms of thepercentage of score variance attributable to different sources. Thus, a re-liability coefficient of .85 signifies that 85 perceI1t 9f the variance in test

Page 72: Anne Anastasi- Psychological Testing I

Split-HalfKuder-RichardsonScorer

Two \ Test-Retest

1'l..'C••.:J";':'.:.•.;-...•:.~io!<!'.l:::i;r~C'~<;~;;.tr..~""F.:~ ....:.Y:-:_~:_~~,,::.;.c~.-:,~;:.:.;(;,tJ';;.!:.4':~~ __••'~.~;-.:.;~ .•..c..::.t,at;.;..."Ulr'&.~~')l.t;·~•..fW"6'.!':"i·:;",-

Alternate- Form(Delayed)

Reliability 121

efficient (\/;;-;-). When the index of reliability is squared, the result is thereliability coefficient (r1l), which can therefore be interpreted directlyas the percentage of true variance.

Experimental designs that yield more than one type of reliability co-efficient for the same group permit the analysis of total score varianceinto different components. Let us consider the following hypotheticalexample. Forms A and B of a creativity test have been administered witha two-month interval to 100 sixth-grade children. The resulting alternate-form reliability is .70. From the responses of either form, a split-half re-liability coefficient can also be computed.6 This coefficient, stepped up bythe Spearman-Brown formula, is .80. Finally, a second scorer has rescoreda l'andom sample of 50 papers, from which a scorer reliability of .92 isobtained. The three reliability coefficients can now be analyzed to yieldthe error variances shown in Table 10 and Figure n. It will be noted thatby subtracting the en'or variance attributable to content sampling alone(split-half reliability) from the error variance attributable to both con-tent and time sampling (alternate-form reliability), we find that .10 of thevariance can be attributed to time sampling alone. Adding the error vari-~nces attributable to content sampling (.20), time sampling (_10), andmterscorer difference (.08) gives a total error variance of .38 and hence atrue variance of .62. These proportions, expressed in the more familiarpercentage terms, are shown graphically in Figure II.

lZ0 Principles of Psyc11010gical Testing

TABLE 8Techniquesfor Measuring Reliability, in Relation to Test FormandTesting Session

TestingSessionSRequired

Test Forms Required

A1temate-Form(Immediate)

scores depends on true vati~nce in the trait measured and 15 percentepends on error variance (as:'opcrationally defined by the specific pr~-edure followed). The statistically sophisticated reader may recall that It

's the square of a correlation coefficient that represents proportion ofommanvariance. Actually, the proportion of true variance in test scores

'sithe square of the correlation between scores on a single form of theest and true scores free from chance errors. This correlation, known asth6index of re1iabdity,~ is equal to the square root of the reliability co-

TABLE 10Anal)'sis of Sources of Error Variance in a H}'P0thetical Test

:fABLE 9,ourcesof Error Variance in Relation to Reliability Coefficients

5 Derivationsof the indexof reliability,based on two dilTerentsets of assumptions,\givenby Gulliksen (l950b, Chs. 2 and 3).

. 6 For a better estimate of the coefficientqf internal consistency.split-half correla-tions could be computed for each fonn amI the two coeffiCientsaveraged by the ap-propriate statistical procedures. '-\;,. ;

Type ofReliabilityCoefficient

From delayed alternate-form reliability: 1 - .70 = .30 (time samplin'kplus contentsampling)

From split-half, Spearman-Brown reliability: 1 - .SO= .20· (contentsampling)

DiHerence .10· (time sampling)TWDl scorer reliability: 1- .92 = .OS· (interscorer

difference )Total Measured Error Varianetl· = .20 + .10 + .08 = .38

True Variance = 1- .38 = .62

,est-Retestlemale-Form(Immediate)emale-Form(Delayed)lit-Half

er-Richardsonand CoefficientIpharer

Time samplingContent samplingTime sampling and Content samplingContent samplingContent sampling and

Content heterogeneityInterscorer differences

'II,',II i,I, ;

i'

Page 73: Anne Anastasi- Psychological Testing I

that individual differences in test scores depend on speed of perform-ance, reliability coefficients found by these methods will be spuriouslyhigh. An extreme example will help to clarify this point. Let us supposethat a 50-item test depends entirely on speed, so that individual differ-ences in score are based wholly on number of items attempted, ratherthan on errors. Then, if individual A obtains a score of 44, he will obvi-ously have 22 correct odd items and 22 correct even items. Similarly,individual B, with a score of 34, will have odd and even scores of 17 and17, respectively. Consequently, except for accidental careless errors on afew items, the correlation between odd and even scores would be perfect,or +1.00. Such a correlation, however, is entirely spurious and providesno information about the reliability of the test.

An examination of the procedures followed in finding both split-halfand Kuder-Richardson reliability \:vill show that both are based on theconsistency in number of errors made by the examinee. If, now, indi-vidual differences in test scores depend, l~ot on errors, but on speed, themeasure of reliability must obviously be based on consistency in speedof u:ork. 'Vhen test performance depends on a combination of speed andpower, the single-trial reliability coefficient will fall below 1.00, but itwill still be spuriously high. As long as individual differences in testscores are appreciably affected by speed, single-trial reliability coefficientscannot be properly interpreted.

'What alternative procedures are available to determine the reliabilityof Significantly spl1eded tests? If the test-retest techniqu~ is applicable, itwould be appropriate. Similarly, equivalent-form reliability may beproperly employed with speed tests. Split-half techniques may also beused, provided that the split is made in terms of time rather than interms of items. In other words, the half-scores must be based on sep-arately timed parts of the test. One way of effecting such a split is toadminister two eqUivalent halves of the test with separate time limits.For example, the odd and even items may be separately printed on differ-ent pages, and each set of items given with one-half the time limit of theentire test. Such a procedure is tantamount to administering two equiva-lent forms of the test in immediate succession. Each form, however, ish¥f as long as the test proper, while the subjects' scores are normallybased on the whole test. For this reason, either the Spearman-Brown orsome other appropriate formula should be used to find the reliability ofthe whole test.

If it is not feasible to administer the two half-tests separarely, an al-ternative procedure is to divide the total t,ime into quarters, and to finda score for each of the four quarters. This caneasil~':J;>~ 'done by havingthe examinees mark the item on which they ar~ w6rkiti~ whenever theexaminer gives a prearranged signal. The number of items correctlycompleted within the first and fourth quarters can then be combined to

Error Variance: 38'J.A_-

10--x.--8-'X,-"'"

Stable over lime; consistent over !orms;free !rom interscorer difference

11. Percentage Distribution of Score Variance in a Hypothetical Test.

'LIABILITY OF SPEEDED TESTS"

oth in test construction and in the interpretation of test scores, anportant distinction is that between t~e ~ea.s~rement. of speed and ofwer. A pure speed test is one in whIch md1~dual differences dependtirel\, on speed of performance. Such a test IS co~s~ructed fr~~ Itemsuniformly low difficulty, all of which are well wI~hm ~he. a?lhty levelthe persons for whom the test is designed. The hme 1Im1t.1~made soort that no one can finish all the items. Under these conditIons, eacherson's score rcflects only the speed with which he worked. A pur~DICeI' test, on the other hand, has a time limit long el:ough ~o permItveryone to attempt an items. The difficulty of the Items IS steeply

, raded, and the test includes some items too difficult for anyone to solve,sothat no one can get a perfect score. "

It will be noted that both speed and power tests are deSIgned to p~e-"vent the achievement of perfect scores. The reason for such.a precauh~,is that perfect scores are indeterminate, since it is impos~lble to .knm.Yhow much higher the individual's score would have been If m?re.l~ems,

d'ffi It items had been included, To enable each mdlVldualor more I cu, ,,' .d dto show fully what he is able to a~c,qm1?H,~rthe test must proVI e a e-

. qllate ceiling, either in number o~ ~te"':iJr in. difficulty level. An..ex~ep~lion to this rule is ,found in mastery ,Jng, as Illustrated by the cllt~no~referenced tests discussed in ChaPtrc4. The purpose of such testm~ ISnot to establish the limits of what th'e3hdividual can do, but to determmewhether a preestablished performance level has or has not been rea.ehed.

In actual practice, the distinction between speed and power :ests IS ~ncof degree most tests depending on both powe~ and speed 111 varymgproportiO~S. Information about these proportions is needed for each test. rder not onlv to understand what the test measures but also to~o~se the prop~r procedures for evaluating its reliability. Single-trialreliability coefficients, such as t~ose found by odd-even or Ku.der-Richardson techniques, are inapplicable to speeded tests. To the extent

Page 74: Anne Anastasi- Psychological Testing I

Principles of PsycllOlogical Testing

'~w,' represent one half-score, while those in the second and thir~ q~artcrs," can be combined to yield the other half-score. Such a combmahon of

. quarters tends to balance out the cumulative effects of practice, fatigue,and other factors. This method is especially satisfactory when the itemsare not steeply graded in difficulty level.

When is a test appreciably speeded? Under what conditions must the. special precautions discussed in this section be observed? Obviously, themere employment of a time limit does not signify a speed test. If allsubjects finish within the giycn time limit, speed of work plays no partin determining the scores. Percentage of persons who fail to completethe test might be taken as a crude index of speed versus power. Evenwhen no one finishes the test, however, the role of speed may be negli-gible. For example, if everyone (<()mpletes exactly 40 items of a 50-item

.test, individual differences with regard to speed are entirely absent, al-though no one had time to attempt all the items.

The essential question, of course, is: "To what extent are individualdifferences in test scores attributable to speed?" In more technical terms,we want to know what proportion of the total variance of test scores isspeed variance. This proportion can be estimated roughly by finding the

... variance of number of items completed by different persons and dividing'\ it by the variance of total test scores (u·'/r:J't). In the example cited

above, in which ev~ry individual finishes 40 items, the numerator of thisfraction would be zero, since there are no individuaL differences in num-ber of items completed (u'(' = 0). The entire index would thus equal zeroin a pure power test. On the other hand, if the total test variance (U2f)is attributable to individual differences in speed, the two variances will

.. be equal and the ratio will be 1.00. Several more refined procedures have;". been developed for determining this proportion, but their detailed con-

sideration falls beyond the scope of this book., . '.An example of the effect of speed on single-trial reliability coefficients

is provided by data collected in an investigi~on of the first edition ofthe SRA Tests of Primary Mental Abilitie.s.~.r Ages 11 to 17 (Anastasi &Drake, 1954). In this study, the reliab!lijY',uf each test was first deter-mined by the usual odd-even procedm:e.;{~;fie~~coefficients, given in thefirst row of Table 11, are closely sinjil Jhose reported in the testmanual. Reliability coefficients were the ..," ,nfited by correlating scoreson separately timed halves. These coef1i~~:are shown in the secondrow of Table 11. Calculation of speed indexes showed that the VerbalMeaning test is primarily a power teSt;,l~i1e the Reasoning test is some-what more dependent on speed. The Spa.~~,and Number tests proved tobe highly speeded. It will be noted iri;1;~h'1' 11 that, when properly com-

TABLE 11Reliability Coefficients of Four of the SRA Tesls of Primary MenIalAbilities for Ages 11 to 17 (1st Edition)(Data from Anastasi & Drake, 1954)

Reliability Coefficient VerbalFound by: Meaning Reasoning Space Number

Single-trial odd-even method .94 ,96 .90 .92Separately timed halves .90 .87 .75 .83

p~ted, the reliability of the Space test is .75, in contrast to a spuriouslyhIgh odd-even coefficient of .90. Similarly, the reliability of the Reasoningte,st drops f~on~..96 to .87, and that of the Kumber test drops from .92 to.8,3. The rehablhty of the relatively unspeeded Verbal Meaning test, all

the other hand, shows a negligible difference whe'n computed by the twomethods.

DEPENDENCE OF RELIABILITY COEFFICIENTSON THE SAMPLE TESTED

7 See. e.g .• Cronbach & Warrington (1951 Y,Culliksen (1950a, 1950b), Cuttman(1955), Helmstadter& Ortmeyer (1953).

HET~ROG~XEITY.An important factor influencing the size of a reliabilitycoeffiCient IS the nature of the group on which reliability is measured. In~he. ~rst pla~e, any correlation coefficient is affected by the range of1I1?~\')?ual dl~erenc:~ in the group. If every member of a group wereah~~ 111spcllmg ablhty, then the correlation of spelling with any othera~lll~y would be zero in that group. It would obviously be impossible;'WI~~1Ilsuch a group, to predict an individual's standing in any otherablhty from a knowledge of his spelling SCOFe.

Anot~er, less extreme, example is provided by the correlation betweentw~ aptItude tests, such as a verbal comprehenSion and an arithmetic rea-sonmg test. If these tests were administered to a highly homogeneoussampll:', such as a group of 300 college sophomores, the correlation be-

I tween the two would probably be close to zero().There is little relation-S~i~, wi~hin such a .s~lected s~mple of college students, between any in-dn Idual s verbal abdlty and hiS numerical reasoning abilitv. On the otherhand, wer~ the test~ to. be. give.n to a hetero~neous sample of 300 per-sons, rangmg f~om mstItut~ona1tzed mentally retar~ed persons to collegegraduates, a hIgh correlatlon would undoubted:}£,::be obtained betweepthe two tests. The mentally retarded would o~ta1.~~hoore.r:scores than tilecollege graduates on both tests, and similar no{ . hips would hold forother subgroups within this highly heterogeneo'us ',pIe.'>

Page 75: Anne Anastasi- Psychological Testing I

Principles of Psychological Testing

mination of the hypothetical scatter diagram given in Figure 12urther illustrate the dependence of correlatioll coefficients on theHity, or extent of individual differences, within the group. Thisr diagram shows a high positive correlation in the entire, heteroge-

s group, since the entries are closely clustered about the diagonalding from lower left- to upper right-hand corners. If, now, we con-only the subgroup falling within the small rectangle in the upper

-hand portion of the diagram, it is apparent that the correlation be-the two variables is close to zero. Individuals falling within this

, icted range in both variables represent a highly homogeneous group,did the college sophomores mentipned above.'ke all correlation coefficients, reliability coefficients depend on the

'iability of ,the sample within which they are found. Thus, if the re-ility coefficient reported in a test manual was determined in a group'ing from fourth-grade children to high school students, it cannot bemed that the reliability would be equally high within, let us say, an

hth-grade sample. \Vhen a test is to be used to discriminate individual

Reliability 127

differences within a more homogeneous sample than the standardizationgroup, the reliabi~ity ~oefficient should be redetermined on such a sample.Formulas for estimating the reliability coefficient to be expected whenthe standard deviation of the group is increased or decreased are avail-able in elementary statistics textbooks. It is preferable, however, to re-compute the reliability coefficient empirically on a group comparable tothat on which the test is to be used. For tests designed to cover a widerange ~f age or abil.ity, the test manual should report separate reliabilitycoeffiCIents for relatively homogeneous subgroups within the standardiza-tion sample.

i i I ; ! I i I ! ! i , , I

I , I i I I, , I , , i , IIi I,

I , I , I ! i i I I iI/ I ,,, I , I ! i , ~ , I,

""1'/1 11\-'~--, I I , ; ! I I i , i I ill I ; 1\11'/1,/1

I i I i , , I I /lill,l '1/'11 Ifi'll IIi i-h': i I i 1 i i"

I jll'/I', :'/'1111, i ,I,I , 1 ! , ! I 1/1 1'1/;/1/ /II I!I: Ii I !

i ,i I ! ,

I ! , , I ~ I III /11/1.//, , /I:/! I!i i ! I I I I ,II, , 1/1 /1/1:/1 I :/1' I , i , ,

~,

I I I .11 11[111/1 III /I /1;11, , ; i iI !! , I I ! ;1 1/11/1' I: I '11,/1 1 i , ' , I , I! I

i I , i'i il'" //I//!// 11/ II II' ! I J, I!

I i I, i I I !1I,II,lI/llIll/ll/ I ;", ; I ,

i: .I i i I , 1 1'1' : I ~" 11111I11 'I;;;l;i.;: 'i , i i

I i I ;111/11 //,/1 1/1,11/, I t~1 , i , I iiI I I 111 111 111/:1/!1I I i •.; 1 I 1 i , I,, , I ,/1 I I I i,l I! ~ I I' I

i / I fll I 11/11/' I ". i! I : I I L, IJII I , II i t~i , i '~?f I I

I 11·/1 I ! I ,:~it· : ! : I

1/1/ /I' I [I I i , I11/II:W 11 1/1/ i I I I ; I

I I , II I I I I 1 I

" /I , II ", ..; ,I I, III I I .. .'fo;., I I I

jfI I i ! Ii .....- ...

/I I I I I !I 1 ",'·,1. I I

ABILITY LEVEL. Kot only does the reliability coefficient vary with theextent of individual differences in the sample, but it may also vary be-tween groups differing in average ability level. These differences, more-over, cannot usually be predicted or estimated by any statistical formula,b~t c~n ~e' discovere~ .only by empirical tryout of the test on groupsd.dfermg 111 age or abilIty levcl. Such differences in the reliability of asmgle test may arise from the faCt that a slightlv different combination ofabilities is measured at different difficulty lev~ls of the test. Or it mayresult from the statistical properties of the scale itself, as in the Stanford-Binet (Pinneau, 1961, Ch. 5). Thus, for different ages and for differentIQ levels, the reliability coefficient of the Stanford-Binet varies from .83to .98. In other tests, reliability may be relatively low for the youngerand less able ¥roups, since their scores are unduly influenced by guessing.Under such CIrcumstances, the particular test should not be employed atthese levels.

It is apparen.t t~at every reliability coefficient should be accompaniedby a fuD descnptIon of the type of group on which it was detelmined.Special attention should be given to the variability and the ability levelof the sa~~le. The reported reliability coefficient is applicable only to~amplef, s~nll]~r to that on which it was computed. A desirable and grow-lIlg practice In test construction is to fractionate the standardizationsample into m~re homogeneous subgroups, with regard to age, sex, gradeleve~, occupation, and the like, and to report separate reliability co-effic~ents for each s~bgroup. Under these conditions, the reliability co-cHicIen¥ are more lIkely to be applicable to the samples ~~th which thetest is to be used ill actual practice. ..

Score on Variable 1

.Frc. 12. The Effect of Restricted Range upon a Correlation Coefficient.INTERPRETATION OF INDIVIDUAL SCORES. The reliability of a test may be

expressed in terms of the standard error of measllre~ent ((fmen.,), also

Page 76: Anne Anastasi- Psychological Testing I

Principles of PsycllOlogical Testing

, called tIle standard error of a score. This measure is particularly wensuited to the interpretation of individual scores, For many testing pur-poses, it is therefore more useful than the reliability coefficient., TI~e

, standard error of measurement can be easily computed from the rehabll-: ity coefficient of the test, by the following formula:

.inwhich al is the standard deviation of the test scores and '11 the reliabil-itycoefficient, hath computed on the same group. For example, if devia-tion IQ's on a particular intelligence test have a standard devia~iol1 of ~5

.and a reliability coefficient of .89, the a"" ••. of an IQ on thIS test IS;;.15\/1- .89 = 15Y.ll = 15(.33) = 5. -v. To understand what the UI/H'.' tells us about a score, let us suppose that.~"wehad a set of 100 IQ's obtai~ed with the above test by a single boy,t;tJim,Because of the types of chance errors discussed in this chapter, these:\ scores will vary, falling into a normal distribution around Jim's true':score.The mean of this distribution of 100 scores can be taken as the true,scoreand the standard deviation of the distribution can be taken as the

, "11Im, • Like an\, standard deviation, this standard error can be interpretedin t~rms of the normal curve frequencies 'discussed in Chapter 4 (seeFigure 3). It will be recalled that between the mean and ±lu there are~pproximatf'ly 68 percent of the cases in, a normal curve. Th~s" we can.nclude .h-.;-the chances arc roughly 2:1 (or 68:32) that JUllS IQ on, is test :_..'" 'fluctuate between ± lUIII,n.'. or 5 points on either side of his

Ie IQ. If his true IQ is no, we 'V<:mldexpect him to score between 105ld U5 about two-thirds (68 percent)' of the time.If we want to be more certain of oiI~rprediction, we can choose higher

'\lddsthan :2:1. Reference to Figurei,1t~~Chapter 4 shows that ±3u covers00.7 percent of the cases. It can be::~_sg~,~t.ainedfrom normal curve fre-uenc)' tables that a distance of 2.58?:7.?~.·~i!4erside of the mean includes'actly 99 percent of the cases. II,tti{ee;:the chances are 99:1 that Jim's

will fall within 2.58ulllras, or (2.58)(5) = 13 points, on either side of. true IQ. We can thus state at ttte 99 percent' confidence level (withIy one chance of error out of l00J,:,that Jim's IQ on any single admin-ation of the test will lie between"97 an9 123 (110 -13 and no + 13).

''Jimwere given 100 equivalent te~ts. ilis IQ would fall outside this band'Valuesonly once..'In actual practice, of course, we do not have the true scores, but onlye scores obtained in a single test administration. Under these circum-~nces,we could try to follow ~t.above reasoning in the reverse direc-. If an individual's obtal,p~l.score is unlikely to deviate by more

2.58O''''r ••. from his true"~ore, we could argue that his true scorelie within 2.580'n1f.B, olflis obtained score. Although we cannot as-

Reliability U9

sign a probability to this statement for any given obtained score, we callsay that the statement would be correct for 99 percent of all the cases.On the basis of this reasoning, Gulliksen (1950b, pp. li-20) proposedthat the standard error of measurement be used as illustrated abo've toestimate the reasonable limits of the true score for persons ""it-h any givenobtained score. It is in terms of such "reasonable limits" that the en-or ofmeasurement is customarily interpreted in psychological testing and itwill be so interpreted in this book.

The standard error of measurement and the reliabilitv coefficient areobviously alternative ways of exprt'ssing test reliability. Unlike the relia-bility coefficient, the error of measuren)('nt is independent of the varia-bility of the group on which it is computed. Expressed in terms of indi-vidual scores, it remains unchanged when found in a homogeneous or aheterogeneous group. On the other hand, being reported in score units,the error of measurement will not be directly comparable from test totest. The usual problems of comparability of units would thus arise whenerrors of measurement are reported in terms of arithmetic problems,words in a vocabulary test, and the like. Hence, if ,,"e want to comparethe reliability of differetlt tests, the reliability coefficient is the bettermeasure. To interpret individual scores, the standard error of measure-ment is more appropriatc.

INTERPRETATI01IO OF SCORE DIFFERENCES. It is particularly important toconsider test reliability and errors of measurement \\'hen evaluating thedifferellces between two scores. Thinking in terms of the range withinwhich each score may fluctuate serves as a check against overempha-sizing small diHerences between scores. Such caution is desirable both j

when comparing test scores of different persons and when comparingthe scores of the same individual in diHerent abilities. Similarly, changesin scores following instructiun or other experimental \'ariables need to beinterpreted in the light of errors of measurement.

A frequent question abollt test scores concerns the individuars relativestanding in different areas. Is Jane more able along verbal than alongnumerical lines? Does Tom have more aptitude for mechanical than forverbal activities? If Jane scored higher on the verbal than on the nu-merical sub tests .on an aptitude battery and Tom scored higher on themechanical than on the verbal, how sure can we be that they would stilldo so1on a retest with another form of the battery? In other words, couldthc score differences have resulted merely from the chance: se)ection ofspecific items in the particular verbal, numerical, and mechahical testsemployed?

Because of the growing interest in the interpretation of score p'rofi.les,test publishers have been developing report forms that permit the evalua-

Page 77: Anne Anastasi- Psychological Testing I

Reliability 131

the difference between the Verbal Reasoning and Numerical Abilityscores probably reflects a genuine difference in ability level; that bctweenMechanical Reasoning and Space Relations probably does not; the dif-ference between Abstract Reasoning and Mechanical Reasoning is inthe doubtful range.

It is well to bear in mind that the standard error of the difference be-tween two scores is larger than the error of measurement of either of thetwo scores. This follows from the fact that this difference is affected bythe chance er1"Orspresent in both scores. The standard error of the diffe;-ence between two scores can be found from the standard errors of meas-urement of the two scores by the follOWing formula:

:RAWSCORE I~~l'~~'llw;;, ft .~~~ I~~l~::'-;-1 ;~ I;; IPERCENTILE 60 9S 80" 95 30 80 90 'l9 85 i

,, <;: ",. -

~~\ ,~- ..

. - -

- - ....

".0 : .. .. ..

0 .. .. ..

,

1

'"~60~~~ 50u~ 40

30,.in which Udi//. is the standard error of the difference between the twoscores, and Umca8.) and Urneas .• are the standard errors of measurement ofthe separate scores. By substituting SDyll - TII for Umeus,) andSDyll - r2lI for Umeas .• , we may rewrite the formula directly in terms ofreliability coefficients, as follows~

In this substitution, the same SD was used for tests 1 and 2, since theirscores would have to be expressed in terms of the same scale before theycould be compared.

\Ve may illustrate the above procedllfe with the Verbal and Perform-ance IQ's on the Wechsler Adult Intelligence Scale (WAIS). The split-half reliabilities of these scores are .96 and .93, respectively. WAIS devia-tion IQ's have a mean of 100 and an SD of 15. Hence the standard errorof the difference between these two scores can be found as follows:

Flc. 13. Score Profile on the Differential Aptitude Tests, Illustrating Use ofPercentile Bands.(Fig. 2, Fifth Edition Manual, p. 73. Reproduced b)' permission. Copyright ® 1973,1974 by The Psychological Corporation, New York, N.Y. All rights reseT\'ed.)

tion of scores in terms of their errors of measurement. An example is, theIndividual Report Form for use with the Differential Aptit,~~e Tests, re-produced in Figure 13. On this form, percentile scores ?~ each subtestof the battery are plotted as one-inch bars, '\\1th the ~1:l~jPed percentil~ 'at the center. Each percentile bar corresponds to a dist~nce of approxI-mately 1Y2 to 2 standard error~ o~ :ithe~' ~,ide ~f 't~i!o~t~ine? ~core.8Hence the assumption that the mdlVl~ua! s true ~~~allS Wlthm thebar is correct about 90 percent oftl,t,:.time. In iI'l~,~rp.tetingthe profiles,test users are advised not to attach Importance to olfferences betweenscores whose percentile bars overlap,- especially if they overlap by morethan half their length. In the profil~%tl!ustrated~~f~gure 13, for example,

·1;-~:. -, ,

8 Because the reliability coefficient (a¥d hence th~ er •• , ••. ) varies somewhat withsubtest, grade, and sex. the actual ranges covered by the one-inch lines are notidentical, but they are sufficiently close to permit uniform interpretations for practicalpurposes.

Udif/. = 15y12 - .96 - .93 = 4.95

To determine how large a score difference could be obtained by chanceat the .05 level, we multiply the standard error of the difference (4.95)by 1.96. The result is 9.70, or approximately 10 points. Thus the differ-ence between an individual's WAIS Verbal and Performance IQ shouldbe at least 10 points to be significant at the .05 level.

1RELIABIUTY OF CRITERION-REFERENCED TESTS

It will be recalled from Chapter 4 that criterion-referenced, tests usu-ally (but not necessarily) evaluate performance in terms o(~ masteryrather than degree of achievement. A major statistical implication of

Page 78: Anne Anastasi- Psychological Testing I

13Z Pl'inciplt:s of Psychological Tcstillg

mastery testing is a reduction in yariability of scores among persons.Theoretically, if everyone continues training until the skill is mastered,variability is reduced to zero. Not only is low variability a result of theway such tests are used; it is also built into the tests through the con-struction and choice of items, as will be shown in Chapter 8.

In an earlier section of this chapter, we saw that any correlation, in-cluding reliability coefficients, is affected by the variability of the groupin which it is computed. As the vatiability of the sample decreases, sodoes the correlation coefficient. Obviously, then, it would be inappropri-ate to assess the reliahilitv of most criterion-referenced tests by the usualprocedures.o Under thes; conditions, even a highly stable and internallyconsistent tcst could yield a reliability coefficient near zero.

In the construction of criterion-referenced tests, two important ques-tions are: (1) How many items must be used for reliable assessment ofeach of the specific instructional objectives covered by the test? (2) "Whatproportion of items must be correct for the reliable establishment ofmastery? In much current testing, these two questions have been an-swered by judgmental decisions. Efforts are under way, however, to de-velop appropriate statistical techniques that will provide objective, em-pirical answers (see, e.g., Ferguson & i\ovick, 1973; Glaser & Nitka, 1971;Hambleton & l\ovick, 1973; Livingston, 19i2; Millman, 1974). A fewexamples will serve to illustrate the nature and scope of these efforts.

The t,,'o question~ about number of items and cutoff score can be in-corporated into a single hypothesis, amenable to ~testillg within theframework of decision theory and sequential analysis (Glaser & :\'itko,197]; Lindgren & :'.1cElrath, 1969; Wald, 1947). Specifically, we wish totest the hypothesis that the examinee has achieved the required le"el ofmastery in tllP content domain or instructional objective sampled by tnetest items. Segucntial analysis consists in taking observations one at atimE' and deciding after cach observation whC'f.tper to: (1) accept thehypothesis; (2) rejcct the hypothes!s; or .(3~f~~ake add~tional o~serYa-tlOns. Thus the number of observations (m.;fhls case :t:lytnber of items)needed to reach a reliable conclusion is, itself deten~nined during theprocess of testing. Rather than being p.::fls~nted,,:ith a fixed, prede-termined number of items the examine~~c;;dntimieS;ltaking tbe test untila mastery or nonmastery d~cision is r~·.·" ·'d. At that point, testing is dis-cuntinue'd and the student is either dire '.,:fo~henext instructional levelor returned to the nonmastered level '0 ; further study. \Vith the com-puter facilities described in Chaptn_~, such sequential decision pro-

9 For fuller discussionof special statistic;\"~roceduresrequired for the constructionand evaluationof criterion-referencedtests,see Glaser and Nitko (1971), Hambletonand Novick (1973), Millman (1974), Popham and Husek (1969). A set of tablesfor determining the minimum number of~lems required for establishing mastery atspeCifiedlevels is provided by Millman (1972,1973).

Reliability 133

ce~ures ar~ feasible and can reduce total testing time while yieldingrehable ~stlma.tes of mastery (Glaser & Kitko, 1971).

Some Investigators have been explorinO' the use of Ban'sian estimationtechniques, whi.eh lend themselves well t~ the kind of decisions requiredby, ma~tery testmg. Because of the large number of specific instructionalobjectives to bc t~sted, criterion· referenced tests typically provide only asmall number of Itcms for cach objective. To supplement this limited in-formation, procedures have been developed for incorporatinO' collateraldata from the student's previous performance history, as well ~s from thetest results of other students (Ferguson & !'\oviek, 197:3; Hambleton &Novick, 1973).

When flexible, individually tailored procedmes are impracticable,I~ore traditional techniques can be utilized to assess the reliability of agl\'en .test. For example, mastery decisions reached at a prerequisite in-structional level can be che{:ked against performance at the next instruc-tional level. Is there a sizeable proportion of students who reached orexceeded the cutoff score on tIle masten' test at .the lower level and~ailed t~ achi~\'e mastery at the next levei within a reasonable period ofmstructlOnal tU1W?Does an analysis of their difficulties suggest that theyhad not truly mastered the prerequisite skiIIs:l If so, these findings wouldstrongly suggest that the mastery test was unreliable. Either the addi-tion of more items or the establishment of a higher cutoff score wouldseem to be indicated. Another procedure for determining the reliabilityof a master)' test is to administer two parallel forms to the same indi-viduals and note the percentage of persons for ",hom the same decision(~mstery or nonmastery) is reached on both forms (Hambleton & No-\'Ick, ] 973 ).

In the development of several criterion-referenced tests, EducationalTesting Service has followed an empirical procedure to set standards ofmastery. This procedure involves administering the test in classes onegrade below and one grade above the grade where the particular con-ce?t or skill i~ taught. The dichotomization can be fmther rcGned byusmg teacher Judgments to exclude any cases in the lower grade knoVl'llto have mastered the concept or skill and any cases in the higher gradewho have demonstrably failed to master it. A cutting score, in terms ofnumber or percentage of correct items, is then selected that best dis-criminates between the two groups. .

Allstatistical procedures for use with criterion-referenced tests are inan exploratory stage. Much remains to be done, in both theoretical de-veloIJ!nent and ~mpir.ical ~ryouts, before the most effective IJlethodologyfor different testmg situatlons can be formulated. 4

Page 79: Anne Anastasi- Psychological Testing I

HAPTER 6Validity: Basic Concepts 135

sample of the behavior domain to be measured. Such a validation -pro-cedure is commonly used in evaluating achievement tests .. This type oftest is designed to measure how well the individual has mastered aspecific skill or course of study. It might thus appear that mere inspec-tion of the content of the test should suffice to establish its "a1idih' forsuch a purpose. A test of multiplication, spelling, or bookkeeping '~'ouldseem to be valid by definition if it consists of multiplication, spelling, orbookkeeping items, respectively.

The solution, however, is not so simple as it appears to be.' Onc diffi-culty is that of adequately sampling the item universe. The behavior do-main to be tested must be systematically analyzed to make certain thataJJ major aspects are covered by the test iteme;. and in the correct pro-

~r example, a test can easily become overloaded with thoseaspects of the field that lend thcmselves more readily to the pl'eparationof objective items. The domain under consideration should be fully de-scribed in advance, rather than being defined after the test has been pre-pared. A \VeIl-constructed achievement test should cover the objectives ofinstruction, not just its subject matter. Content must therefore be broadlydefined to include major objectives, such as the application of principlesand the interpretation of data,~ as well as factual knowledge. ~vloreover,content validity depends on the relevance of the individual's test re-sponses to the behavior area under consideration, rather than on theapparent rcle\'ance of item content. Mere inspection of the test may failto reveal the processes actually used by examinces in taking the test.

It is. also important to guard against any tendency to overgeneralizeregarding the domain sampled by the test. For instance, a multiple-choicespelling test may measure the ability to recognize correctly and incor-rectly spelled worde;. But it cannot be assumed that such a test alsomeasures ability to spell correctly from dictation, frequency of misspell-ings in written compositions, and other aspects of spelling ability (Ahl-strom, 1964; Knoell & Harris, 1952). Still another difficulty arises fromthe possible inclusion of irrelevant factors in the test scores. For example,a test designed to measure proficiency in such areas as mathematics ormechanics may be unduly influenced bv the ability to understand verbaldirections or by speed o{performing si~ple, routi~e tasks.

.alidity:.;Basic C011cepts

T·HE VALIDlTY of a test concerns u;lwf the test measures and how, wen it does so. In this connection, we should guard against ae-, cepting the test name as an index of .what. the ~est measures. Test

names provide short, convenient labels for IdentificatIon purposes. Mosttest names are far too broad and vague to furnish meaningful clues to thebehavior area covered, although increasing e£forts are being made to usemore specific and operationally definable test names. ~he ~rait measuredby a given test can be defined only through an e~amIna~l~n of. the ob-jective sources of information and empirical operatIOns ut~li~ed In estab-lishing its validity (Anastasi, 1950). Moreover, the vahdlty of ,a .tes;cannot be reported in general terms. No test can be said to ha.ve 'hl~hor "low" validitv in the abstract. Its validity must be determmed WIthreference to the' particular use for, which the test is being considered.

Fundamentallv all procedures for determining test validity are con-cerned with the 'r~lationships between performance on the test and otherindependently observable facts about the behavio~ ehar~cte~stics underconsideration. The specific methods ·employed for mvestIgatmg these re-lationships are numerous and have been descri~ed by various names. Inthe Standards for Educational and PsycJlOloglcal Tests,' (1974), theseprocedures are classified under three prineip~~"categories: c~l1t~nt,criterion-related, and construct validity. Each o~ tnese types of valIdatIon,procedures will be considered in one of the .fgll?c'~ir:!g.section~, and therelations amona them will be examined in,~ .concludmg section. Tech-niques for analyzing and intcrpreting vali1~tt "data with reference topractical decisions will be discussed in Chapter 7. SPECIFIC PROCEDURES. Content validity is built into a test from the out-

set through the choice of appropriate' items. For educational tests, theprepfaration of items is preceded by a thorough and systematic examina-ti'Qn of relevant course syllabi and textbooks, as well as by consultation

NATURE. Content validity involves essentially the systematic exami~a-tion of the test content to determine whether it covers a representative

I Further discussions of content validity from several angles ca,n be found in Ebel(1956), Huddleston (1956), and Lennon (1956). .

Page 80: Anne Anastasi- Psychological Testing I

Principles of Psychological Testing

with subject-matter experts. On the basis of the information thus gath-'-ered,test specifications are drawn up for the item writers. These specifi-cations should show the content areas or topics to be covered, the instruc-'onal objectives or processes to be tested, and the relative importance of'ndividual topics and processes. On this basis, the number of items ofach kind to be prepared on each topic can be established. A convenientay to set up such specifications is in terms of a two-way table, withocesses across the top and topics in the left-hand column (see Table,eh. 14). Not all cells in such a table, of course, need to have items,

,nee certain processes may be unsuitable or irrelevant for certain topics.t might be added that such a specification table will also prove helpful. the preparation of teacher-made examinations for classroom use in anyubject.~Jn listing objectives to be co\'ered in an educational achievement test,e test constructor can be guided by the extensive survey of educationaljectives given in the Taxonomy of ~ducational Objectives (Blooma!., 1956; Krathwohl et al., 1964), Prepared by a group of specialistseducational measurement, this handbook also provides examples of

any types of items designed to test each objective. Two volumes areilable, covering cognitive and affective domains, respectively. Thejor categories given in the cognitive domain include knowledge (insenseof remembered facts, terms, methods, principles, etc.), compre-sion,application, analysis, synthesis, and evaluation. The classificationaffective objectives, concerned with the modification of attitudes, in-

rests, values, and appreciation, includes five major categories: recciv-'g, responding, yaluing, organization, and characterization.IThediscussion of content validity in the manual of an achievement test

uld include information on th~ content areas and the skills or ob-ivescovered bv the test, with some indication of the number of itemsach category. 'In addition, the procedures followed in selecting cate-, s and classifying items should be described. If subject-matter experts

ipated in the test-construction process, their number and pro-lal qualifications should be stated. If they served as judges in classi- ,items, the directions they were given should be reported, as well as

extent of agreement among judges. Because curricula and courseeilt change over time, it is paI:tJcularly desirable to give the datesn subject-matter experts were' consulted. Information should like-be provided about number and nature of 'course syllabi and text-s surveyed, including publication dates.umber of empirical procedures may also be followed in order toement the content validation of an achievement test. Both totals and performance on individual items can be checked for gradeess.In general, those items are retained that show the largest gainspercentages of children passing them from the lower to the upper

JeqwnN wall ~N'" .••LllCO•.... ..,'" ~::~"'.,.'" "' •.... "" "'o~--~~~~ ~NN "''''.,. Lll"''''0:>0>0"INN NNN NNM

U!Pn&S IU!~OS

~ a3u:a!3S'u;:>is -,~~ samuewnH " " "

3A!leJJeN iI

I " " "5Cl'lpn~s le!XlS

" " "'0

is eou81OS x " "a" " " x'p f!.• .,u;'C:6- S3!l!UllWnH ;

~'M.!leJJC!N

I" " "sa!P01S II?POS

""0'u; a:>Ua!~S xIii " " " " ".r.E0.E 5a!)!lJcwnH " " x0I,) ,

a,,!~eJJeN

" ":

llj5!l:f% CONO LllMcn "'N~ "I•.•..'" "'''' •...coco", .,. ... .,. """-CX) r--.lllCO "'''''''6 oiIpt'!JE) •.•..MLll CO.,.<0 •...•....•.... Nmq-10l!:t~LO "'''''''' MNO

'".,.CO'" "'''I''' <t ••• .,.

~.~- ;

£~ '4D!1l% ~~.:rl"'.,.- •.... ""'"~ 0 .•••.•..M COOOl "'ON ~~filll!'l~;l;~Z gaP'!J~! ~ "'''' ..•. "'N.,. •.•.,.to ",•••co N"'.,. 0••..'"~~g~- LllNN.....11l5!1l~ f....-CD "I"'''' "'.,.- coo", "'OeoL ape.!) •.•..<0.,. "''''''' "'''I''' co•••.'" "'''''''

-~•... "'~Lll N"'<t COOlN "'0>..-N.,.M NM.,. "l"'''' "--N "INN

JaqwnN wall -N'" "''''(0 "'COOl O~N ~~~ ~~~ "'O~ "1M.,. "'''' •.... "''''0~NN "'NN N"'''' NNM

Page 81: Anne Anastasi- Psychological Testing I

Principles of Psychological Testing

es.Figure 14 shows a portion of a table from the manual of theential Tests of Educational Progress-Series II (STEP). For every

. in each test in this achievement battery, the information provideddes its classification with regard to learning skill and type of ma-l,as well as the percentage of children in the normative sample whothe right answer to the item in each of the grades for which thatof the test is designed. The 30 items included in Figure 14 repre-

t onepart of the Reading test for Level 3, which covers grades 7 to 9.ther supplementary procedures that may be employed, when ap-priate, include analyses of t~l)es of errors commonly made on a testobservation of the work methods employed by examinees. The latter

ld be done by testing students individually with instructions to "thinkud" while ,solving each problem. The contribution of speed can beckedby noting how many persons fail to finish the test or by one of

e more refined methods discussed in Chapter 5. To detect the possibleirrelevantinfluence of ability to read instructions on test performance,

,~res on the test can be ~rrelated \",ith scores on a reading compre-nsiontest. On the other hand, if the test is designed to measure read-g comprehension, giving the questions v.oithout the reading passage onhich they are based will show how many could be answered simply

fromthe examinees' prior information or other irrelevant cues.

Validity: Basic Concepts 1.39

into the initial stages of constructing any test, eventual validation of apti-tude or personality tests requires empirical verification by the proceduresto be described in the following sections. These tests bear less intrinsicresemblance to the behavior domain they are trying to sample than doachievement tests. Consequently, the content of aptitude and personalitytests can do little more than reveal the hypotheses that led the test con-structor to choose a certain type of content for measuring a specifiedtrait. Such hypotheses need to be empirically confirmed to estabiish thevalidity of the test.

Unlike achievement tests, aptitude and personality tests are not basedon a specified course of instruction or uniform set of prior experiencesfrom which test content can be drawn. Hence, in the latter tests, indi-viduals are likely to vary more in the work methods or psycholOgicalprocesses employed in responding to the same test items. The identicaltest might thus measure different functions in different persons. Underthese conditions, it would be virtually impossible to determine the psy-chological functions measured by the tcst from an inspection of itscontent. For example, college graduates might solve a problem in verbalor mathematical terms, while a,mechanic would arrive at the same solu-tion in terms of spatial visualization. Or a test measuring arithmeticreasoning among high scho.ol freshmen might measure only individualdifferences in speed of computation when given to college" students. Aspecific illustration of the dangers of relying on content analysis of apti-tude tests is provided by a study conducted with a digit-symbol substitu-tion ~est"(Burik, 1950). This test, generally regarded as a typical "code-learmng test, was found to measure chiefly motor speed in a group ofhigh school students.

APPLICATIONS. Especially when bolstered by such empirical checks asthoseilIusb'ated above, content vali,dity provides an adequate techniqueforevaluating achievement tests. It permits us to answer two questionsihatare basic to the validitv of an achievement test; (1) Does the test

'cover a representative sa~ple of the speCified skills and knowledge?(2) Is test performance reasonably free from the influence of irrelevant

; \Janables?~. Content validity is particularly appropriate for the criterion-refer~n~d.. testsdescribed in Chapter 4. Because performance on these tests lS 111-f .terpreted in tern1S of content meaning, it is obvious that content validity~ isa prime requiremenf for their effective use. Content validation is also· applicable to certain occupational tests designed for employee selection

andclassification, to .be discussed in Chapter 15. This type of validationissuitable when the test is an actual job sample or otherwise calls for thesameskills and knowledge required on the job. In such cases, a thorough

· jobanalysis should be carried out in order to demonstrate the close re-· semblance between the job activities and the test.

For aptitude and personality tests, on the other hand, content validityis usually inappropriate and may, in fact, be misleading. Although con-siderations of relevance and effectiveness of content must obviously enter

FACE "ALIDITY. Content validitv should not be confused with face va-lidity. The latter is not validity 'in the technical sense; it refers, not towhat the test actually measures, but to what it appears superficially tomeasure. Face validity pertains to whether the test "looks valid" to theexaminees who take it, the administrative personnel who decide on itsuse, and other technically untrained observers. Fundamentally, the ques-tion of face validity concerns rapport and public relations. Althoughcommon usage of the term validity in tlhs connection may make forconfusion, face validity itself is a desirable feature of tests. For example,when tests originally designed for children and developed within a class-room setting were grst extended for adult use, they frequently met with~esistance and criticism because of their lack of face validity. Certainlyif test content appears irrelevant, inappropriate, silly, or childish, theresult will be poor cooperation, regardless of the actual validity of the

Page 82: Anne Anastasi- Psychological Testing I

Validity: Basic Concepts 141

sonnel to occupational training programs represent examples of the sortof decisions requiring a knowledge of the predictive validity of tests.Other examples include the use of tests to screen out applicants likelyto develop emotional disorders in stressful environments and the use oftests to identify psychiatric patients most likely to benefit from a par-ticular therapy.

In .a number of instances, concurrent validity is found merely as asu~stJt~te for predictive validity. It is frequently impracticable to extendvah~atlon ~rocedures over the time required for predictive validity or too~tam a s~Itable preselection sample for testing purposes. As a compro-m~se .solutIOn, therefore, tests are administered to a group on whomcntenon data are already available. Thus, the test scores of collegestud~nts may b~ compared with their cumulative grade-point average at~he tIme of testmg, or those of employees compared with their currentJob success.

For certain uses of psychological tests, on the other hand, concurrentvalidity ~sthe ~~st ~pprop!iate type and can be justified in its own right.The logICal dI~tinchon between predictive and concurrent validity is?ased, not on hme, but on the objectives of testing. Concurrent validityISrel.ev~nt to tests employed for diagnosis of existing status, rather thanpredIction of future outcomes. The difference can be illustrated bv ask-ing: "Is Smith neurotic?" (concurrent validity) and "Is Smith lik"ely tobecome neurotic::>"(predictive validity) . .. Because ~he criterion for concurrent validity is always available at the

hme of testmg, we might ask what function is served bv the test in suchsitua~ions. B~sicalIy, such tests provide a simpler, quicker, or less ex-~ensive subs.htute for the criterion data. For example, if the criterion con-SIStsof continuous observation of a patient during a two-week hospital- ,ization period, a test that could sort out normals from neurotic and '?oubtful cases would appreciably reduce the number of persons requir-mg such extensive observation.

140 Principles of Psychological Testing

~st.Especially in adult testing, it is not sufficie~t for a t~st to. be ob-ctivelyvalid. It also needs face validity to function effectively In prac-

o al situations..Face validity can often be improved by merely reformulating testmsin terms that appear relevant and plausible in the particular settingwhichthe" will be used. For example, if a test of simple arithmeticsoningis 'constructed for use with machinists, the items should beded in tcrms of machine operations rather than in terms of "howy oranges can be purchased for 36 cents" or other traditional school-

k problems. Similarly, an arithmetic test for naval personnel can beressedin naval terminology, without necessarily altering the functionsasured.To be sure, face validity should never be regarded as a substi-e for objectively determined validity. It cannot be assumed that im-\1ng the face validity of a test '\vill improve its objective validity.r can it be assumed that when a test is modified so as to increase itse validity,its objective validity remains unaltered. The validity of theinits final form will always need to be directly checked.

riterion-relatedvalidity indicates the effectiveness of a test in predict-an individual's beha\'ior in specified situations. For t~is purpose, per-anceon the test is checked against a criterion, i.e., a direct and in-dent measure of that which the test is deSigned to predict. Thus,mechanical aptitude test, the criterion might bc subsequent job

ormanceas a machinist; for a scholastic aptitude test, it might begegrades; and for a neuroticism test, it might be associates' ratings

..her available information on the subjects' behavior in various lifelions.

'CURREI'.:TAND PREDICTIVE VALIDITY. The criterion measure againsttest scores are validated may be obtained at approximately the

. time as the test scores or after a stated interval. The APA test·urds (1974), differentiate between concurrent and predictive valid-the basis of these time relations between criterion and test. Therediction"can be used in the broader sense, to refer to prediction

he test to any criterion situation, or in the more limited sense of'onover a time interval. It is in the latter sense that it is used inression"'predictive validity:' The information provided by pre-validityis most relevant to tests used in the selection and das-n of personnel. Hiring job applicants, selecting students foronto college or professional schools, and assigning military per-

• ~RITERION CO~TAMINATION. An essential precaution in finding the va-hdlty of a test IS to make certain that the test scores do not themselvesinfluence any individ~ars c~terion. status. For example, if a college ill<-st.metor or a foreman III an mdustnal plant knows that a particularillai~VIdual scored very p~rly on an aptitude test, such lcIl,owl~qgemight in-fluence the gr~de gIVen to the student or the rating assigned to theworker. Or a hIgh-scoring person might be given the benefit of the doubt~hen academic grades or on-the-job ratings are being prepared. SuchmHuences would obviously raise the correlation between test scores andcrite~on in ~ manner that is entirely spurious' or <ilrtificia1:;. .'

TIus pOSSIblesource of error in test validation is known as criterion

Page 83: Anne Anastasi- Psychological Testing I

rillciplesof Psychological Testing

tion, since the criterion ratings become "contaminated" by theowledgeof the test scores. To prevent the operation of such ans absolutely essential that no person who participates in the as-of criterion ratings have any knowledge of the examinees' testor this reason, test scores employed in "testing the test" mustrictlyconfidential. It is sometimes difficult to convince teachers,s,military officers, and other line personnel that such a precau-ential. In their urgency to utilize all available information fordecisions,such persons may fail to realize that the test scores

e put aside until the criterion data mature and validity can be

d,

Validity: Basic Concepts 143

selected group than elementary school graduates, the relation betweenamount of education and scholastic a titnde is far from erEect. Espe-cIa y at t e Ig er e ucationallevels, economic, social, motivational, andother nonintellectual factors may influence the continuation of the indi-vidual's education. Moreover, with such concurrent validation it is diffi-cult to disentangle cause-and-effect relations. To what extent ~re the ob-tain~d differences in intelligence test scores simply the result of theyarymg amount of education? And to what extent could the test havepredicted individual differences in subsequent educational progress?These questions can be answered only when the test is administered be-fore the criterion data have matured, as in predictive validation.

I.n t~e development of special aptitude tests, a frequent type of cri-teno~ is bas~d on performance in specialized training. For example, me-chamcal aptitude tests may be validated against final achievement insho~ courses. Various business school courses, such as stenographY,t~l~g, or bookkeeping, provide criteria for aptitude tests in these area's.SlIl~Ilarly,p~rformance in music or art schools has been employed in vali-datmg musIc. or art. aptitude tests. Several professional aptitude testshave been validated In terms of achievement in schools of law medicinedentistry, engineering, and oth;r areas. In the case of custom-:nade tests'deSigned for use within a specific testing program, training reco;ds are ~f:equent ~ource of ~riterion data. An outstanding illustration is the valida-hO~ ~f Au Force pIlot selection tests against performance in basic flighttr~m~g. Performance in training programs is also commonly used as a~ntenon ~or test validation in other military occupational specialties andm some mdustrial validation studies.

~mong the specific indices of training performance employed for cri-tenon purposes may be mentioned achievement tests administered on.completion of training, formally assigned grades, instructors' ratings, andsucc~ssful co~pletjon of. training versus elimination from the program.l\ful~lple .aptItude battenes have often been checked against grades inspec,IRehIg? school or college courses, in order to determine their validityas dIfferential predictors. For example, scores on a verbal comprehensiontest may be compared with grades in English courses spatial visualiza-tion scores with geometry grades, and so forth. '

In connection with the use of training records in general as criterionmeasures, a useful distinction is that between intermediate and ultimatecriteria: In the development of an Air Force pi!Pt-selection test or a medi-cal aptItude test, for example, the ultimate criteria would be combatperfo~mance a~d eventual achievement as a practicing physician, re-spectIvely. ObVIOuslyit would require a long time for such criterion datato mature. It is doubtful, moreover, whether il~ly ultimate criterion isever obtained in actual practice. Finally, even were such an ultimatecriterion available, it would probably be subject to many unconttolled

MON CRiTERIA. Any test may be validated against as many criteriae are specific uses for it. Any method for assessing behavior in

tion could provide a criterion measure for some particular pur-he criteria employed in £ndif\g the validities reported in test

Is,however, fall into a few common categories. Among the criteriaequendyemployed in validating intelligence test~ is some index ofic ac ' t. It is for this reason that such tests have oftenore precisely described as measures of scholastic aptitude,. The

cindicesused as criterion measures include school grades, achieve-est scores, promotion and graduation records, special honors and

as,and teachers' or instructors' ratings for "intelligence." Insofar asratings given within an acade~ic setting are likely to be heavily~dby the individual's scholastic performance, they may be properlyed with the criterion of academic achievement.

e various indices of academic achievement have provided criterionat all educational levels, from the primary grades to college anduateschool.Although employed principally in the validation of gen-intelligence tests, they have also served as criteria for certain'pIe-aptitude and personahty tests. In the validation of any of theses.oftests for use in the selection of college students, for example, a

:~on criterion is freshman grade-point average. This measure is the.age grade in all courses taken during the freshman year, each gradegweighted by the number of course points for which it w.a~~ceived.variant of the criterion of academic achievement frequenl:ly em-edwith out-of-school adults is the amount of education the individualpleted. It is expected that in general the more intelligent individualsinutl their education longer, while the less inte.lli ent drop out of01earlier. The assumption underlying this crite . that the educa-al ladder serves as a progressively selective nee, eliminating

oseincapable of continuing beyond each step. Although it is undoubt-ly true that college graduates, for example, represent a more highly

Page 84: Anne Anastasi- Psychological Testing I

4 Principles of Psychological Testing

. tors that would render it relatively useless. For example, it would becult to evaluate the relative degree of success of physicians practicingerent specialties and in different parts of the country. F'or these rea-s, such intermediate criteria as performance records at some stage ofiningare frequently employed.as criterion measures.or many purposes, the most satisfactory type of criterion measure ist based on follow-up records of actual ;ob performance. This criterionbeen used to some extent in the validation of general intelligence asaspersonality tests, and to a larger extent in the validation of special

tude tests. It is a common criterion in the validation of custom-made. for specific jobs. The "jobs" in question may vary widely in bothI and kind, including work in business, industry, the professions, andarmed services. Most measures of job performance, although prob-

not representing ultimate criteria, at least provide good inter-iate criteria for many testing purposes. In this respect they are to beerred to training records. On the other hand, the measurement ofperform;mce does not permit as much uniformity of conditions as is

, ible during training. Moreover, since it usually involves a l?ngerlow-up,the criterion of job puformance is likely to entail a loss m thember of available subjects. Because of the variation in the nature of

inally .similar jobs in different organizations, test manuals reporting'ditydata against job criteria should describe not only tbe specificterion measures employed but also the job duti~s performed by therkers.Validation by the method of contrasted groups generally involves a

compositecriterion that reflects the cumulative and uncontrolled selectivej~fluencesof everyday life. This critcrion is ultimately based on survi"al,withina particular group versus elimination therefrom. For example,. ip

e validation of an intelligence test, the scores obtained by institution~l-, ed mentally retarded children may be compared with those obtainedy schoolchildren of the same age. In this case, the multiplicity of factorsetermining commitment to an institution for the mentally retarded con-itutes the criterion. Similarly, the validity of a musical aptitude or aechanical aptitude test may be checked by comparing the scores ob-

tainedby students enrolled in a music school or an engineering school,respectively,with the scores of unselected high school or college student~., To be sure, contrasted groups can be selected on the basis of any cn-terion, such as school grades, ratings, or job performa!1ce, by simplychoosingthe extremes of the distribution of criterion me~sures. The con-trasted groups included in the present category, ho}'-'wer, are disti?ctgroupsthat have gradually become differentiated through the operationofthe multiple demands of daily living. The criterion under cons~dera-lionis thus more complex and less clearly definable than those preVIouslydiscussed.

Validity: Basic Concepts 145

. The method o~ contrasted groups is used quite commonly in the valida-hon of persollahty tests. Thus, in validating a test of social traits, thetest perform~nce of salesmen or executives, on the one hand, may becompar~d WIth that of clerks or engineers, on the other. The assmnption'underlymg such a procedure is that, with reference to man v social traitsindividua.ls who hav~ entered and remained in such occupatiq9~:~s sellingor executive work Will as a group excel persons in such fiela~['ils clericalwork or engine.ering. Similarly, college students who hav~>:engaged inman~ .extracl~rncular activities may be compared with those who havepartlcIp~ted 111 nOlle during a comparable period of college attendance.Oc~up~tlOnaI groups have frequently becn used in the development andvahdahon ?f interest tests, such as the Strong Vocational Interest Blank,as well as ~n the preparation of attitude scales. Other groups sometimesemployed. m the validation of attitude scales include political, religious,~eograp~lCal, or ot~er spccial groups generally knO\vn to represent dis-tmetly dIfferent pomts of "iew on certain issues.

In the developmc.nt of certain personality t~sts, psychiatric diagnosis isused both as a basIS for the selection of items and as evidence of testv~lidity, Ps.y~hiatric diagnosis may serve as a satisfactory criterion pro-VIded that It IS based on prolonged observation and detailed case historyrather than on a cursory psychiatric interview or examination. In th;latter. case, there is no reason to expect the psychiatric diagnosis to besupenor to the test score itself as an indication of the individual's emo-tion~l ~ondition. Such a psychiatric diagnosis could not be regarded as~ c:ltenon measure, but rather as an indicator or predictor whose own va-lidIty would have to be determined.

Mention has already been made, in connection with other criterioncate,?o~ies, of certain types of ratings by school teachers, instructoml inspeclahzed. cou~s.es, an~ jo~ supervisors. To these can be added ratingsby offic~rs 10 mIhtary sltuahons, ratings of students by school counselors,and ratmgs by co-workers, classmates, fellow club-members and othergrou?~ of associ~tes. The ratings discussed earlier represent~d merely aSUhsldI~ry tec?mque for obtaining information regarding such criteria asacademiC achIevement, performance in specialized training, or job suc-ce~s. :Ve are now considering the use of ratings as the very core of thecntenon mea~ur~. Under these circuwstances, the ratings themselvesdefine the CrItenon. Moreover, such.:ratings are not restricted to theevaluation of speci~c achievement, but involve a personal judgment byan observer regardmg any of the variety;of traits that psychological testsattempt to measure. Thus, the subjects in the vali~\ltion sample might be~ate? on such c?aracteristics as dominance, mech~ll.icaI ingenuity, orig-mali~, leadership, or honesty.':":"

Ratings have bee~ employed in the valid~tion of,lltmost every type oftest. They are partICularly useful in providing criteria for personality

Page 85: Anne Anastasi- Psychological Testing I

ril1ciplesof Psychological Testing

ation,since the criterion ratings become "contaminated" by theowledgeof the test scores. To prevent the operation of such an

.'s absolutely essential that no person who participates in the as-t of criterion ratings have any knowledge of the examinees' testor this reason, test scores employed in "testing the test" mustrictlyconfidential. It is sometimes difficult to convince teachers,s,military officers, and other line personnel that such a precau-ential. In their urgency to utilize all available information fordecisions,such persons may fail to realize that the test scores

e put aside until the criterion data mature and validity can be

d.

Validity: Basic Concepts 143

selected group than elementary school graduates, the relation betweenamount of education and scholastic a titude is far from erfect. Espe-cIa y at t e Ig er e ucationallevels, economic, social, motivational, andother nonintellectual factors may influence the continuation of the indi-vidual's education. Moreover, with such concurrent validation it is diffi-cult to disentangle cause-and-effect relations. To what extent ~re the ob-tain~d differences in intelligence test scores simply the result of theyarymg amount of education? And to what extent could the test havepredicted individual differences in subsequent educational progress?These questions can be answered only when the test is administered be-fore the criterion data have matured, as in predictive validation.

I.n t~e development of special aptitude tests, a frequent type of cri-teno~ is bas~d on performance in specialized training. For example, me-chamcal aphtude tests may be validated against final achievement insho~ courses. Various business school courses, such as stenography,t~l~g, or bookkeeping, provide criteria for aptitude tests in these areas.SlIl~Ilarly,p~rformance in music or art schools has been employed in vali-datmg musIC,or art. aptitude tests. Several professional aptitude testshave been validated m terms otachievement in schools of law, medicine,dentistry, engineering, and other areas. In the case of custom-made testsdesigned for use within a specific testing program, training reco;ds are ~f:equent ~ource of ~riterion data. An outstanding illustration is the valida-hO~ ?f Alr Force pllot selection tests against performance in basic flighttr~m~g. Performance in training programs is also commonly used as a~ntenon for test validation in other military occupational specialties andm some industrial validation studies.

~mong the specific indices of training performance employed for cri-tenon purposes may be mentioned achievement tests administered oncompletion of train,ing, formally assigned grades, instructors' ratings, andsucc~ssful co~plehon of. training versus elimination from the program .l\ful~lple .aphtude battenes have often been checked against grades inspec.,fi~hlg~ school or college courses, in order to determine their validityas dIfferential predictors. For example, scores on a verbal comprehensiontest may be compared with grades in English courses spatial visualiza-tion scores with geometry grades, and so forth. '

In connection with the use of training records in general as criterionmeasures, a useful distinction is that between intermediate and ultimatecriteria: In the development of an Air Force pilpt-selection test or a medi-cal aptitude test, for example, the ultimate criteria would be combatperfo~mance a~d eventual achievement as a practidng physician, re-spectively. ObVIOuslyit would require a long time for such criterion datato mature: It is. doubtful, moreover, whether a".truly ultimate criterion isever obtamed m actual practice. Finally, even were such an ultimatecriterion available, it would probably be subje,ct to many uncontrolled

MON CRiTERIA. Any test may be validated against as many criteriae are specific uses for it. Any method for assessing behavior inationcould provide a criterion measure for some particular pur-he criteria employed in finding the validities reported in test

Is, however, fall into a few common categories. Among the criteriaequentlyemployed in validating intelligence test~ is some index ofic ac . t. It is for this reason that such tests have oftenore precisely described as measures of scholastic aptitude .. The

cindicesused as criterion measures include school grades, achieve-est scores, promotion and graduation records, special honors and

as,and teachers' or instructors' ratings for "intelligence." Insofar asratings given within an acade~ic setting are likely to be heavily,edby the individual's scholastic performance, they may be properly. ed with the criterion of academic achievement.e various indices of academic achievement have provided criterionat all educational levels, from the primary grades to college and

.uateschool.Although employed principally in the validation of gen-

.intelligence tests, they have also served as criteria for certaintiplc-aptitude and personality tests. In the validation of any of these

of tests for use in the selection of college students, for example, aon criterion is freshman grade-point average. This measure is the

, ge grade in all courses taken during the freshman year, each gradegweighted by the number of course points for which it waJJ~ceived.variant of the criterion of academic achievement frequently em-

yedwith out-of-school adults is the amount of education the individualpleted. It is expected that in general the more intelligent individualstinue their education longer, while the less int~lli ent drop out of001 earlier. The assumption underlying this erite . that the educa-al ladder serves as a progressively selective . nee, eliminatingse incapable of continuing beyond each step. Although it is undoubt-ly true that college graduates, for example, represent a more highly

Page 86: Anne Anastasi- Psychological Testing I

44 Principws of Psychological Testing

tors that would render it relatively useless. For example, it would becult to evaluate the relative degree of success of physicians practicingrent specialties and in different parts of the country. For these rea-sueh intermediate criteria as performance records at some stage of

iningare frequently employed as criterion measures.or many purposes, the most satisfactory type of criterion measure ist based on follow-up records of actual ;ob performance. This criterion.been used to some extent in the validation of general intelligence as

aspersonality tests, and to a larger extent in the validation of specialtude tests. It is a common criterion in the validation of custom-made

for specine jobs. The "jobs" in question may vary widely in bothand kind, including work in business, industry, the professions, and

armed services. Most measures of job performance, although prob-not representing ultimate criteria, at least provide good inter-

iate criteria for many testing purposes. In this respect they are to beerred to training records. On the other hand, the measurement ofperform;mce does not permit as much uniformity of conditions as isible during training. Moreover, since it usually involves a longer

low-up,the criterion of job ptrformanee is likely to entail a loss in thember of available subjects. Because of the variation in the nature ofminallv.similar jobs in different organizations, test manuals reporting~ditydata against job criteria should describe not only the specific'terion measures employed but also the job duti~s performed by therkers.Validation by the method of contrasted groups generally involve~ aill osite criterion that reflects the cumulative and uncontrolled selectIvefluencesof everyday life. This critcrion is ultimately based on sur\'iY~1'thin a particular group versus elimination therefr?m. For.ex~mp.le,.~n,e validation of an intelligence test, the scores obtamed by mSbtutlOnal-

mentally retarded children may be compared with those obtainedschoolchildren of the same age. In this case, the multiplicity of factors

etermining commitment to an institution for the mentally ret~rded con-stitutes the criterion. Similarly, the validity of a musical aptitude or a

echanical aptitude test may he checked by comparing the scores ob-ainedby students enrolled in a music school or an engineering school,espectively,with the scores of un selected high school or college student~.To be sure, contrasted groups can be selected on the basis of any cn-

terion, such as school grades, ratings, or job performa!!ce, by simplychoosingthe extremes of the distribution of criterion metsures. TIle con-trasted groups included in the present category, h~~wer, are disti~ctgroupsthat have gradually become differentiated through the ope~ationofthe multiple demands of daily living. The criterion under cons~dera-tionis thus more complex and less clearly definable than those preViouslydiscussed.

Validity: Ba51c Concepts 145

. The method o~ contrasted groups is used quite commonly in the valida-tion of personahty tests. Thus, in validating a test of social traits, thetest perform~nce of salesmen or executives, on the one hand, maybecompar~d WIth that of clerks or engineers, on the other. The assumption'underlymg such a procedure is that, with reference to many socialtraitsindividua.ls who hav~ entered and remained in such occupatiQp~r~s sellingor executive work Will as a group excel persons in such fie1~&~iisclericalwork or engineering. Similarly, college students who hav~.'~ngaged inman! .extracl~rricular activities may be compared V\'ith those who havepartlcIp~ted 111 nOlle during a comparable period of college attendance.Oc~up~tlOl1al.groups have frequently been used in the development andvahdabon ?f mterest tests, such as the Strong Vocational Interest Blank,as well as ~n the preparation of attitude scales. Other groups sometimesemployed. III the validation of attitude scales include political, religious,~eograp~lCal, or other special groups generally known to represent dis-tmetly different points of \oiew on certain issues.

In the developme.nt of certain personality t~sts, psychiatric diagnosis isused both as a basIS for the selection of items and as evidence of testv~lidity. Ps.y~hiatric diagnosis may serve as a satisfactory criterion pro-VIded that It is based on prolonged observation and detailed case historyrather than on a cursory psychiatric interview or examination. In th~latter. case, there is no reason to expect the psychiatric diagnosis to besupenor to the test score itself as an indication of the individual's emo-tion~l ~ondition. Such a psychiatric diagnosis could not he regarded as~ c:ltenon measure, but rather as an indicator or predictor whose own va-lidity would have to be determined.

Mention has already been made, in connection with other criterioncatel?o~ies, of certain types of ratings by school teachers, instructor,s inspeCialized. cou~s.es. an~ jO~ supervisors. To these can be added ratingsby officers m mIlitary Situations, ratings of students bv school counselorsand ratings by co-workers, classmates, fellow club-~embers, and othe;grou?~ of associ~tes. The ratings discussed earlier represented merely asuhsldl~ry tec?mque for obtaining information regarding such criteria asacademIC achievement, performance in specialized training, or job suc-ce:s. :Ve are now considering the use of ratings as the viery core of thecntenon mea:ur~. Under these circutJIstances, the ratings themselvesdefine the crltenon. Moreover, suchuatings are not restricted to theevaluation of speci~c achievement, ~ut involve a personal judgment byan observer regardmg any of the vanety,of traits that psychological testsattempt to measure. Th~s, .the subjects in the vali~\ltion sample might be~ate~ on such charactensbcs as dominance, mechll.:nical ingenuity, orig-mality, leadership, or honesty. .,;,.

Ratings have bee~ employed in t?e validl!tion of;,almost every type oftest. They are partIcularly useful In providing criteria for personality

Page 87: Anne Anastasi- Psychological Testing I

146 Principles of PSljchological Testing'J;,tests,since objective criteria are much more difficult to find in this area.}lfhisis especially true of distinctly social traits, in which ratings based on;personal contact may constitute the most logically defensible criterion.iiAlthoughratings may be subject to many judgmental errors, when ob-

. )ained under carefully controlled conditions they represent a valuable's9urceof criterion data. Techniques for improving the accuracy ofi:iatingsand for reducing common types of errors will be considered in,{,Chapter20.

,11 Finally, correlations between a new test and previously available testsi~arefrequently cited as evidence of validity. When the new test is an ab-,breviated or Simplified form of a currently available test, the latter can

,;Properly be regarded as a criterion measure. Thus, a paper-and-pencil',test might be validated against a more elaborate and time-consuming per-

<i, formancetest whose validity had previouslv been established. Or a groupftestmight be Ivalidated:against an individu~l test. The Stanford-Binet, for:lhample, has repeatedly served as a criterion in validating group tests."In such a case, the new test may be regarded at best as a crude appro xi-~mation of the earlier one. It should be noted that unless the new test, represenl~a simpler or shorter substitute for the earlier test, the use of the

,';latter as a cdterion is indefensible.t,,;

1 SPECIFICITYOF CRITERIA.Criterion-related validity is most appropriate'for local validation studies, in which the effectiveness of a test for a,; specificprogram is to be assessed. This is the approach followed, for, example, when a given company wishes to evaluate a test for selecting, applicants for one of its jobs or when a given college wishes to determinei how well an academic aptitude test can predict the course performance~"ofits students. Criterion-related validity can be best characterized as the~practical validity of a test in a specified situation. This type of validation',represents applied research, as distinguished from basic research, and as: such it provides results that are less generalizable than the results ofI other procedures.

That criterion-related validity may be quite specific has been demon-, strated repeatedly. Figure 15 gives examples of the wide variation in the

correlations of a single type of test with criteria of job proBciency. The, .~'firstgraph shows the distribution of 72 correlations found between in-~:telligence test scores and measures of the job proficiency of generalc. clerks; the second graph summarizes in similar fashion 191 correlations

. :' between finger dexterity tests and the job proficiency of benchworkers.,:;Although in both instances the correlations tend to chIster in a particular',~range of validity, the variation among individual studies is considerable.~,.Thevalidity coefficient may be high and positive in one study and negli-'; gibleor even substantially negative in another.

£.1"

Validity: Basic COllcepts 147

Similar .vari~tion with r~gard to the prediction of course grades is il-lustrated m Flgure 16. ThIS Bgure shows the distribution of correlationsobtained between grades in mathematics and scores on each of the sub-tests Of,the Differential Aptitude Tests. Thus, for the Numerical Abilitytest (NA), the largest number of validity coefficients among boys fellbetween .50 and .59; but the correlations obtained in different mathe-m~tics ~ourses and in different schools ranged from .22 to .75. EquallyWide dlff~rences we~e found with the other subtests and, it might beadded, WIth grades 10 other subjects not included in Figure 16.

2072 coefficients for generalc1erh on intelligence tests,proficiency criteria

~ 10c:.,'u~••0U

0~~ -1.00

+1.00'0>... 200 191coefficients for bench"Ol workers on finger dexterity0C tests, proficiency criteria.,~•• 100..

o-1.00 -0.50 .00 +0.50 +1.00

FIG. 15. Examples of Variation in Validity Coefficients of Given Tests for Par-ticular Jobs.(Adapted from Ghiselli, 1966, p. 29.)

Some. of .the variation in validity coefficients against job criteria re-ported l.n FIgure 15 r~ults from differences among the specific tests em-plo ed 10 different studies to measure' . , ..:~2, rity. Inthe resu s 0 0 19ures and 16, moreover/some variation is at-tributable to diHerences in homogeneity and lev~l~£ the groups tested .The range of validity coefficients found, however, is far wider thancould be explained in these terms. J)ifferences in the' crjtena themselves~un~oubtedb' a m.!!iorr~ason for th.~~~ariatiQnQ~~~rvgafilong vali<lliyc~. Thus, thc duties of offic~g~rks or berichworkers may differ

Page 88: Anne Anastasi- Psychological Testing I

148 Principles of Psychological Testing

. de artments in the same company.'dely among compames or amo~. tP differ in content, teachingmilarlv, courses in the same su Jec may t' student achieve-ethod'instructor characteristics, bases for evalua

lmg . to be the

' c· ntly w lat appearsellt, and l~umerous other ways. o~sd~e:ent ' combmation of traits ine critefJon ma resent ver

i rrent situations. t'. the same situation. For example,riteria may also vary over Ime In. .. criteria often differs

e validitv coefficient of a test against Job .tra~m(th' lli 1966) Thereomits v~lidity against job performance cntena Ise, ce of ~ iven'evidence that the traits required for successful terfo~~:nor job e;peri-

b or even a. si~g~detaslk(;~r~' ~vi~n th:9;~~o~~iS~!::c& Fruchter, 1960;ce of the mdivi ua eiS m .' . ' . 1960) There is also

~:~~::le~d~:1~~I'Sh~~~6;h~~~~I~ri:ri~~:~ge ove~. timt.e fo1rgOotha~r. f . b shIfts In orgamza IOna ,asons such as changmg nature (} )0 s, al d't'ons ('.lac-

' . k d ther tempor con 1 1 IV.dividual advancement In ra~ ' anll

kn° f course that educational. 1967 P . 1966) It IS we own, 0 ,.nne)', ; nen,. t' In other words the' . d t t change over Ime. ,meula an course con en. ... IIi ence and aptitude teststeria most commonly used m vaiidatmg mte g d''namely, job performance and edut:ational achievement-are ynamlc

Validity: Basic Concepts 149

rather than static. It follows that criterion-related validity is itself subjectto temporal changes.

<SYl':mETIC VALIDITY. Criteria not only differ across situations and overtime, but they are also likely to be complex (see, e.g., Richards, ~llor,Price, & Jaeobsen, 1965). Success on a job, in school, or in other actirytiesof daily life depends not on one trait but on many traits. Hence, 'prac-tical criteria are likely to be multifaceted. Several ,different indicatorsor measures of job proficiency or academic achievement could thus beused in validating a test. Since these measures may tap different traitsor combinations of traits, it is not surprising to find that they yield differ-ent validity coefficients fpr any given test. '\'hen different criterionmeasures are obtained for the same individuals. their interoorre!atioDs areoften quite low. For instance, accident records or absenteeism may show\"

virtually no relation to productivity or error ,data for the same job (Sea-shore, Indik, & Georgopoulos, 1960). These differences, of course, arereflected in the validity coefficients of any given test against differentcriterion measures. Thus, a test may fail to correlate significantly withsupervisors' ratings of job proflciency and yet show appreciable validityin predicting who will resign and who will be promoted at a later date(Albright, Smith, & Glennon, 1959),

Because of criterion complexity, validating a test against a compositecriterion of job proficiency, academic achievement, or other similar ac-complishments '~a be of uestionable value and is certainl of limitedgenerality. If different subcriteria are relatively independent, a more ef-fectIve procedure is to validate each test against that aspect of the cri-teiiO'i1Jf IS best designed to measure. An analysis of these more speCificreIahonships lends meaning t6 the test Scores in terms of the multipledimensions of criterion behavior (Dunnette, 1963; Ebel, 1961; S. R. Wal-lace, 1965). For example, one test might prove to be a valid predictor ofa clerk's perceptual speed and accuracy in handling detail work, anotherof his ability to spell correctly, and still another of his ability to resistdistraction.

If, now, we return to the practicClI question of evaluating a test orcombination of tests for effectiveness in predicting a complex criterionsuch as success on a given job, we are faced with the necessity of con-ducting a separate validation stud in each loc tion and re eatinit at frequent mten~ S. This is admittedly a desi procedure and onethat is often recommended in test manuals. In ma~r situations, however,it is not feasible to follow this procedure be~jise of well-nigh insur-mountable practical obstacles. Even if adequatel~ p'ained' personnel areavailable to carry out the necessary research, mosf Critf:'rion-related va-lidity studies conducted in industry afe likely to prove unsatisfactory for

• f· "\' l'd'ty Coefficientsof the Differential Aptitude16. GraphIC Summary 0 a I I . Th bad ac-. (Forms Santi T) for Course Grades in Mathematics! em ~rst ~ theanyingnumb~r.sin each column indicate the number 0 coe clen S In

givenat the left. "R roduced by permiSSIon. CopyrIght © 1975,Fifth Edition Manual, p. 82: eP

N Y k N Y All rights reserved.)by The Psychological Corporatlon, ew or, .•

Page 89: Anne Anastasi- Psychological Testing I

tso Principles of Psychological Testing

at leastthree reasons, First, it is difficult to obtain dependable and suf-Scientlycomprehensive criterion data. Second, the number of employeesengagedin the same or closely similar jobs '~ithin a co~pany i,s often60 small for significant statistical results. Thlfd, correlations will very~robablybe lowered by restriction of range through preselection, si~cepolythose persons actuany hired can be followed up on .the Job.: For all the reasons discussed above, personnel psychologJ.sts have

- shownincreasing interest in a technique 1.."110\\'11as synthetic validity.\Firstintroduced by Lawshe (1952), the concept of synthetic validity has!beendefined by Balma (1959, p. 395) as "the inferring of validity in aspecificsituation from a systematic analysis of job elements, a determina-

_Honof test validity for these elements, and a combination of elementalfvalidities into a ~'hole." Several procedures have been developed for",.gathering the1needed empirical data and for ~mbining these d~ta. to, obtainan estimate of synthetic validity for a particular complex cntenon

(see,e.g., Guion, 1965; Lawshe & Balma, 1966, Ch. 14; McCormick, 1959;l'rimoff, 1959, 1975). Essentially, the process involves three steps: (1)

_. detailed job analysis to identify the job elements and their relative_weights; (2) analysis and empirical study of each test to determine ~he

.i extent to which it measures proficiency in performing each of these Jobelements; and (3) finding the validity of each test for the given jobsynthetically from the weights of these elements in the job and in thetest.

In a long-term research program conducted with U.S. Civil Service jobapplicants, Primoff (1975) has developed the J-coefficient (for "job-coefficient") as an index of synthetic validity. Among the special featuresof this procedure are the listing of job elements expressed in terms ofworker behavior and the rating of the relative importance of these ele-ments in each job by supervisors and jo1},}p$]Jmbents. Correlations be-tween test scores and sell-ratings on jOp;Jj~m~~s are found in total ap-plicantsamples (not subject to thep~1f'~-~~lW? of employed workers).Various chec1..ing procedures are fon9~~ed to ensure stability of correl~-tions and weights derived from self-~~~gs. as wen as adequacy of C[l-terion coverage. For these purpose.s;~a~a_ are ?btained from d~Herentsamples of applicant populations. 1\~~£nal estimate of correlation be-tween test and job performance is,!9Pnd from the correlation of eachjob element with the pifticular job';~~ the weight of the same elementin the given test.' There i" evidence that the J-coefficient has proved

~f·', The statistical procedures aTe essentiaIly an adaptation of multiple regression

equations, to be discussed in Chapter- 7. For each job element, its correlation withthe job is multiplied by its weight in the test, and these produtcs are added across allappropriate job elements.

Validity: Basic Concepts 151

helpful in improvin~ th~ employment opportunities of minority appli-cants and persons WIth lIttle formal education, because of its concentra-tion on job-relevant skills (Primoff, 1975)., A different application of synthetic validity, especially suitable for usem a sn~all company with few employ~es in each type of job, is describedby Gmon (1965). The study was carried out in a company having 48employee~, each of whom was doing a job that was appreCiably differentfrom the Jobs of the other employees. Detailed job analyses neverthelessrevealed seve.n job elements commo!}Jto many jobs. Each employee wasrated on the Job elements appropriate to his job; and these ratings werethen checked against the employees' scores on each test in a trial battery.On the basis of these analyses, a separate battery could be "svnthesized"for each job by co~bining the two best tests for each of the j~b elementsdemanded by that Job. When the batteries thus assembled were appliedt~ a subsequently hired group of 13 employees, the results showed con-SIderable promi~e. Because of the small number of cases, these resultsare only suggestive. The study was conducted primarily to demonstrate amodel for the utilization of synthetic validity.

The two examples of synthetic validity were cited only to illustratethe scope of possible applications of these techniques. For a descriptionof the actual procedures followed, the reader is referred to the ariginalsources .. In ~ummary, the Concept of synthetic validity can be imple-~ented III diHerent ways to fit the practical exigencies of different situa-tIOns.. It oH~rs .a promising approach to the problem of complex andchangmg. cntena; and it permits the assembling of test batteries to fit~he reqUIrements of specific jobs and the detennination of test validity1D many contexts where adequate criterion-related validation studies areimpracticable.

!he construct validity o~ a test is the extent to which the test may besaId to me~ure. a theoretical construct or trait. Examples of such con-structs ~re mtelhge~~, mechanical comprehension, verbal fluency, speedof ,;alking, neurotiCIsm, and anxiety. Focusing on a broader, more en-dunng, .and more abstract kind of behavioral description t'han the previ.ously dlscusse~ types ,of validity, construct validation requires the grad-ual a~um~latIon of mfonnation from a variety of sources. Any datathrOWIng hght on the nature of the trait under consideration and the~~~tions .aHecting i~ developm.e~t and manifestations' are grist for this,al~dl~ mill: IllustratIOns of speCific technique~ $uitabl~, for constructvahdatlon Will be considered below. ':&-"

Page 90: Anne Anastasi- Psychological Testing I

Validity; Basic Concepts 153

acco~~ing to. a hierarchical pattern of learned skills, they, too, can utilizeempmcal eVidence of hierarchical invariance in their validation.

";: DEVELOPMENTAL CHANGES.A major criterion employed in the validation',',ofa number of intelligence tests is age d.ifferentiation. Su.ch tests a.~the,Stanford.Binet and most preschool tests arc checked agamst chronolog-':ical age to determine whether the scores show a pr~gressive i~crease

, .with advancing age. Since abilities are expected to mcre~se \~lth age, ,duringchildhood, it is argued that the test scores should likewise show

, such an increase, if the test is valid. The very concept of an age scale,:'0£ intelligence, as initiated by Binet, is based on the assumption that "in-~telligence"increases with age, at least until maturit,Y- . .

The criterion of age differentiation, of course, IS mapp1icable to any,functions that do not exhibit clear-cut and consistent age changes. In thearea of personality measurement, for example, it ~as found li~ited u~e.Moreover, it should be noted that, even when apphcable, age differentia-tion is a necessary but not a sufficient condition for validity. Thus, if the

. test scores fail t~ improve with age, such a finding probably indicates" that the test is not a valid measure of the abilities it was designed to."sample. On the other hand, to prove that a test measures something

that illcr,eases with age does not define the area covered by the test veryprecisely. A measure of height or weight would al~o show regul~r ag~inc1'ements,although it would obviously not be deSignated as an mtelli-

'\ gencetest. . .'A final point should be emphasized reg~rding the. mterpretahon .of ~e

age criterion. A psychological test validated a?amst such a cnteno~measures behavior characteristics that increase w1th age under the condl'tions existing in the type of environment in which the test was stand-ardized. Because different cultures may stimulate and foster the develop-ment of dissimilar behavior characteristics, it cannot be assumed thatthe criterion of age differentiation is a universal one .. Lik~ all ~th~r

, criteria, it is circumscribed by the particular cultural settmg m whlCh It

is derived.Developmental analyses are also basic to the construct validation of

the JPiagetian ordinal scales cited in Chapter 4. A fundamental assump,tion of such scales is 1thesequential patterning of development, such thatthe attainment of earlier stages in concept development is prerequisite tothe acquisition of later conceptual skills. T'here is thus. an ~ntrinsic h~er-archy in the content of these scales. The construct vahdahon of ~rdi~alscales should therefore include empirical data on the sequential 10-

variance of the successive steps. This involves checking the performanceof children at different levels in the development of any tested concept,such as conservation or object permanen,ce. Do children who demonstratemastery of the concept at a given level :also exhibit mastery at the ~owerlevels? Insofar as criterion-rt:ferenced tests are also frequently deSIgned

CO~~Anoss WlTIl OTHER TESTS. Correlations between a new testand slIDllar earlier tests are sometimes cited as evidence that the new testme~sures apprOximately the same general area of behavior as other testsdes~gnated by"the ~ame name, such as "intelligence tests" or "'mechanicalaphtude tests: Unlike the correlations found in criterion-related validity,these correlahons sh~uld be ~oderately high, but not too high. If the newtest correlates too lughly With an already available test, withuut suchadded advantages as brevity or ease of administration, then the new testrepresents needless duplication.

Correlations with other tests are employed in still another way tod~m~nstrate that the new test is relatively free from the influence of cer-ta~n m~le:ant factors. For ex~~ple, a special aptitude test or a person-alItr teat "hould hav.e a neglIgtble correlation with tests of general in-te1hgence ~r scholastic aptitude. Similarly, reading comprehension shouldnot appreCiably affect performance on such tests. Thus, correlations witht~sts of general intelligence, reading, or verbal comprehension are some-h~es reporte~ as indirect or negative evidence of validity. In these cases,hlgh correlations, would make the test suspect. Low correlations, how-ever, would n~t 10 t~emselves insure validity. It will be noted that thisuse o~ correlations With other tests is similar to one of the supplementarytechmques described under content validity. '

FA~OR ANALYSrs.Of particular relevance to construct validitv is fadoran~lySlS: a s~atistical procedure for the identification of psy~hological~ralts. E,s~entia.lly, factor analysis is a refined technique for analyzing theI~terrelationships of behavior data. For example, if 20 tests have beenglven ~o 300 persons, the first step is to compute the correlations of eacht~st Wlth e:ery other., An inspection of the resulting table of 190 eoi-rela-ti,O~ may Itself reveal. certain clusters among the tests, suggesting the 10-catI?n of common traIts. Thus" it_ tests as vocabulary, analogies op-pOSites, and sent~nce ~mpletioJl •• high correlations with each ~therand low correlations With all ot~ ~ts, we could tentatively infer thepre~en.(:e of a verbal :omprehe~ioj "tor. Because',~uch an inspectionalana ~m of .a ~rrelaho~ table is ~t and uncetjtirln, however, moreprecIse statistical teclm1ques have blWft developed to locat thf t

. cd """'- e e commonac ors reqmr to account for the'ttbtai, ·ned co i ti' Th h. f f "rre,a ons. ese tee -

m~ues a .actor a~alysis will be e~amiil~d further in Chapter 13, togetherWIth multnple aptItude tests developed~~y means of I~r analysis.

Page 91: Anne Anastasi- Psychological Testing I

Principles of PSljchological Testing

n the process of factor analysis, the number of variables or .cate~oriesermsof which each individual's performance can be descnbed lS re-ed from the number of original tests to a relatively small number ofrs, or common traits. In the example cited above, five or six factorst suffice to account for the intercorrelations among the 20 tests. Each'dual might thus be described in terms of his scores in the five or six

ors, rather than in tcrms of the original 20 scores. A major purpose of(>ranalysis is to simplify the description of behavior by reducing the

er of categories from an initial multi licit 1 of test vari bles to a few

ac;Aft~rthe factors have been idcntified, they can be utilized in describinge factorial composition of a test. Each test can thus be cl1afacterized inrmsof the l1)a)orfactors determining its scores, together with the weightr loading of each factor and the correlation of the test with each facto~.uch a correlation is known as the factorial validity of the test. Thus, lfhe verbal comprehension factor has a weight of .66 in a vocabulary test,he factorial validity of this vocabulary test as a measure of the trait oferbal comprehension is .66. It should be noted that factorial validity is

entially the correlation of the test with whatever is common to a groupof tests or other indices of behavior. The set of variables analyzed can,ofcourse, include both test and nontest data. Ratings and other criterion'measurescan thus be utilized, along with other tests, to explore the fac-torial validity of a particular test and to define the common traits it

measures.

Validity: BasicCaacepts 155

~orrelation of .subtest scores with total score. Many intelligence tests, forlD:tance,. con~lst of separately administered subtests (such as vocabulary,anthmehc, picture completion, etc.) whose scores are combined in findingthe total test score. In the construction of such tests, the scores on eachsubtest are often correlated with total score and any subtest whose cor-relation with total score is too low is eliminated. The correlations of therem~ining sUbte~ts with total score are then reported as evidence ofthe Internal consistency of the entire instrument.. It is app.arent that internal consistency correlations, whether based onItems or subtests,. are essentially measures of homogeneity. Because ithelps to charactenze the behavior domain or trait sampled by the test, thedegree of homogeneity of a test has some relevance to its construct valid-ity .. Ne.vert~eless, ~he contribution of internal consistency data to testvahdatlOn IS very limited. In the absence of data external to the test it-self, little can be learned about what a test measmes.

INTERNAL CONSISTENCY. In the published descriptions of certain tests,especially in the area of personality, the statement is made that the testhas been validated by the method of internal consistency. The essentialcharacteristic of this method is that the criterion is none other than the

-total score on the test itself. Sometimes an adaptation of the contrasted. grOUpmethod is used, extr'"emegroups being selected on the basis of the

total test score. The performance of the upper criterion group on each test i

item is then compared with that of the lower criterion group. Items thatfail to show a significantly greater proportion of "passes" in the upperthan in the lower criterion group are considered invalid, and are either~liminated or revised. Correlation~l pr.qcedures may also be employed forthis purpose. For example, the biserial'correlation between ."pass-f~il" .oneach item and total test score can be computed. Only those Items )'leldmgsignificant item-testcorr~fliJi.Pns would be retained. A test whose itemswere selected by this meth,qd can be said to show internal consistency,since each item differentiates in the same direction as the entire test.

Another application of the criterion of internal consistency involves the

EFFECT OF EXPERIYENTAL VARIABLES ON TEST SCORES' A further sourceof data forconstmct validation is provided by ex-periments on the effectof selecte(;I~ariables on test scores. In checking the validitv of a ~riterion-referellce'O test for use in an individualized instruction~l program, forexample, one approach is through a comparison QE pretest and posttestscor~s..The rationale of such a test calls for low scores on the pretest,admlms~ered b~fore ~he relevant instruction, and high scores on the post-test. ThiS relationshIp can also be checked for individual items in thete~t (Po.pharo, 1971). Ideally, the largest proportion of examinees shouldfall an Item ?n the pretest and pass it on the posttest. Items that arecommonly falled on both tests are too difficult, and those passed on bothtests ~oo easy, for t~e purposes of such a test. If a sizeable proportion ofexa~mees pass an ltem on thc pretest and fail it on the posttest, there isobvlOusly something wrong with the item, or the instruction, or both.

A. test designed to measure anxiety-proneness can be administered tosub!ects who are subsequently put through a situation designed to arouseamQe.~, such as .t~~ng an examination under distracting and stressfulconditions. The lDltlal anxiety test scores can then be correlated withphySiolog!cal. and other indices of an~iety expression du~pg and afterthe exammatIon. A different hypothesis regarding an anxietY· test could?e evalua~ed by admini~tering the test before and after an anxiety-arous-mg expen:~ce an~ seemg whether test scores rise Significantly on theretest. PosItive flndmgs from such an experiment would indicate that thetest scores. reBect current anxiety level. In a similar w,lI.y;' exper4;h.lentscan be. designed to test any other hypothesis regarding th.~;;tfait ~~SUredby a gIVen test.' .'

Page 92: Anne Anastasi- Psychological Testing I

TABLE 12A Hypothetical Multitrait-M:ultimethod Matrix(From Campbell & Fiske, 1959, p. 82.)

NVERGENT AND DISCRIMINANT VALIDATiON. In a thoughtful analysisnstruet validation, D. T. Campbell (1960) points out that in order

emollstrate construct validity we must show not only that a test cor-es highly with other variables with whi~h .it should ~heoret.icallyelate, but also that it does not correlate sIgmficantly wIth van abIes

which it should differ. In an earlier article, Campbell and Fiske) described the former process as convergent validation and the

er as discriminant validation. Correlation of a mechanical aptitudewith subsequent grades in a shop course would be an example ofvergentvalidation. For the same test, discriminant validity would berated by a low and insignificant correlation with scores. on a .readingprehension test, since reading ability is an irrelevant varIable m a testgnedto measure mechanical aptitude. ., .t will be recalled that the requirement of low correlatlOn WIth trrele-t variables was discussed in connection with supplementary and pre-tionary procedures followed in content validation. Discrin;inant va~-ionis also especially relevant to the validation of personality tests, Inich irrelevant variables may affect scores in a variety of ways.ampbell and Fiske (1959) proposed a systematic experimental deSignthe dual approach of convergent and discriminant validation, which

ey called the multitrait-multimet1lOd J7latrix. Essentially, this procedurequiresthe assessment of two or more traits by tw.o Qr ~ore metho~s. Apathetical example provided by Campbell and FIske WIll serve to IUUS-

ate the procedure. Table 12 shows all possible correlations among theoresobtained when three traits are each measured by three methods., e three traits could represent three personality characteristics, such asA) dominance, (B) sociability, and (C) achievement motivation. Thehreemethods could be (1) a self-report inventory, (2) a projective tech-'iquc,and (3) associates' ratings. Thus, Al would indicate dom~na~ceoreson the self-report inventory, A2 dominance scores on the projective

est,C3 associates' ratings on achievement motivation, and so forth.The hypothetical correlations giv~n in Ta~le 12 include reli.ability co-

fficients (in parentheses, along principal dIagonal) and validity coef-cients (in boldface, along three shorter diagonals). In these validity

coefficients,the scores obtained fc",~~~psame trait by different methodsarecorrelated; each measure "is.thu~ 'being checked against other, inde-pendent measures of the same'::'trait, ~s'.in the familiar validati~n proce-dure.The table also includes correlations between different traIts meas-uredby the same riJ":.thod'(in solid triilngles) ~nd corrclati.ons betweendifferent traitsllleasured by different methods (Ill broken trIangles). Forsatisfactory construct validity, the validity coefficients should obviouslybe higher than the correlations between different traits measured bydifferent methods; they should also be higher than the correlations be-

,...56::-:- ...~22---:11: 67-'---42-------:I ',', I 1'" ',. .33 II..... .•. I : ......•......, ....•....•.. :

:.23'".58"',)2: :.43 '".~6',,:.34:I "... ....,~ t •••••• I

l~~1 :..~~:~~~45L~~ .:~~~::::~58~.58 (.85)•...• ·::;"'-~"':'_~~~'~:;~'::>'.-~;~I;;,';"_:~~~~ ..•_.:;.::~:=.t;,..~~~':Q.~IM&~)

Note: Le~tersA. B, C refer to traits, subSCripts1,2,3 to methods. Validity coefficients(rnon~tralt-heteromethod) are the three diagonal sets of boldface numbers; reliabilityc~efficlents (~ono~ralt-rnonomethod) are the numbers in parentheses along principaldiagonal. Sohd tnangles enclose heterotrait-monomethod correlations; broken tri-angles enclose heterotrait-hcteromethod correlations.

'l

tween different traits measured by the same method. For example, theCOITf:lationbetween dominance scores from a self-report inventory anddOITt~ijancescores from a projective test should be higher than the cor-relatIon between dominance and sociability scores from a self-report in.ventor~. If ~he l~tter correlation, representing common method variance,:-rere hIgh, It mIght inllicate, for example, that a person's scores on thisInventory are unduly affected by some irrelevant common factor such asability to understand the questions or desire to make oneself appear in afavorable light on all traits.

Fiske (1973) has added still another set of correlations that should bechecke~, esp~cially in the construct validation of personality tests. These~rrelab~ns Involve the same trait measured by the"same method, butWith a dlffer~nt test. For examplc, two il)vestigators may each plieparea self-report Inventory designed to assesseIl,durance. Yet the end~rancescores obtained with the two inventories may show quite diffe~~nt. pat-terns of correlations with measures of other personality traits. Under these

Traits

A,Method 1 B,

C,

A,Method 2 B.

C.

Page 93: Anne Anastasi- Psychological Testing I

IllustrativeQuestion

Type ofValidity

Validity: Basic Conc('pts 159

at a higher educational level, as when selectinO' hiO'h school students forb t:<

college admission, it needs to be evaluated against the criterion of sub-sequent college performance rather than in terms of its content validity.

The examples given in Table 13. focus on the differences among thevarious types of validation procedures. Further consideration of theseprocedures, however, shows that content, criterion-related, and constructvalidity do not correspond to distinct or lOgically coordinflte categories.On the contrary, construct validity is a comprehensive concept, whichincludes the other types. All the specific techniques for establishing con-tent and criterion-related validity, discussed in earlier sections of thischapter, could have heen listed again under construct validity. Comparingthe test performance of contrasted groups, such as neurotics and normals,is one way of checking the construct validity of a test designed to meas-ure emotional adjustment, anxiety, or other postulated traits. Comparingthe test scores of institutionalized mental retardates with those of normalschoolchildren is one way to investigate the construct validity of anintelligence test. The correlations of a mechanical aptitude test with per-formance in shop courses and in a wide variety of jobs contribute to ourunderstanding of the construct measured by the test. Validity againstvarious practical criteria is commonly reported in test manuals to aid thepotential user in understandin~ what a test measures. Although he maynot be directly concerned with the prediction of any of the specific cri-teria employed, by examining such criteria the test user is able to buildup a concept of the behavior domain sampled by the test.

Content validity likewise enters into both the construction and thesubsequent evaluation of all tests. In assembling items for any new test,the test constructor is guided by hypotheses regarding the relations be-tween the type of content he chooses and the behavior he wishes tomeasure. All the techniques of criterion-related validation, as well as theother techniques discussed under construct validation, represent ways oftesting such hypotheses. As for the test user, he too relies in part oncontent validity in evaluating any test. For example, he may check thevocabulary in an emotional adjustment inventory to determine whethersome of the words are too difficult for the persons he plans to test; hemay conclude that. the scores on a particular test depend too much onspeed for his purposes; or he may notice that an intelligence test de-veloped twenty years ago contains many obsolescent items unsuitable foruse today. All these observations about content are relevant to the con-struct validity of a test. In fact, there is no information provided by anyvalidation procedure that is not relevant to construct validity.

The term construct validity was officially introduced into the psy-chome~rist's lexicon in 1954 in the Technical RecommenN4a{ions for Psy-c11010glcal Tests and Diagnostic Techniques, which constituted the firstedition of the current APA test Standards (1974). Although the validation

Principles of Psychological Testing

.ditions,it cannot be concluded tllat both inventories measure the same·sonalityconstruct of endurance. ., .t might be noted that within the framework of the mnlhtrmt-mulh-hod matrix, reliability represents agreement between two measures ofsame trait obtained through maximally similar methods, such as

alle! forms of the same test; validity represents agreement betweenmeasures of the same trait obtained by maximally different methods,

chas test scores and supervisor's ratings. Since similarity and differencemethods arem~tters of degree, theoretically reliability and validity canregarded as falling along a single continuum: O~~inarily, ho\~'e~er, thehniques actually employed to measure rehabllIty and validIty cor-ond to easily identifiable regions of this continuum.

We have considered several ways of asking, "How valid is this test?"Topoint up the distinctive features of the different types of validity, letusapply each in turn to a test consisting of 50 assorted arithmetic prob-lems.Four ways in which this test might be employed, together with thetypeof validation procedure appropriate to each, a:e illustra:ed ~nTable13. This example highlights the fact that the chOIce of valIdahon pro-

; cedure depends on the use to be made of the test scores. The same test,when employed for different purposes, should be validated in differentways.If an achievement test is useet to predict subsequent performance

TABLE 13Validationof a Single Arithmetic Test for Different Purposes

. Achievement test in ele-mentary school aritlune-tic

Aptitude test to predictperformance in highschool mathematics

Technique for diagnosinglearning disabilities

Measure of logical reason-ing

How much has Dicklearned in the past?

How well will Jim learn inthe future?

Criterion-related:predictive

Does Bill's performanceshow specific disabili-ties?

How can we describeHenry's psychologicalfunctioning?

Criterion-related:concurrent

Page 94: Anne Anastasi- Psychological Testing I

160 Pritlci,Jles of PSlJchological Testing

procedures subsumed under construct validity were not new at the time,the discussions of construct validation that followed served to make the

. implications of these procedures more explicit and to provide a systematic,; rationale for their use. Construct validation has focused attention on the

role of psychological theory in test construction and on the need toformulate hypotheses that can be proved or disproved in the validationprocess. It is particularly appropriate in the evaluation of tests for usein research.

In practical contexts, construct validation is suitable for investigating; the validity of the criterion measures used in traditional criterion-related" test validation (see, e.g., James, 1973). Through an analysis of the cor-

relations of different criterion measures with each other and with other, relevant variables, and through factorial analyses of such data, one can

learn more about the meaning of a particular criterion. In some instances,the r~sults of such a study may lead to modification or replacement of thecriterion chosen to validatc a test. Under any circumstances, the resultswill enrich the interpretation of the test validation study.

Another practical application of construct validation is in the evalu-ation of tests in situations that do not permit acceptable criterion-relatedvalidation studies. as in the local validation of some personnel tests forindustrial use. The difficulties encountered in these situations were dis-cussed earlier in thi.s chapter, in connection with synthetic validity. Con-str~ct validation offers another alternative approach that could be fol-lowed in evaluating the appropriateness of published-tests for a particularjob. Like synthetic validation, this approach requires a systematic jobanalysis, followed by a description of worker qualifications expressed in

;.:''terms of relevant behavioral constructs. If, now, the test has bcen sub-jected to sufficient research prior to publication, the data cited in themanual should permit a specification of the principal constructs measuredby the test. This information could be used directly in assessing therelevance of the test to the required job functions, if the correspondenceof constructs is clear enough; or it could serve as a basis for computinga J-coefficient or some other quantitative index of synthetic validity. .

Construct validation has also stimulated the search for novel ways of I

gathering validity data. Although the principal techniques employed ininvestigating construct validity have long been familiar, the field ofoperation has been' expanded to admit a \\rider variety of procedures.This very multiplicity of data-gathering techniques, however, presentscertain hazards. It is possible for a test constructor to try a large numberof different validation procedures, a few of which will yield positive re-sults by chance. If these confirmatory results were then to be reportedwithout mention of all the validity probes that yielded negative results, avery misleading impression about the validity of a test could be created.

I Another possible danger in the application of construct validation is that

Validity: Basic Concepts 161

it may open the way for s b" .validity. Since . ~ J~chve, unvenfled assertions about testcept, it has bE':~~~~~~ v;~~~~;s ~uc~ asbroad and loosely dcflned can-constructors Seem to ~r . . rs 00. ome textbook writers and testpsychological trait na~lescelVe It as content validity expressed in terms ofsubjective accounts of ~h~:~~e, t~e~ present as construct validity purely

A further source of ossibl ey e ~ve (o~ hope) the test measures.construct validation "is ; I e

dCo~fuslOn anses from a statement that

a measure of some at .~vo ve w e~ever ~ test is to be interpreted asoned'." (Cronbach & ~:e~tel~r quahty whIch is not 'operationally de-published analysis of the co' 5;, f' 282). Appearing in the first detailedoften incorrectl acce ted :c~p ? ~nstruct "alidity, this statement wasthe absence of ~ata ~hat t~ Justifrng a claim for construct validity insuch an interpretati;n is i1lus:a~~t ors of .the sta~e~ent did. not intendarticle, that "unless the n t k d b

kytheIr own inSIstence, III the same

e war ma es contact with b .construct validation cannot b I' d" 0 servations . . .h . . . e c alme (p. 291) In th .t ey cnhclze tests for wh' h" fi . e same connectIon,

been oHered as if it wcre l~al'; t.ne~pun network of rationalizations hasconstruct, trait or behavio dl a I~n (p, 291). Actually, the theoreticalb d ' r omam measured bv rti Ie a equateI), defined only' th I' h f - a pa cu ar test canvalidating that test Such I~ Iie. ~g t 0 data gathered in the process ofabIes with which th~ test c~ ~ ~lhO~ would take into account the vari-found to affect its Scores an~et~ ed SIgnificantly, as well as the conditionsscores. These procedures are e ; ~~~ps that diff~r significantly in suchbutions made bv the co t n fIre :- m aCcord w1th the positive contrl-

. ncep 0 construct valid'ty I 'the empirical investigation of the r I' h' 1. t IS only throughexternal data that we can d' ehahons IpS of test SCores to other

ISCOverw at a test measures.

Page 95: Anne Anastasi- Psychological Testing I

HArTER 7alidity:Measuremel~tand lrlterpretation

MEASUREMEXT OF RELATIONSHIP. A validity coefficient is a correlationbetween test score and criterion measure. Because it provides a singlenumerical index of test validity, it is commonly used in test manuals toreport the validity of a test against each criterion for which data areavailable. The data used in computing any validity coefficient can alsobe expressed in the form of an expectancy table or expectancy chart,illustrated in Chapter 4. In fact, such tables and charts provide a con-venient way to show what a validity coefficient means for the persontested. It will be recalled that expectancy charts give the probability thatan individual who obtains a certain score on the test will attain a speci-fied level of criterion performance. For example, with Table 6 (Ch. 4,p. 101), if we know a student's score on the DAT Verbal Reasoning test,"",e can look up the chances that he will earn a particular grade in ahIgh school course. The same data yield a validity coefficient of .66When both test and criterion variables are continuous, as in this example,the familiar Pearson Product-Moment Correlation Coefficient is appli-cable. Other types of correlation coefficients can be computed when thedata are expressed in different forms, as when a two-fold pass-fail cri-terion is employed (e.g., Fig. 7, Ch. 4). The specific procedures forcomputing these different kinds of correlations can be found in anystandard statistics text.

CHAPTER 6 was concerned with different concepts of validity and

. their appropriateness for various testing. f~nctions; t~is. chapterdeals with quantitative expressions of vahdlty and theIr mterpre-

tation. The test user is concerned with validity at either or both of twostages. First, when considering the suitability of a test for his purposes,he examin~ailable validit)'data reported in the test manual or ot~erp~ed so.Jltces..Through such in~ormation, he arrives at a tentativeconcept of what psychological fu~ctlOns the test actually measures, andhe judges the relevance of such function~ to his p.rop~sed use of t~e test.In effect, when a test user relies on published validation data, he IS dea.l-ing with construct validity, regardless of the specific pro?ed~res used mgathering the data. As we have seen in Chapter 6, the cntena employedin published studies cannot be assumed to be iden?cal. with th~se thetest user wants· to predict. Jobs bearing the same title m two dIfferentcompanies are rarely identical. Two courses in freshman English taughtin different colleges may be quite dissim~1;l.r· i

Because of the specificity of each criterion, te~t users are .us~ally ad-vised to check the validity of anv chosen, 'test agamst local cnterla when-ever possible. Although publishe'd dat~ay str~ngl~ sugg~st that a giventest should have high validity in a particular sltuatio~, dlTee: corrobo~a-tion is always desirable. The dete:t'inination of validJ!Y agamst specificlocal criteria represents the second stage in the test ~r's evaluation ofvalKTfty.The teChnIques ttr'1le dIscussed 1~ this chapter are esp~ciallyrelevant to the analysis of validity data obtamed by ~e test u.ser hlms~1f.Most of them are also useful, however, in understanding and mterpretmgthe validity data reported in test manuals.

J6z

COI\"DITIONS AFFECTING VALIDITY COEFFlCIEXTS. As in the case of reli-ability, it is essential to specify the nature of the group on which avalidity coefficient is found. The same test may measure different func-tions when given to individuals who differ in age, sex, educational level,occupation, or any other relevant characteristic. Persons with differentexperiential backgrounds, for example, may utilize different work meth-ods to solve the same test problem. Consequently, a test could have highvalidity in predicting a particular criterion in one population, and littleor no validity in another. Or it might be a valid measure of differentfunctions in the two populations. Thus, unless the validation s~ple isrepri'!seiififiVe of the population on which the test is to be used, validityshould be redetermined on a more appropriate sample.

The question of sample heterogeneity is relevant to the measurementof validity, as it is to the measurement of reliability,'.,since both charac-teristics ale commonly reported in terms of correlation eoefficiElnts. Itwill be recalled that, other things being equal, the wider the range ofscores, the higher will be the correlation. This fact should be kept in

Page 96: Anne Anastasi- Psychological Testing I

Principles of Psychological Testing

mind when interpreting the validity coefficients given in test manuals.,;. Il.special difHcttlt}, encountered in many validation samples arises frompreselection. For example, a new test that is being validated for job selee-.tionmay be admini$tered to a group of newly hired employees on whom;criterJIonmeaSures of job performance will eventua11y be available. It is~likely;however, that such employees represent a superior selection of all"!hosewho applied for the job. Hence, the range of such a group in both.'tests¢ores and criterion measures will be curtailed at the lower end of the

:·bdistribution.the effe~t of such preselection will therefore be to lower the'validity coefficient. In the subsequent use of the test, when it is admin-dster¢d to all applicants for selection purposes, the validity can be ex-pected to be somewhllt higher.

" Validity coefficients may also change over time because of changing.'selection standards. An example is provided by a comparison of validity,coefficients compll.ted over a 3D-year interval with Yale students (Burn-"ham, 1965). Correlations were found between a predictive index based, on College Entrance Examination Board tests and high school records,

f onthe one hand, and average freshman grades, on the other. This correla-tion dropped from .11 to .52 over the 30 years. An examination of the

r' bivariate distributions dearly reveals the reason for this drop. Because of~higher admissibn standards, the later class was more homogeneous than.:the earlier class in both predictor and criterion performance. Conse-

quently, the correlation was lower in the later group, although the ac-t curacy with whkh individuals' grades were predicted showed littlech~nge. In other words, the observed drop in correlation did not indicate

. that the predictors were less va-lid than they had been 30 years earlier.Had the difference$ in group homogeneity been ignored, it might have

" been 'Wrongly concluded that this was the case.'0' For the propet interpretation of a validity coefficient, attention should

alm be given to the form of the relationship between test and criterion., The computation of a Pearson correlation coefficient a;;sumes that the re-lationship is linear and uniform throughout the range. There is evidence I

I that in certain situations, however, these' conditions may not be met, (Fisher, 1959; Kahneman & Ghiselli, 1962). Thus, a particular job may

" require a minimum level of reading comprehension, to enable employeesto read instructiorl manuals, labels, and the like. Once this minimum ise:,tceeded, however, further increments in reading ability may be un-related to degree of job success. This would be an example of a nonlinearrelation between test and job performance. An examination of the bivari-ate distributjon or scat.\:er diagram obtained by plotting reading compre-hension scores a!Ylinst criterion measures would show a rise in job per-

I fprmance up to the minimal required reading ability and a leveling offbeyond that point. Hence, the entries would cluster around a curve rather

Validity: Mcasuremcnt and Interprctation ~65In other situations th 1" f b

individual entries m;y d: .lfIte~ ~st 6t may be a straight line, but theat the lower end of the s:~ e Sart er around this line at the upper thanaptitude test is a necClisa; ~ut u~:se that 'performa~c::e on a scholasticachievement in a course Th t' h a tufficlent condItion for successfulpoorly in the cOU"se' bl!lt'a' a ISth,t h~ how-scoring students will perform

• ,. mong e Ig -scor' t d .fonn well in the course . d th mg s u ents, some WIll per-motivation. In this situat~:n ~h ers ",:::1:erf~rm poorly because of lowperformance among the ·h.'h ere. WI e WIder variability of criteriondents, This condition in ~g ~sco~g t~an. am?ng the low-scoring stu-scedasticih.' Th p. bwanate dIstrIbution is known as hctero-

'J' e earson correlatio hvariability throughout tb ~ assum:s ?moscedasticity or eqll.alpresent example, the bivae..r:n~ o'b th~ bIVanate distribution. In theat the upper end and n na e shtn utIon would be fan-shaped-wideb' , arrow at t e lower end A . ' .

Ivanate distribution itsdf ill 11 . . II exammation of thenature of the relations·hip b 'tV usua y give a, good indication of the

e ween test and 't' Eand expectancy charts also I cn erIOn. xpectancy tablesthe test at different levels. correct y reveal the relative effectiveness of

MAGNITUDE OF A V.Aj.LIDITY·COEFFr •coefficient be? No gener I CIE~T. How hIgh should a validity. . a answer to thIS gr' .mterpretation of a validit ffi . ues lOll IS pOSSIble, since theof concomitant circumsta; coe clent ~ust take into account a numberbe high enough to be sta~~~·'n;~o~tamed correlation, of course, shouldsuch as the 01 or 05 level' d~8.Ica !Jds~gnificant at some acceptable level

. . . s ISCusse in Cha t 5 I h 'drawing any conclusions about th • lid' per . not er words, beforesonably certain that the obt' d el~d~ Ity of a test, we should be rea~

. ame va I Ity coeffi' t ldthroug~ chance fluctuatip.tls of sam Ii . Clen cou not ,have arisen

Havmg establjshed a signiflcant p ng fro.m a true correlation of zero.criterion, however, we need to e correlat1~n between test Scores andlight of the uses to be m d f v~luate the SIZeo~ the correlation in ~hevidual's exact criterion s~ e 0 ~le test. If we WIsh to predict an indi-will receive in college the ~:fi~~tas:: grade-point average a studentof the standard erro; of estimare coe .clen.t may be interpreted in termsmeasurement discussed in : whl7h IS analogous to the ,error ofthat the errOr of measure.:~;~c~~n WIth reliability. It wl"llbe recalledpected in an individual's n Icates the margin,. of error to be ex-~irni1ar1y, the eITor of esti=~: :~ a res~t of th~ unreliability of the t~t.m the individual's predicted 't o~s t e margm of ~r,rot to be expe~tec:lvalidity of the test. cn erIon score, as a ~lwf the imper{~(;t

The error of estimate is found b th f 11 . ";"'"Y e 0 owmg fOfn,ula:

Page 97: Anne Anastasi- Psychological Testing I

6 Prillciples of Psychological Testing

:whichr2 >'V is the square of the validity coefficient and Uv is th~ ~tandardeviatiol1of the criterion scores. It will be noted that if the vahdlty wereerfect(r >'V ;::: 1.00), the error of estimate would be zero. On the otherand,with a test having zero validity, the error of estimate is as large ase standard deviation of the criterion distribution (ucBr.;::: ulIVI -0 =

v), Under these conditions, the prediction is no better. than. a ~ues~; andhe range of prediction error is as wide as the enbre distnbutIOn ofcriterionscores. Between these two extremes are to be found the errorsofestimatecorresponding to tests of varying validity.

Reference to the formula for cr •• t. will show that the term VI - r'''11, servesto indicate the size of the error relative to the error that wou~, result from a mere guess, i.e., with zero validity. In other words, lfv'l- r'xv ig equal to 1.00, the error of estimate is ~s .lar~e as it would beif we were to guess the subject's score. The predlc~ve lmprove~~nt at-tributable to the use of the test would thus be rol. If the validlty co-efficientio; .80, then VI - "XI/ is equal to .60, and -the error is 60 percentaslarge as it would be by chance. To put it diffe:ently, the use of s~ch atest enables us to predict the individual's critenon performance wlth amarginof error that is 40 percent smaller than it would be if we were to

guess. . . . .It would thus appear that even with a validlty of .80,whl~h 1S unusu~lIy

high, the error of predicted scores is.conside~abl~..u th,e pnmary ~~ctl~nof psychological tests were to predlct each mdIvl~ual ~ exact l?OSlhO~inthe criterion distribution, the outlook would be qUite dlscouraglOg. \\ henexamined in the light of the error of estimate, mos~ t~sts do not appearvery efficient. In most testing situations, ho~ev~r,. lt IS not necessary topredict the specific criterion performance of mdlvl~ual .c~ses, but ratherto determine which individuals will exceed a certam mlmmum standardof performance, or cutoff point, in the cri:erion. What are the ch.an:esthat Mary Greene will graduate from medIcal school, tI:at Tom Hlggms",'in pass a course in calculus, or that Beverly ~ruce WIll succeed as anastronaut? Which applicants are likely to be satlsfactory clc::rks,salesmen,or machine operators? Such information is ~seful not only fo~ ~roup i

selection but also for individual career planmng. For example, lt 15 ad-vantageous for a student to know that he has a gOO? chanc~ of pas~ingall courses in law school, even if we are unable to estimate WIth certamtywhether his grade average will be 74 or ~I: . .,

A test may appreciably improve predIctive effiCIencyIf It sho~s a~1Jsignificant correlation with the criterion, however 10w..Un.der.certa~n Clt-cumstanees even validities as low as .20 or .30 may Justify lncluslon ofthe test in ~ selection program. For many testing purposes, evaluation .oftests in terms of the error of estimate is unrealistically stringent. Consld-eration must be given to other ways of evaluating the contribution of a

Validity: AI easuremellt and Interpretation 167

test, which take into account the types of decisions to be made from thescores. Some of these procedurcs will be illustrated in the following sec-tion.

BASIC APPROACH. Let us suppose that 100 applicants have been givenfln aptitude test and followed up until each could be evaluated for suc-cess on a certain job. Figure 17 shows the bh'ariate distribution of testscores and measures of job success for the 100 subjects. The correlationbetween these two variables is slightly below .70. The minimum accept-able job performance, or criterion cutoff point, is indicated in the diagramby a heavy horizontal line. The 40 cases falling below this line wouldrepresent job failures; the 60 above the line, job successes. If all 100 appli-~ants are hired, thereforc, 60 percent will succeed on the job. Similarly,if a smaller number were hired at random, without reference to testscores, the proportion of successes would probably be close to 60 percent.Suppose, however, that the test scores are used to select the 45 mostpromising applicants out of the 100 (selection ratio;::: .45). In such acase, the 45 individuals falling to the right of the heavy vertical linewould be chosen. Within this group of 45, it can be seen that there- arc 7job failures, or false acceptances, falling below the heavy horizontal line,and 38 job successes. Hence, the percentage of job successes is now.84rather than 60 (i.e., 38/45 = .84). This increase is attributable to the useof the test as a screening instrument. It will be noted that errors in pre-dic:te:d criterion score that do not affect the decision can be ignored.Opl)' those prediction errors that cross the cutoff line and hence placethe individual in the wrong category will reduce the selective effective-ness of the test .. For a complete evaluation of the effectiveness of the test as a screeningmstrument, another category of cases in Figure 17 must also be examined.This is the category of false re;ections, comprising the 22 persons whoscore below the cutoff point on the test but above the criterion cutoff.From these data we would estimate that 22 percent' of the total applicantsample are potential job successes who will be lost if the test is used as ascreening device with the present cutoff point. These false rejects in apersonnel selection situation correspond to the false positives in clinicalevaluations. The latter term has been adopted frO,J:lkmedical practice, inwhi~ .a t~st for a pathological condition is reported ~positive if thecondltion 1S present and negative if the patient is Dormal. A false positivethus refers to ~ case in ~hich the test erroneously 4l~~atf,(~-1:hepresence?f ~ ?athologJ~1 condition, as when brain damage~,-~ mdicated in anmdlVldual who lS actually normal. This terminology is likely to be COD-

Page 98: Anne Anastasi- Psychological Testing I

r.' fusingunless we remember that in clinical practice a positiv~ result po a, test denotes pathology and unfavorable diagnosis, whereas In pers~n~el. selectiona positive result conventionally refers to a favorabJ~ prediCtIon: regarding job performance, academic achievement, and the lI~e.

. In settin on a test, attention should be ven to the'i. percentage of false rejects (or false positives as we as to the .erc:nt-i) a cesses an ai ures wit in t~_se eete grou.!} In certam SItu-;; ations,the cutoff point should be set sufficiently higt, to e~clu?e all but

',' a few possible failures. This would be the case when t~~',;obIS of such!: a nature that a poorly qualified worker could cause senous loss or dam-i age. An example would be a commercial airline. pilot. Under o.ther:' circumstances, it may be more important to admit. as many qualIfied~ personsas possible, at the risk of including more fallures .. In the latter',> case the number of false rejects can be reduced by the choice of a lower,~,cutoffscore. Other factors that normally determine the position of ~he

."i, cutoffscore include the available personnel snpP4:, the number of job. -"

Validity: Measurement and Interpretation 169

openin s, and the ur ('nc or seed with which t ,filled.

In many personnel decisions, the selection ratio is determined by thepractical demands of the situation. Because of supply and demand infilling job openings, for example, it may be necessary to hire the top 40percent of applicants in one case and the top 75 percent in another.When the selection ratio is not externall,T imposed, the cutting smre 011

a test can be set at that point giving the maximum differentiation be.tw~ Clilelioll grouEs. TIus can be done roughly by comparing thedistrl ution of test scores in the two criterion groups. More precise math-ematical procedures for setting optimal cutting scores have also beenworked out (Darlington & Stauffer, 1966; Guttman & Raju, 1965; Rorer,Hoffman, La Forge, & Hsieh, 1966). These procedures make it possible totake into account other relevant parameters, such as the relative serious-ness of false rejections and false acceptances.

In the terminology of decision theofy, the example given in Figure 17illustrates a simple strategy, or plan for deciding which applicants to ac-cept and which to reject. I~ mor.e.~eral terms, a strategy is a techniquefor utilizing information in order to reach a decision about individuals. IntTllscase, the strategy was to accept the 45 persons with the highest te;scores. The increase in percentage of successful employees from 60 to 84could be used as a basis for estimating the net benefit resulting from theuse of the test.

Statistical decision theory was developed by Wald (1950) with specialreference to the decisions required in the inspection and quality controlof industrial products. Many of its implications for the construction andinterpretation of psychological tests have been systematically worked outby Cronbach and GIeser (1965). Essentially, decision theory is an at-tempt to put the decision-making process into mathematical form, so thdtavailable information may be used to arrive at the most effective decision~nder .s~ecified circumstances. The mathematical procedures employedlD. d~c1Slon.th~ory a~e often quite complex, and few are in a form per-mItting theIr Immediate application to practical testing problems. Someof the basic concepts of decision theory, however, are proving helpful inthe reformulation and clarification of certain questions about tests. A fewof these ideas were introduced into testing before the formal develop-ment of statistical decision theory and were later recognized as Dttinginto that framework.

, I'~

I II

JobSuccesses

CriterionCutoff

Jobfailures

LowLow

Test Score

~'FIC. 17. Increase in the Proportion of "Successes" Resulting from the Use of, a Selection Test.

PREDICTION OF OUTCOMES. A precursor of decision theory ini.psychologi.ca.1testing is ~o b~ found in the Taylor-Russell table~( 193,~),which per-mIt a detennmation of -the net gain in selection acc~racy atbibutable tothe use of the test. ~ information required inc1\ipls'the validity co-

Page 99: Anne Anastasi- Psychological Testing I

·60 .60.60 .60.61 .60.61 .61.62 .61

Validity: Measurement and Interpretation 171

selected after the use of the test. Thus, the difference between .60 andanyone table entry shows the increase in proportion of successful selec-tions attributable to the test.

Obviously if the selection ratio w.ere 100 percent, that is, if all appli-cants had to be accepted, no test, howen'r valid, could improve theselection process. Reference to Table 14 sho\\'s that, when as many as 95percent of applicants must be admitted, even a test with perfect validity( r = 1.00) would raise the proportion of successful persons by only 3 per-cent (.60 to .63). On the other hand, when only 5 percent of applicantsneed to be chosen, a test with a validity coefficient of only .30 can raisethe percentage of successful applicants selected from 60 to 82. The risefrom 60 to 82 represents the incremental vaUdity of the test (Sechrest,1963), or the increase in predictive validity attributable to the test. Itindicates the contribution the test makes to the selection of individualswho will meet the minimum standards in criterion performance. In ap-plying the Taylor-Russell tables, of course, test validity should be com-puted on the same sort of group used to estimate percentage of priorsuccesses. In other words, the contribution of the test is not evaluatedagainst chance success unless 'applicants were preViously selected bychance-a most unlikely circumstance. If applicants had been sele<;:teqon the basis of previous job history, letters of recommendation, and inter-views, the contribution of the test should be evaluated ODe. the- ,basis atwhat the test adds to these previous selection procedures. ..

The incremental validity resul~~~ from the use of a test depends notonly on the selection ratio but l\~'()ll the base rate. In the previouslyillustrated job selection situation, the base rale refers to the proportion ofsuccessful employees prior to the introduction of the test for selectionpurposes. Table 14 shows the anticipated outcomes when the base rateis .60. For other base rates, we need to consult the other appropriatetables in the cited reference (Taylor & Russell, 1939). Let us consideran example in which test validity is .40 and the selection ratio is 70 per-cent. Under these conditions, what would be the contribution or incre-mental validity of the test if we begin with a base rate of 50 percent?And what would be the contribution if we begin with more extreme baserates of 10 and 90 percent? Reference to the appropriate Taylor-Russelltables for these base rates shows that the percentage of successful em-ployees would rise from 50 to 75 in the Hrst case; from 10 to 21 in thesecond; and from 9 to 99 in the third. Thus, the improvement in percent-age of successful employees attributable tQ .the use of the test is 25 whenthe base rate was 50, but only 11 and 9 when the b,ase rates were moreextreme. .

The implications of extreme base rates are of specia~,,interest in clinicalpsychology, where the base rate refe~ to' the frequency of the patho-lOgical condition to be diagnosed in the, p.qpulation tested (Buchwald,

o Principles of Psychological Testing

cient of the test, the proportion of applicants who m~~t be acclep~e~lection ratio), and the proportion of successfu~ app lc~n~ :: ~~r:ethout the use of the test (base rate). A change many 0 t I"

ctorscan alter the predictive efficiency of the test.For urposes of illustration, one of the Taylor-Russell tables has been

e rod~eed in Table 14. This table is designed for us~ when the base.aie or ercenta e of successful applicants selected pnor to the use ofhe test 1s 60. Ot~er tables are prOVided by Taylor and Russe~l for ~t~~rbase ra~es Across the top of the table are given different va ues ~ .eselection ;atio, and along the side are the tes~ validities. The entnes 111

the' body of the table indicate the proportion of successful· persons

TABLE 14 i ( f G'Proportionof "Successes" Expected through the Use 0 Test 0 lvenValidityand Given Selection Ratio, for Base Rate .60.

(From Taylor and Russell, 1959, p. 576) . =-"~':~~,",7"'J2'-':UliH~~'.:>,JI;~~,.:~!M.r ••_.:::·..:;.':5.~~~

Selection Ratio

.30 .40 .50 .60 .70 .80 .90 .95

.75

.80

.85

.90

.951.00

.991.001.001.001.001.00

.99

.991.001.001.001.00

.96

.98

.991.001.001.00

.93

.95

.97

.991.001.00

.90

.92

.95

.97

.991.00

.60 .60 .60

.61 .61 .61

.63 .62 .61

.64 .63 .62

.65 .64 .63

.66 .65 .63

.68 .66 .64

.69 .67 .65

.70 .68 .66

.72 .69 .66

.73 .70 .67

.75 .71 .68

.76 .73 .69

.78 .74 .70.80 .75 .71

.71.72.73.74.75.75

.86

.88

.91

.94.97

1.00

.81 .77

.83 .78

.86 .80

.88 .82.92 .84

1.00 .86

.62 .61

.62. .6J

.63 .62

.63.62

.64 .62

.64 .62

.64 .62

.65 .63

.65 .63

.66 .63

.66 .63

.66 .63

.66 .63

.67 .63

.67 .63

.67 .63

Page 100: Anne Anastasi- Psychological Testing I

Princillies of PSljcllological Testing

,t1965; Cureton, 1957a; Meehl & Rosen, 1955; J. S. Wiggins, 1973). For:)example, if 5 percent of the intake population of a clinic has organic:brain damage, then 5 percent is the base rate of brain damage in this

,~population. Although the introduction of any valid test win improve:~.predictive or diagnostic accuracy, the improvement is greatest when the. base rates are closest to 50 percent. '''ith the extreme base rates found

;i'wfth rare pathological conditions, however, the improvement may be.:, negligible. Under these conditions, the use of a test may prove to be

unjustified when the cost of its administration and scoring is taken into'; account. In a clinical situation, this cost would include the time of pro-

fessional personnel that IDlght otherwise be spent on the treatment of• additional cases (Buchwald. 1965). The number of false positives, or

normal individuals incorrectly classified as pathological, would of courseincrease this overall cost in a clinical situation.

"'Then the seriousness of a rare condition makes its diagnosis urgent,.. tests of moderate validity may be employed in an early stage of sequential

decisions. For example, all cases might first be screened with an easilyadministered test of moderate validity. If the cutoff score is set highenough (high scores being favorable), there will be few false negativesbut many false positives, or normals diagnosed as pathological. The lattercan then be detected through a more intensive individual examinationgiven to all cases diagnosed as positive by the test. This solution wouldbe appropriate, for instance, when available facilities !Jlake the intensiveindividual examination of all cases impracticable.

RELATION OF VALIDITY TO MEAN OUTPUT LEVEL. In many practical situ-ations, what is wanted is an estimate of the effect of the selection test,not on percentage of persons exceeding the minimum performance, buton overall output of the selected persons. How does the actual level ofjob proficiency or criterion achievement of the workers hired on thebasis of the test compare with that of the total applicant sample thatwould have been hired without the test? Following the work of Taylorand Russell, several investigators addressed themselves to this question(Brogden, 1946; Brown & Ghiselli, 1953; Jarrett, 1948; Richardson, 1944).Brogden (1946) first demonstrated that the expected increase in outputis directly proportional to the validity of the test. Thus, the improvementresulting from the use of a test of validity .50 is 50 percent as great as theimprovement expected from a test of perfect validity.

The relation between test validity and expected rise in criterionachievement can be readily seen in Table 15.1 Expressing criterion scores

1 A table including more values for both selection ratios and validity coefficientswas prepared by Naylor and Shine (1965).

00,..(

It:l~

0~

It:lce:

0~

It:l1:-:

0I,,;

It:l~

..,0c

II) C'1·8esIII \I')0

C,) ~.e-:2 0

~~

o~

~ c.2Qj:8~tn IX:

Page 101: Anne Anastasi- Psychological Testing I

Principles of Psychological Testingstandard scores with a mean of zero and an SD of 1.00, this table givese expected mean criterion score of workers selected with a test of givenidity and with a given selection ratio. In this context, the base outputan, corresponding to the performance of applicants selected without

se-ofthe test, is given in the column for zero validity. Using a test witherovalidity is equivalent to using no test at all. To illustrate the use ofhe table, let us assume that the highest scoring 20 percent of the appli-.cantsare hired, (selection ratio == .20) by means of a test whose validitycoefficient is.50. Reference to Table 15 shows that the mean criterion.performance of this group is .70 SD above the expected base mean of anIllitested sample. \Vith the same 20 percent selection ratio and a perfecttest (validity coefficient = 1.00), the mean criterion score of the acceptedapplicants }vould be 1.40, just twice what it would be with the test ofvalidity .50. Similar direct linear relations will be found if other meancriterion performances are compared within any roW of Table 15. Forinstance, with a selection ratio of 60 percent, a validity of .25 yields amean criterion score of .16, while a validity of .50 yields a mean of .32.Again, doubling the validity doubles the output rise.

The evaluation of test validity in terms of either mean predicted out-put or proportion of persons exceeding a minimum criterion cutoff isobviously much more favorable than an evaluation based on the previ-ously discussed error of estimate. The reason for the difference is thatprediction errors that do not affect decisions are irrelevant to the selec-tion situation. For example, if Smith and Jones are both superior workersand are both hired on the basis of the test, it does not matter if the testshows Smith to be better than Jones while in job performance Jones

excels Smith.

TIlE ROLE OF VALUES IN DECISION TIIEORY. It is characteristic of decisiontheory that tests are evaluated in terms of their effectiveness in a specificsituation. Such evaluation takes into account not only the validity of thetest in predicting a particular criterion but also a number of otherparameters, including base ra:e and s~~ Another important'parameter is the relative utility of expected outcomes, the judged favor-ableness or unfl\.vorablcness of each outcome. The lack of adequatesystems for assigning values to outcomes in terms of a uniform utilityscale is one of the chief obstacles to the application of decision theory.In industrial decisions, a dollar-and-cents value can frequently be as-Signed to different outcomes. Even in such cases, however, certain out-comes pertaining to good will, public relations, and employee morale aredifficult to assess in monetary terms. Educational decisions must take intoaccount institutional goals, social values, and other relatively intangiblefactors. Individual decisions, as in counseling, must consider the indi-

Validity: Measurement and Interpretation 175

vidual's preferences and I hout, however, that decisi:: Ut~:~ste~: It a~ been repeatedly pointedvalues into the d .. ry Id not mtroduce the problem of

tems h 1eClSlon process, but merely made it explicit Value-:"·

ave a ways enter d . t d .. . ~clearly re~gnized or sy:te:a~ica~~s~:~dl~~~ they were not heretofore

In choosmg a decision strate th 1 .utilities across all outcome R lY' e goa IS to maximize expectedof a Simple de . . s. . e er.enee to the schematic representationcedure Th' d~Islon strategy m FIgure 18 \vill help to clarify the pro-17 in :.vhic~ la~ralm sho~s the decision strategy illustrated in Figurethe' d " a smg e test IS administered to a group of applicants and

eClSIon to accept or . t Iicutoff score on the t t ~Jec an app cant is made on the basis of avalid and fals es. here are four possible outcomes, includingability of he acceptances and valid and false rejections. The prob-

eac outcome can be f d fr heach of the four sectio . oun om t e number of persons inin that example th ns ofbFIgu:e. 17. Since there were 100 applicants

, ese num ers dIVIded b 100' hthe four outcomes listed in Fi . 18 rh gIVe t e probabilities ofutilities of the diff gure. e other data needed are the

erent outcomes expre dexpected overall utili of th ' sse on a common scale. Theing the probability t h e,strategy could then be found by multiply-these products forOt~:c faoutco~e by the utility of the outcom~, addingrespondin to h u~ ou comes, and subtracting a value cor-test of 10:val:d~;~:t;~r:e~:~:r t~h~s last ~erm ~i?h~ights th~ fact that aeasily administered by reIat' r e r~tamed If It IS short, mexpensive,group administration An 1· IdV:~dun1tramed personnel, and suitable for

. n IVI ua test req . . t' dor expensive equipment would d h' h Uln~g. a rame examinerllee a Ig er vahdlty to justify its use.

Decision Outcome Probability

Valid Acceptance .38

False Acceptance .07

V~lid Rejection .33

False';Rejection .22

Administer testand C1pplycutoff score

FIG. 18. A$imple Decision Strategy.

2 For ,w fl'ctitious example illustraf all .'Wiggins"0973), pp. 257-274. mg steps II! these computations, see y, a

Page 102: Anne Anastasi- Psychological Testing I

It should also be noted that many personnel decisions are in effectsequential, although they may not be so perceived. Incompetent em-ployees hired because of prediction errors can usually be dischargedafter a probationary period; failing students can be dropped from col-lege at several stages. In such situations, it is only adverse selectiondecisions that are terminal. To be sure, incorrect selection decisionS- thatare later rectified may be costly in terms of several value systems. }Jutthey are often less costly than terminal wrong decisions. " ",

A second condition that may alter the effectiveness of a psychologicaltest is, the availability of alternative treatments and the possibility ofadaptmg treatments to individual characteristics, An example would bethe utilization of different training procedures for workers at differentaptitude levels, or thc introduction of CQ.l!lpensatory educational pro-grams for students with certain educational disabilities. Under theseconditions, the decision strategy followed in individual eases should takeinto account available data on the interaction of initial test score and dif-ferential treatment. When adaptive treatments ar~ utilized, the successrate js likely to be substantially improved. Be£ause, the assignment ofin<!ividuals to alternative treatments is essentiallyadilsSif'ication rathertharu-sel~oblem, -mOre wiI[ be Sald about tlleTequired method-ology in a later section on classification decisions.

The examples cited illustrate a few of the ways in which the conceptsan~ rationale of decision theory can assist in the evaluation of psycho-logIcal tests for specific testing purposes, Essentially, decision theory hasserved to focus attention on the complexity of factors that determine thecontribution a given test can make in a particular situation. The validitycoefficient alone cannot indicate whether or not a test should be used~ince it is only one of the factors to be considered in evaluating th~lmpact of the test on the efficacy of the total decision process) .

';, SEQUEXTIAL STRATEGIES AND ADAPTIVE TREATMENTS. In some situations,~;the effectiveness of a test may be increased through the use ~f more

complex decision strategies which take still more param,etoe~s.lllto .ac-. count. Two examples will serve to illustrate these poss~blhtles, ,~ust,

, ,t t may be used to make sequential rather than termmal deCISIOns,'. es sod' F' 17 d 18 aU;,With the simple decision strategy Illustrate III 19ures an ,:"decisions to accept or reject ar: treated as terminal. Figure 19, on the" other hand, shows a two-stage sequential decision, T~st A could be a,';shortand easilv administered screening test. On the baSIS of.per~orma~ce, on this test, in'dividuals would be sorted into three categ?nes.: mcludl~~

thoseclearly accepted or rejected, as well s. 3n in~ermedlat~ uncertamgroup to be examined further with more intenSIve tec~mque~, repre-

.. sented by Test B. On the basis of t~e second-sta?e testmg, tIllS group,; wouldbe sorted into accepted and rejected categorIes.

Such sequential testing can also be cmployed within a si~gle test~ng, f t to t'm (DeWItt & \Velss,> session,to ma'Ximize the effectlve usc 0 es mg Ie.'

..~.1974; Linn, Rock, & Cleary, 1969; Weiss- -& 13etz, 1973): Altho~gh. appli-cable to paper-and-pencil printed grou~ ~ts, seq~entIal testmg IS par-ticularly well suited for computer testing, ~ssenhally the sequen~e ~fitems ~r item groups 'within the test is determine? b~ the examl,nee sownperfom1anceo For example, everyone might begm w1th a set of Ite~sof intermediate difficulty. Those who score poorly are routed t? easIeritems' those who score well, to more difficult items. Such branchmg may I

oeeu; repeatedly at several stages, The princip~l eff.e~t is that eachexaminee attempts only those items suited to h~s abJ1l~y level, ratherthan trying all items, Sequential testing ~~del.s WIll be dlscusse~ furtherin Chapter 11, in connection with the utlhzahon of computers 10 group

testing. hI' I d' dAnother strategy, suitable for the diagnosis of psye 0 ogICa 1~or ers,is to use only two categories, but to test further. a~ cases clas~ified as

.. positives (i.e., possibly pathol~gi~al) ~y the.prel~mmary sc~eem~g test..', This is the strategy cited earlIer ll1 this. s.e~tion, ~n connection Wlth the

use of tests to diag,nose pathological condItIons With very low base rates.,.~\

DIFFERENTIALLY PIlEDlCTABLE SUBSETS OF PERSONS. The validity of atest for a given criterion may vary among subgroups differing in personalcharacteristics. The classic psychometric model assumes that predictionerrors are characteristic of the test rather than of the person and thatthese errors are randomly distributed among persons. With the flexibilityof ap~roach ushe,re~ in by decision theory, there has been increasing ex-ploration of prediction models involving interacti~ hetween persons and

3 For a ,fuller discussion of the implicationsof decision theory for test use, seeJ. S. Wlggms (1973), Ch. 6, and at a more technical level; Cronbach and GIeser(1965), .

Page 103: Anne Anastasi- Psychological Testing I

178 Principles of Psychological Testing~.' ts. Such interaction implies that the same test may be a better pre-

tor for cert~i~Ciasses or subsets of persons than it is for others. Forxamplc,a given test may be a better predic~or of criterio~ performanceor men than for women, or a better predlctor for applicants from aower than for applicants from a higher socioeconomic level. In. thesexamples,sex and socioeconomic level are known as moderator vanables,

sincethey moderate the validity of the test (Saunders, 1956).I When computed in a total group, the vali~ity coe!R<,ient of a test may'be too low to be of much practical value In prcdlction. But when reo< computed in subsets of individuals differing in some i~e~tifia?le charac-, teristic, validity may be high in one subset and negl1g~~le In anot~er.; The test could thus be used effectively in making declSJons regardmg! persolls in the first group but not in the second. Per~aps anothe~ test or" some other assessment device could be found that IS an effective pre-

dictor in the second group. .A moderator variable is some characteristic of persons that makes It

i posS'ibfeto-pre'ct e pre ictability 0 I erent 10 ividuals with a givenins rument. t may e a emograp lC vana e, such as sex, age, e u-.al level, or socioeconomic background; or it may be a score onanother test. Interests and motUlation often function as moderatorvariables. Thus, if an applicant has little interest in a job, he will prob-ably perform poorly regardless of his scores on relevant aptitude tests.Among such persons, the correlation between aptitude test scores andjob performance would be low. For individuals who are interested andhighly motivated, on the other hand, the correlation between aptitudetest score and job success may be quite high.

EMPmlCALEXAMPLESOF MODERATORVARIABLES.Evidence for the op-eration of moderator variables comes from a variety of sources. In a sur-vey of several hundred correlation coefficients between ap~tude testscores and academic grades, H. G. Seashore (1962) found htgher cor-relations for women than for men in the large majority of instances. Tht;same trend was founa in high sChool and college, although the trend wasmore pronounced at the coll~ge level. ~he ?~ta do not in.dicate. thereason for this sex difference in the predictabIhty of academlc achieve-ment, but it may be interesting to speculate about it in the light of otherknown sex differences. If women students in general tend to be moreconforming and more inclined to accept the values and standards of theschool situation, theiJ;class achievement will probably devend largely ontheir abilities. If, on the other hand, men students tend to concentratetheir efforts on those activities (in or out of school) that arouse theirindividual interests, these interest differences wO..!,Jldintroduce additionalval'ianee-in their......courseachiev~t and would make it more difficult to

Validity: Mr:a~'U"C11lentand Interpretation 179

predict achievement from test scores. Whatever the reason for thedifference, sex does a ear to function as a moderator variable in thepredictability of academic gra es from aphtu e test scores.

A number of investigations have been specially designed to assess therole of moderator variables in the prediction of academic achievement.Several studies (Frederiksen & Cilbert, 1960; Frederiksen & Melville,1954; Stricker, 1966) tested the hypothesis that the more compulsivestudents, identified through two tests of compulsivity, Y{,ouldput a greatdeal of effort into their course work, regardless of their interest in thecourses, but that the effort of the less compulsive students would dependon their interest. Since effort will be reflected in grades, the correlationbetween the appropriate interest test scores and grades should be higheramong noncompulsive than among compulsive students. This hypothesiswas confirmed in several groups of male engineering students, but notamong liberal arts students of either sex. Moreover, lack of agreementamong different indicators of compulsivity casts doubt on the generalityof the construct that was being measured.

In another study (Grooms & Endler, 1960), the college grades of themore anxious students correlated higher (r = .63) with aptitude andachievement test scores than did the grades of the less anxious litudents(r = .19). A different approach is illustrated by Berdie (1961), who in-vestigated the relation between intraindividual variability on a test andthe predictive '-'ilidity of the same test. It was hypothesized that a giventest will be a- better predictor for those individuals who perform moreconsistently in different parts of the test-and whose total scores are thusmore reliable. Although the hypothesis was partially confirmed, the re-lation proved to be more complex than anticipated (Berdie, 1969).

In a different context, there is evidence that self-report personality in-ventories may have higher validity for some types of neurotics than forothers (Fulkerson, 1959). The characteristic behavior of the two typestends to make one type careful and accurate in reporting symptoms, theo~her ~areless and evasive. The individual who is characteristically pre-ClSeand careful about details, who tends to worry about his problems,and who uses intellectualization as a primary defense is likely to providea more accurate picture of his emotional difficulties on a self-report in-ventory than is the impulsive, careless individual who tends to avoidexpressing unpleasant thoughts and emotions and who llses denial as aprimary defense.

Ghi~elli (1956, 1960a, 1960b, 1963, 1968; Chise~!C Sander~, 1967) hasextenslvely explored the role of moderator variaBles iIl. UidiIstrial situ-ations. In a study of taxi drivers (Ghiselli, 1956), the @rrelati~n betweenan aptitude test and a job-performance criterion in the t6tl;J applicantsa'ijl~ was only .220. The group was then sorted into tpirds qp the basis~ ~~ ..~ on an occupational interest test. When the validity of the

Page 104: Anne Anastasi- Psychological Testing I

,_~ Validity: Measurement and Interpretation 181

/?I?'e:-T~!'~redict a s~n.gle.criterion, they are known as a test batten(. 1Jpe chiefOf:tlp.fJ701(.fl.problem arlsmg 10 the use of such batteries concerns the way 'in which

scores ,on the di~ert:n~ tests are to be combined in arrivi,!!g at a decisionregardmg each IndiVIdual. The. statistical procedures followed for thispurpose ~re of tw.g major typ:s, namely, multiple regression equationand multiple cutoff scores. --------'--:...::-.~-=..::!.:.::.:.:.:=___, ..

~Vhe~~ts ~re adIriinistered in the intensive study of individual cases,?1t~/et-,VV1C:: 111 ~li~lCaldiagnosis, counseling, or the evaluation of high-level execu//' I ' "."",,-

ves, It Isa£.QmIDOlLpr.actice.fOLtb~aminer to utilize test scores with~1out further st~tistical...analpis.-W preparing a case report and in making!recom~endatI~ns, the examiner relies on judgment, past experience, and \theoret~cal ratIOnale to interpret score patterns and integrate findings \from dl~erent tests. Such clinical use of test scores will be discussed \further 1ll Chapter 16. \

\• MULTIPLE ~RESS~ON. EQUATION. The multiple regression equation '\ I '--"1

YIelds a predicted cntenon Score for each individual on the basis of his ,1'V~, ..-;f irU

':'score . b t . '1'1.. £ II ._-' v(.{li,(,Gi/ltt/Vl..', '. a te ~L1e 10 owing regression equation _-1.::- ....-------...illu~trates the applIcation of this technique to predicting a student's JachIevement in high school mathematics courses from his scores on Ii,)'verbal (V), numerical (N), and reasoning (R) tests: L :1"11

. Math~matics Achievement =: .21 V + .21 N + .82 R + 1.35, 'IIn t~IS ~quabon, the student's stanine Score on each of the three tests is 1 I:multiplied by the corresponding weight given in the equation. The sum :1of t~c..sepr~~uct~, plus a constant (1.35), gives the student's predicted ';:s~!!~ne pOSItIon lD mathematics courses. 1 ;:.'Suppose that Bill Jones receives the following stanine scores: I ".

Verbal 6 ; :If

~::~~~ : 1 'lii,'1The estimated ma'h ti' h'Lema cs ac levement of this student is found as 'i i,follows: ' II

1 fill

I illl'1 i: \

(i11

111:1

'I,i!J 'J, ,"

180 Principles of Psychological Testing

aptitude test was recomputed within the third whose occupational in-terest level was most appropriate for the job, it rose to .664.

A technique employed by Chiselli in much of his research consists infinding for each individual the absolute difference (D) between his

:. actualand his predicted criterion scores. The smaller the value of D, the, morepredictable is the individual's criterion score. A predictability scale~; is then developed by comparing the item responses of two contrasted:- subgroups selected on the basis of their D scores. The predictability-: scaleis subsequently applied to a new sample, to identify highly pre-o dictableand poorly predictable subgroups, and the validity of the original. testis compared in these two subgroups. This approach has shown con-

siderablepromise as a means of identifying persons for whom a test will" be a good or a poor predictor. An extension of the same procedure has.'been developed to determine in advance which of two tests will be a'\ betterpredictor for each individual (Chiselli, 1960a).

Other investigators (Dunnette, 1972; Hobert & Dmmette, 1967) have"'. argued that Chiselli's D index, based on the absolute amount of pre-

.~dictionerror without regard to direction of error, may obscure importantindividualdifferences. Alternative procedures, involving separate analysesofoverpredicted and underpredicted cases, have accordingly been pro-

'posed. .;\ Atthis time the identification and use of moderator variables are still,'i,n' an explor;tory ·phase. Considerable caution is_required to avoidmethodologicalpitfalls (see, e,g., Abrahams & Alf, 1972a, 1972b; Dun-nette,1972;Ghiselh, 1972; Velicer, 1972a, 1972b). The results are usually~9uitespecific to the situations in which they were obtained. And it isiinportant to check the extent to which the use of moderators actually

'proves the prediction that could be achieved through other more'rectmeans (Pinder, 1973).

':;xForthe prediction of practical criteria, not one but several tests areeperallyrequired. Most: criteria are complex, the criterion measure de-. ingon a number of different traits. A single test designed to measure

a criteriQnwould thus have to be highly heterogeneous. It has al-y been pointed out, however, that~ re~!i.~~ homogeneous _~~st,suringlargely' a singlet~ is more satisfactory b~~.e_iL)'ieIasJess--.--.. ---US-Scores ('Ch-:;)). Hence, it is usually preferable to use aination of several relatively homogeneous t~sts, each covering aent aspect of the criterion, rather than a single test consisting of apodge of many diffe:rent kinds of items.en a number of speciaUy selected tests are employed together to

Math. Achiev. == (.21)(6) + (.21) (4) +( .32)( 8) + 1.35 = 6.01

Bill's ~redictcd stanine is approximately 6. It ~l be recalled (Ch. 4) thata stanme of 5 represents average pedormance. Bill would thus be ex-pected to ~o somewhat better than average in mathe~tics courses. Hisvery supenor performance in the reasoning test (R =8') and his above-~verage score on the verbal test (V = 6) compensate for his poor score10 spee~ and a~uracy of computation (N = 4). _

SpecIRc techmques for the computation of regression equations can be