invariance of woodcock-johnson iii scores for students with learning disorders and students without...

18
School Psychology Quarterly Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders Nicholas Benson and Gordon E. Taub Online First Publication, June 17, 2013. doi: 10.1037/spq0000028 CITATION Benson, N., & Taub, G. E. (2013, June 17). Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders. School Psychology Quarterly. Advance online publication. doi: 10.1037/spq0000028

Upload: gbtmatina

Post on 02-Jan-2016

34 views

Category:

Documents


2 download

DESCRIPTION

Invariance of Woodcock-Johnson III Scores for Students WithLearning Disorders and Students Without Learning Disorders

TRANSCRIPT

Page 1: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

School Psychology Quarterly

Invariance of Woodcock-Johnson III Scores for StudentsWith Learning Disorders and Students Without LearningDisordersNicholas Benson and Gordon E. TaubOnline First Publication, June 17, 2013. doi: 10.1037/spq0000028

CITATIONBenson, N., & Taub, G. E. (2013, June 17). Invariance of Woodcock-Johnson III Scores forStudents With Learning Disorders and Students Without Learning Disorders. SchoolPsychology Quarterly. Advance online publication. doi: 10.1037/spq0000028

Page 2: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

Invariance of Woodcock-Johnson III Scores for Students WithLearning Disorders and Students Without Learning Disorders

Nicholas BensonThe University of South Dakota

Gordon E. TaubUniversity of Central Florida

The purpose of this study was to test the invariance of scores derived from theWoodcock-Johnson III Tests of Cognitive Ability (WJ III COG) and Woodcock-Johnson III Tests of Academic Achievement (WJ III ACH) across a group of studentsdiagnosed with learning disorders (n � 994) and a matched sample of students withoutknown clinical diagnoses (n � 994). This study focused on scores reflecting broadcognitive abilities and areas of academic achievement in which children may demon-strate learning disabilities. Results of this study support the conclusion that the WJ IIICOG and WJ III ACH measure similar constructs for students with learning disabilitiesand students without learning disabilities. However, large and pervasive between-groups differences were found with regard to intercepts. Intercepts can be defined aspredicted group means for individual tests, in which predicted group means are basedon the factor loadings of these tests on the latent variable they are intended to measure.As many intercepts are not equivalent, it is possible that observed scores may notaccurately reflect differences in the construct of interest when testing children withlearning disabilities. However, tests displaying the largest intercept differences alsodisplayed the largest group differences in observed scores, providing some support forthe conclusion that these differences reflect construct-relevant between-group differ-ences. Implications of this research are discussed.

Keywords: Woodcock-Johnson III Tests of Cognitive Ability, Woodcock-Johnson III Tests ofAcademic Achievement, learning disabilities, Cattell-Horn-Carroll theory, confirmatory factoranalysis

The Woodcock-Johnson III (WJ III; Wood-cock, McGrew, & Mather, 2001a) includes twoco-normed test batteries: the WJ III Tests ofAchievement (WJ III ACH; Woodcock,McGrew, & Mather, 2001b) and the WJ IIITests of Cognitive Abilities (WJ III COG;Woodcock, McGrew, & Mather, 2001c). Thetheoretical foundation of the WJ III is derived

from the Cattell Horn-Carroll (CHC) theory ofcognitive abilities (McGrew, 2009), which inte-grates both the Cattell-Horn (Horn & Noll,1997) and three-stratum (Carroll, 1993) theoriesof cognitive abilities. CHC theory is hierarchi-cal and includes three strata. Stratum III is lo-cated at the apex and consists of psychometricg, which refers to the general factor presumed tocause the positive correlation of mental tasks.Stratum II consists of at least 10 broad abilities,including auditory processing (Ga), crystallizedintelligence (Gc), fluid reasoning (Gf), long-term retrieval (Glr), quantitative knowledge(Gq), reading and writing ability (Grw), pro-cessing speed (Gs), short-term memory (STM;Gsm), decision speed/reaction time (RT; Gt),and visual processing (Gv). Stratum I consistsof numerous narrow CHC abilities (McGrew,2005).

CHC theory provides a comprehensive andempirically supported psychometric theory per-

Nicholas Benson, Division of Counseling and Psychol-ogy in Education, The University of South Dakota; GordonE. Taub, Department of Educational and Human Sciences,University of Central Florida.

We are grateful to the Woodcock-Muñoz Foundation forproviding the data used in this research.

Correspondence concerning this article should be ad-dressed to Nicholas Benson, Division of Counseling andPsychology in Education, The University of South Dakota,414 East Clark Street, Delzell Education Center Room205D, Vermillion, SD 57069. E-mail: [email protected]

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

School Psychology Quarterly © 2013 American Psychological Association2013, Vol. 28, No. 2, 000 1045-3830/13/$12.00 DOI: 10.1037/spq0000028

1

Page 3: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

taining to the structure of human cognitive abil-ities (Alfonso, Flanagan, & Radwan, 2005). Arecent review of factor analytic research indi-cates that most cognitive/intellectual abilitytests published within the past 10 years areconsistent with the structure of abilities expli-cated by CHC theory (Keith & Reynolds, 2010).Although there is ample evidence supportingthe CHC structure of cognitive abilities(McGrew, 2009), additional research is neededto determine if this structure can be generalizedto the learning disabled population. Althoughcognitive abilities appear to be relevant to theidentification of learning disabilities (Flanagan,Fiorello, & Ortiz, 2010) and instructional plan-ning (McGrew & Wendling, 2010), additionalresearch is needed to determine the extent towhich tests of cognitive and academic abilitiesmeasure similar constructs for individuals withand without learning disabilities.

Cognitive/intellectual abilities and academicachievement are constructs that must be inferredfrom test scores but presumably have meaningand relevance beyond the scores themselves(e.g., help explain important life outcomes suchas learning and school performance). Constructsrefer to theoretically based explanatory con-cepts for which it is impractical to directly ob-serve an adequate criterion (Cronbach & Meehl,1955). Thus, constructs must be measured usinga carefully selected, theoretically based sampleof representative indicators. According to theStandards for Educational and PsychologicalTesting (Standards; American Educational Re-search Association, American PsychologicalAssociation, & National Council of Measure-ment in Education, 1999), test users must selectmeasures that accurately reflect the construct ofinterest and have validity for intended purposes.Evidence of bias challenges the validity of scoreinterpretations used for practical applications,such as identifying individuals with disabilitiesand planning instruction. In the Standards, biasis described as any construct-irrelevant sourceof variance that produces systematic differencesin test scores between identifiable groups ofexaminees. By definition, construct-irrelevantvariance refers to any variance that is indepen-dent from the construct of interest and thusarises from extraneous sources. In order to con-clude that test fairness exists, “examinees ofequal standing with respect to the construct thetest is intended to measure should on average

earn the same test score, irrespective of groupmembership” (American Educational ResearchAssociation et al., 1999. p. 74).

Measurement Invariance

An essential preliminary step in establishingthe utility of test scores for diagnosing learningdisabilities includes determining the extent towhich measurement invariance exists across in-dividuals with and individuals without a diag-nosed learning disability. In order to ensureaccurate comparisons between two or more dis-tinct groups, it is first necessary to establish thatthe numerical values of test scores are on thesame measurement scale (Drasgow, 1987). Ifmeasurement invariance is present, the proba-bility of obtaining a given observed score isindependent of group membership, and individ-uals from different groups will obtain the sameobserved score if their true score on the con-struct of interest is identical (Mellenburgh,1989; Meredith, 1993; Meredith & Millsap,1992; Wu, Li, & Zumbo, 2007). The extent towhich measurement invariance exists is deter-mined by testing the measurement model (i.e.,specified relations between factors and ob-served indicators). Measurement invariance issupported if the measurement model is invariantacross groups (Wu et al., 2007). Equivalencedeterminations are made by testing increasinglystringent levels of invariance. Achievement offull measurement invariance is the exceptionrather than the rule, as most tests contain at leastsome nonequivalent items (Byrne, Shavelson,& Muthén, 1989).

The WJ III technical manual (McGrew &Woodcock, 2001) includes empirical evidencesupporting the structural fidelity of scores ob-tained from the instrument as well as the con-figural invariance of the battery as a function ofsex and race. This research was extended byTaub and McGrew’s (2004) investigation ofconfigural and metric invariance for WJ IIICOG scores across the instrument’s five broadage ranges, which supported the invariance ofscores across age groups. Although evidence sup-ports the structural fidelity of the WJ III andsuggests that scores are invariant across age,sex, and race, additional research is needed toestablish the invariance of scores obtained bychildren and adolescents with disabilities. Ni-ileksela (2012) recently examined the measure-

2 BENSON AND TAUB

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 4: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

ment invariance of WJ III scores obtained by 6- to19-year-olds diagnosed with learning disordersin his dissertation pertaining to cognitive–achievement relations. Niileksela found evi-dence to support configural and metric invari-ance across groups of participants diagnosedwith reading disorder (n � 180), mathematicsdisorder (n � 231), and disorder of writtenexpression (n � 149). Notably, in addition tothe seven broad abilities examined in the pres-ent study, Niileksela included eight additionalnarrow abilities in the measurement model.With one exception, these narrow abilities aremeasured by only two WJ III tests per factor. Aminimum of three indicators per factor is rec-ommended, especially when sample size issmall (Kline, 2010). Additionally, the ratio ofparticipants to free parameters is problematic,given the complexity of the model, relativelysmall sample size for each group, and percent-age of missing data within each group. Visualinspection of figures presented in Niileksela’sdissertation indicate that over 140 free parame-ters are included in some models, which ap-proximates the sample sizes for subtype groupsutilized in these analyses, and thus falls belowthe recommended minimum free parameter tosample size ratio of 5:1 (Bentler & Chou, 1987).Thus, there is a need for additional research toexamine the invariance of WJ III scores acrossgroups of children and adolescents with andwithout learning disabilities.

Purpose of the Study

The purpose of this study was to test theinvariance of scores derived from the WJ IIIacross a group of students diagnosed with learn-ing disorders and a matched sample of studentswithout known clinical diagnoses. The impor-tance of invariance violations must be evaluatedby the extent to which these violations interferewith the intended purposes of the test (Millsap& Meredith, 2007). As violations of invarianceincrease, score comparability decreases (Reise,Widaman, & Pugh, 1993). The WJ III is com-monly used to examine profiles of strengths andweaknesses across cognitive and academictasks, and scores that fall below age- and grade-level expectations are typically viewed as evi-dence of deficits. If scores obtained by studentswith diagnosed learning disorders are not com-parable to those obtained by students without

known clinical diagnoses, then potentially arti-factual group differences in patterns of relation-ships among cognitive and academic variablesmay emerge and complicate the process of iden-tifying learning disabilities.

Method

Participants

This study utilized participants drawn fromthe Woodcock-Muñoz Foundation Clinical Da-tabase Project (CDB)1 as well as the WJ IIIUnited States standardization sample. The CDBis the result of ongoing collaborative effortsbetween psychological professionals, medicalprofessionals, and the Woodcock-Muñoz Foun-dation. The CDB consists primarily of archivedrecords provided by a variety of collaboratingpsychological and medical professionals, andsecondarily of records obtained from clinicalresearch studies. Participants aged 5 to 18 yearswith a principal Diagnostic and Statistical Man-ual of Mental Disorders (4th ed., text rev.;DSM–IV–TR; American Psychiatric Associa-tion, 2000) diagnosis from the Learning Disor-ders section were drawn from the CDB (N �994). Diagnoses were made by the collaboratingprofessionals who provided the records.

All records submitted by practitioners werereviewed by judges with expertise in neuropsy-chology and psychometrics to ensure the qualityof data and accuracy of diagnoses. This processincluded examination of data from scores ob-tained from tests of cognitive abilities and aca-demic achievement to verify the presence ofacademic limitations as well as a pattern ofcognitive strengths and weaknesses. Through-out this document, we use the terms disabilityand disorder synonymously. Disability is anumbrella term for impairments that limit activ-ity and participation; these limitations resultfrom complex interactions between an individ-ual and external factors in their environment(World Health Organization, 2001). As previ-ously noted, the participants in this study werediagnosed using DSM–IV–TR criteria. In schoolsettings, the regulations set forth in the Individ-

1 Information concerning the Woodcock-Muñoz Founda-tion Clinical Database Project is available at http://www.woodcock-munoz-foundation.org/research/clinicalDB.html.

3INVARIANCE OF WJ III

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 5: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

uals with Disabilities Education ImprovementAct (2004; IDEIA) are used to identify childrenwith learning disabilities. The IDEIA regula-tions do not mandate the use of the intelligence-achievement discrepancy model and allow localeducation agencies to choose alternative identi-fication methods, such as a response-to-intervention model, when making determina-tions regarding the presence or absence of aspecific learning disability. As the identificationprocess differs in clinical and school settings,the sample examined in this study may differfrom the population of students identified in aschool setting. However, as the use of testscores obtained from intellectual and academictesting remains a reasonable practice in schoolsettings (Mather & Gregg, 2006), it is reason-able to assume that the sample examined inthis study is approximately representative ofthe population of students identified in schoolsettings.

Information about age, sex, race, ethnicity,and principal diagnosis distributions for theCDB sample is presented in Table 1. Notably,most participants had multiple diagnoses. Of the538 participants with reading disorder as a prin-cipal diagnosis, 67% had an additional diagno-sis of disorder of written expression, and 27%had an additional diagnosis of mathematics dis-order. Of the 172 participants with disorder ofwritten expression as a principal diagnosis,21.5% had an additional diagnosis of readingdisorder, and 9.3% had an additional diagnosisof mathematics disorder. Additionally, of the119 participants with mathematics disorder as aprincipal diagnosis, 9.1% had an additional di-agnosis of reading disorder and 9.3% had anadditional diagnosis of disorder of written ex-pression. As many participants had multiple di-agnoses, it was not possible to differentiate the

clinical sample into three pure and distinct sub-types (i.e., reading disorder, mathematics disor-der, and disorder of written expression) of ade-quate size and power for this study.

To facilitate invariance testing, a matchedsample (N � 994) was drawn from the WJ IIIU.S. standardization sample. The standardiza-tion sample was selected to be nationally rep-resentative using random selection of partici-pants within a stratified sampling design thatcontrolled for several demographic variables(i.e., census region, community size, sex, race,ethnicity, type of school attended, level of pa-rental education, occupational status of parents,and type of parental occupation; McGrew &Woodcock, 2001). The matching process in-volved dividing the standardization sample intomales and females, then further dividing each ofthese samples into age groups. Random selec-tion was used to draw participants from each ofthese subgroups in proportions that approxi-mated the composition of the CDB sample withregard to sex and age. Information about age,sex, race, and ethnicity distributions for thematched sample is presented in Table 1. Statis-tical comparisons of age, sex, race, and ethnic-ity revealed equal distributions across the CDBand matched samples.

Measures

Measures included 35 tests from the WJ III.Extensive validity evidence is available to sup-port interpretations of scores derived from theWJ III (Floyd, Shaver, & McGrew, 2003;McGrew, Schrank, & Woodcock, 2007;McGrew & Woodcock, 2001; Taub & McGrew,2004) and independent reviews suggest that theWJ III is among the best norm-referenced bat-teries available for measuring the cognitive and

Table 1Descriptive Statistics for the Clinical Database (CDB) and Matched Samples

SampleAge

M (SD)

Sex(Percent)

Race and ethnicity(Percent)

Principal diagnosis(Percent)

M F W B AI API O H RD MD WE NOS

CDB 11.0 (3.2) 60.1 39.9 82.1 9.3 2.5 1.4 1.1 19.8 54.1 12.0 17.3 16.6Matched 10.9 (3.2) 60.1 39.9 84.3 10.9 2.6 2.2 NA 19.2 NA NA NA NA

Note. AI � American Indian; API � Asian or Pacific Islander; B � Black; F � female; H � Hispanic; M � male;MD � mathematics disorder; NA � not available; NOS � learning disorders not otherwise specified; O � other; RD �reading disorder; W � White; WE � disorder of written expression.

4 BENSON AND TAUB

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 6: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

academic abilities of school-age children andyouth (Cizek, 2003; Sandoval, 2003). Age-based standard scores (M � 100, SD � 15)were used as the metric of analysis. Means,standard deviations, sample size, and percent ofmissing data for all WJ III tests and both sam-ples are shown in Table 2. Missing data forthese samples reflect the impracticality of ad-ministering all tests to every participant. Forexample, with regard to the clinical sample,clinicians routinely administer subsets of test

batteries, as their goal is to address referralconcerns while working within financial andtime constraints.

Data Analysis

The AMOS (Arbuckle, 2007) statistical pro-gram was used to conduct confirmatory factoranalyses following the maximum-likelihood es-timation method, which assumes a multivariatenormal distribution. Covariance matrices were

Table 2Sample Sizes, Means, and Standard Deviations for WJ III Tests Disaggregated by Sample

WJ III test

CDB sample Matched sample

n M SD n M SD

Academic Knowledge 447 94.0 17.7 881 100.6 16.2Analysis–Synthesis 571 97.2 15.2 682 100.5 16.1Applied Problems 885 91.1 16.6 896 100.6 15.1Auditory Attention 544 96.6 15.4 680 98.5 16.4Auditory Working Memory 117 86.3 16.4 507 99.3 14.4Block Rotation 248 102.2 16.8 547 100.4 15.3Calculation 901 88.2 19.1 898 99.9 15.8Concept Formation 754 97.3 16.5 872 99.6 15.6Decision Speed 666 97.1 15.7 699 98.5 15.2Editing 424 93.5 18.1 693 99.5 15.5General Information 662 96.9 14.9 786 99.9 15.5Incomplete Words 473 99.6 16.2 776 96.7 17.3Letter–Word Identification 918 84.4 16.8 901 100.0 15.7Math Fluency 884 86.3 14.9 823 98.8 14.9Memory for Names 271 98.2 14.6 719 101.2 15.5Memory for Words 633 95.3 15.2 853 100.2 15.8Number Matrices 292 96.9 16.4 818 99.0 15.7Number Series 290 94.2 19.9 932 100.8 16.9Numbers Reversed 750 89.8 16.3 776 98.6 15.2Pair Cancellation 300 98.3 9.6 721 98.8 11.9Passage Comprehension 917 83.2 18.1 899 100.1 15.6Picture Recognition 603 99.8 13.5 656 99.5 15.2Quantitative Concepts 413 89.8 19.1 757 98.8 16.7Reading Fluency 854 84.3 15.4 808 98.0 15.3Reading Vocabulary 392 92.2 17.4 690 100.7 15.8Retrieval Fluency 653 95.1 14.9 729 98.9 15.1Sound Blending 753 99.3 15.2 840 100.0 14.5Spatial Relations 753 97.4 13.8 700 100.2 15.4Spelling 916 81.6 16.2 894 98.6 15.2Verbal Comprehension 722 94.2 15.2 631 100.7 15.0Visual–Auditory Learning 750 94.2 15.2 719 99.6 15.8Visual Matching 752 87.5 16.0 848 98.6 14.6Word Attack 604 90.2 13.5 845 99.4 15.3Writing Fluency 842 85.8 15.0 812 99.0 14.4Writing Samples 908 91.2 15.8 888 99.6 15.6Minimum n 117 507Maximum n 918 932% of missing data 37 22

Note. CDB � clinical database; WJ III � Woodcock-Johnson III.

5INVARIANCE OF WJ III

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 7: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

constructed from standard scores rather thanraw data to account for developmental differ-ences in item difficulty. Although missing dataare not optimal, large samples with missing datawere utilized for data analysis, as utilizinglarger samples with missing data is known toincrease power relative to utilizing smaller sub-sets of these samples with complete data(Baraldi & Enders, 2009). Data screening indi-cates that the variables in this study do notviolate distributional assumptions and thatmissing data meet the missing at random as-sumption. Missing data were addressed using adirect approach known as full information max-imum likelihood (FIML) estimation, which hasbeen found to produce unbiased estimates fordata that are missing at random (Arbuckle,1996; Schafer, 1997; Schafer & Graham, 2002).

Models used for measurement invariancetesting were specified based on information re-garding the proposed structure for the WJ IIIprovided on page 63 of the technical manual(McGrew & Woodcock, 2001). The modelshown in Figure 1 includes CHC strata III (i.e.,g) and II (i.e., Ga, Gc, Gf, Glr, Gs, Gsm, andGv) abilities for which sufficient observed datafrom indicators were available in the CDB sam-ple. Stratum I abilities for the WJ III COG werenot included in the analyses, as none of theseabilities had more than two indicators and mosthad only one indictor.

The measurement model for the WJ III ACHshown in Figure 2 is consistent with definitionsof specific learning disabilities in basic readingskills, reading comprehension,2 math calcula-tion, math problem solving, and written expres-sion defined in the IDEIA (U.S. Department ofEducation, 2004). This model was utilized be-cause the WJ III COG and WJ III ACH areroutinely used to facilitate identification of, andinstructional planning for, individuals withlearning disabilities. Factors representing basicwriting skills and an academic fluency factor3

were included to maintain consistency with theWJ III scoring model. Additionally, factors rep-resenting Grw and Gq provided on page 63 ofthe technical manual were included (McGrew &Woodcock, 2001). The model includes corre-lated Stratum II factors rather than a factorrepresenting an academic g, as an academic gseemingly is of limited value when assessingchildren with learning disabilities as it would

tend to be severely attenuated by academic def-icits.

Notably, the models shown in Figures 1 and2 technically are not measurement models be-cause these models are hierarchical and includestructural paths between latent variables (i.e.,paths from g to broad abilities). Nevertheless,because the WJ III is presumed to have a hier-archical structure, these relationships were in-cluded in the analyses. Measurement invariancewas tested for both the WJ III COG and WJ IIIACH using a process that involved testing mod-els with increasingly restrictive cross-groupequality constraints. The models tested for theWJ III COG and WJ III ACH are presented inTable 3 and Table 4, respectively.

The first step in this process was to test forconfigural invariance (i.e., measured vari-ables define the same factors and have thesame pattern of loadings across groups). Aninvariant structural configuration suggeststhat tests measure the same constructs in bothgroups, although the magnitude of factorloadings may vary. The second step in thisprocess involved constraining measurementweights as equal across groups in order to testmetric invariance (i.e., the magnitude of ob-served variables’ factor loadings are invariantacross groups). Invariant factor loadings sug-gest that the factor is calibrated in a similarway in both samples (Steinmetz, 2013). Iffactor loadings are not equivalent, then thefactor does not have the same meaning for theclinical sample as it does for the matchedsample of students without known clinicaldiagnoses.

In the third step of this process, measure-ment intercepts were constrained to be equalacross groups in order to test scalar invari-ance. Scalar invariance refers to an absence of

2 Although the Individuals with Disabilities EducationImprovement Act (2004) does specify reading fluency as aseparate area of academic achievement, reading fluency wasused as an indicator of reading comprehension in this studybecause it plays an important role in a reader’s ability tocomprehend text. Moreover, there is only one indicator ofreading fluency on the WJ III ACH, so a composite scorecannot be obtained.

3 It also should be noted that models without an academicfluency factor provided very poor fit for the CDB sampleand the matched sample. This factor could be replaced byloadings from Gs in a model including both cognitive andacademic abilities.

6 BENSON AND TAUB

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 8: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

systematic group differences in additive bi-ases as estimated by discrepancies betweenobserved means and means that are expectedbased on factor loadings (Hayduk, 1989;Steinmetz, 2013). In this study, intercepts canbe defined as predicted group means for indi-vidual WJ III tests, in which predicted groupmeans are based on the factor loadings ofthese tests on the broad ability they are in-tended to measure. Invariant intercepts ensure

that observed scores accurately reflect differ-ences in the construct of interest (Millsap,1998).

The next step in this process involved con-straining structural weights (i.e., the directeffects of g on broad abilities as well as thedirect effects of broad abilities on narrowabilities) as equal across groups. This wasfollowed by testing a model with structuralcovariances (i.e., variances and covariances

Analysis-Synthesisu2

Spatial Relationsu6

Visual Matching u8

Decision Speedu9

Concept Formation

Gf

Gv

Gs

uf1

uf2

uf3

u1

Number Matrices u3

Number Seriesu4

Picture Recognition u7

Block Rotationu5

Pair Cancellationu10

Incomplete Wordsu16

Memory for Wordsu19

VerbalComprehension

u20

General Informationu21

Sound Blending u15

Auditory Attentionu14

Auditory WorkingMemoryu17

Numbers Reversedu18

AcademicKnowledgeu22

Ga

uf5

Gsm

uf6

Gc

uf7

g

Memory forNames

u12

Visual-AuditoryLearning

Glr

uf4u11

Retrieval Fluency u13

Figure 1. Measurement model for the WJ III COG.

7INVARIANCE OF WJ III

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 9: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

in the structural part of the model) con-strained. The structural part of the model con-sists of factors representing the constructs ofinterest. Next, structural residuals (i.e., factorvariance not explained by higher-order fac-tors) were constrained. In the final step of thisprocess, the residual variances (i.e., variancenot explained by factors) for WJ III tests wereconstrained in order to test strict factorialinvariance (Meredith, 1993).

In addition to invariance testing, averagevariance extracted (AVE) and construct reli-ability estimates were calculated to demon-strate the adequacy of select constructs. Al-though cognitive abilities may theoretically

be good indicators of academic skills, anyobserved scores reflecting these constructswill be misleading unless they are measuredadequately by the tests purported to measurethem. AVE, which is calculated by dividingthe sum of squared factor loadings by the sumof squared factor loadings plus error variance,reflects the amount of common varianceshared by tests used to measure a construct,while construct reliability reflects internalconsistency (Anderson & Gerbing, 1988;Fleishman & Benson, 1987). Higher valuesare better for both statistics, and values ex-ceeding .5 for AVE and .7 for compositereliability are minimally acceptable.

Table 3Summary of Results for WJ III COG Invariance Testing Across the CDB and Matched Samples

Step Constraints �2(df) � �2 �df p CFI TLI RMSEA AIC

1 Configuration 943.412 (394) — — — .950 .936 .026 1255.4122 Measurement weights 1000.127 (411) 56.714a 17 �.001 .947 .934 .027 1278.1273 Measurement Intercepts 1579.620 (433) 579.493a 22 �.001 .896 .878 .037 1813.6204 Structural weights � Step 2 1016.939 (417) 16.812b 6 .010 .946 .934 .027 1282.9395 Structural covariances � Step 4 1021.645 (418) 4.706a 1 .030 .945 .934 .027 1285.6456 Structural residuals � Step 5 1043.455 (425) 21.81a 7 .003 .944 .933 .027 1293.4557 Measurement residuals � Step 6 1181.996 (450) 138.541 25 �.001 .934 .925 .029 1381.996

Note. AIC � Akaike information criterion; CDB � clinical database; CFI � comparative fit index; RMSEA � root meansquare error of approximation; TLI � Tucker Lewis index; WJ III COG � Woodcock-Johnson III Tests of CognitiveAbility.a Compared with previous model. b Compared to model in Step 2.

Letter-Word ID u2

ReadingVocabulary u4

Spelling u6

Writing Samples u8

Writing Fluency u9

Calculation u10

Math Fluency u11

QuantitativeConcepts

u12AppliedProblems

u13

Word Attack u1

Reading Fluency u5

PassageComprehension u3

Editing u7

BRS

uf1

RC

uf2

WEuf4

BWS

uf3

BMS

uf5

MR

0

uf6

Grw

Gq

Gs

Figure 2. Measurement model for the WJ III ACH.

8 BENSON AND TAUB

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 10: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

Use of Fit Indices

Simulation studies suggest that multiple fitindices should be used when evaluating modelfit (Fan & Sivo, 2007). Although some fit indi-ces have been shown to provide redundant in-formation (Meade, Johnson, & Braddy, 2008),several types of fit indices have been developed,each reflecting different facets of model fit. Re-searchers must select an adequate set of indicesfor examining model fit in light of substantiveresearch questions and probable sources of bias(Miles & Shevlin, 2007). The indices utilized inthis study included the root mean square error ofapproximation (RMSEA), the comparative fitindex (CFI), Akaike information criterion(AIC), and change in chi-square value (��2).

The RMSEA was utilized because simulationstudies suggest it performs reasonably well as atest of invariance and is fairly insensitive tosample size, interactions among variables, andthe reliability of indicators (Meade et al., 2008).The RMSEA reflects how well the model wouldfit the population covariance matrix if optimalparameter values were chosen. The CFI wasutilized because it is insensitive to sample sizeand has performed well as a test of invarianceduring simulation studies (Cheung & Rensvold,2002; Meade et al., 2008). The CFI reflects thefit of the hypothesized model relative to anindependence model (i.e., a model in which allcorrelations among variables in the model arezero). The AIC was used because simulationstudies suggest this index is useful when con-ducting model comparisons, including compar-isons involving nonnested models (Yuan,2005). As the successive models used to test

invariance are nested, a likelihood ratio test(��2) was used to determine if adding theoret-ically substantive constraints resulted in a largeand statistically significant drop in �2 values(Keith, 2005).

Results

Results suggest that both the WJ III COG andWJ III ACH measurement models are partiallyinvariant across a group of students diagnosedwith learning disorders and a matched samplewithout known clinical diagnoses. Results forthe WJ III COG and WJ III ACH are presentedin Tables 3 and 4, respectively. Results suggestthat configural and metric invariance are tenablefor both the WJ III ACH and WJ III COG,although support for metric invariance is mar-ginal for the WJ III COG. According to thelikelihood ratio test, model fit degraded for theWJ III COG and WJ III ACH when measure-ment weights were constrained to be equal.However, as shown in Table 5, the magnitude ofbetween-groups differences in first-order factorloading is reasonably small. With regard to theWJ III ACH, removing the equality constraintfor the loading of passage comprehension onreading comprehension, while maintaining con-straints on all other first-order loadings, resultedin a statistically nonsignificant change in chisquare, 17.897(8), p � .022. Thus, with theexception of passage comprehension, measure-ment weights for the WJ III ACH appear to beinvariant.

In contrast to measurement loadings, strongevidence of noninvariance was found for in-tercepts. Intercept differences by test, along

Table 4Summary of Results for WJ III ACH Invariance Testing Across the CDB and Matched Samples

Step Constraints �2(df) � �2 �df p CFI TLI RMSEA AIC

1 Configuration 437.168 (108) — — — .974 .956 .039 637.1682 Measurement weights 468.442 (117) 31.274a 9 �.001 .972 .957 .039 650.4423 Intercepts 1126.679 (130) 658.237a 13 �.001 .922 .891 .062 1282.6794 Structural weights � previous 511.922 (121) 43.48b 4 �.001 .969 .954 .040 685.9225 Structural covariances � previous 618.117 (127) 125.413a 6 �.001 .961 .945 .044 780.1176 Structural residuals � previous 637.335 (132) 19.218a 5 .002 .960 .945 .044 789.3357 Measurement residuals � previous 699.744 (145) 62.409a 13 �.001 .956 .945 .044 825.744

Note. AIC � Akaike information criterion; CDB � clinical database; CFI � comparative fit index; RMSEA � root meansquare error of approximation; TLI � Tucker Lewis index; WJ III ACH � Woodcock-Johnson III Tests of AcademicAchievement.a Compared with previous model. b Compared with model in Step 2.

9INVARIANCE OF WJ III

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 11: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

with the statistical significance of these dif-ferences, are presented in Table 6. Statisticalsignificance tests reveal that intercepts for 30of 35 tests differ between the clinical andmatched sample.4 Due to the magnitude andpervasiveness of noninvariance, interceptswere not constrained to be equal in subse-quent steps of invariance testing. Accordingto the likelihood ratio test, model fit degraded

for both the WJ III COG and WJ III ACHwhen constraints were placed on structuralweights (i.e., the direct effects of g on broadabilities as well as the direct effects of broad

4 The experiment-wise error rate of .297 suggests there isabout a 30% chance that a Type I error was made in thecourse of these comparisons.

Table 5Standardized Factor Loadings Sorted by Between-Group Differences inFactor Loadings

Test (factor)Clinical

sample �Matchedsample �

Clinical � �Matched �

Spatial Relations (Gv) 0.692 0.533 0.159Passage Comprehension (RC) 0.887 0.736 0.151Analysis Synthesis (Gf) 0.796 0.654 0.142Retrieval Fluency (Glr) 0.421 0.279 0.142Number Series (Gf) 0.804 0.666 0.138Decision Speed (Gs) 0.782 0.645 0.137Number Matrices (Gf) 0.719 0.610 0.109Numbers Reversed (Gsm) 0.777 0.672 0.105Auditory Attention (Gs) 0.279 0.189 0.090Concept Formation (Gf) 0.818 0.746 0.072Applied Problems (MR) 0.871 0.80 0.071Calculation (BMS) 0.882 0.818 0.064Reading Fluency (RC) 0.691 0.629 0.062Writing Fluency (WE) 0.606 0.549 0.057Math Fluency (BMS) 0.572 0.516 0.056Writing Samples (WE) 0.768 0.717 0.051Pain Cancellation (Gs) 0.866 0.819 0.047Letter–Word Identification (RD) 0.954 0.909 0.045Memory for Words (Gsm) 0.629 0.593 0.036Reading Vocuablary (RC) 0.813 0.780 0.033Block Rotation (Gv) 0.499 0.468 0.031Sound Blending (Ga) 0.724 0.700 0.024Picture Recognition (Gv) 0.452 0.429 0.023Quantitative Concepts (MR) 0.881 0.859 0.022Academic Knowledge (Gc) 0.846 0.832 0.014Word Attack (RD) 0.763 0.763 0.000Spelling (BWS) 0.802 0.803 �0.001Verbal Comprehension (Gc) 0.941 0.944 �0.003Visual–Auditory Learning (Glr) 0.813 0.818 �0.005Visual Matching (Gs) 0.840 0.848 �0.008Incomplete Words (Ga) 0.513 0.521 �0.008Auditory Attention (Ga) 0.373 0.384 �0.011Editing (BWS) 0.736 0.755 �0.019General Information (Gc) 0.820 0.851 �0.031Memory for Names (Glr) 0.594 0.627 �0.033Retrieval Fluency (Gs) 0.237 0.319 �0.082Reading Fluency (Gs) 0.384 0.469 �0.085Math Fluency (Gs) 0.441 0.533 �0.092

Note. � is a factor loading. BMS � basic math skills; BWS � basic writing skills; Ga �auditory processing; Gc � crystallized intelligence; Gf � fluid reasoning; Glr � long-termretrieval; Gs � processing speed; Gsm � visual processing; Gv � visual processing; MR � mathreasoning; RC � reading comprehension; RD � reading decoding; WE � written expression.

10 BENSON AND TAUB

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 12: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

abilities on narrow abilities). However, in-variance for these second-order loadings ismarginally tenable for the WJ III COG, asother fit indices displayed little change. Non-invariance of structural weights appears to bemore problematic for the WJ III ACH, al-though visual inspection of estimates for sec-ond-order loadings indicates these loadingsare large in both groups, as none of theseloadings fall below .8 and most exceed .9.

Adding constraints on structural covariancesdid not significantly degrade model fit for theWJ III COG, ��2 � 4.706 [1], p � .030,suggesting that variances and covariances in the

structural part of the model are invariant. Con-versely, structural covariances for the WJ IIIACH appear to be noninvariant. This may re-flect the fact that an academic g is not includedin the WJ III ACH model. According to thelikelihood ratio test, model fit degraded for boththe WJ III COG and WJ III ACH when con-straints were placed on structural residuals (i.e.,factor variance not explained by higher-orderfactors). However, invariance of structural re-siduals is tenable, as other fit indices displayedlittle change. Finally, adding constraints onmeasurement residuals for WJ III tests resultedin substantial degradation of model fit for the

Table 6Intercepts, Intercept Differences Sorted by Magnitude, and Effects of Disabilities on Observed Test Scores

TestClinical sample

interceptMatching sample

interceptInterceptdifference t df p

Disability effectsizea

Spelling 81.587 98.587 �17.000 �23.02 3618 �.001 1.082Passage Comprehension 83.192 100.088 �16.896 �21.30 3630 �.001 .999Letter–Word Identification 84.441 100.030 �15.589 �21.08 3636 �.001 .959Reading Fluency 83.543 97.607 �14.064 �18.83 3322 �.001 .892Writing Fluency 85.059 99.026 �13.967 �19.42 3306 �.001 .897Quantitative Concepts 85.327 99.215 �13.888 �15.03 2338 �.001 .512Math Fluency 86.062 98.807 �12.745 �17.85 3412 �.001 .839Reading Vocabulary 87.427 100.151 �12.724 �14.04 2162 �.001 .518Visual–Auditory Learning 87.483 99.858 �12.375 �14.76 2936 �.001 .348Word Attack 88.132 100.320 �12.188 �17.07 2896 �.001 .631Calculation 88.161 99.883 �11.722 �14.21 3596 �.001 .667Visual Matching 87.359 98.830 �11.471 �14.99 3198 �.001 .727Auditory Working Memory 87.678 98.971 �11.293 �7.27 1246 �.001 .879Editing 88.259 98.903 �10.644 �11.21 2232 �.001 .363Number Series 89.996 100.542 �10.546 �9.70 2442 �.001 .374Applied Problems 91.245 100.583 �9.338 �12.49 3560 �.001 .599Numbers Reversed 89.525 98.634 �9.109 �11.42 3050 �.001 .559Writing Samples 91.185 99.473 �8.288 �11.17 3590 �.001 .535Memory for Names 94.297 101.461 �7.164 �7.31 1978 �.001 .197Academic Knowledge 93.669 100.540 �6.871 �8.05 2654 �.001 .395Verbal Comprehension 93.703 100.260 �6.557 �8.64 2704 �.001 .430Memory for Words 94.490 100.449 �5.959 �7.39 2970 �.001 .315Number Matrices 93.884 99.104 �5.220 �5.38 2218 �.001 .132Retrieval Fluency 94.032 99.129 �5.097 �6.58 2762 �.001 .253Analysis–Synthesis 96.225 100.273 �4.048 �4.76 2504 �.001 .210General Information 96.712 100.235 �3.523 �4.59 2894 �.001 .197Pair Cancellation 95.712 99.028 �3.316 �5.50 2040 �.001 .044Decision Speed 95.995 99.138 �3.143 �3.79 2728 �.001 .091Spatial Relations 97.178 100.169 �2.991 �3.93 2904 �.001 .192Concept Formation 97.052 99.858 �2.806 �3.55 3250 �.001 .144Auditory Attention 95.945 98.427 �2.482 �2.79 2446 0.030 .119Sound Blending 99.172 100.114 �0.942 �1.27 3184 0.201 .047Block Rotation 99.646 100.353 �0.707 �0.60 1588 0.723 �.114Picture Recognition 99.716 99.383 0.333 0.42 2516 0.523 �.021Incomplete Words 99.214 96.696 2.518 2.91 2496 0.117 �.172

a Effect size represents mean differences of observed test scores between the clinical group and a nonclinical comparisongroup.

11INVARIANCE OF WJ III

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 13: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

WJ III COG. Model fit also degraded for the WJIII ACH.

An examination of the AVE and constructreliability estimates in Table 7 reveals that val-ues for most constructs reach the minimumthreshold for acceptability. Notably, estimatesfor Ga and Gv suggest it may not be appropriateto interpret these constructs when deriving com-posites from the tests included in this study.

Discussion

The results of this study support the conclu-sion that the WJ III measures similar, but notidentical, constructs for children who have andhave not been diagnosed with a learning disor-der. Results suggest that WJ III tests define thesame factors and are calibrated in a similar wayacross both samples. Thus, scores obtainedfrom the WJ III have the same meaning for theclinical sample as they do for the matched sam-ple of students without known clinical diagno-ses. However, large and pervasive between-groups differences were found with regard tointercepts. Noninvariant intercepts suggest thatuniform bias (i.e., bias that is uniform acrossability levels; Mellenbergh, 1989) may have anegative impact on scores obtained by childrenwith learning disabilities. In other words, resultssuggest that bias may have a fairly consistenteffect on scores obtained by students with learn-ing disabilities, regardless of performance level.

A group’s mean score on a test is a functionof latent means, factor loadings, and intercepts(Steinmetz, 2013). Note that it is possible forintercepts to be invariant across groups whenmean scores differ, in which case it can beconcluded that between-groups mean differ-ences in test scores are a function of differencesin the construct of interest (Wicherts & Dolan,2010). As intercepts for multiple WJ III testswere found to be noninvariant, it is unclear ifthese intercept differences reflect clinicallymeaningful differences in constructs of interestor bias. In addition to measuring broad abilities,the tests used to measure cognitive abilities andacademic achievements were designed to mea-sure specific or narrow abilities. However, mostof these narrow abilities are measured by asingle test, and scores derived from perfor-mance with individual WJ III tests are lessreliable than cluster scores derived from perfor-mance with multiple WJ III tests (McGrew &Woodcock, 2001). Practitioners typically relyon cluster scores that reflect broad abilities fordiagnostic purposes, as scores reflecting thesebroad abilities are more reliable. Scores that fallbelow age- and grade-level expectations aretypically viewed as evidence of cognitive andacademic deficits. Although noninvariant inter-cepts may be a function of narrow abilitiestapped by specific tests (Wicherts & Dolan,2010), invariant intercepts are problematic

Table 7AVE and Construct Reliability for Select WJIII Constructs

Construct Clinical sample Matched sample

GaAVE .400 .395CR .651 .649

GcAVE .760 .769CR .904 .909

GfAVE .617 .451CR .865 .765

GlrAVE .504 .485CR .740 .716

GsAVE .689 .602CR .869 .817

GsmAVE .627 .499CR .856 .748

GvAVE .310 .229CR .564 .469

RCAVE .642 .515CR .842 .760

RDAVE .746 .704CR .853 .825

BMSAVE .553 .468CR .703 .626

MRAVE .767 .689CR .868 .816

Note. AVE � average variance extracted; BMS � basicmath skills; CR � composite reliability; Ga � auditoryprocessing; Gc � crystallized intelligence; Gf � fluid rea-soning; Glr � long-term retrieval; Gs � processing speed;Gsm � visual processing; Gv � visual processing; MR �math reasoning; RC � reading comprehension; RD � read-ing decoding; WJ III � Woodcock-Johnson III Tests.

12 BENSON AND TAUB

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 14: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

when the measurement aim is broad abilities, asit is difficult to disentangle clinically meaning-ful differences in narrow abilities from system-atic group differences that result from bias. Inother words, the finding of noninvariant inter-cepts suggests that (a) the measurement targets(i.e., broad cognitive and academic abilities) donot sufficiently capture clinically meaningfuldifferences between students diagnosed withlearning disorders and students without knownclinical diagnoses, or (b) systemic bias producesartificially low scores for individuals with learn-ing disabilities.

The importance of narrow abilities is sup-ported by Niileksela’s (2012) study of cogni-tive–achievement relations. Narrow abilities,especially knowledge and the ability to reasonquantitatively, were found to have consistentdirect effects on academic achievement for par-ticipants drawn from the CDB as well as forparticipants without identified learning disor-ders. Additionally, differences in cognitive–achievement relations were observed betweengroups based on the type of learning disorder(i.e., reading disorder, mathematics disorder, ordisorder of written expression). For example,the ability to reason quantitatively did not havea direct effect on mathematics achievement inthe sample of students with mathematics disor-ders. Although additional research utilizinglarger samples and additional indicators of nar-row abilities is needed to confirm the results ofNiileksela’s study, these results are intriguingand suggest that scores reflecting narrow abili-ties may have greater diagnostic and treatmentutility relative to scores reflecting the sevenbroad abilities measured by the WJ III COG.

In the present study, evidence of noninvari-ance was found with regard to intercepts andmeasurement residuals. Thus, although theWJ III appears to measure similar constructsacross clinical and nonclinical groups, it doesnot measure these constructs identically inboth groups. Noninvariance of intercepts andmeasurement residuals does not necessarilyindicate bias and may instead result frommodel specification error (e.g., narrow abili-ties exist but are not specified because the WJIII has few pure measures of these narrowabilities).

If the intercept differences observed in thisstudy are construct relevant, then the largestdifferences should be observed on academic

skills impacted by learning disabilities and cog-nitive abilities viewed as hallmarks of learningdisabilities. A meta-analysis of cognitive pro-cessing differences between students withlearning disabilities and typically achievingpeers indicated that phonological processing,verbal working memory, language ability, andGs deficits characterized a specific learning dis-ability in reading, whereas executive functiondeficits, including attention and working mem-ory problems, characterized a specific learningdisability in math (Johnson, Humphrey, Mel-lard, Woods, & Swanson, 2010). To explore thishypothesis, an effect size representing the meandifferences of observed test scores between theclinical and nonclinical samples was calculated.As shown in Table 6, when sorted by the mag-nitude of this effect size, it is clear that many ofthe largest differences in observed scores, aswell as the largest intercept differences, occurfor tests that are measures of academic skillsand cognitive abilities that children with learn-ing disabilities tend to have difficulty with.Likewise, the smallest intercept differences tendto occur for tests that have not demonstratednotable empirical relationships with learningdisabilities. These differences are consistentwith a recent synthesis of a large body of re-search pertaining to academic deficits as well asthe relations of these deficits with cognitiveabilities (Johnson et al., 2010; Niileksela,2012).

AVE and construct reliability estimates sup-port the interpretation of most broad abilities,although estimates for Ga and Gv suggest that itmay not be appropriate to interpret these con-structs when deriving composites from the testsincluded in this study. AVE and construct reli-ability estimates also suggest that most con-structs are measured similarly across groups.

Limitations

One limitation of this study is that few con-structs representing narrow abilities were in-cluded. Although the WJ III COG measuresnumerous narrow abilities, some of these abili-ties are measured by a single WJ III test. More-over, some of the WJ III tests were infrequentlyadministered to participants in the CDB sample.Thus, it was not possible to test the invarianceof scores reflecting these abilities across a groupof students with identified learning disorders

13INVARIANCE OF WJ III

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 15: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

and a matched sample without an identifiedlearning disorder. It would be useful to includemore narrow abilities in future studies, as someof the strongest direct effects on academic skillsappear to be at the narrow ability level(McGrew & Wendling, 2010). The fact that it isdifficult to determine if the intercept differencesfound in this study represent bias or meaningfuldifferences between clinical and nonclinicalsamples represents another limitation.

Another limitation of this study is that par-ticipants with learning disorders were lumpedinto a single group rather than separated intoreading disorder, mathematics disorder, and dis-order of written expression groups. Moreover,the majority of participants in this study hadreading disorder as their principal diagnosis,and many who did not had a secondary ortertiary diagnosis of reading disorder. It is pos-sible that the WJ III tests function differentlybased on type of learning disorder, and addi-tional research is needed to explore this possi-bility. Notably, existing research suggests thatWJ III tests display configural and metric in-variance across reading disorder, mathematicsdisorder, and disorder of written expressiongroups (Niileksela, 2012). In addition to exam-ining reading disorder, mathematics disorder,and disorder of written expression groups, re-search is also needed to identify whether differ-ences exist for subtypes within each type ofdisorder (e.g., subtypes within mathematics dis-order might include a group that primarily hasdifficulty with the procedural requirements ofmath and a group that primarily has difficultywith memorizing and retrieving math facts).Finally, the present study does not address theutility of the WJ III as a diagnostic tool. Re-ceiver operating characteristic (ROC) proce-dures are recommended for evaluating the diag-nostic accuracy of psychological assessments(Mandrekar, 2010; McFall & Treat, 1999;Swets, Dawes, & Monahan, 2000). Future re-search utilizing ROC procedures is needed todetermine if profiles of scores derived fromWJ III have utility for diagnosing learningdisabilities.

Summary

This study provides empirical evidence re-garding the extent to which scores derived fromthe WJ III are invariant across a group of stu-

dents with identified learning disorders and amatched sample of students without an identi-fied learning disorder. This research has impor-tant implications for research and practice. Re-searchers must know how well the measurementmodel functions within groups of students withlearning disabilities to ensure the accuracy oftheir findings. As intercepts were not equivalentacross the clinical and matching samples, re-searchers must exercise caution when interpret-ing mean score differences between groups ofstudents with learning disabilities and groups ofstudents without learning disabilities. For exam-ple, mean differences on the spelling, passagecomprehension, and letter–word identificationtests result largely from either bias or narrowabilities that were not specified as factors in themodels examined in this study.

Practitioners must evaluate the meaningful-ness of invariance violations by the extent towhich these violations interfere with the in-tended purposes of the test (Millsap & Mere-dith, 2007). The results of this study suggestthat the WJ III tests measures similar constructsacross children with and without learning dis-abilities. However, large differences in inter-cepts were found between samples. Tests ofacademic skills and cognitive abilities known tobe related to these academic skills (McGrew &Wendling, 2010) displayed the largest interceptdifferences. Notably, the largest group differ-ences in observed scores among WJ III testsalso occurred for these same tests, suggestingthat these differences may reflect differences inconstructs of interest instead of systematic bias.The fact that children diagnosed with learningdisabilities typically display a variety of otherlearning problems (e.g., failing grades, lowscores on high-stakes achievement tests) pro-vides external validity for tests of academicskills and related cognitive abilities, suggestingthese tests are sensitive to learning problems.

The results of this study are consistent withMcGrew and Wendling’s (2010) call for moreselective, referral-focused assessment that em-phasizes assessment of narrow cognitive andnarrow achievement abilities. However, futureresearch is needed to confirm that differences innarrow cognitive abilities play a causal role inthe expression of learning disabilities. Increasedfocus on assessment of narrow abilities willnecessitate development of more tests to mea-sure these abilities. Existing test batteries rarely

14 BENSON AND TAUB

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 16: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

include more than two indicators of a givennarrow ability, and many narrow abilities arenot assessed by most existing batteries. Ade-quate measurement of these narrow abilitieswill require at least two tests per ability, andthese tests will need to share a substantialamount of common variance reflecting their re-spective abilities.

Finally, it should be noted that cluster scoresderived from the WJ III should not be consid-ered to be incomparable across students withlearning disabilities and students without learn-ing disabilities. The size of intercept differencesmay be estimated and taken into account (Wicherts& Dolan, 2010). Thus, if intercept differencesare found to result from construct-irrelevantvariables rather than clinically meaningful dif-ferences in narrow abilities, it would be possibleto control for this bias when developing statis-tical formulas for identifying consistenciesamong or discrepancies between abilities andachievement.

References

Alfonso, V. C., Flanagan, D. P., & Radwan, S.(2005). The impact of Cattell-Horn-Carroll Theoryon test development and interpretation of cognitiveand academic abilities. In D. P. Flanagan & P. L.Harrison, (Eds.), Contemporary intellectual as-sessment: Theories, tests, and issues (2nd ed., pp.185–202). New York, NY: Guilford Press.

American Educational Research Association, Amer-ican Psychological Association, & National Coun-cil of Measurement in Education. (1999). Stan-dards for educational and psychological testing.Washington, DC: American Psychological Associ-ation.

American Psychiatric Association. (2000). Diagnos-tic and statistical manual of mental disorders (4thed., text rev.). Washington, DC: Author.

Anderson, J. C., & Gerbing, D. W. (1988). Structuralequation modeling in practice: A review and rec-ommended two-step approach. Psychological Bul-letin, 103, 411–423. doi:10.1037/0033-2909.103.3.411

Arbuckle, J. L. (1996). Full information estimation inthe presence of incomplete data. In G. A. Marcou-lides & R. E. Schumacker (Eds.), Advanced struc-tural equation modeling: Issues and techniques(pp. 243–277). Mahwah, NJ: Erlbaum.

Arbuckle, J. L. (2007). Amos 7.0 [Computer soft-ware]. Chicago, IL: Smallwaters.

Baraldi, A. N., & Enders, C. K. (2009). An introduc-tion to modern missing data analysis. Journal of

School Psychology, 48, 5–37. doi:10.1016/j.jsp.2009.10.001

Bentler, P. M., & Chou, C. P. (1987). Practicalissues in structural modeling. SociologicalMethods & Research, 16, 78 –117. doi:10.1177/0049124187016001004

Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989).Testing for the equivalence of factor covarianceand mean structures: The issue of partial measure-ment invariance. Psychological Bulletin, 105,456–466. doi:10.1037/0033-2909.105.3.456

Carroll, J. B. (1993). Human cognitive abilities: Asurvey of factor-analytic studies. New York, NY:Cambridge University Press. doi:10.1017/CBO9780511571312

Cheung, G. W., & Rensvold, R. B. (2002). Evaluat-ing goodness-of-fit indexes for testing measure-ment invariance. Structural Equation Modeling, 9,233–255. doi:10.1207/S15328007SEM0902_5

Cizek, G. J. (2003). [Review of the Woodcock-Johnson III]. In B. S. Plake, J. C. Impara, & R. A.Spies (Eds.), The fifteenth mental measurementsyearbook (pp. 1020–1024). Lincoln, NE: BurosInstitute of Mental Measurements.

Cronbach, L. J., & Meehl, P. E. (1955). Constructvalidity in psychological tests. Psychological Bul-letin, 52, 281–302. doi:10.1037/h0040957

Drasgow, F. (1987). Study of the measurement biasof two standardized psychological tests. Journal ofApplied Psychology, 72, 19 –29. doi:10.1037/0021-9010.72.1.19

Fan, X., & Sivo, S. (2007). Sensitivity of fit indices tomodel misspecification and model types. Multivar-iate Behavioral Research, 42, 509 –529. doi:10.1080/00273170701382864

Flanagan, D. P., Fiorello, C. A., & Ortiz, S. O.(2010). Enhancing practice through application ofCattell–Horn–Carroll theory and research: A “thirdmethod” approach to specific learning disabilityidentification. Psychology in the Schools, 47, 739–760.

Fleishman, J., & Benson, J. (1987). Using LISREL toevaluate measurement models and scale reliability.Educational and Psychological Measurement, 47,925–939. doi:10.1177/0013164487474008

Floyd, R. G., Shaver, R. B., & McGrew, K. S. (2003).Interpretation of the Woodcock-Johnson Tests ofCognitive Abilities: Acting on evidence. In F. A.Schrank & D. P. Flanagan (Eds.), WJ III clinicaluse and interpretation (pp. 1–46). New York, NY:Academic Press. doi:10.1016/B978-012628982-4/50002-7

Hayduk, L. A. (1989). Structural equation modeling –Essentials and advances. Baltimore, MD: TheJohn Hopkins University Press.

Horn, J. L., & Noll, J. (1997). Human cognitivecapabilities: Gf-Gc theory. In D. P. Flanagan, J. L.Genshaft, & P. L. Harrison, (Eds.), Contemporary

15INVARIANCE OF WJ III

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 17: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

intellectual assessment: Theories, tests, and issues(pp. 53–91). New York, NY: Guilford Press.

Individuals with Disabilities Education ImprovementAct of 2004, Pub. L. No. 108–446, 118 Stat. 2647;2004 Enacted H. R. 1350; 108 Enacted H. R. 1350.Final regulations implementing IDEA 2004 werepublished in the Federal Register, Monday, Au-gust 14, 2006, pp. 46540–46845.

Johnson, E. S., Humphrey, M., Mellard, D. F.,Woods, K., & Swanson, H. (2010). Cognitive pro-cessing deficits and students with specific learningdisabilities: A selective meta-analysis of the liter-ature. Learning Disability Quarterly, 33, 3–18.

Keith, T. Z. (2005). Using confirmatory factor anal-ysis to aid in understanding the constructs mea-sured by intelligence tests. In D. P. Flanagan &P. L. Harrison (Eds.), Contemporary intellectualassessment: Theories, tests, and issues (2nd ed.,pp. 581–614). New York, NY: Guilford Press.

Keith, T. Z., & Reynolds, M. R. (2010). Cattell-Horn-Carroll abilities and cognitive tests: Whatwe’ve learned from 20 years of research. Psychol-ogy in the Schools, 47, 635–650.

Kline, R. B. (2010). Principles and practice of struc-tural equation modeling (3rd ed.). New York, NY:The Guilford Press.

Mandrekar, J. N. (2010). Receiver operating charac-teristic curve in diagnostic test assessment. Jour-nal of Thoracic Oncology, 5, 1315–1316. doi:10.1097/JTO.0b013e3181ec173d

Mather, N., & Gregg, N. (2006). Specific learningdisabilities: Clarifying, not eliminating, a con-struct. Professional Psychology: Research andPractice, 37, 99–106. doi:10.1037/0735-7028.37.1.99

McFall, R. M., & Treat, T. A. (1999). Quantifyingthe information value of clinical assessments withsignal detection theory. Annual Review of Psychol-ogy, 50, 215–241. doi:10.1146/annurev.psych.50.1.215

McGrew, K. S. (2005). The Cattell-Horn-Carroll the-ory of cognitive abilities: Past, present, and future.In D. P. Flanagan & P. L. Harrison, (Eds.), Con-temporary intellectual assessment: Theories, tests,and issues (2nd ed., pp. 136–181). New York, NY:Guilford Press.

McGrew, K. S. (2009). CHC theory and the humancognitive abilities project: Standing on the shoul-ders of the giants of psychometric intelligenceresearch. Intelligence, 37, 1–10. doi:10.1016/j.intell.2008.08.004

McGrew, K. S., Schrank, F. A., & Woodcock, R. W.(2007). Technical manual. Woodcock-Johnson IIINormative Update. Rolling Meadows, IL: River-side Publishing.

McGrew, K., & Wendling, B. (2010). CHC cogni-tive–achievement relations: What we have learned

from the past 20 years of research. Psychology inthe Schools, 47, 651–675.

McGrew, K. S., & Woodcock, R. W. (2001). Tech-nical manual. Woodcock-Johnson III. RollingMeadows, IL: Riverside Publishing.

Meade, A. W., Johnson, E. C., & Braddy, P. W.(2008). Power and sensitivity of alternative indicesin tests of measurement invariance. Journal ofApplied Psychology, 93, 568–592. doi:10.1037/0021-9010.93.3.568

Mellenburgh, G. J. (1989). Item bias and item re-sponse theory. International Journal of Educa-tional Research, 13, 127–143. doi:10.1016/0883-0355(89)90002-5

Meredith, W. (1993). Measurement invariance, factoranalysis, and factorial invariance. Psychometrika,58, 525–543. doi:10.1007/BF02294825

Meredith, W., & Millsap, R. E. (1992). On the misuseof manifest variables in the detection of measure-ment invariance. Psychometrika, 57, 289 –311.doi:10.1007/BF02294510

Miles, J., & Shevlin, M. (2007). A time and place forincremental fit indices. Personality and IndividualDifferences, 42, 869 – 874. doi:10.1016/j.paid.2006.09.022

Millsap, R. E. (1998). Group differences in regres-sion intercept: Implication for factorial invariance.Multivariate Behavioral Research, 33, 403–424.doi:10.1207/s15327906mbr3303_5

Millsap, R. E., & Meredith, W. (2007). Factorialinvariance: Historical perspectives. In R. Cudeck& R. C. MacCallum (Eds.), Factor analysis at100: Historical developments and future directions(pp. 131–152). Mahwah, NJ: Erlbaum.

Niileksela, C. R. (2012). Moderation of cognitive-achievement relations for children with specificlearning disabilities: A multi-group latent variableanalysis using CHC theory. Unpublished doctoraldissertation. Retrieved from http://kuscholarworks.ku.edu/dspace/bitstream/1808/10014/1/Niileksela_ku_0099D_11998_DATA_1.pdf

Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993).Confirmatory factor analysis and item responsetheory: Two approaches for exploring measure-ment invariance. Psychological Bulletin, 114,552–566. doi:10.1037/0033-2909.114.3.552

Sandoval, J. (2003). [Review of the Woodcock-Johnson III]. In B. S. Plake, J. C. Impara, & R. A.Spies (Eds.), The fifteenth mental measurementsyearbook (pp. 1024–1028). Lincoln, NE: BurosInstitute of Mental Measurements.

Schafer, J. L. (1997). Analysis of incomplete multi-variate data. London, UK: Chapman & Hall. doi:10.1201/9781439821862

Schafer, J. L., & Graham, J. W. (2002). Missing data:Our view of the state of the art. PsychologicalMethods, 7, 147–177. doi:10.1037/1082-989X.7.2.147

16 BENSON AND TAUB

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Page 18: Invariance of Woodcock-Johnson III Scores for Students With Learning Disorders and Students Without Learning Disorders

Steinmetz, H. (2013). Analyzing observed compositedifferences across groups: Is partial measurementinvariance enough? Methodology: European Jour-nal of Research Methods for the Behavioral andSocial Sciences, 9, 1–12.

Swets, J. A., Dawes, R. M., & Monahan, J. (2000).Psychological science can improve diagnostic de-cisions. Psychological Science in the Public Inter-est, 1, 1–26. doi:10.1111/1529-1006.001

Taub, G. E., & McGrew, K. S. (2004). A confirma-tory factor analysis of the Cattell-Horn-Carroll the-ory and the cross-age invariance of the Woodcock-Johnson Tests of Cognitive Abilities III. SchoolPsychology Quarterly, 19, 72–87. doi:10.1521/scpq.19.1.72.29409

U.S. Department of Education. (2004). The Individ-uals with Disabilities Education Improvement Actof 2004 {IDEIA 2004}. Washington, DC: Author.

Wicherts, J. M., & Dolan, C. V. (2010). Measure-ment invariance in confirmatory factor analysis;An illustration using IQ test performance of mi-norities. Educational Measurement: Issues andPractice, 29, 3, 39–47. doi:10.1111/j.1745-3992.2010.00182.x

Woodcock, R. W., McGrew, K. S., & Mather, N.(2001a). Woodcock-Johnson III. Itasca, IL: River-side Publishing.

Woodcock, R. W., McGrew, K. S., & Mather, N.(2001a). Woodcock-Johnson III Tests of CognitiveAbilities. Itasca, IL: Riverside Publishing.

Woodcock, R. W., McGrew, K. S., & Mather, N.(2001c). Woodcock-Johnson III Tests of Achieve-ment. Itasca, IL: Riverside Publishing.

World Health Organization. (2001). ICF: Interna-tional Classification of Functioning, Disability andHealth. Geneva, Switzerland: Author.

Wu, A. D., Li, Z., & Zumbo, B. D. (2007). Decodingthe meaning of factorial invariance and updatingthe practice of multi-group confirmatory factoranalysis: A demonstration with TIMSS data. Prac-tical Assessment, Research and Evaluation, 12,1–26.

Yuan, K. H. (2005). Fit indices versus test statistics.Multivariate Behavioral Research, 40, 115–148.doi:10.1207/s15327906mbr4001_5

Received October 12, 2012Revision received April 8, 2013

Accepted April 14, 2013 �

17INVARIANCE OF WJ III

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.