eysenck personality test

8/12/2019 Eysenck Personality test

1/12

http://mec.sagepub.com/and Development

Evaluation in CounselingMeasurement and

http://mec.sagepub.com/content/44/3/159The online version of this article can be foundat:

DOI: 10.1177/0748175611409845 2011 44: 159Measurement and Evaluation in Counseling and Development

Tammi Vacha-Haase and Bruce ThompsonStudies

Score Reliability: A Retrospective Look Back at 12 Years of Reliability Generalization

Published by:

http://www.sagepublications.com

On behalf of:

Institution of Mechanical Engineers

at:can be foundMeasurement and Evaluation in Counseling and Developmentdditional services and information for

http://mec.sagepub.com/cgi/alertsEmail Alerts:

http://mec.sagepub.com/subscriptionsSubscriptions:

http://www.sagepub.com/journalsReprints.navReprints:

http://www.sagepub.com/journalsPermissions.navPermissions:

http://mec.sagepub.com/content/44/3/159.refs.htmlCitations:

at University of Bucharest on March 6, 2014mec.sagepub.comDownloaded from at University of Bucharest on March 6, 2014mec.sagepub.comDownloaded from
http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/content/44/3/159http://www.imeche.org/homehttp://mec.sagepub.com/content/44/3/159http://mec.sagepub.com/content/44/3/159http://www.sagepublications.com/http://www.sagepublications.com/http://www.imeche.org/homehttp://www.sagepub.com/journalsPermissions.navhttp://mec.sagepub.com/cgi/alertshttp://mec.sagepub.com/cgi/alertshttp://mec.sagepub.com/content/44/3/159.refs.htmlhttp://mec.sagepub.com/content/44/3/159.refs.htmlhttp://www.sagepub.com/journalsReprints.navhttp://www.sagepub.com/journalsReprints.navhttp://www.sagepub.com/journalsPermissions.navhttp://mec.sagepub.com/http://www.sagepub.com/journalsPermissions.navhttp://www.sagepub.com/journalsPermissions.navhttp://mec.sagepub.com/content/44/3/159.refs.htmlhttp://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/content/44/3/159.refs.htmlhttp://mec.sagepub.com/content/44/3/159.refs.htmlhttp://www.sagepub.com/journalsPermissions.navhttp://www.sagepub.com/journalsPermissions.navhttp://www.sagepub.com/journalsReprints.navhttp://www.sagepub.com/journalsReprints.navhttp://mec.sagepub.com/subscriptionshttp://mec.sagepub.com/subscriptionshttp://mec.sagepub.com/cgi/alertshttp://mec.sagepub.com/cgi/alertshttp://www.imeche.org/homehttp://www.imeche.org/homehttp://www.sagepublications.com/http://mec.sagepub.com/content/44/3/159http://mec.sagepub.com/


2/12

What is This?

- Jun 2, 2011Version of Record>>

at University of Bucharest on March 6, 2014mec.sagepub.comDownloaded from at University of Bucharest on March 6, 2014mec.sagepub.comDownloaded from

09 4
http://online.sagepub.com/site/sphelp/vorhelp.xhtmlhttp://online.sagepub.com/site/sphelp/vorhelp.xhtmlhttp://mec.sagepub.com/content/44/3/159.full.pdfhttp://mec.sagepub.com/content/44/3/159.full.pdfhttp://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://online.sagepub.com/site/sphelp/vorhelp.xhtmlhttp://mec.sagepub.com/content/44/3/159.full.pdf


3/12

Measurement and Evaluation inCounseling and Development

44(3) 159168 The Author(s) 2011

Reprints and permission: http://www.sagepub.com/journalsPermissions.nav

DOI: 10.1177/0748175611409845http://mecd.sagepub.com

Research in Brief

All the statistical analyses (e.g., t tests,

ANOVA, ANCOVA, Pearson r, regression, as

well as T2, MANOVA, MANCOVA, descrip-

tive discriminant analysis, canonical correla-

tion analysis) within the general linear model

(GLM; see Cohen, 1968; Knapp, 1978) are

correlational in that the implicit building block

for these analyses is the computation of the

intervariable correlation or covariance matrix.Indeed, secondary analyses of previously

published results are easily performed given

access to these matrices, even if the raw data

are unavailable (Zientek & Thompson, 2009).

However, poor score reliability will compro-

mise estimates of both statistical significance

(i.e.,pCALCULATED

values) and effect size within

classical GLM analyses, because score reli-

abilities are not considered by the analyses.

Instead, classical GLM analyses assume per-

fect or at least very good score reliabilities.

Score reliabilitycharacterizes the degree to

which scores measure something as opposed

to nothing (e.g., are completely random).

Random variations in data, including the ran-

dom variations associated with measure-

ment error, attenuate the relationships among

measured variables. Such attenuation occurs

because correlation coefficients are sensitive

to systematic covariances among measured

variables replicated over study participants and

not random fluctuations.The fact that poor score reliability compro-

mises the foundation of commonly applied

statistical analyses suggests the obvious con-

clusion that evaluation of the score reliabili-

ties for the scores in hand ought to be the

09845 MECXXX10.1177/0748175611409845Vacha-Hasseand ThompsonMeasurementunselingand Development 44(3)

1Colorado State University, Fort Collins, CO, USA2Texas A&M University, College Station, TX, USA3Baylor College of Medicine, Houston, TX, USA

Corresponding Author:Bruce Thompson, Dept. of Educ. Psyc., 4225 TAMU

College Station, TX 77843, USA

Email: [email protected]

Score Reliability: A

Retrospective Look Back

at 12 Years of Reliability

Generalization Studies

Tammi Vacha-Haase1and Bruce Thompson2,3

Abstract

The present study was conducted to characterize (a) the features of the thousands of primary

reports synthesized in 47 reliability generalization (RG) measurement meta-analysis studies and(b) typical methodological practice within the RG literature to date. With respect to the treat-ment of score reliability in the literature, in an astounding 54.6% of the 12,994 primary reportsauthors did not even mention reliability! Furthermore, in 15.7% of the primary reports authors

did mention score reliability, but merely inducted previously reported values as if they appliedto their data. Clearly, the admonitions of Wilkinson and the APA Task Force (1999) have yetto have their desired impacts with respect to reporting reliability estimates for ones own data.

Keywords

reliability, measurement, psychometrics, reliability generalization, meta-analysis

at University of Bucharest on March 6, 2014mec.sagepub.comDownloaded from
http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/


4/12

160 Measurement and Evaluation in Counseling and Development 44(3)

obligatory first step in any quantitative study,

prior to conducting any substantive analyses.

In the words of the American Psychological

Association (APA) Task Force on Statistical

Inference,

It is important to remember that a test is

not reliable or unreliable. Reliability is

a property of the scores on a test for a

particular population of examinees. . . .

Thus, authors should provide reliability

coefficients of the scores for the data

being analyzed even when the focus

of their research is not psychometric.

(Wilkinson & APA Task Force, 1999,

p. 596)

Given the importance of score reliability in all

quantitative analyses, and the fluctuations in

reliabilities across test administrations, ways

to explore systematically the variabilities in

reliabilities should be of special interest to

researchers.

Reliability Generalization (RG)Meta-Analysis

Twelve years ago, in a seminal article, Vacha-

Haase (1998) proposed RG as an extension of

another measurement meta-analytic method,

Validity Generalization, which was developed

by Schmidt and Hunter (1977) and Hunter

and Schmidt (1990). Vacha-Haase (1998)

described RG as a method to characterize

empirically: (a) the typical reliability of

scores for a given test across studies, (b) theamount of variability in reliability coefficients

for given measures, and (c) the sources of

variability in reliability coefficients across

studies (p. 6).

Reliability generalization is built on the

recognition that it is incorrect to speak of the

reliability of the test, or to say that the test

is reliable (Thompson, 1994). Reliability

inures as a property to scores, and not to tests

(Thompson & Vacha-Haase, 2000). Thus,

reliability coefficients fluctuate across test

administrations, and these fluctuations are

ripe for meta-analytic investigation.

The dozen years or so since the Vacha-

Haases (1998) conceptualization of RG have

seen both RG-related methodology develop-

ments (e.g., Bonnett, 2010; Rodriguez &

Maeda, 2006) as well as an increasing numberof RG studies being published. Tutorials on

how to do RG studies have been presented

(Henson & Thompson, 2002). And recogni-

tion of RG has been international (Dandan &

Houcan, 2004).

To date, several dozen RG meta-analyses

have been reported across an impressive array

of measures. For example, RG studies have

been conducted on literatures for measures

involving statetrait anxiety (Barnes, Harp,

& Jung, 2002), locus of control (Beretvas,

Suizzo, Durham, & Yarnell, 2008), mathe-

matics anxiety (Capraro, Capraro, & Henson,

2001), psychopathology (Campbell, Pulos,

Hogan, & Murry, 2005), learning styles

(Henson & Hwang, 2002), substance abuse

propensities (Miller, Woodson, Howell, &

Shields, 2009), ways of coping (Rexrode,

Petersen, & OToole, 2008), and life satisfac-

tion (Wallace & Wheeler, 2002).

Purposes of the Present Article

The present article reports a secondary analy-

sis of the 47 RG studies presented in journal

articles during the past 12 years. We identified

these 47 RG studies by searching PsycInfo

and ERIC for any use of the term reliability

generalization in the title, abstract, or as a

keyword. The source RG reports are desig-

nated with asterisks in our references. Weconducted our study for two broad purposes.

Quality of the social sciences literature. Our

first purpose was to characterize the quality of

the social sciences literature with respect to

score reliability considerations, as reflected in

the primary reports synthesized in the 47 RG

studies. Similar older analyses provide some

historical context for our more contemporary

report.

In an examination of the American Edu-

cational Research Journal (AERJ), Willson

(1980) reported that only 37% ofAERJarticles

explicitly provided reliability coefficients for

http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/http://mec.sagepub.com/


5/12

Vacha-Hasse and Thompson 161

the data analyzed in the studies, and he con-

cluded that reliability . . . is unreported in . . .

[so much published research] is . . . inexcus-

able at this late date (p. 9). Almost 20 years

later, Vacha-Haase, Ness, Nilsson, and Reetz(1999) reviewed three journals and found that

only 36% of the quantitative articles pro-

vided reliability coefficients for the data being

analyzed.

One reason for poor treatment of psycho-

metric issues within the social sciences litera-

ture is that [a]lthough most programs in

sociobehavioral sciences, especially doctoral

programs, require a modicum of exposure to

statistics and research design, few seem to

require the same where measurement is con-

cerned (Pedhazur & Schmelkin, 1991, p. 2).

Unfortunately, doctoral curricula in recent

years have allocated less and less space for

psychometric training (Capraro & Thompson,

2008), so practices may not have improved

since the time of Willsons (1980) report.

Our mega meta-analysis (see Vacha-Haase,

Henson, & Caruso, 2002) of the 47 RG stud-

ies provides a contemporary assessment of

the degree to which authors of primary reportsare attending to score reliability issues. The

RG studies each synthesized an average of

342.0 (SD= 494.8) prior studies. Thus, the

present study characterizes a huge array of

original studies in diverse areas of the social

sciences.

This first research focus included consider-

ation of how often primary researchers ignored

reliability, inducted prior reliability for mea-

sures rather than reporting reliability for theirown scores (see Vacha-Haase, Kogan, &

Thompson, 2000), or reported reliability for

the data actually being analyzed in their sub-

stantive studies. We also sought to character-

ize the typical score reliabilities reported in

primary reports summarized in RG studies and

the variability of these reliabilities.

Typical practice within the RG literature. In

addition to characterizing the quality of the

primary reports synthesized in the 47 RG

studies with respect to score reliability, second,

we also sought to characterize typical meth-

odological practice within the RG literature to

date. For example, we were interested in the

ways that RG researchers identified source

studies, the types of statistical and graphical

analyses reported, the types of predictor vari-

ables used to predict variabilities in score reli-abilities, and which predictors were or were

not generally found to be useful in making

these predictions.

Results

Quality of the Literature WithRespect to Score Reliability

Across the 47 studies, on average, literature

searches for instrument uses yielded 814.1

hits (SD = 1195.4). However, many of

these turned out to be theoretical or nonem-

pirical studies or studies in which the target

measure was mentioned but not administered.

On average, each RG study involved 342.0

(SD= 494.9) empirical studies in which the

target measure was administered.

In an astounding 54.6% of the 12,994 pri-

mary reports authors did not even mention

reliability! This is a discouraging finding withrespect to the integrity of such a broad array

of substantive studies, especially because most

of these reports used classical GLM methods.

Although structural equation modeling (SEM)

does estimate measurement error variance as

part of substantive analyses, as noted previ-

ously, classical GLM methods (e.g., ANOVA,

regression, descriptive discriminant analysis)

do notestimate measurement error variances

as part of their substantive analyses (seeYetkiner & Thompson, in press). Clearly, the

admonitions of Wilkinson and the APA Task

Force (1999) have yet to have their desired

impacts with respect to reporting reliability

estimates for ones own data.

In 15.7% of the 12,994 primary reports,

authors did mention score reliability but

merely inducted previously reported values as

if they applied to their data. When this was

done, in 48.0% of the inductions only the test

manual was referenced as the source of the

induction, whereas in the remaining cases the

manual and/or prior articles were referenced.



6/12


RG studies were based on an average of

64.5 (SD= 52.4) primary reports in which

authors reported reliability coefficients for

their own data. In only eight of the RG studies

did the researchers contact primary authors inan attempt to obtain information missing from

the primary reports. Because some RG studies

involved multiple related measures, multiple

subscale scores from a single measure,

reliability coefficients being reported for

subgroups, or multiple administrations of

measures, a given RG study often involved

multiple reliability coefficients. RG studies

on average involved 240.0 (SD=755.6) reli-

ability estimates.

The average of the mean coefficient alpha

values reported across the RG studies was .80

(SD= .09) and ranged from .45 to .95. The

smallest mean alpha reported in the RG stud-

ies was .17, and the largest mean alpha was

.92. However, some of these values were for

subscales on measures rather than for total

scores. And it must be remembered that coef-

ficient alpha and other coefficients, such as

stability reliability coefficients, measure quite

different things and thus tend to vary even forthe same measure (McCrae, Kurtz, Yamagata,

& Terracciano, 2011).

Typical Practice Withinthe RG Literature

Diverse statistical and graphical methods

were used across the 47 RG meta-analyses.

RG researchers frequently used multiple anal-

yses to understand and characterize their RGdata. A majority (i.e., 54.2%) of the 47 RG

reports used multiple regression as an analy-

sis, whereas 27.1% of the reports used

ANOVA. Some 6.2% of the RG researchers

used hierarchical linear modeling to honor

the fact that in some studies subscales were

nested within measures or several reliability

coefficients were nested within single pri-

mary reports. Box-and-whisked plots were

used in 35.4% of the 47 RG studies.

RG researchers typically investigate which

features of the primary reports may predict

variabilities in score reliabilities. The RG

studies on average investigated 8.5 (SD=4.0)

predictor variables in these analyses. The most

commonly used predictor variables included

gender (83.3% of the 47 RG studies), sample

size (68.8%), age in years (54.2%), and eth-nicity (52.1%).

The predictors that, when used, tended to

be noteworthy included the number of items

for measures that had forms of different

lengths (31.2%) and the score standard devia-

tion in the primary report (29.2%). Both

these results are psychometrically reasonable.

Scores from measures with more items tend to

be more reliable, especially when more items

result in more dispersed total test scores,

because total test score dispersion drives

score reliability so strongly (Reinhardt, 1996;

Thompson, 2003). Of course, scores from

longer tests do notinherently have higher reli-

ability, if the test is made longer by adding

items of poor quality or that do not increase

total score dispersion. For example, the Bem

Sex Role Inventory is a published test for

which the short form scores on the Femininity

scale tend to be higher than for their long form

counterpart (Bem, 1981).Two other predictors also tended to be

noteworthy in predicting variabilities in score

reliabilities. Participant age (22.9%) and par-

ticipant gender (22.9%) were among the better

predictors of variabilities in score reliabilities.

Discussion

RG studies provide some insight about both

the score reliabilities produced by given mea-sures across samples and typical reliability

reporting practices within the literature. With

respect to the first outcome, authors of RG

studies must work to avoid certain pitfalls

(see Dimitrov, 2002). For example, RG inves-

tigators should take into account use of

(a) different types of reliability estimates

across studies and (b) different test forms,

especially when forms have different numbers

of items.

A particularly difficult challenge for

RG researchers involves the RG modeling

misspecifications that occur when relevant



7/12


characteristics of the study samples are not

coded as independent variables in RG analysis

(Dimitrov, 2002, p. 794). These model mis-

specifications may occur because original

reports often do not provide enough detailabout the measurement and sampling designs

being used.

A related problem is that substantive

researchers who report score reliability coef-

ficients for their own data most often report

only Cronbachs alpha, notwithstanding the

limitations of this estimate and the fact that

the measurement model underlying that esti-

mate may not fit many of the situations in

which the estimate is used (see Dimitrov,

2002). Hogan, Benjamin, and Brezinskis

(2000) empirical study of the literature found

that two thirds of the articles they examined

reported alpha, and they also noted that

despite their prominence in the psychometric

literature of the past 20 years, we encountered

no reference to generalizability coefficients . . .

or to the test information functions that arise

from item response theory (p. 528).

These problems limit the potential benefits

of RG studies. Of course, as Thompson andVacha-Haase (2000) reminded,

It is important to remember that RG

studies are a meta-analytic character-

ization of what is hoped is a population

of previous reports. We may not like

the ingredients that go into making this

sausage, but the RG chef can only work

with the ingredients provided by the

literature. (p. 184)

Score Reliability Within the

Social Sciences Literature

Our most important finding is that such an

astonishingly large proportion (i.e., a little

more than half) of primary substantive studies

do not even mention score reliability! This is

a discouraging finding with respect to the

integrity of such a broad array of primary

studies, especially because most of these

studies used classical GLM analyses. Clearly,

the admonitions of Wilkinson and the APA

Task Force (1999) have yet to have their

desired impacts.

We believe this disturbing reality is an

artifact of too many applied researchers still

believing that tests qua teststhemselves havethe property of reliability. This misconception

may not be conscious, but is all the more per-

nicious when unconscious, because uncon-

scious misperceptions may be less likely to be

reconsidered and corrected. The problem of

sloppy speaking about reliability, in which

tests are described as being reliable,

is not just an issues of sloppy speak-

ingthe problem is that sometimes we

unconsciously come to think what we

say or what we hear, so that sloppy

speaking does sometimes lead to a

more pernicious outcome, sloppy think-

ing and sloppy practice. (Thompson,

1992, p. 436)

Some textbooks directly confront the mis-

conception that tests are reliable. For exam-

ple, Pedhazur and Schmelkin (1991) noted,

Statements about the reliability of a measureare . . . [inherently] inappropriate and poten-

tially misleading (p. 82). Similarly, Gronlund

and Linn (1990) emphasized that

reliability refers to the resultsobtained

with an evaluation instrument and not

to the instrument itself. . . . Thus, it

is more appropriate to speak of the

reliability of the test scores or the

measurement than of the test orthe instrument. (p. 78)

More recently, Urbina (2004) emphasized

the fact is that the quality of reliability is one

that, if present, belongs not to test but to test

scores (p. 119). She perceptively noted that

the distinction between scores versus tests

being reliable is subtle, but noted that

the distinction is fundamental to an

understanding of the implications of the

concept of reliability with regard to the

use of tests and the interpretation of test



8/12


scores. If a test is described as reliable,

the implication is that its reliability

has been established permanently, in all

respects for all uses, and with all users.

(p. 120)

Urbina utilizes a piano analogy to illustrate

the fallacy of describing tests as reliable, not-

ing that saying the test is reliable is similar

to stating a piano will always sound the same,

regardless of the type of music played, the

person who is playing it, the type of the piano,

or the surrounding acoustical environment.

Our second major finding is that score reli-

ability on average in the applied research lit-

erature appears to be reasonably sufficient to

support inquiry using classical GLM statis-

tics, given a mean coefficient alpha of .80

(SD=.09) and a range from .45 to .95.

These results suggest a glass-half-full-

half-empty conclusion about the quality of

our literature with respect to score reliability.

Clearly, some substantive studies are being

conducted with scores of questionable reli-

ability. Furthermore, we must wonder what

were the reliabilities of those scores in thosestudies in which reliabilities were not reported

for data in hand, or reliability was not even

mentioned!

Typical Practice Within

the RG Literature

Each of the 47 RG studies we investigated

involved a gargantuan investment of researcher

time and effort, as does any meta-analysis,whether the meta-analysis is substantive or

psychometric in focus. RG researchers are

employing a wide array of statistical analyses,

and one out of three used box-and-whisker

plots to communicate their results, which

is consistent with the recommendation of

Wilkinson and APA Task Force (1999) to use

graphics to communicate multiple features of

data (e.g., central tendency, dispersion, shape,

outliers) in pictures. We also found that RG

researchers are using a wide array of predictor

variables to help understand what design

features may cause reliabilities to fluctuate

across test administrations.

Over time, we expect the quality of RG

studies to improve further once more and

more primary reports include estimates ofscore reliabilities as more researchers realize

that tests are not reliable. Indeed, the most

important impact of the creation of RG and

the reporting of RG findings is that these

reports in themselves directly confront chronic

misconceptions that tests are reliable. RG

studies in and of themselves communicate the

important understanding that score reliabili-

ties vary across administrations and are not

secreted into test booklets during the test

printing process.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of

interest with respect to the research, authorship,

and/or publication of this article.

Funding

The author(s) received no financial support for the

research, authorship, and/or publication of this

article.

References

Note: The 47 RG studies included in this study are

marked with asterisks.

*Bachner, Y. G., & ORourke, N. (2007). Reli-

ability generalization of responses by care

providers to the Zarit Burden Interview.Aging

& Mental Health, 11, 678685. doi:10.1080/

13607860701529965

*Barnes, L. L. B., Harp, D., & Jung, W. S. (2002).Reliability generalization of scores on the Spiel-

berger StateTrait Anxiety Inventory.Educa-

tional and Psychological Measurement, 62,

603618. doi:10.1177/0013164402062004005

Bem, S. L. (1981). Bem Sex-Role Inventory: Pro-

fessional manual. Palo Alto, CA: Consulting

Psychologists Press.

*Beretvas, S. N., Meyers, J. L., & Leite, W. L. (2002).

A reliability generalization study of the Marlowe

Crowne Social Desirability Scale.Educational

and Psychological Measurement, 62, 570589.

doi:10.1177/0013164402062004003



9/12


*Beretvas, S. N., Suizzo, M.-A., Durham, J. A.,

& Yarnell, L. M. (2008). A reliability gen-

eralization study of scores on Rotters and

NowickiStricklands locus of control scales.

Educational and Psychological Measurement,68, 97119. doi:10.1177/0013164407301529

Bonnett, D. G. (2010). Varying coefficient alpha

meta-analytic methods for alpha reliability.Psy-

chological Methods, 15, 368385. doi:10.1037/

a0020142

*Campbell, J. S., Pulos, S., Hogan, M., & Murry, F.

(2005). Reliability generalization of the Psycho-

pathy Checklist Applied in youthful samples.

Educational and Psychological Measurement,

65, 639656. doi:10.1177/0013164405275666

*Capraro, R. M., & Capraro, M. M. (2002).

MyersBriggs Type Indicator score reliability

across studies: A meta-analytic reliability gen-

eralization study.Educational and Psychologi-

cal Measurement, 62, 590602. doi:10.1177/

0013164402062004004

*Capraro, M. M., Capraro, R. M., & Henson, R. K.

(2001). Measurement error of scores on the

Mathematics Anxiety Rating Scale across

studies. Educational and Psychological Mea-

surement, 61, 373386. doi:10.1177/00131640121971266

Capraro, R. M., & Thompson, B. (2008). The edu-

cational researcher defined: What will future

researchers be trained to do?Journal of Edu-

cational Research, 101, 247253. doi:10.3200/

JOER.101.4.247-253

*Caruso, J. C. (2000). Reliability generalization

of the NEO Personality Scales. Educational


doi:10.1177/00131640021970484*Caruso, J. C., & Edwards, S. (2001). Reliability gen-

eralization of the Junior Eysenck Personality Ques-

tionnaire.Personality and Individual Differences,

31, 173184. doi:10.1016/S091-8869(00)00126-4

*Caruso, J. C., Witkiewitz, K., Belcourt-Dittloff, A.,

& Gottlieb, J. D. (2001). Reliability of scores

from the Eysenck Personality Questionnaire:

A reliability generalization study.Educational


doi:10.1177/00131640121971437

Cohen, J. (1968). Multiple regression as a general

data-analytic system. Psychological Bulletin,

70, 426433. doi:10.1037/h0026714

Dandan, G., & Houcan, Z. (2004). A redefinition

of reliability and the study of reliability gen-

eralization.Psychological Science (China), 27,

445448.

*Deditius-Island, H. K., & Caruso, J. C. (2002). Anexamination of the reliability of scores from Zuck-

ermans Sensation Seeking Scales, Form V.Edu-

cational and Psychological Measurement, 62,

728734. doi:10.1177/0013164402062004012

Dimitrov, D. M. (2002). Reliability: Arguments

for multiple perspectives and potential prob-

lems with generalizability across studies.Edu-


783801. doi:10.1177/001316402236878

*Dunn, T. W., Smith, T. B., & Montoya, J. A.

(2006). Multicultural competency instrumen-

tation: A review and analysis of reliability

generalization.Journal of Counseling & Devel-

opment, 84, 471482.

*Graham, J. M., & Christiansen, K. (2009). The reli-

ability of romantic love: A reliability generaliza-

tion meta-analysis.Personal Relationships, 16,

4966. doi:10.1111/j.1475-6811.2009.01209.x

*Graham, J. M., Liu, Y. J., & Jeziorski, J. L. (2006).

The Dyadic Adjustment Scale: A reliability

generalization meta-analysis. Journal of Mar-riage and Family, 68, 701717. doi:10.1111/

j.1741-3737.2006.00284.x

Gronlund, N. E., & Linn, R. L. (1990). Measure-

ment and evaluation in teaching (6th ed.).

New York, NY: Macmillan.

*Hanson, W. E., Curry, K. T., & Bandalos, D. L.

(2002). Reliability generalization of Working

Alliance Inventory scale scores. Educational


doi:10.1177/0013164402062004008*Hellman, C. M., Fuqua, D. R., & Worley, J. (2006).

A reliability generalization study on the Sur-

vey of Perceived Organizational Support: The

effects of mean age and number of items on

score reliability. Educational and Psychologi-


0013164406288158

*Hellman, C. M., Muilenburg-Trevino, E. M., &

Worley, J. A. (2008). The belief in a just

world: An examination of reliability estimates

across three measures. Journal of Personal-

ity Assessment, 90, 399401. doi:10.1080/

00223890802108238



10/12


*Henson, R. K., & Hwang, D.-Y. (2002). Vari-

ability and prediction of measurement error in

a Kolbs Learning Style Inventory Scores: A

reliability generalization study. Educational

and Psychological Measurement, 62, 712727.doi:10.1177/ 0013164402062004011

*Henson, R. K., Kogan, L. R., & Vacha-Haase, T.

(2001). A reliability generalization study of the

Teacher Efficacy Scale and related instruments.


61, 404420. doi:10.1177/00131640121971284

Henson, R. K., & Thompson, B. (2002). Charac-

terizing measurement error in scores across

studies: Some recommendations for conduct-

ing reliability generalization (RG) studies.

Measurement and Evaluation in Counseling

and Development, 35, 113127.

Hogan, T. P., Benjamin, A., & Brezinski, K. L.

(2000). Reliability methods: A note on the

frequency of use of various types. Educa-


523531.

Hunter, J. E., & Schmidt, F. L. (1990). Methods

of meta-analysis: Correcting error and bias in

research findings. Newbury Park, CA: Sage.

*Huynh, Q.-L., Howell, R. T., & Benet-Martinez, V.(2009). Reliability of bidimensional accul-

turation scores: A meta-analysis. Journal of

Cross-Cultural Psychology, 40, 256274.

doi:10.1177/0022022108328919

*Kieffer, K. M., Cronin, C., & Fister, M. C. (2004).

Exploring variability and sources of measure-

ment error in Alcohol Expectancy Question-

naire reliability coefficients: A meta-analytic

reliability generalization study. Journal of

Studies on Alcohol, 65, 663671.*Kieffer, K. M., & Reese, R. J. (2002). A reliability

generalization study of the Geriatric Depres-

sion Scale. Educational and Psychological

Measurement, 62, 969994. doi:10.1177/0013

164402238085

Knapp, T. R. (1978). Canonical correlation analysis:

A general parametric significance testing sys-

tem. Psychological Bulletin , 85, 410416.

doi:10.1037//0033-2909.85.2.410

*Lane, G. G., White, A. E., & Henson, R. K. (2002).

Expanding reliability generalization methods

with KR-21 estimates: An RG study of the

Coopersmith Self-Esteem Inventory. Educa-


685711. doi:10.1177/0013164402062004010

*Leach, L. F., Henson, R. K., Odom, L. R., &

Cagle, L. S. (2006). A reliability generalization

study of the Self-Description Questionnaire.Educational and Psychological Measurement,

66, 285304. doi:10.1177/0013164405284030

*Li, A., & Bagger, J. (2007). The Balanced Inven-

tory of Desirable Responding (BIDR): A reli-

ability generalization study. Educational and

Psychological Measurement, 67, 525544.

doi:10.1177/001316440292087

*Lopez-Pina, J. A., Sanchez-Meca, J., & Rosa-

Alcazar, A. I. (2009). The Hamilton Rating

Scale for Depression: A meta-analytic reli-

ability generalization study. International

Journal of Clinical and Health Psychology, 9,

143159.

McCrae, R. R., Kurtz, J. E., Yamagata, S., & Ter-

racciano, A. (2011). Internal consistency,

retest reliability, and their implications

for personality scale validity. Personality

and Social Psychology Review, 15, 2850.

doi:10.1177/1088868310366253

*Miller, B. K., & Byrne, Z. S. (2009). Perceptions

of organizational politics: A demonstration ofthe reliability generalization technique.Journal

of Managerial Issues, 21, 280300.

*Miller, C. S., Shields, A. L., Campfield, D.,

Wallace, K. A., & Weiss, R. D. (2007). Sub-

stance use scales of the Minnesota Multiphasic

Personality Inventory: An exploration of score

reliability via meta-analysis. Educational and


doi:10.1177/0013164406299130

*Miller, C. S., Woodson, J., Howell, R. T., &Shields, A. L. (2009). SASSI: Assessing the

reliability of scores produced by the Substance

Abuse Subtle Screening Inventory. Substance

Use & Misuse, 44, 10901100.

*Mji, A., & Alkhateeb, H. M. (2005). Combining

reliability coefficients: Toward reliability gen-

eralization of the Conceptions of Mathemat-

ics Questionnaire.Psychological Reports, 96,

627634. doi:10.2466/pr0.96.3.627-634

*Nilsson, J. E., Schmidt, C. K., & Meek, W. D.

(2002). Reliability generalization: An exami-

nation of the Career Decision-Making Self-

Efficacy Scale.Educational and Psychological



11/12


Measurement, 62, 647658. doi:10.1177/

0013164402062004007

*ORourke, N. (2004). Reliability generalization

of responses by care providers to the Center

for Epidemiologic StudiesDepression Scale.Educational and Psychological Measurement,

64, 973990. doi:10.1177/0013164404268668

Pedhazur, E. J., & Schmelkin, L. P. (1991). Mea-

surement, design, and analysis: An integrated

approach. Hillsdale, NJ: Erlbaum.

*Reese, R. J., Kieffer, K. M., & Briggs, B. K.

(2002). A reliability generalization study of

select measures of adult attachment style.Edu-


619646. doi:10.1177/0013164402062004006

Reinhardt, B. (1996). Factors affecting coef-

ficient alpha: A mini Monte Carlo study. In

B. Thompson (Ed.),Advances in social science

methodology (Vol. 4, pp. 320). Greenwich,

CT: JAI Press.

*Rexrode, K. R., Petersen, S., & OToole, S.

(2008). The Ways of Coping Scale: A reli-

ability generalization study.Educational and


doi:10.1177/ 0013164407310128

Rodriguez, M. C., & Maeda, Y. (2006). Meta-analysisof coefficient alpha. Psychological Methods,

11, 306322. doi:10.1037/1082-989X.11.3.306

*Ross, M. E., Blackburn, M., & Forbes, S. (2005).

Reliability generalization of the Patterns of Adap-

tive Learning Survey Goal Orientation Scales.


65, 451464. doi:10.1177/0013164404272496

*Rouse, S. V. (2007). Using reliability generaliza-

tion methods to explore measurement error:

An illustration using the MMPI-2 PSY-5Scales.Journal of Personality Assessment, 88,

264275.

*Ryngala, D. J., Shields, A. L., & Caruso, J. C.

(2005). Reliability generalization of the

Revised Childrens Manifest Anxiety Scale.


65, 259271. doi:10.1177/0013164404272495

Schmidt, F. L., & Hunter, J. E. (1977). Develop-

ment of a general solution to the problem of

validity generalization. Journal of Applied

Psychology, 62, 529540. doi:10.1037//0021-

9010.62.5.529

*Shields, A. L., & Caruso, J. C. (2003). Reliabil-

ity generalization of the Alcohol Use Dis-

orders Identification Test. Educational and


doi:10.1177/0013164403063003004*Shields, A. L., & Caruso, J. C. (2004). A reli-

ability induction and reliability generalization

study of the CAGE Questionnaire.Educational


doi:10.1177/0013164403261814

Thompson, B. (1992). Two and one-half decades

of leadership in measurement and evaluation.

Journal of Counseling and Development, 70,

434438.

Thompson, B. (1994). Guidelines for authors.Edu-


837847.

Thompson, B. (Ed.). (2003). Score reliability:

Contemporary thinking on reliability issues.

Thousand Oaks, CA: Sage.

*Thompson, B., & Cook, C. (2002). Stability of the

reliability of LibQUAL + TM scores: A reli-

ability generalization meta-analysis study.Edu-


735743. doi:10.1177/0013164402062004013

Thompson, B., & Vacha-Haase, T. (2000). Psycho-metrics is datametrics: The test is not reliable.


60, 174195. doi:10.1177/00131640021970448

Urbina, S. (2004).Essentials of psychological test-

ing. Hoboken, NJ: John Wiley.

Vacha-Haase, T. (1998). Reliability generaliza-

tion: Exploring variance in measurement error

affecting score reliability across studies.Edu-


620. doi:10.1177/00131640121971059Vacha-Haase, T., Henson, R. K., & Caruso, J. C.

(2002). Reliability generalization: Moving

toward improved understanding and use of

score reliability. Educational and Psychologi-


0013164402062004002

*Vacha-Haase, T., Kogan, L. R., Tani, C. R.,

& Woodall, R. A. (2001). Reliability gener-

alization: Exploring variation of reliability

coefficients of MMPI clinical scales scores.


61, 4559. doi:10.1177/00131640121971059



12/12


Vacha-Haase, T., Kogan, L.R., & Thompson, B.

(2000). Sample compositions and variabilities

in published studies versus those in test man-

uals: Validity of score reliability inductions.

Educational and Psychological Measurement,60, 509522. doi:10.1177/00131640021970682

Vacha-Haase, T., Ness, C. M., Nilsson, J., &

Reetz, D. (1999). Practices regarding reporting

of reliability coefficients: A review of three jour-

nals. Journal of Experimental Education, 67,

335341. doi:10.1080/00220979909598487

*Vacha-Haase, T., Tani, C. R., Kogan, L. R.,

Woodall, R. A., & Thompson, B. (2001). Reli-

ability generalization: Exploring reliability

variations on MMPI/MMPI-2 validity scale

scores. Assessment, 8, 391401. doi:10.1177/

107319110100800404

*Vassar, M., & Crosby, J. W. (2008). A reliability

generalization study of coefficient alpha for

the UCLA Loneliness Scale. Journal of Per-

sonality Assessment, 90, 601607. doi:10.1080/

00223890802388624

*Victorson, D., Barocas, J., Song, J., & Cella, D.

(2008). Reliability across studies from the Func-

tional Assessment of Cancer TherapyGeneral

(FACT-G) and its subscales: A reliability gen-eralization. Quality of Life Research: An Inter-

national Journal of Quality of Life Aspects of

Treatment, Care & Rehabilitation,17, 11371146.

doi:10.1007/s11136-008-9398-2

*Wallace, K. A., & Wheeler, A. J. (2002). Reli-

ability generalization of the Life Satisfaction

Index.Educational and Psychological Mea-

surement, 62, 674684. doi:10.1177/001316

4402062004009

Wilkinson, L., & American Psychological Associa-tion (APA) Task Force on Statistical Inference.

(1999). Statistical methods in psychology jour-

nals: Guidelines and explanations. American

Psychologist, 54, 594604. doi:10.1037//0003-

066X.54.8.594

Willson, V. L. (1980). Research techniques in

AERJ articles: 1969 to 1978. Educational

Researcher, 9(6), 510. doi:10.2307/1175221Yetkiner, Z. E., & Thompson, B. (in press). Dem-

onstration of how score reliability is integrated

into SEM and how reliability affects all statisti-

cal analyses.Multiple Linear Regression View-

points, 36(2).

*Yin, P., & Fan, X. (2000). Assessing the reli-

ability of Beck Depression Inventory scores:

Reliability generalization across studies.Edu-


201223. doi:10.1177/00131640021970466

*Youngstrom, E. A., & Green, K. W. (2003). Reli-

ability generalization of self-report of emotions

when using the Differential Emotions Scale.


63, 279295. doi:10.1177/0013164403253226

*Zangaro, G. A., & Soeken, K. L. (2005). Meta-

analysis of the reliability and validity of Part B of

the Index of Work Satisfaction across studies.

Journal of Nursing Measurement, 13, 722.

doi:10.1891/jnum.2005.13.1.7

Zientek, L. R., & Thompson, B. (2009). Matrix summa-ries improve research reports: Secondary analyses

using published literature.Educational Researcher,

38, 343352. doi:10.3102/0013189X09339056

Bios

Tammi Vacha-Haase is a professor of psychol-

ogy at Colorado State University.

Bruce Thompson is a distinguished professor of

educational psychology, and of library science, atTexas A&M University, and adjunct professor of

allied health sciences, Baylor College of Medicine

(Houston).

eysenck personality test

Documents