reliability and true-score measures of binary items as a ... · specifically, the tsm expected...

12
Requests for reprints should be sent to Dimiter M. Dimitrov, Graduate School of Education, MNN4B3, 4400 University Dr., George Mason University, Fairfax, VA 22030, e-mail: [email protected]. JOURNAL OF APPLIED MEASUREMENT, 4(3), 222–233 Copyright © 2003 Reliability and True-Score Measures of Binary Items as a Function of Their Rasch Difficulty Parameter Dimiter M. Dimitrov George Mason University This article provides formulas for expected true-score measures and reliability of binary items as a function of their Rasch difficulty when the trait (ability) distribution is normal or logistic. The proposed formulas have theoretical value and can be useful in test development, score analysis, and simulation studies. Once the items are calibrated with the dichotomous Rasch model, one can estimate (without further data collection) the expected values for true-score measures (e.g., domain score, true score variance, and error variance for the number-right score) and reliability for both norm-referenced and criterion-referenced interpretations. Thus, given a bank of Rasch calibrated items, one can develop a test with desirable values of population true-score measures and reliability or compare such measures for subsets of items that are grouped by substantive char- acteristics (e.g., content areas or strands of learning outcomes). An illustrative example for using the pro- posed formulas is also provided.

Upload: others

Post on 15-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reliability and True-Score Measures of Binary Items as a ... · Specifically, the TSM expected domain score, er-ror variance, true score variance, and reliability (for norm-referenced

Requests for reprints should be sent to Dimiter M. Dimitrov, Graduate School of Education, MNN4B3, 4400University Dr., George Mason University, Fairfax, VA 22030, e-mail: [email protected].

JOURNAL OF APPLIED MEASUREMENT, 4(3), 222–233

Copyright© 2003

Reliability and True-Score Measuresof Binary Items as a Function

of Their Rasch Difficulty Parameter

Dimiter M. DimitrovGeorge Mason University

This article provides formulas for expected true-score measures and reliability of binary items as a function oftheir Rasch difficulty when the trait (ability) distribution is normal or logistic. The proposed formulas havetheoretical value and can be useful in test development, score analysis, and simulation studies. Once the itemsare calibrated with the dichotomous Rasch model, one can estimate (without further data collection) theexpected values for true-score measures (e.g., domain score, true score variance, and error variance for thenumber-right score) and reliability for both norm-referenced and criterion-referenced interpretations. Thus,given a bank of Rasch calibrated items, one can develop a test with desirable values of population true-scoremeasures and reliability or compare such measures for subsets of items that are grouped by substantive char-acteristics (e.g., content areas or strands of learning outcomes). An illustrative example for using the pro-posed formulas is also provided.

Page 2: Reliability and True-Score Measures of Binary Items as a ... · Specifically, the TSM expected domain score, er-ror variance, true score variance, and reliability (for norm-referenced

RELIABILITY AND TRUE-SCORE MEASURES 223

Despite some disadvantages of the True-ScoreModel (TSM) in metric development and accu-racy of measurement compared to item-responsetheory models (e.g., Hambleton and Jones, 1993)and Rasch models (e.g., Linacre, 1997; Smith,2000, 2001), the TSM has been and is still used intest development and test score analysis. Recentdebates and editorial policies on issues of reliabil-ity (e.g., Dimitrov, 2002; Sawilowsky, 2000; Th-ompson and Vacha-Haase, 2000) indicate the ne-cessity of adequate understanding and estimationof TSM reliability and standard error of measure-ment at sample and population level. In this ar-ticle, population (expected) values of true-scoremeasures and reliability for binary items are de-termined from Rasch measurement information.Specifically, the TSM expected domain score, er-ror variance, true score variance, and reliability(for norm-referenced and criterion-referenced in-terpretations) are presented as a function of theRasch item difficulty parameter. The proposedformulas have theoretical value and can be veryuseful in test development, score analysis, andsimulation studies. For example, given a bank ofbinary items calibrated with the dichotomousRasch model (Rasch, 1960), one can select itemswith known true-score measures and reliabilityprior to administering the test.

It is important to note that the informationprovided with the proposed formulas and the in-formation obtained through Rasch analysis cancomplement (not replace or exclude) each otherin measurement analysis. For example, the TSMreliability evaluated with the method developedin this article provides more information about theaccuracy of measurement at population level rela-tive to classical coefficients such as Cronbach’salpha (Cronbach, 1951), but it cannot replace theinformation provided by Rasch reliability mea-sures for locating persons on the underlying trait(e.g., Linacre, 1996, 1997; Wright, 2001). Othercomments on this issue that follow later in the textalso support the argument that researchers andpractitioners will benefit from combining Raschmeasurement information with TSM informationabout population measures provided by formulasdeveloped in this article.

Theoretical Framework

With the dichotomous Rasch model, theprobability for correct answer on item i with dif-ficulty δ

i for a person with a trait score θ is

exp( )( ) .

1 exp( )i

ii

Pθ δ

θθ δ−

=+ − (1)

As Pi(θ) is also the true score on item i for a

person at θ, the expected item mean for the per-son population is

π θ ϕ θ θi iP d=−∞

∞∫ ( ) ( ) , (2)

where ϕ(θ) is the probability density function(pdf) for the population trait distribution. The ex-pected number-right score for a test of n binaryitems is then

µ π==∑ ii

n

1(3)

and the expected domain score is π = µ/n (in termsof percentages: π = 100µ/n).

Also, Pi(θ)[1 - P

i(θ)] is the error variance

for a binary item i at θ (Lord, 1980, p. 52). There-fore, the expected item error variance for theperson population is

σ θ θ ϕ θ θ2 1( ) ( )[ ( )] ( ) .e P P di i i= −−∞

∫ (4)

The expected error variance for the number-rightscore on a test of n dichotomous items is then

σ σe i

i

n

e2 2

1

==∑ ( ) . (5)

It is important to emphasize that σe2 repre-

sents the accuracy of number-right scores and isnot to be confused with the mean square measure-ment error (MSE

p) that represents the accuracy

of trait scores on the logit scale with Rasch mea-surement models (e.g., Smith, 2001). Also, whilethe MSE

p is a sample statistic that requires infor-

mation about the person’s trait score, θ, σe2 does

not require such information because it is obtainedthrough integration over the trait interval.

Closed form integral evaluations for πi and

σ2(ei) in Equations 2 and 4, respectively, are pro-

Page 3: Reliability and True-Score Measures of Binary Items as a ... · Specifically, the TSM expected domain score, er-ror variance, true score variance, and reliability (for norm-referenced

224 DIMITROV

vided in the next section. The population distribu-tion for the underlying trait is assumed to be nor-mal or logistic. The pdf of a logistic distribution(e.g., Evans, Hastings, and Peacock, 1993, p. 98)with the location at the origin of the scale is

ϕ θθ

θ( )

exp( / )

[ exp( / )],=

+c

c c1 2 (6)

where c is the scale parameter. This article dealswith two specific logistic distributions (c = 1 orc = 1/2) that yield exact integral evaluations forEquations 2 and 4 and capture normal-like abil-ity shapes that may occur in practice with Raschmeasurement (see Figure 1).

Formulae Development

Expected Domain Score with Normal AbilityDistribution

With Pi(θ) for the dichotomous Rasch model

and ϕ (θ) with N(0,1), an exact closed form evalu-ation for the integral in Equation 2 does not ex-ist. Therefore, an approximation formula was de-veloped in two steps. First, using the computerprogram MATLAB (MathWorks, Inc., 1999),quadrature method evaluations were obtained forvalues of the Rasch item difficulty, δ

i, in the in-

terval from -6 to 6 with an increment of 0.01 onthe logit scale. Second, the results were tabulatedand then approximated with the four-parameter

sigmoid function using the regression wizard ofthe computer program SigmaPlot 5.0 (SPSS Inc.,1998). The resulting approximation formula (withan absolute error smaller than 0.02) for the ex-pected item mean is

πδi

i= − +

+0 0114

10228

1 1226.

.

exp( / . ). (7)

Formula 7 can be used with any normal trait dis-tribution, N(µθ;σθ ), after transforming the itemdifficulty estimate: δi

* = (δi -µθ)/σθ (e.g., Smith,2000). For n binary items, the expected numberright-score, µ, is obtained with Equation 3; (theexpected domain score is π = µ/n).

Expected Domain Score with Logistic AbilityDistribution

With c = 1, Equation 2 [with Pi(θ) fromEquation 1 and ϕ(θ) from Equation 6] becomes

π θ δ θθ δ θ

θii

i

d= −+ − +−∞

∫ exp( ) exp( )

[ exp( )][ exp( )].

1 1 2 (8)

With the substitution t = exp(θ), the integralevaluation in Equation 8 becomes straightforwardand (with simple algebra) leads to an exact for-mula for the expected mean on individual items:

π δ δδ

ii i

i

= − +−

( ) exp( )

[exp( ) ].

1 1

1 2 (9)

With c = 1/2, Equation 2 becomes

π θ δ θθ δ θ

θii

i

d= −+ − +−∞

∫ 2 2

1 1 2 2

exp( ) exp( )

[ exp( )][ exp( )]. (10)

Again, using the substitution t = exp(θ), a straight-forward integration leads to an exact formula:

π π δ δ δ δδ

ii i i i

i

= − − − ++

exp( )[exp( ) ] ( ) exp( )

[ exp( )],

2 1 2 2 1 2 2

2 1 2 2 (11)

where the constant π (.3.1416) is not to be con-fused with the notation for the domain score.

Expected Error Variance with Normal TraitDistribution

With ϕ(θ) for the standard normal pdf, Equa-tion 4 can be written

σ θ δθ δ π

θ θ22

2

1

1

25( )

exp( )

[ exp( )]exp( . ) .e di

i

i

= −+ −

−∞

∫ (12)

As an exact closed form evaluation for the inte-gral in Equation 12 does not exist, an approxi-

Figure 1. Probability density functions (PDF) of thestandard distribution and two logistic distributionswith scale parameters c = 1 and c = 1/2.

Page 4: Reliability and True-Score Measures of Binary Items as a ... · Specifically, the TSM expected domain score, er-ror variance, true score variance, and reliability (for norm-referenced

RELIABILITY AND TRUE-SCORE MEASURES 225

mation was developed using the technique de-scribed with the development of Formula 7. Theresulting approximation formula for the errorvariance of individual items is

σ δ2 205( ) exp[ . ( / ) ],e A B Ci i= + − (13)

where: A=0.011, B=0.195, and C=1.797, if |δi| < 4,

or A=0.0023, B=0.171, and C=2.023, if |δi| > 4.

As Formula 13 shows, σ2(ei) is an even function

of the item difficulty, i.e., the value of σ2(ei) is the

same for δi and -δ

i. Depending on the value of δ

i,

the absolute error of approximation with Formula13 ranges from 0 to 0.0008, with a mean of 0.0002and a standard deviation of 0.0002. Also, the er-rors vary in sign thus canceling out to a large de-gree when the estimates of σ2(e

i) with Formula 13

are summed to obtain the error variance for thenumber-right score, σ

e2 (Equation 5).

Expected Error Variance with Logistic AbilityDistribution

This section provides exact formulas forσ2(e

i) with the fixed logistic distributions of used

in this article (c = 1 and c = 1/2). The mathemati-cal derivations (provided in Appendix A) lead tothe following exact evaluations of the expecteditem error variance, where E

i = exp(δ

i):

1. With c = 1,

( )σ δ δ2

3

2 2

1( )

( ).e

E E E

Ei

i i i i i

i

= − + +

− (14)

For δi = 0, one should use σ2(e

i) = 0.1667

(the limit evaluation with δi 6 0) to avoid “divi-

sion by zero” with Formula 14 (see Appendix A).

2. With c = 1/2,

[ ]σ

δ δ π π π2

3 4 2

2 3

8 1 8 1 6

2 1( )

( ) ( )

( ).e

E E E E E

Ei

i i i i i i i

i

=− + + + − +

+ (15)

The sum of σ2(ei) for the test items is the

expected error variance for the number-rightscore, σ

e2.

Expected True Score Variance

Let σ2(τi) be the variance of the true score

on item i at θ, Pi(θ), as θ varies from -4 to 4.

This item true variance relates to the item mean,π

i, and item error variance, σ2(e

i), as follows

σ 2(τi) = π

i(1 - π

i) - σ2(e

i). (16)

Proof: Using the expectation rule VAR(X) =E(X2) - [E(X)]2 with X = P

i(θ), we have

{ }

σ τ θ ϕ θ θ θ ϕ θ θ

θ θ θ ϕ θ θ π

θ ϕ θ θ θ θ ϕ θ θ π

π σ π π π σ

2 22

2

2

2 2 2

1

1

1

( ) [ ( )] ( ) [ ( )] ( )

( ) ( )[ ( )] ( )

( ) ( ) ( )[ ( )] ( )

( ) ( ) (

i i i

i i i i

i i i i

i i i i i i

P d P d

P P P d

P d P P d

e e

= −

= − − −

= − − −

= − − = − −

−∞

−∞

−∞

−∞

−∞

∫ ∫∫

∫∫).

At test level, the true score variance for thenumber-right score, στ

2, is

2 2 2

1 1

[ (1 ) ( )][ (1 ) ( )] .n n

i i i j j ji j

e eτσ π π σ π π σ= =

= − − − −∑∑ (17)

Proof: With unidimensional tests, there is aperfect correlation between the congeneric truescores on two items, say τi and τj, because of thelinear relationship: τ

i = a

ij + b

ij τ

j, where b

ij… 0, 1

(e.g., Jöreskog, 1971). The covariance of τi andτ

j, then, is: σ(τ

i, τ

j) = σ(τ

i) σ(τ

j). Therefore, for

the variance of the true number-right score on an-item test, τ (=Στ

i), we have

σ σ τ τ σ τ σ ττ2

11 11

= === ==∑∑ ∑∑( , ) ( ) ( ).i jj

n

i

n

i jj

n

i

n

(18)

Equation 18 leads directly to Formula 17 by re-placing σ(τ

i) and σ(τ

j) with their expressions from

Equation 16. It should be noted also that Formu-las 16 and 17 hold for any trait distribution astheir derivations remain the same with any φ(θ).

Reliability

Under TSM, the reliability of measurementis defined as the ratio of true score variance toobserved score variance

ρxx

= στ2/σ

x2 = στ

2/(στ2 + σ

e2). (19)

For internal consistency evaluations, ρxx

is typi-cally estimated by the Cronbach’s coefficient al-pha or by the KR20 coefficient for dichotomouslyscored items (Kuder and Richardson, 1937).However, even at population level, Cronbach’salpha (or KR-20) is an accurate estimate of ρ

xx

only if there is no correlation among errors andthe test components are at least essentially tau-

Page 5: Reliability and True-Score Measures of Binary Items as a ... · Specifically, the TSM expected domain score, er-ror variance, true score variance, and reliability (for norm-referenced

226 DIMITROV

equivalent (Novick and Lewis, 1967). For Raschcalibrated items, one can determine ρ

xx from

Equation 19 by replacing στ2 and σ

e2 with their

population estimates using formulas developedin the previous sections. This approach, unlikeCronbach’s alpha, does not require essential tau-equivalency (the weaker assumption of conge-neric measures is sufficient) thus eliminating fac-tors that may negatively affect the populationestimate of ρ

xx. As a reminder, essentially tau-

equivalent items are assumed to have equal true-score variances, whereas congeneric measuresmay have different scale origins and may vary inprecision (Jöreskog, 1971). Previous researchaddresses differences between some empiricalestimates of ρ

xx and the Rasch person separa-

tion reliability, RR (e.g., Clauser, 1999; Linacre,

1996, 1997). Both ρxx

and RR represent the ratio

of “true variance to observed variance” but withρ

xx the variances are for raw scores, whereas with

RR they are for trait scores (logits). Linacre (1996)

reports that the true-score reliability (KR-20 orCronbach’s alpha) is generally higher than R

R,

whereas the statistical Rasch validity exceeds itstrue-score counterpart. Also, the raw-score stan-dard errors of extreme scores are close to zero,whereas extreme scores are usually excluded inRasch analysis because their measure standarderrors on the logit scale are infinite (e.g., Clauser,1999).

Criterion-Referenced Dependability

Brennan and Kane (1977) introduced a de-pendability index, Φ(λ), for criterion-referencedinterpretations in the framework of generalizabilitytheory (GT; e.g., Brennan, 1983)

Φ∆

( )( ) ( )

( ) ( ) ( ),λ

σ π λσ π λ σ

=+ −

+ − +

2 2

2 2 2

p

p (20)

where σ2(p) is the universe-score variance for per-sons, σ2(∆) is the absolute error variance, π is thedomain score, and λ is the cutting score; (all scoresare in proportion of items correct). In the contextof the GT design “person x items”, σ2(∆) = σ2(pi,e)/n + σ2(i)/n, where n is the number of items (e.g.,Shavelson and Webb, 1991, p. 86). When λ = π,the index Φ(λ) reaches its lower limit referred toas index Φ in GT. Feldt and Brennan (1993) noted

that “the index Φ(λ) characterizes the dependabil-ity of decisions based on the testing procedure,whereas the index Φ characterizes the contribu-tion of the testing procedure to the dependabilityof such decisions” (p. 141).

Taking into account that στ2 is the true vari-

ance for the person’s number-right score (seeFormula 17), whereas σ2(p) in Formula 20 is thetrue variance of the person’s proportion of itemscorrect, we have: σ2(p) = στ

2 /n2. On the otherside, σ2(i) = σ2(π

i) because they both represent

the variance of the expected item mean, πi, across

n items. Also, taking into account that σ2(ei) is

the error variance for the number-right score, theabsolute error variance can be represented withσ2(∆) = σ2(e

i)/n2 + σ2(π

i)/n. With this, Formula

20 translates into

Φ( )( )

( ) ( ).λ σ π λ

σ π λ σ σ πτ

τ= + −

+ − + +

2 2 2

2 2 2 2 2

n

n ne i (21)

When λ = π, Φ(λ) reaches its lowest limit (indexΦ)

Φ =+ +

σσ σ σ π

τ

τ

2

2 2 2e in ( )

. (22)

The comparison of Formulas 19 and 22 shows thatΦ does not exceed ρ

xx. This is consistent with the

argument of Feldt and Brennan (1993) that “crite-rion-referenced interpretations of ‘absolute’ scoresare more stringent than norm-referenced interpreta-tions of ‘relative’ scores” (p. 141). It is important toemphasize that the estimation of ρ

xx, Φ, and Φ(λ) in

GT requires information about the raw scores for asample of examinees, whereas the formulas devel-oped in this article do not require such informationas long as the Rasch item calibration is available.

Item Reliability

Besides reliability coefficients at test level,indices of reliability at item level can also be use-ful in test development and analysis. Under TSM,the reliability of item i is usually estimated withthe product s

ir

iX, where s

i is the item-score stan-

dard deviation and riX

is the point-biserial corre-lation between the item score and the total testscore (e.g., Allen and Yen, 1979, p. 124). Thisarticle uses the definition “true item variance to

Page 6: Reliability and True-Score Measures of Binary Items as a ... · Specifically, the TSM expected domain score, er-ror variance, true score variance, and reliability (for norm-referenced

RELIABILITY AND TRUE-SCORE MEASURES 227

observed item variance” for reliability of indi-vidual items, ρ

ii. Therefore, the reliability for

Rasch calibrated items is evaluated here with

ρ σ τσ τ σ

iii

i ie=

+

2

2 2

( )

( ) ( ), (23)

where σ2(τi) is obtained with Formula 16 and

σ2(ei) with Formula 13, when θ - N(0,1), or For-

mulas 14 and 15 when θ is with the logistic dis-tribution for c = 1 and c = 1/2, respectively. In-formation about item reliability can beparticularly useful when the purpose is to selectitems that maximize the internal consistency re-liability (e.g., Allen and Yen, 1979, p. 125).

Example

This example illustrates the estimation of ex-pected true-score measures and reliability (at itemand test level) using the formulas developed in thisarticle for Rasch calibrated binary items. The ex-ample is organized in two sections. The first sec-tion provides (in algorithmic order) the expectedmeasures and formulas used for their estimationwith the normal trait distribution. The executionof the formulas in this section is conducted throughthe use of the statistical package SPSS (SPSS Inc,1997). The SPSS syntax developed for this pur-pose is provided in Appendix B. The second sec-tion of this example compares the expected true-score measures and reliability to their empiricalcounterparts obtained with simulated data.

Theoretical Evaluation of True-Score Measureswith Formulas

This section illustrates how researchers andpractitioners may use the Rasch calibration ofbinary items to evaluate expected true-score mea-sures and reliability at both item and test level.The Rasch difficulty parameters, δ

i, for 20 hypo-

thetical items are provided in Table 1; (δi sum to

zero and cover uniformly the interval from -2.2to 2.5 on the logit scale). The expected measuresand the formulas used for their evaluation withθ - N(0,1) are listed below in algorithmic order.

1. Expected item mean, πi—Formula 7.

2. Expected item error variance, σ2(ei)—

Formula 13.

3. Expected item true variance, σ2(τi)—

Formula 16.

4. Expected item reliability, ρii—Formula

23.

5. Expected number-right score, µ—For-mula 3; (the domain score is π = µ/n).

6. Expected error variance for the num-ber-right score, σ

e2—Formula 5.

7. Expected true score variance for thenumber-right score, στ

2—Formula 17.

8. Variance of the expected item mean,σ2(π

i) - the variance of π

1, π

2, ..., π

20 (see

Step 1).

9. Reliability, ρxx

—Formula 19.

10. Dependability index Φ—Formula 22.

11. Dependability index Φ(λ)—Formula21.

The SPSS printout (with the syntax in AppendixB and item parameters, δ

i, in Table 1) provides

the expected true variance for the number-rightscore (στ

2 = 10.4200), the error variance for thenumber-right score (σ

e2 = 3.1046), the expected

number-right score (µ = 10.1027), and the vari-ance of expected item means, σ2(π

i) = .071. Us-

ing these values, we obtain: π = µ/n = .5051, ρxx

= .7704 (with Formula 19), and Φ = .6972 (withFormula 22). Also, using Formula 21, values ofthe dependability index Φ(λ) are calculated andgraphed for values of the cutting score, λ, thatvary from 0 to 1 on the domain scale with anincrement of 0.005 (see Figure 2). The graphicalrepresentation of Φ(λ) shows, for example, thatits lowest value (Φ = .6972) occurs when the cut-ting score equals the population domain score (λ= π = .5051). Also, Φ(λ) = .85 for λ = .7 andΦ(λ) exceeds .90 when the cutting score is above.8 (i.e., 80% in percentages). This type of infor-mation is very useful for criterion-based inter-pretations and decisions with mastery tests.

The SPSS syntax (see Appendix B) providesalso the expected true-score measures and reli-ability for individual items. They appear as “new”variables in the SPSS data spreadsheet, with no-tations that should be interpreted as follows: var_e= σ2(e

i), p = π

i, var_tau = σ2(τ

i), and roi = ρ

ii; (the

Page 7: Reliability and True-Score Measures of Binary Items as a ... · Specifically, the TSM expected domain score, er-ror variance, true score variance, and reliability (for norm-referenced

228 DIMITROV

values of these output variables are provided inTable 1).

Empirical Evaluations of True-Score Measureswith Simulated Data

The expected measures obtained in the pre-vious section are compared now with their em-pirical counterparts obtained with simulated data.Specifically, binary scores were generated to fit

the Rasch model [with the item parameters, δi, in

Table 1 and θ - N(0,1)] using a computer pro-gram written in SAS (SAS Institute, 1985) forMonte Carlo simulations (Dimitrov, 1996). The(ANOVA-based) generalizability model “personx item” (p x i) incorporated in this program wasrun with 20 replications generating binary scoresfor 1,500 persons in each replication. The re-sulting empirical estimates of true-score measures

Table 1

Expected True-Score Measures and Reliability for Individual Items Evaluated as a Function ofTheir Rasch Difficulty, δ

i.

Item δi σ2(ei) πi (pi)a σ2(τi) ρii

1 -2.2000 .1032 .8656 (.8620) .0132 .1131

2 -2.0000 .1160 .8440 (.8446) .0157 .11913 -1.8200 .1278 .8224 (.8133) .0183 .1251

4 -1.5300 .1467 .7833 (.7920) .0231 .1358

5 -1.4000 .1550 .7639 (.7560) .0254 .1408

6 -1.2500 .1641 .7402 (.7287) .0282 .1466

7 -1.0500 .1754 .7065 (.6787) .0320 .1541

8 -.8500 .1854 .6705 (.6827) .0356 .1610

9 -.6100 .1951 .6247 (.6054) .0394 .1679

10 -.2500 .2041 .5520 (.5600) .0432 .1746

11 .0000 .2060 .5000 (.4874) .0440 .1760

12 .2800 .2036 .4419 (.4463) .0430 .1742

13 .4500 .2000 .4072 (.3860) .0414 .1715

14 .8500 .1854 .3295 (.3400) .0356 .1610

15 1.2100 .1664 .2663 (.2514) .0289 .1481

16 1.3300 .1593 .2470 (.2353) .0267 .1435

17 1.9700 .1179 .1594 (.1633) .0161 .1201

18 2.1500 .1063 .1395 (.1300) .0138 .1145

19 2.2200 .1019 .1323 (.1180) .0129 .1125

20 2.5000 .0851 .1064 (.1093) .0100 .1049

Note: σ2(ei) is the expected error variance, πi - expected mean, (pi - empirical mean), σ2(τ i) - expected truevariance, and ρii - expected reliability for individual items.a Obtained for the SAS simulated binary scores.

Table 2

Theoretical True-Score Measures and Reliability (Evaluated with Formulas) and Their EmpiricalCounterparts Evaluated with Simulated Data for the Rasch Item Difficulties (δ

i) in Table 1 and

the Normal Trait Distribution.

Evaluation π στ2 σ

e2 σ2(πi) ρxx Φ

Theoretical .5051 10.4200 3.1046 .0710 .7704 .6972

Empirical .5067 9.9572 3.1647 .0708 .7548 .6802

Note. The empirical estimates are obtained through averaging their values over 20 replications of SAS simu-lations for binary scores that fit the Rasch model with 1,500 persons per replication.

Page 8: Reliability and True-Score Measures of Binary Items as a ... · Specifically, the TSM expected domain score, er-ror variance, true score variance, and reliability (for norm-referenced

RELIABILITY AND TRUE-SCORE MEASURES 229

and reliability (at test level) are summarized inTable 2. The comparison of these empirical esti-mates with their theoretical counterparts (alsopresented in Table 2) shows a close match. Thesame holds for the comparison of the expecteditem means, π

i, with their empirical counterparts

(pi) obtained for the SAS simulated binary scores

(see Table 1). Thus, with Rasch calibrated items,the formulas developed in this article provide(without data) estimates of true-score measuresand reliability that one can obtain (with “ideal”data simulated for large samples) using the “per-son x item” GT model. In addition, the formulasprovide expected values of true-score measuresand reliability for individual items [σ2(τ

i), σ2(e

i),

and ρii] that are not provided with the GT model.

The Rasch person separation reliability in-dex, R

R, was also calculated for the generating

measures and item difficulties with the SAS simu-lations. Linacre (1997) refers to R

R obtained with

generated θ- measures as generator-based Raschreliability and shows that it is an upper limit fordata-based R

R. The generator-based reliability

with the SAS simulations in this example wasfound to be R

R = .673. The fact that the theoreti-

cal ρxx

(.770) is higher than RR (.673) in this ex-

ample is not a surprise given that even empiricalestimates of ρ

xx (KR-20 or Cronbach’s alpha) gen-

erally exceed RR (Linacre, 1996).

Conclusion

This paper provides formulas for true-scoremeasures and reliability of binary scores as a func-tion of the Rasch item difficulty for fixed distribu-tions (normal or logistic) of the underlying trait.The scale parameters c = 1 and c = 1/2 were se-lected for the two fixed logistic distributions be-cause they yield exact integral evaluations andproduce normal-like trait distributions of the un-derlying trait that may occur with Rasch measure-ments; (this is not true with just any scale param-eter of the logistic distribution). Formulas 7 and13 for π

i and σ2(e

i), respectively, with the normal

trait distribution are developed by the use of ap-proximation procedures, whereas all other formu-las result from exact integral evaluations. The ex-ample in the previous section illustrates anapplication of the formulas for Rasch calibrateditems. The calculations are easy to perform usingstatistical programs such as SAS and SPSS (seeAppendix B), spreadsheet-based programs, oreven hand calculators. The formulas can also beefficiently incorporated into computer programsfor test analysis and measurement simulations.

The formulas developed in this article havetheoretical and practical value for Rasch test de-velopment, score analysis, and simulation stud-ies. Their closed analytical forms may reveal re-lationships that are difficult or impossible to seewith empirical tools (e.g., Formula 13 shows thatthe item error variance has the same value foropposite, δ

i and - δ

i, Rasch item difficulties).

Also, given a bank of Rasch calibrated items,one can select items to develop a test with knowntrue-score measures and reliability for a personpopulation prior to administering the test. Onecan also compare (without using raw scores ortrait measures) the expected domain scores andreliability for test strands in which items aregrouped by substantive characteristics (e.g., con-tent areas or learning outcomes). In another sce-nario, the formulas can be used to evaluate (priorto administration) test booklets that are devel-oped for follow-up measurements (e.g., in longi-tudinal studies) given the Rasch calibration of

Figure 2. The dependability index, Φ(λ), estimatedwith Formula 21 for the theoretical true-score mea-sures in Table 2 with the illustrative example.

Page 9: Reliability and True-Score Measures of Binary Items as a ... · Specifically, the TSM expected domain score, er-ror variance, true score variance, and reliability (for norm-referenced

230 DIMITROV

items at the base year. In simulation studies, re-searchers may use the formulas to generate true-score characteristics and reliability for targetedvalues of Rasch item difficulty without the ne-cessity of generating binary scores or θ-scoresfor persons.

The examples of possible applications of theformulas developed in this article illustrate whatresearchers and practitioners can gain over andabove what they would learn from the Raschanalysis. It is important to emphasize that theproposed formulas and the Rasch analysis pro-vide different types of information that can effi-ciently complement (not replace or exclude) eachother in test development and analysis. For ex-ample, while the Rasch analysis is effective atlocating persons on the underlying trait (Linacre,1996), the formulas developed in this article areeffective at determining population true-scorecharacteristics for Rasch calibrated items with-out using raw scores or trait measures for exam-inees. Also, while the Rasch measures of reliabil-ity (R

R) and “separation” provide information

about measurably different levels of performancein a sample of examinees (e.g., Wright, 1996,1998), the index Φ(λ) provides information aboutthe dependability of criterion-referenced deci-sions. Which approach to use (Rasch analysis,true-score analysis with the proposed formulas,or both) depends on the goals of the study as wellas on the data that is available (raw scores, traitscores, or only estimates of Rasch item difficulty).

One can also argue that estimates of true-scoremeasures and reliability can be obtained withinthe framework of generalizability theory using, forexample, computer programs such as GENOVA(Crick and Brennan, 1983). This approach, how-ever, (a) requires the binary scores for a largesample of examinees and (b) does not provide true-score measures at item level such as σ2(e

i), σ2(τ

i),

and ρii. Therefore, for Rasch calibrated items, the

formulas developed in this article provide (with-out data) richer, more accurate, and easily obtainedinformation about true-score measures and reli-ability at population level relative to (ANOVA-based) generalizability methods. Skewed trait dis-tributions also occur with Rasch measurement

(e.g., in medical studies; Wright, 2001). Dimitrov(2001) provided formulas for the expected errorvariance with some skewed trait distributions.Formulas 16 and 17 for the true score variancecan also be used with skewed distributions becausetheir derivation holds with any ϕ(θ). In conclu-sion, using Rasch calibration of items to evaluatetheir expected true-score measures, reliability, anddependability extends the traditional boundariesin calculating, interpreting, and reporting measure-ment results.

ReferencesAllen, J. M., and Yen, W. M. (1979). Introduc-

tion to measurement theory. Pacific Grove,CA: Brooks/Cole.

Brennan, R. L. (1983). Elements of gener-alizability theory. Iowa City, IA: AmericanCollege Testing Program.

Brennan, R. L., and Kane, M. T. (1977). An in-dex of dependability for mastery tests. Jour-nal of Educational Measurement, 14, 277-289.

Clauser, B. (1999). Relating Cronbach and Raschreliabilities. Rasch Measurement Transac-tions, 13, 696.

Crick, J. E., and Brennan, R. L. (1983). Manualfor GENOVA: A generalized analysis of vari-ance system. The American College TestingProgram, Iowa City: Iowa.

Cronbach, L. J. (1951). Coeficient alpha and theinternal structure of a test. Psychometrika, 16,297-334.

Dimitrov, D. M. (2002). Reliability: Arguments formultiple perspectives and potential problemswith generalization across studies. Educationaland Psychological Measurement, 62, 783-801.

Dimitrov, D. M. (2001, October). Reliability ofRasch measurement with skewed ability dis-tributions. Paper presented at the InternationalConference on Objective Measurement. Chi-cago, IL.

Dimitrov, D. M. (1996, April). Monte Carlo ap-proach for reliability estimations ingeneralizability studies. Paper presented at theAnnual Meeting of the American EducationalResearch Association, New York.

Page 10: Reliability and True-Score Measures of Binary Items as a ... · Specifically, the TSM expected domain score, er-ror variance, true score variance, and reliability (for norm-referenced

RELIABILITY AND TRUE-SCORE MEASURES 231

Evans, M.., Hastings, N., and Peacock, B. (1993).Statistical distributions (2nd ed.). New York:John Wiley.

Feldt, L. S., and Brennan, R. L. (1993). Reliabil-ity. In R. L. Linn (Ed.), Educational measure-ment (3rd. Ed.) (pp. 105-146). Phoenix, AZ:American Council on Education and The OryxPress.

Hambleton, R. K., and Jones, R. W. (1993). Com-parison of classical test theory and item re-sponse theory and their applications to testdevelopment. Educational Measurement: Is-sues and Practice, 12, 38-47.

Jöreskog, K. G. (1971). Statistical analysis of setsof congeneric tests. Psychometrika, 36, 109-133.

Kuder, G. F., and Richardson, M. W. (1937). Thetheory of the estimation of test reliability.Psychometrika, 1937, 2, 151-160.

Linacre, J. M. (1997). KR-20 or Rasch reliabil-ity: Which tells the “truth”? Rasch Measure-ment Transactions, 11, 580-581.

Linacre, J. M. (1996). True-score reliability orRasch statistical validity? Rasch MeasurementTransactions, 9, 455-456..

Lord, F. M. (1980). Applications of item responsetheory to practical testing problems. Hillsdale,NL: Lawrence Erlbaum.

MathWorks, Inc. (1999). Learning MATLAB(Version 5.3). Natick, MA: Author.

Novick, M. R., and Lewis, C. (1967). Coefficientalpha and the reliability of composite measure-ments. Psychometrika, 32, 1-13.

Rasch, G. (1960). Probabilistic models for intel-ligence and attainment tests. Copenhagen:Danmarks Paedagogiske Institut.

SAS Institute. (1985). SAS user’s guide: Version5 edition. Cary, NC: Author.

Sawilowski, S. S. (2000). Psychometrics versusdatametrics: Comments on Vacha-Haase’s “re-liability generalization” method and someEPM editorial policies. Educational and Psy-chological Measurement, 60, 157-173.

Shavelson, J. S., and Webb, N. M. (1991).Generalizability theory: A primer. NewburyPark, CA: Sage.

Smith, Jr., E. V. (2000). Metric development andscore reporting. Journal of Applied Measure-ment, 3, 303-326.

Smith, Jr., E. V. (2001). Evidence for the reli-ability of measures and validity of measureinterpretation: A Rasch measurement perspec-tive. Journal of Applied Measurement, 2, 281-311.

SPSS Inc. (2002). SPSS Base 11.0 User’s guide.Chicago: Author.

SPSS Inc. (1998). SigmaPlot 5.0 User’s guide.Chicago: Author.

Thompson, B., and Vacha-Haase, T. (2000). Psy-chometrics is datametrics: The test is not reli-able. Educational and PsychologicalMeasurement, 60, 174-195.

Wright, B. D. (2001). Separation, reliability andskewed distributions. Rasch MeasurementTransactions 14, 786.

Wright, B. D. (1998). Interpreting reliability.Rasch Measurement Transactions, 11, 602.

Wright, B. D. (1996). Reliability and separation.Rasch Measurement Transactions, 9, 472.

Page 11: Reliability and True-Score Measures of Binary Items as a ... · Specifically, the TSM expected domain score, er-ror variance, true score variance, and reliability (for norm-referenced

232 DIMITROV

Appendix A

Derivation of Formulas 14 and 15 for Item True Variancewith Logistic Trait Distribution

For Pi(θ) with the dichotomous Rasch model (Equation 1), we have

P Pi ii

i( )[ ( )]

exp( )

[ exp( )]θ θ

θ δθ δ

11 2− =

−+ − (A1)

which (as one can easily see) is also the first derivative of Pi(θ). With this, Equation 4 becomes

σ ∂ θ ∂θ ϕ θ θ ϕ θ θ2 ( ) [ ( ) / ] ( ) ( ) ( ) .e P d dPi i i= =−∞

−∞

∫ ∫ (A2)

As one may also notice, the logistic ϕ(θ) in Equation 6 is the first derivative of the function

Φ( )exp( / )

exp( / ).θ θ

θ=

+c

c1

Replacing ϕ(θ) in Equation A2 with the first derivative of Φ(θ), we have

σ ∂ θ ∂θ ∂Φ θ ∂θ θ ∂ θ ∂θ θ2 ( ) [ ( ) / ][ ( ) / ] [ ( ) / ] ( ) .e P d P di i i= =−∞

−∞

∫∫ Φ (A3)

With integration by parts for the integral in Equation A3, we have

σ ∂ θ ∂θ θ θ ∂ θ ∂θ

θ ∂ θ ∂θ θ

θ θ δ θ δθ θ δ

θ

e i i

i

i i

i

iP d P

P d

d

2

2 2

3

0

1

1 1

= −

= −

= − − − −+ + −

−∞∞

−∞

−∞

−∞

[ ( ) / ] ( ) ( ) [ ( ) / ]

( )[ ( ) / ]

exp( / ) exp( )[ exp( )]

[ exp( / )][ exp( )].

Φ Φ

Φ

c

c

Let Ei = exp(δ

i). Using the substitution rule for integration with x = exp(θ), we obtain

( )( )( )

σ 21

1 30 1

( ) ./

/e

E E

Edi

i i

i

=−

+ +

∫x x

x xx

c

c (A4)

The evaluation of the integral in Equation A4 for c = 1 or c = 1/2 is straightforward and yields to

1. With c = 1,

( )σ δ δ2

3

2 2

1( )

( ).e

E E E

Ei

i i i i i

i

= − + +

− (A5)

When δi = 0, the denominator of the ratio in Formula A5 equals zero. For this particular case,

estimating the limit of the ratio at δi = 0, we obtain σ2(e

i) = 0.1667.

2. With c = 1/2,

[ ]σ

δ δ π π π2

3 4 2

2 3

8 1 8 1 6

2 1( )

( ) ( )

( ),e

E E E E E

Eii i i i i i i

i=

− + + + − ++ (A6)

where π is a constant (π = 3.14159... is not to be confused with the domain score) and Ei denotes

exp(δi) for simplicity of the analytical form. As one may notice, Formulas A5 and A6 are exactly

Formulas 14 and 15, respectively, with which the derivation is completed.

Page 12: Reliability and True-Score Measures of Binary Items as a ... · Specifically, the TSM expected domain score, er-ror variance, true score variance, and reliability (for norm-referenced

RELIABILITY AND TRUE-SCORE MEASURES 233

Appendix B

SPSS Syntax for Evaluation of True-Score Measuresof Rasch Calibrated Binary Items with the Normal Trait Distribution;

(Input variable: b, the Rasch item difficulty)

DO IF (ABS(b) < 4).

COMPUTE ve = .011 + .195*exp(-.5*((b/1.797)**2)).

ELSE.

COMPUTE ve = .0023 + .171*exp(-.5*((b/2.023)**2)).

END IF.

COMPUTE p = -.0114 + 1.0228/(1 + exp(b/1.226)).

COMPUTE vt = p*(1 - p) - ve.

IF(vt < 0) vt = 0.

SET FORMAT = F8.4 ERRORS = NONE RESULTS OFF HEATHER NO.

FLIP

VARIABLES b ve p vt.

VECTOR V = VAR001 TO VAR020.

COMPUTE Y = 0.

LOOP #I = 1 TO 20.

LOOP #J = 1 TO 20.

COMPUTE Y = Y + SQRT(V(#I)*V(#J)).

END LOOP.

END LOOP.

FLIP VAR001 TO VAR020 Y.

COMPUTE roi = vt/(vt + ve).

SET RESULTS ON.

REPORT FORMAT = AUTOMATIC

/VARIABLES = ve ‘ ‘ p ‘ ‘ vt ‘ ‘

/BREAK = (TOTAL)

/SUMMARY = MAX(vt) ‘True score variance:’

/SUMMARY = SUBTRACT(SUM(ve) MAX(ve)) (vt (COMMA) (4)) ‘Error variance:’

/SUMMARY = SUBTRACT(SUM(p) MAX(p)) (vt (COMMA) (4)) ‘Expected mean:’ .

SELECT IF(CASE_LBL ~= ‘Y’ ) .

RENAME VARIABLES (CASE_LBL = ITEM) (ve = var_err) (vt = var_tau).

VARIABLE LABELS p ‘Expected item mean’ .

DESCRIPTIVES

VARIABLES = p

/STATISTICS = VAR .

Note: The number of items (in this example, 20) should be specified in the syntax by the user. With50 items, for example, change 20 to 50 and VAR020 to VAR050 (see the bold notations in therespective four syntax lines).