formulation of the detect population parameter and evaluation of detect estimator bias

Journal of Educational MeasurementFall 2006, Vol. 43, No. 3, pp. 215–243

Formulation of the DETECT Population Parameterand Evaluation of DETECT Estimator Bias

Louis A. Roussos andOzlem OzbekUniversity of Illinois at Urbana-Champaign

The development of the DETECT procedure marked an important advancement innonparametric dimensionality analysis. DETECT is the first nonparametric tech-nique to estimate the number of dimensions in a data set, estimate an effect size formultidimensionality, and identify which dimension is predominantly measured byeach item. The efficacy of DETECT critically depends on accurate, minimally biasedestimation of the expected conditional covariances of all the item pairs. However,the amount of bias in the DETECT estimator has been studied only in a few simu-lated unidimensional data sets. This is because the value of the DETECT populationparameter is known to be zero for this case and has been unknown for cases whenmultidimensionality is present. In this article, integral formulas for the DETECTpopulation parameter are derived for the most commonly used parametric multidi-mensional item response theory model, the Reckase and McKinley model. These for-mulas are then used to evaluate the bias in DETECT by positing a multidimensionalmodel, simulating data from the model using a very large sample size (to eliminaterandom error), calculating the large-sample DETECT statistic, and finally calculat-ing the DETECT population parameter to compare with the large-sample statistic.A wide variety of two- and three-dimensional models, including both simple struc-ture and approximate simple structure, were investigated. The results indicated thatDETECT does exhibit statistical bias in the large-sample estimation of the item-pair conditional covariances; but, for the simulated tests that had 20 or more items,the bias was small enough to result in the large-sample DETECT almost alwayscorrectly partitioning the items and the DETECT effect size estimator exhibitingnegligible bias.

Standardized educational assessment batteries, such as the SAT or the ACT, arecomposed of multiple tests, each of which measures a particular knowledge domainand reports a single scale score.1 Because such tests report a single scale score, they Q1

are sometimes idealized as having exactly one latent trait, i.e., a single underlyingconstruct or a single “dimension.” Thus, the term “unidimensionality” is used in de-scribing this idealization. The assumption of unidimensionality has proven very use-ful in the development of statistical procedures for designing, constructing, scoring,and evaluating standardized tests. However, from the perspective of test developmentspecialists and item writers, the substantive knowledge domain of a standardized testis invariably composed of a multitude of knowledge subdomains, skills, and contexttopics, which may result in the test being sensitive to distinct multiple dimensions,where this sensitivity is termed “multidimensionality.” Moreover, the different typesof test items (multiple choice, constructed response, testlets with a common stimulus,etc.) may also give rise to multidimensionality. If the amount of multidimensionalityis large enough (size of multidimensionality is discussed below in more detail), then

215

Roussos and Ozbek

a variety of harmful consequences may result in regard to designing, constructing,scoring, and evaluating a test using the assumption of unidimensionality (e.g., seeWainer & Wang, 2001; or Yen, 1984).

Thus, the increasing awareness of the potential multidimensional nature of educa-tional and psychological tests has led to the development of improved statistical toolsto detect and estimate multidimensionality. In this regard, nonparametric techniqueshave become increasingly popular because they avoid strong parametric modeling as-sumptions while still adhering to the fundamental principles of item response theory(IRT). Also, nonparametric techniques are not as computationally intensive as para-metric methods, thus enabling more efficient data analysis. In particular, the devel-opment of the DETECT procedure and accompanying software (Kim, 1994; Zhang& Stout, 1999b) marked an important advancement in nonparametric dimensional-ity analysis. DETECT is the first nonparametric technique to estimate the numberof dimensions in a data set, estimate an effect size for the multidimensionality, andidentify which dimension is predominantly measured by each item. In practice, ofcourse, the most thorough dimensionality analysis would include both parametricand nonparametric methods. For example, the combined use of cluster analysis withfactor analysis could also be used to identify the number of dimensions and whichitems load on which dimensions, thus providing convergent validity information withrespect to the results from a DETECT analysis.

The efficacy of DETECT critically depends on accurate, minimally biased esti-mation of the expected conditional covariances of all the item pairs. Whereas sim-ulation studies (Zhang & Stout, 1999b) and real data analyses (Stout et al., 1996)have provided much evidence for the effectiveness of DETECT, the amount ofbias in the DETECT estimator has been studied only in a few simulated unidi-mensional data sets. This is because the value of the DETECT population param-eter is known to be zero for the case of unidimensionality but has been unknownfor cases when multidimensionality is present. In this article, integral formulasfor the DETECT population parameter are derived for the most commonly usedparametric multidimensional item response theory model, the Reckase and McKin-ley (1991) model. These formulas are then used to evaluate the bias in DETECTby positing a multidimensional model, simulating data from the model using avery large sample size (to eliminate random error—a necessary condition for esti-mating statistical bias), calculating the large-sample DETECT statistic, and finallycalculating the DETECT population parameter to compare with the large-samplestatistic.

The first section of the article reviews the DETECT procedure. In this section,we review the general DETECT theory and the DETECT statistic. The next sectionderives the integral formulas for the DETECT population parameter for the generaltwo-dimensional and three-dimensional cases assuming the Reckase and McKinleymodel. The third section describes the simulation studies we conducted, includingdetailed descriptions of the variety of models we studied and detailed summaries ofthe results of the simulation studies. Finally, we summarize and discuss the resultsand comment on their implications for the effectivness of DETECT and for futureDETECT research.

216

Brief Review of the DETECT Procedure

To explain the DETECT procedure, the term conditional covariance must firstbe defined, as conditional covariances play a critical role in the procedure. Con-sider a test that is unidimensionally scored (e.g., number-right score) but measuresmultiple latent traits. Let �

¯stand for the vector of these multiple traits, such that

�¯

= (�1, �2, . . . , �D), where D is the number of dimensions of the test. We saythat �

¯is the D-dimensional latent trait vector underlying examinee performance on

the test for a randomly selected examinee. For example, if the test content specifica-tions for a mathematics test partitions the items of a test into three types measuringalgebra, geometry, and trigonometry, then such a test might conceivably be found tohave D = 3.

An examinee’s score on a multidimensional test may be considered as an estimateof a true score based on some composite of the �d , d = 1, . . . , D, latent traits. Let�α be this composite, such that,

�α = α¯

t�¯

=D∑

d=1

αd�d , (1)

where α¯

= (α1, α2, . . . , αD)t is a constant vector with the magnitude of α¯

equal tounity. In particular, the αd values are determined by the direction of best measure-ment of the test in the geometric D-dimensional space. There are a number of re-lated ways to define this direction of best measurement with Wang’s (1988) referencecomposite being one of the more well-known methods. In this article we chose themethod of Zhang and Stout (1999a) because the DETECT theory is based on it. Wegive the more specific equation for it further below. As an example, consider the D =3 mathematics test hypothesized above, and imagine a three-dimensional coordinatesystem for which �1, �2, and �3 represent algebra, geometry, and trigonometry,respectively. The direction of best measurement for each item can be represented bya vector extending from the origin and pointing in a specific direction. For example,the vector for a pure algebra item would coincide with the �1 axis. The vector foran item primarily measuring geometry but also requiring to a lesser degree some al-gebra skill, would lie in the �1, �2 plane and point in a direction that has a smallerangle with the �2 axis than with the �1 axis. The direction of measurement α

¯of the

test composite can be viewed intuitively as a sort of weighted average or centroidof all the item directions, the items having greater discrimination power being givengreater weight.

Now we can define what we mean by conditional covariance. The conditionalcovariance that we are referring to here is the covariance between the scores on twodifferent items, conditional on the test latent trait composite, �α .

The DETECT procedure is based on the idea that items that measure the samedimension on a multidimensional test exhibit positive conditional covariances, anditems that measure different dimensions exhibit negative conditional covariances.(See Roussos, Stout, & Marden, 1998, for a detailed substantive description ofhow multidimensionality causes negative and positive conditional covariances.) Notethat a conditional covariance between two items behaves much differently than the

217

Roussos and Ozbek

ordinary covariance. Test items invariably have positive covariances with each other.Even two items that measure different ability constructs would be expected to haveitem responses that positively co-vary because the ability constructs would be ex-pected to be positively correlated. A conditional covariance, on the other hand, rep-resents the covariance that remains after conditioning on a unidimensional estimateof ability, which is essentially the covariance that remains after fitting a unidimen-sional model. Zhang and Stout (1999a, 1999b) have presented a more precise discus-sion of the theoretical underpinnings of DETECT than we can go into here, exceptto mention that they showed that this sign behavior of the conditional covariancesis consistent across all values of the conditioning variable, �α , for a large familyof multidimensional models, including the commonly used Reckase and McKinley(1991) model. This is important because it allows us to focus on a single parameterfor an item pair—the expected conditional covariance, i.e., the conditional covari-ance averaged over the distribution of �α .

We should note that this relationship that we described above between the di-mensions measured by two items and the sign of their conditional covariance is thebasis not only for DETECT but also for many other dimensionality estimation pro-cedures. For example, consider two early nonparametric dimensionality hypothesistesting statistics: a log-odds ratio statistic (the log-odds has the same sign behavioras the covariance) developed by Rosenbaum (1984) and the DIMTEST conditional-covariance-based statistic developed by Stout (1987). In the case of Rosenbaum, thehypothesis test was applied to item pairs under the null hypothesis that the itemsmeasure the same dimension. The hypothesis test rejects the null hypothesis when astatistically significant negative statistic occurs indicating that the two items in theitem pair measure different dimensions and, thus, the test is not unidimensional. Inthe case of Stout’s DIMTEST statistic, the statistic is applied to a set of items un-der the assumption that all the items in the set measure the same dimension and,under the null hypothesis, that this dimension is the same as that measured by the re-maining items on the test. The hypothesis test rejects when a statistically significantpositive statistic occurs indicating that the items in the set measure a dimension thatis different from that measured by the remaining items on the test.2

Based on the above noted relationship between the dimensions measured by twoitems and the sign of their conditional covariance, DETECT estimates the expectedconditional covariances for all the item pairs and attempts to find a partitioning of theitems into clusters such that the conditional covariances between items from the samecluster are positive and the conditional covariances between items from differentclusters are negative. For convenience, we will use the term within-ccov’s to refer tothe conditional covariances that correspond to items that come from the same clusterand between-ccov’s to refer to the conditional covariances that correspond to itemsthat come from different clusters. DETECT then calculates the mean of all the item-pair conditional covariances by summing the within-ccov’s, summing the negativeof the between-ccov’s, and dividing by the total number of item pairs. Therefore, thetheoretical DETECT parameter is given by the following equation:

DETECTα(P) = 2

N (N − 1)

∑1≤i1<i2≤N

δi1,i2 (P)E[Cov(Xi1, Xi2 | �α)], (2)

218

DETECT Bias Evaluation

where, N is the number of items on the test, i1 and i2 indicate two item numbers,X i1 and X i2 are random variables for the item responses to items i1 and i2, P is anypartition of the test, and δi1,i2 (P) is equal to 1 when i1 and i2 are from the samecluster, and −1 otherwise.

If DETECT is successful in finding a cluster partition of the items such that all thewithin-ccov’s are positive and all the between-ccov’s are negative, then the DETECTindex will be equal to its maximum possible value, given the conditional covariancematrix.

The DETECT estimator is written in the same way as the DETECT populationparameter, except that E[Cov(Xi1 , Xi2 | �α)], the expected conditional covarianceparameter, is replaced by E[Cov(Xi1, Xi2 | �α)], an estimator of the expected condi-tional covariance parameter.

Because DETECT critically depends on accurate estimation of the signs of theconditional covariances, the estimation of E[Cov(Xi1, Xi2 | �α)] must be carried outwith as small statistical bias as possible. As described in Zhang and Stout (1999a),the estimator used in DETECT is the equally weighted average of two estimates, onethat is known to have a positive bias and one that is known to have a negative bias.The equation for the positively biased estimator of E[Cov(Xi1, Xi2 | �α)] is givenby:

E+[Cov(Xi1, Xi2 | �α)] =N−2∑k=0

Jk

JCov(Xi1, Xi2 | Y = k),

where

Y = the “rest score” on the test, that is, the score on the restof the test, not including X i1 and X i2 ,

J = the total number of examinees,Jk = the number of examinees who got a rest score of k, and

Cov(Xi1, Xi2 | Y = k) = the observed conditional covariance between the scoreson items i1 and i2, where the conditioning variable, Y ,is the sum score on the remaining items on the test andis set equal to some fixed value, k.

The positive bias for this estimator is well known and was first documented inHolland and Rosenbaum (1986). The degree of bias is related to the amount of unre-liability that is present in the conditioning variable, Y . The longer the test, the morereliable Y will be and the smaller the positive bias will be.

The equation for the negatively biased estimate is similar and is given by:

E−[Cov(Xi1, Xi2 | �α)] =N∑

k=0

Jk

JCov(Xi1 , Xi2 | Y = k),

where

Y = score on the entire test, including X i1 and X i2 ,J = the total number of examinees,

Jk = the number of examinees who got a total test score of k,and

219

Roussos and Ozbek

Cov(Xi1, Xi2 | Y = k) = the observed covariance between the scores on items i1

and i2 computed for examinees having a conditioningscore of k on the entire test.

The negative bias for this estimator is also well known and comes from the factthat the conditioning score includes the scores on the two items that we are takingthe covariance of. Thus, the longer the test, the smaller this bias becomes.

A simulation study conducted by Zhang and Stout (1999a) indicated that for a40-item unidimensional test the two biases mostly canceled each other out. Thus,they formed the DETECT conditional covariance estimator as the equally weightedaverage of these two estimates,

E[Cov(Xi1, Xi2 | �α)] = {E+[Cov(Xi1, Xi2 | �α)] + E−[Cov(Xi1 , Xi2 | �α)]} ÷ 2.

Using these estimated conditional covariances, the DETECT computer programsearches for the partitioning P that maximizes the DETECT estimator. Because thespace of all possible partitions is huge, it is computationally prohibitive to searchthis entire space. Instead, DETECT uses a specially designed genetic algorithm tolimit the search of the space to an intelligently chosen subspace. For details of thisalogorithm, the reader is referred to Zhang and Stout (1999b).3

Under the assumption of unidimensionality, all the conditional covariances havean expected value of zero, a direct result of the fundamental IRT principle of local in-dependence. Indeed, dimensionality estimation is, in essence, a search for violationsof local independence, such violations being deemed local item dependence or LID.Because the DETECT index estimates the average item-pair conditional covariancefor a prescribed dimensionality structure, it is, in effect, an estimate of the averagesize of the violation of local independence. Thus, the interpretation of the DETECTindex as an effect size for the amount of multidimensionality can now be more rig-orously characterized as an index of the average size of the LID. We consider this tobe an especially appropriate effect size measure because the size of the violations oflocal independence is what determines the size of whatever effects are caused by thepresence of any unintended multidimensionality.

The actual DETECT software multiplies the index described above by 100. Us-ing this scale, Stout et al. (1996) tentatively suggested that DETECT values of 0.1or less are indicative of approximate unidimensionality and values greater than 1.0indicate the presence of strong multidimensionality. Based on our own experienceswith real data analyses and based on many simulations, we have most recently beenusing the following similar, but more detailed, scale: 1.0 or more indicates strongmultidimensionality, 0.4 to 1.0 indicates moderate to large multidimensionality, 0.4to 0.2 indicates moderate to weak multidimensionality, and below 0.2 indicates veryweak multidimensionality or approximate unidimensionality.

Derivation of DETECT Parameter Equations

To compute the theoretical DETECT parameter given in Equation (2), we needto specify an IRT model and calculate the expected conditional covariance based onthat model.

220


In this article, we use the most commonly applied multidimensional IRT model,the compensatory model of Reckase and McKinley (1991), augmented to include alower asymptote parameter. Furthermore, we chose to re-write the form of the itemresponse function (IRF) and use slightly different notation. Our re-written IRF isequivalent to the one introduced by Reckase and McKinley and is given as follows:

P(Xij = 1 | �¯ j = θ

¯ j ) = ci + (1 − ci )

[1 + e

−1.7‖a¯i ‖

(a¯

ti θ¯ j

‖a¯i ‖ −bi

)]−1

, (3)

where

Xij = the response of examinee j to item i;a¯ i = the vector of discrimination parameters for item i , (ai1, ai2, . . . , aiD), with

each parameter corresponding to one of the D dimensions of the test (inthe case of simple structure, all the aid parameters for an item are equalto zero except for the one corresponding to the one dimension the item ismeasuring; and in the case of approximate simple structure, one of the aid

parameters is much larger than the others);‖a

¯ i‖ = the magnitude of a¯ i , referred to as MDISC by Reckase and McKinley

(1991);θ¯ j = the vector of ability parameters for examinee j , (θ j1, θ j2, . . . , θ jD), with

each parameter corresponding to one of the D dimensions of the test; andbi = the difficulty parameter for item i, referred to as MID by Reckase and

McKinley (1991).

This re-write of the Reckase and McKinley equation is particularly advantageousbecause it reveals a strong parallelism between the Reckase and McKinley multi-dimensional IRF (MIRF) and the well-known Birnbaum (1968) unidimensional IRF(UIRF), which is characterized by ai, bi, and ci item parameters and a single θ j exam-inee ability parameter. The MIRF equation is seen to take on the exact same form ofthe UIRF equation with ai of the UIRF replaced by the magnitude of the a

¯ i vector andwith θ j of the UIRF replaced by a normalized linear composite of the componentsof θ

¯ j , which represents the composite best measured by item i. The MIRF bi param-eter then retains the same meaning as the UIRF bi parameter. This MIRF equationis referred to as a “compensatory” model because the linear composite allows a highvalue of θ jd on one dimension to compensate for a low value of θ jd on some other di-mension. Such an interaction of ability dimensions can occur, for example, in solvingtest items on a reading comprehension test where a test-taker who is somewhat weak,in general, in terms of extrapolation or inferencing skills, can compensate by havingan in-depth knowledge of the particular topic for a reading passage. Another exampleis math problems that involve a written component (Walker & Beretvas, 2003). Evenwhen test items involve skills that seem more non-compensatory, such as a math iteminvolving both geometry and algebra skills (where lack of knowledge of either skillwould seem to be a formidable barrier to correctly solving the item), practitionershave found that the compensatory model still provides a good fit (Ackerman, Gierl,& Walker, 2003; Reckase, Ackerman, & Carlson, 1988; Ackerman, 1994), perhaps

221

Roussos and Ozbek

because a slight amount of compensation can always occur even for seemingly non-compensatory skills.

In addition to specifying an IRF model, we also need to specify the multidimen-sional ability distribution. In this article we use the multivariate normal distribution tomodel the ability distribution. We now can specialize and expand Equation (2) usingthe above modeling assumptions to obtain integral expressions for the expected con-ditional covariance for the general two-dimensional and three-dimensional cases.

Two-Dimensional (2D) Case

In the two-dimensional (2D) case, we have:

�α = α1�1 + α2�2.

The weights, α1 and α2, are determined by the “direction of best measurement”of the test, as defined in Zhang and Stout (1999a). In particular, Zhang and Stoutshowed that for any compensatory model,

αd ∝N∑

i=1

wi aid ,

where, specializing Zhang and Stout’s results to the Reckase and McKinley MIRF2D model,

wi =∫

�2

∫

�1

Pi (θ1, θ2) − ci

1 − ci1.7Qi (θ1, θ2)[

N∑l=1

Pl (θ1, θ2)Ql(θ1, θ2)

]1/2

f (θ1 | θ2) dθ1

f (θ2) dθ2.

In this equation, f (·) is the probability density function for the designated variable,Pi(θ 1, θ 2) is the MIRF equation, and Qi(θ 1, θ 2) = 1 − Pi(θ 1, θ 2). In this article,the wi values were calculated using standard numerical integration techniques. Theformula for αd requires a constant of proportionality so that the magnitude of α

¯can

be made equal to 1.To calculate covariances conditional on �α , we reparametrized the MIRF equa-

tion so that �¯ j = (�α,�2), thus giving a MIRF equation in terms of �α and �2

instead of in terms of �1 and �2. Leaving out the ci parameter and the j subscriptfor convenience, the rewritten MIRF is given by:

P(Xi = 1 | θ¯) =

1 + e−1.7‖a

¯ i ‖[ ai1

α1θα+

(ai2− α2

α1ai1

)θ2

‖a¯ i ‖

−bi

]−1

222


Next, we needed the formulas for P(Xi = 1 | θ α) and P(Xi = 1, Xk = 1 | θ α).The first conditional probability is obtained by integrating the MIRF equation overthe distribution of �2 conditional on �α , which gives us:

P(Xi = 1 | θα) =∫

�2

P[Xi = 1 | (θα, θ2)] f (θ2 | θα) dθ2.

To calculate the joint probability of Xi = 1 and Xk = 1 conditional on �α , we inte-grate �2 out of the joint probability conditional on both �α and �2. Because �α and�2 comprise the entire ability space, this second joint probability is calculated fromthe principle of local independence by taking the product of the marginal probabil-ities conditional on �α and �2, which, of course, are given by the MIRF equation.Thus, the joint probability of Xi = 1 and Xk = 1 conditional on �α is given by,

P(Xi = 1, Xk = 1 | �α) =∫

�2

P[Xi = 1 | (θα, θ2)]P[Xk = 1 | (θα, θ2)] f (θ2 | θα) dθ2.

The theoretical expected conditional covariance is then given by:

E[Cov(Xi , Xk | �α)] =∫

�α

(P(Xi = 1, Xk = 1 | θα)

− P(Xi = 1 | θα)P(Xk = 1 | θα)) f (θα) dθα. (4)

Because �α is a linear combination of two normal random variables, �1 and �2,it too is a normal random variable, so that the density functions for �α and �2

conditional on �α are easily solved for. For a specified IRT model, Equation (4) wassolved using standard numerical integration techniques.

Three-Dimensional (3D) Case

The equations in the three-dimensional (3D) case are, for the most part, simplystraightforward generalizations of the equations used in the 2D case. However, thecompensatory nature of the MIRF equations lends itself to some notable simplifi-cations, especially if someone attempts to derive computational equations for four-dimensional or higher cases.

First, consider the equation for �α , which generalizes in the expected way, �α =α1�1 + α2�2 + α3�3.

The equation for αd is the same as in the 2D case, and the equation for wi gener-alizes in the obvious way.

To calculate covariances conditional on �α , we again reparametrized the MIRFequation, this time using �

¯ j = (�α,�2,�3). Leaving out the ci parameter and the jsubscript for convenience, this rewritten MIRF is given by:

223

Roussos and Ozbek

P(Xi = 1 | θ¯) =

1 + e−1.7‖a

¯i ‖[ ai1

α1θα+(ai2− α2

α1ai1)θ2+(ai3− α3

α1ai1)θ3

‖a¯i ‖ −bi

]−1

To obtain the formulas for P(Xi = 1 | θ α) and P(Xi = 1, Xk = 1 | θ α), the com-pensatory model results here in some notable simplification. If we simply generalizefrom the 2D case, the first conditional probability would be obtained by integratingthe MIRF equation over the joint distribution of �2 and �3, conditional on �α; buta simpler solution is obtained by recognizing that the linear combination of �2 and�3 in the MIRF equation can be treated as a single variable, say qi, which is givenby:

qi =(

ai2 − α2

α1ai1

)θ2 +

(ai3 − α3

α1ai1

)θ3.

Now the P(Xi = 1 | θ α) can be obtained by integrating the MIRF equation overthe distribution of qi conditional on �α . Because the equation is practically the sameas for the 2D case, we omit it here to save space.

To calculate the joint probability of Xi = 1 and Xk = 1 conditional on just �α ,we could generalize from the 2D case and integrate both �2 and �3 out of the jointprobability conditional on �α , �2, and �3. As we did above, we can again introduceqi and qj for each item, so that each item depends on just two ability variables, �α

and qi or qj. In this case, we do not get any simplification because integrating outqi and qj is no simpler than integrating out �2 and �3. However, if one were toextend the calculations to higher numbers of dimensions, all these cases would bereduced to the 3D case, which, of course, would result in noteworthy computationalsavings.4

Simulation Study

The purpose of this simulation study was to estimate the amount of statistical biaspresent in the DETECT estimation procedure and in its pairwise conditional covari-ance components. Estimation error is naturally due to two sources, bias and randomerror. In order to estimate the amount of bias, the random error in the estimatormust be made very small. In the present study the random error was made negligi-ble by employing very large sample sizes. Typically in real data analyses, DETECTinvolves splitting a data set into a training sample and a cross-validation sample toavoid capitalization on chance error in the training sample. In our analyses, we useda sample size of 120,000 simulated examinees, essentially eliminating the randomnoise and thus obviating the need to use a cross validation sample. In addition to es-timating the bias in the DETECT statistic, we also estimated the bias in the individualconditional covariance estimates.

Dimensionality Structure of the Simulated Data

The simulation study was designed to test out the ability of the DETECTprocedure to provide minimally biased estimation of a variety of the types of

224


dimensionality structures that DETECT is particularly intended to be applied to.Specifically, DETECT is particularly amenable to estimating simple structure multi-dimensionality. By “simple structure” we mean that each item on the test is measur-ing a single simulated dimension, but different groups of items may measure differentdimensions. Furthermore, DETECT is also intended to exhibit strong robustness toslight to moderate departures from exact simple structure. We will refer to such de-partures from exact simple structure as “approximate simple structure,” and we givea more exact description of how we simulated such structures below.5

DETECT can, of course, be used effectively in many other multidimensionalityestimation settings, for example, by suggesting candidate items for a DIMTEST(Stout, 1987; Nandakumar & Stout, 1993; Stout, Froelich, & Gao, 2001) analysisor by giving the user an indication of the degree to which the multidimensionality iscompatible with simple structure. But, still, the DETECT procedure was specificallydesigned to be sensitive to, and to provide unbiased estimation of, simple structureand approximate simple structure, which is why we focused on such structures in ourstudy.

To simulate multidimensional item responses, we employed the compensatorymultidimensional logistic IRF model of Reckase and McKinley (1991), which wasgiven in Equation (3). To facilitate visualization of dimensionality structures formedby using this IRF model, a geometric representation has been developed (see, e.g.,Ackerman, 1996), where each modeled dimension is represented by an axis in anorthogonal coordinate system. Items are plotted as finite-length vectors in this mul-tidimensional space. The characteristic of these vectors that is of interest here is theangle between the vector for item i and an axis d corresponding to a particular di-mension. This angle is referred to as α id, and it is related to aid by,

cos αid =(

aid

‖a¯ i‖

). (5)

Consider the case of 3D simple structure. In this case, each item is unidimensional.Thus, for any one item i, α id = 0◦ and aid = ‖a

¯ i‖ for the value of d that correspondsto the single dimension that the item measures; and α id = 90◦ and aid = 0 for theother two values of d (i.e., for the dimensions not measured by the item). Next, con-sider the case of 3D approximate simple structure. In this case every item measuresall three dimensions (aid > 0, for all three values of d), but aid will be much largerfor one of the dimensions than for the other two. Thus, α ik will be closer to 0◦ thanto 90◦ for the dimension d having the largest aid, while α id will be closer to 90◦ thanto 0◦ for the other two dimensions.

The dimensionality structure of many standardized testing situations can probablybe well modeled by approximate simple structure. Standardized tests are invariablydesigned using a list of test specifications that typically include a limited list of skilland knowledge domains, one of which is assigned to each item on the test. These do-mains correspond to different potential dimensions of the test, though the number ofstatistically detectable dimensions would probably be fewer with some domains clus-tering together (e.g., due to high correlations) to form dimensions. When an item isassigned to a particular skill or knowledge domain, this is like assigning the primary

225

Roussos and Ozbek

dimension the item is measuring. Furthermore, it is very difficult to write items thatare pure measures of a skill or knowledge domain. Hence, it seems reasonable toassume that in addition to the skill or knowledge domain corresponding to the testspecification, the vast majority of items also measure, though to a lesser degree, otherskill or knowledge domains, thus resulting in approximate simple structure. An em-pirical example of this type of structure is discussed in Miller and Hirsch (1992).Another important situation in which approximate simple structure seems reason-able is in the analysis of an assessment instrument that is a battery of tests, each ofwhich is intended to provide a distinct single score. For example, the Test of Englishas a Foriegn Language (TOEFL; a test developed and administered by EducationalTesting Service) is made up of three subtests, each of which is intended to measurea distinct construct and can be thought of as a dimension of the entire TOEFL. Anempirical example of this particular type of structure can be found in Beguin andGlas (2001).

To simulate a variety of dimensionality structures within this framework, five dif-ferent parameters were varied: Number of dimensions, correlation between the di-mensions, test length, number of items on each dimension, and type of item dimen-sionality structure.

Number of dimensions. Although the focus of our study was the evaluation of theamount of bias in the DETECT statistic and in its conditional covariance estimatorin the case of multidimensionality, we also included unidimensional cases to providea more complete evaluation, especially focussing on the effect of test length on thebias.

In the case of multidimensionality, both two-dimensional (2D) and three-dimensional (3D) structures were simulated. It was important to investigate both 2Dand 3D structures because the DETECT theory indicates that DETECT works best inthe 2D case; whereas, in the 3D case, two of the dimensions can sometimes collapsetogether (e.g., if they are too highly correlated). As indicated above in deriving theDETECT parameter equations, for higher numbers of dimensions (4D or more) thecalculation of the ccov parameter for any one item pair reduces to a 3D calculation.Thus, because 3D was simpler to visualize than higher dimensions and because allthe effects of higher-dimensional structures are apparently included in the 3D case,we limited the maximum number of dimensions in this initial study to be the 3Dcase.

The latent dimensionality was represented by an ability vector, �¯

, specifically(�1, �2) in the 2D case and (�1, �2, �3) in the 3D case. These vectors were mod-eled as multivariate normal with means of 0 and standard deviations of 1 on all thedimensions. The correlations were varied as indicated below.

Correlation between dimensions. Correlations of 0.5 and 0.7 were employed. In the3D structures, we simulated three cases: (1) all correlations equal 0.5, (2) all corre-lations equal 0.7, and (3) two of the correlations equal to 0.5 and one equal to 0.7.Past studies in the literature have considered correlations over 0.5 to be high; andpast studies seldom, if ever, employed correlations greater than 0.7 (Hambleton &Rovinelli, 1986; Nandakumar, 1994; Roznowski, Tucker, & Humphreys, 1991; Yen,

226


1984). Because DETECT would obviously work better with lower correlations, welimited our study to a moderate and a high correlation.

It was important to us to include the less-studied case of one correlation beingdifferent than the other two because the DETECT theory (Zhang & Stout, 1999b)indicates that such dimensional structures are harder to detect. Indeed, for the simplestructure 3D case having two of the correlations equal to 0.5 and all three dimensionsbeing equally weighted in �α (one can think of this as each dimension having ap-proximately equal numbers of items associated with them), the upper bound on thethird correlation is 0.8. Thus, our choice of 0.7 was fairly close to the upper bound.So, if there is statistical bias in the DETECT estimator, then this particular case beingnear the boundary of the limits of DETECT provides a particularly stressful test forthe procedure, especially in the case of approximate simple structure where the effec-tive correlations between clusters representing the dimensions would all be greaterthan the latent trait correlations.

Test length. In the case of unidimensionality we investigated test lengths of 5, 10,20, and 40 items. In the case of multidimensionality we investigated test lengthsof approximately 20 and 40 items (slight deviations were sometimes necessary toaccommodate the desired numbers of items per dimension, as discussed below). Ourchoices for test length were finalized in a pilot study and as the study progressed.As will be seen below, the results for the case of unidimensionality clearly indicatedthat test lengths of 10 or less had unacceptably large bias, which naturally led usto restrict the test length for the multidimensional cases to 20 and 40 items. Also,because bias reduces as test length increases and because (as will be seen below)the bias in the DETECT estimator was already seen to be minimal in the 40 itemcases, there was clearly no reason to extend the simulated dimensionality structuresto longer tests for this article.

Number of items per dimension. For the test length of 20 items, we only investi-gated the case of equal numbers of items on each dimension. In the case of threedimensions, we needed to use 21 items so that we could have equal numbers whilemaintaining a test length of at least 20 items.

For the test length of 40 items, we investigated two cases: (1) equal numbers ofitems per dimension and (2) the one longest dimension having approximately 50%more items (in the 2D case) or twice as many items (in the 3D case) as the remainingdimension or dimensions. In the 3D case, to maintain three equal-sized dimensions,we used a test length of 42 items; and in the unequal-sized 3D case, we used 10,10, and 22 items to keep the test length the same as the equal-sized 3D case whilekeeping the longer dimension at least twice as long as the other two.

It was important to us to include the less-studied case of one dimension beingmuch longer in test length than the others because the DETECT theory (Zhang &Stout, 1999b) showed that in certain 3D cases DETECT will fail to find all threedimensions if the longest dimension becomes too long. For example, in the 3Dsimple-structure case having all three correlations equal to each other and two ofthe dimensions having equal numbers of items, the upper bound on the number ofitems for the third dimension would be about three times the length of the other di-mensions. In our 3D unequal-sized cases we had the longest dimension being about

227

Roussos and Ozbek

twice as long as the other two dimensions, clearly a stressed situation but not as closeto the upper limit as our correlation choice.

Type of item dimensionality structure. Two types of structures were investigated:simple structure (SS) and approximate simple structure (APSS). As implied in theearlier discussion, SS refers to a test that can be divided into clusters, each of whichcorresponds to a separate test dimension (that is a component of θ

¯). The term ap-

proximate simple structure (APSS) is introduced here for the situation where all theitems in a single cluster have their highest discrimination on the same single dimen-sion (the primary dimension the cluster of items is measuring) but the items in thecluster are also allowed to have varying degrees of smaller amounts of discrimina-tion on the remaining dimensions of the test. Thus, in APSS each item of the test isprimarily measuring just one of the dimensions of the test, and all the items that mea-sure the same primary dimension can form a cluster that primarily measures a singledimension. Therefore, in the APSS case the test can again be divided into clusters,each of which primarily corresponds to a separate test dimension.

As discussed above, this mimics the common real-life situation of a standardizedtest that is developed with test specifications which may be thought of as assigninga primary dimension to each item on the test. The use of approximate simple struc-ture in addition to simple structure reflects the fact that items almost invariably alsoinvolve incidental or less emphasized measurement of other dimensions.

For the APSS 2D case, the items corresponding to a dimension d0 had their itemvectors fall randomly into a fan-shaped region with one edge of the fan being theaxis for the dimension. To study the effect of the width of the fan, we employed twofan-width angles, 15◦ and 30◦.

Similarly, for the APSS 3D case, the item vectors for a dimension d0 fell into acone-shaped region. One edge of the cone lay along the axis for dimension d0, whilethe cone pointed in a direction that was symmetric with respect to the other two axes.The simulation studied the effect of the angular width of the cone by employing twowidths, 15◦ and 30◦.

Note that using an angle of 45◦ for each dimension would result in a 2D repre-sentation with absolutely no structure (i.e., the item vectors would be uniformly ran-domly distributed in the 2D space), which we felt was unrealistic. Thus, the choiceof 30◦ corresponded to what we felt was nearly the largest angle that still resulted insubstantial structure.

Table 1 presents a detailed summary of the dimensionality structures used for thesimulated data. A total of 49 different models were used with 45 of them beingmultidimensional.

Finally, we present the ‖a¯ i‖ and bi item parameters used for simulating the data.

We chose to use items having a wide variety of values of ‖a¯ i‖ and bi. The ranges we

used for ‖a¯ i‖(0.5 to 2.0) and for b (−1.5 to 1.5) are typical of values that occur for

the unidimensional a and b parameters estimated for items on standardized tests. Tofurther increase the realism, we limited the low MDISC values (‖a

¯ i‖ = 0.5) to occuron the relatively easier items and restricted the high MDISC values (‖a

¯ i‖ = 1.5 or 2)to occur on the relatively harder items, a phenomenon that has been noted in theliterature (Nandakumar & Roussos, 2004). By systematically varying MDISC and

228

TAB

LE

1St

ruct

ure

ofth

eSi

mul

ated

Dat

a

Num

ber

ofN

umbe

rof

Dim

ensi

onal

Num

ber

ofFa

nor

Dim

ensi

ons

Exa

min

ees

Mod

elIt

ems

Clu

ster

sρ

Con

eA

ngle

s

1D12

0,00

0U

nidi

men

sion

al5

N.A

.N

.A.

N.A

.10

N.A

.20

N.A

.40

N.A

.2D

120,

000

Sim

ple

Stru

ctur

e20

10/1

0(0

.5)

N.A

.42

21/2

1or

4015

/15

(0.7

)A

ppro

xim

ate

Sim

ple

Stru

ctur

e20

10/1

0(0

.5)

1542

21/2

1or

or40

15/2

5(0

.7)

303D

120,

000

Sim

ple

Stru

ctur

e21

7/7/

7(0

.5,0

.5,0

.5)

N.A

.42

14/1

4/14

(0.7

,0.7

,0.7

)42

10/1

0/22

(0.5

,0.5

,0.7

)A

ppro

xim

ate

Sim

ple

Stru

ctur

e21

7/7/

7(0

.5,0

.5,0

.5)

1542

14/1

4/14

(0.7

,0.7

,0.7

)or

4210

/10/

22(0

.5,0

.5,0

.7)

30

229

Roussos and Ozbek

MID values similar to the a and b values commonly estimated on real tests, we couldevaluate the statistical bias of the DETECT conditional covariance estimator withrespect to a variety of combinations of these values.

Although the set of item parameter values for any of the simulated tests probablydo not perfectly correspond to any real data set, neither are they drastically differentfrom those that would be found on real tests. We purposely chose moderation inselecting our MDISC and MID item parameters so that our simulated tests wouldnot be unusually reliable or unreliable, or unusually difficult or easy, conditional onthe number of items used. Although we have a strong interest in studying a greatervariety of parameter settings, we felt it best to limit the current study to maintain amanageable scope. In this manner we also restricted the ci parameter to be equal to0.17 for all the items as we were not interested in investigating the effect of varying cin the current study. Table 2 presents the parameters used for the 2D simulations, andTable 3 presents the parameters used for the 3D simulations. For the simulated tests

TABLE 2Item Parameters for 2D Models

N = 40 (15-25)N = 20 (10-10) N = 42 (21-21)

Dimensions 1 & 2 Dimensions 1 & 2 Dimension 1 Dimension 2

Item No. MDISC b MDISC b MDISC b MDISC b

1 0.5 −1.5 0.5 −1.5 0.5 −1.5 0.5 −1.52 0.5 −1 0.5 −1 0.5 −1 0.5 −1.53 1 −1 1 −1 0.5 −0.5 0.5 −14 1 −0.5 1 −0.5 1 −1 0.5 −15 1 0 1 0 1 −0.5 0.5 −0.56 1.5 0 1.5 0 1 0 1 −17 1.5 0.5 1.5 0.5 1 0.5 1 −0.58 1.5 1 1.5 1 1 1 1 −0.59 2 1 2 1 1.5 −.05 1 0

10 2 1.5 2 1.5 1.5 0 1 011 0.5 −1.5 1.5 0.5 1 0.512 0.5 −1 1.5 1 1 113 1 −1 2 0.5 1 114 1 −0.5 2 1 1.5 −0.515 1 0 2 1.5 1.5 016 1.5 0 1.5 017 1.5 0.5 1.5 0.518 1.5 1 1.5 0.519 2 1 1.5 120 2 1.5 1.5 121 1 1 2 0.522 2 123 2 124 2 1.525 2 1.5

230

TABLE 3Item Parameters for 3D Models

N = 42 (10-10-22)N = 21 (7-7-7) N = 42 (14-14-14)

Dimensions 1, 2 & 3 Dimensions 1, 2 & 3 Dimensions 1 & 2 Dimension 3

Item no. MDISC b MDISC b MDISC b MDISC b

1 0.5 −1.5 0.5 −1.5 0.5 −1.5 0.5 −1.52 1 −1 0.5 −1 0.5 −1 0.5 −1.53 1 −0.5 0.5 −0.5 1 −1 0.5 −14 1 0 1 −1 1 −0.5 0.5 −15 1.5 0.5 1 −0.5 1 0 1 −0.56 1.5 1 1 0 1.5 0 1 −0.57 2 1.5 1 0.5 1.5 0.5 1 08 1 1 1.5 1 1 09 1.5 −0.5 2 1 1 0.5

10 1.5 0 2 1.5 1 0.511 1.5 0.5 1 112 1.5 1 1 113 2 1 1.5 014 2 1.5 1.5 015 1.5 0.516 1.5 0.517 1.5 118 1.5 119 2 120 2 121 2 1.522 2 1.5

that had the same numbers of items for each dimension, the same item parameterswere used for each dimension.

All item responses were independently generated so that items having the samevalues of ‖a

¯ i‖ and bi would not necessarily have the same generated item responses.The aid parameters, as indicated by Equation (5), depended not only on the ‖a

¯ i‖value of Table 2 but also on the α id values. The random generation of the α id valueswas described above in the description of the types of dimensionality structure.

Method

For each studied multidimensional structure, data for 120,000 examinees weregenerated. The data were then analyzed as follows:

1. The DETECT procedure was run using all 120,000 of the examinees withoutcross-validation to obtain the DETECT clusters, the corresponding estimate ofthe large-sample DETECT statistic, and the conditional covariance estimates forall the individual item pairs.

2. The item-pair conditional-covariance population parameters were then calculatedby numerical integration for every item pair. The conditional covariances pa-rameters were multiplied by 100 to make them comparable with the DETECT

231

Roussos and Ozbek

estimates (DETECT automatically multiplies the conditional covariance esti-mates by 100).

3. The DETECT population parameter was calculated using the item-pairconditional-covariance population parameters with the same clusters that wereused in calculating the large-sample DETECT statistic.

Five criteria were used to evaluate the results of the simulation studies:

DETECT bias. This bias is calculated by subtracting the value of the DETECTpopulation parameter from the value of the large-sample DETECT statistic.

Accuracy in clusters. This statistic is the percentage of item pairs correctly clusteredaccording to the true multidimensional structure. A correct clustering of an itempair occurs when two items on the same dimension are placed in the same clusteror when two items on different dimensions are not placed in the same cluster. Thisstatistic is not calculated for the unidimensional cases.

IDN index. This index tells the percentage of the item pairs in the clusters for whichthe large-sample estimated within-ccov’s were positive and the estimated between-ccov’s were negative. The IDN estimate is also compared to the true value of IDNcalculated from the conditional covariance parameters using the clusters estimatedby DETECT on the simulated data.

Average ccov bias. This statistic is obtained by first calculating the statistical biasfor each item-pair conditional covariance (subtracting the conditional covariancepopulation parameter from the large-sample estimated conditional covariance asproduced by DETECT) and then averaging these differences over all the item pairs.

RMS ccov bias. This bias statistic is the root-mean-square of the differences used tocalculate the average ccov bias.

Results and Discussion

Unidimensional Results

Table 4 presents the results for the unidimensional simulations.In the case of unidimensionality, the DETECT parameter is zero because all the

theoretical conditional covariances are zero. If the DETECT estimator is unbiased,then as the number of examinees increases, the statistic would become closer andcloser to zero. Any value less than or greater than zero would be a measure of bias(because our sample size is so large that random error should be negligibly small).We wanted to see the bias be less than 0.1, and we would be concerned if the bias

TABLE 4DETECT and CCOV Results: Unidimensional Models

Test DETECT DETECT DETECT Average R.M.S. AccuracyLength Estimate Parameter Bias CCOV Bias CCOV Bias in Clusters

5 0.63 0 0.63 −0.50 0.70 N.A.10 0.33 0 0.33 −0.20 0.40 N.A.20 0.18 0 0.18 −0.05 0.24 N.A.40 0.09 0 0.09 −0.02 0.14 N.A.

232


were greater than 0.2, the cutoff we have developed for determining approximateunidimensionality. The results indicate that the DETECT estimator does indeed havesome statistical bias in the unidimensional case, and that the bias increases to non-negligible amounts as test length decreases to 10 items. The results suggest that,for data sets consisting of 20 or more items, one may interpret a DETECT value of0.2 or less as indicating unidimensionality. Moreover, the results also suggest thatDETECT should not be used with tests that have less than 20 items.

These results are of special significance to the authors of this article in that wehave a special interest in using DETECT for analyzing small sets of items (less than20 items). Specifically, the theory underlying DETECT indicates that such small-scale DETECT analyses have the potential to be able to determine the number ofdimensions in a data set even when the multidimensionality is not approximate sim-ple structure and the sign pattern between the dimensions does not hold.6 Thus, thebias of the DETECT estimator must be reduced before this idea can be more seri-ously investigated. We believe that the bias correction method developed for use inDIMTEST (see Stout et al., 2001) offers a promising proposal for a solution to theDETECT bias problem.

The results for the bias estimation for the individual item-pair conditional covari-ances indicate that although the average (across item pairs) amount of bias is quitesmall for the test lengths of 20 and 40 items (−.05 and −.02, respectively), the rea-son these averages are small is not that all the individual biases are similarly small,but rather is due to more substantial negative and positive biases (on average about0.24 and 0.14 for the test lengths of 20 and 40, respectively, as indicated by RMSbias) tending to cancel each other out. If such bias persists under the multidimen-sional models (as is likely), it could cause misclassification of items if the bias islarge enough relative to the true conditional covariances and is of the opposite sign.As expected from the decrease in the DETECT statistic bias with test length, theRMS bias in the estimated conditional covariances also decreases with the decreasebeing about 40% for each doubling of test length.

Multidimensional Results

Tables 5 and 6 present the results for the 2D SS cases, Tables 7 and 8 presentresults for the 2D APSS cases, Tables 9 and 10 present the results for 3D SS, andTables 11 and 12 present results for 3D APSS.

TABLE 5DETECT Results: 2D Simple Structure Models

Test DETECT DETECT DETECT IDN Theo. Accuracy inLength ρ Estimate Parameter Bias Index IDN Clusters (%)

20 (10-10) .5 1.00 0.95 0.05 0.97 1.00 100.7 0.62 0.58 0.04 0.96 1.00 100

42 (21-21) .5 0.99 0.96 0.03 0.99 1.00 100.7 0.60 0.59 0.01 0.99 1.00 100

40 (15-25) .5 0.91 0.89 0.02 0.99 1.00 100.7 0.56 0.55 0.01 0.99 1.00 100

233

TABLE 6CCOV Bias Results: 2D Simple Structure Models

Within Clusters Between Clusters

Average R.M.S. Average R.M.S. Average R.M.S.CCOV CCOV CCOV CCOV CCOV CCOV

Test Length ρ Bias Bias Bias Bias Bias Bias

20 (10-10) .5 −0.03 0.21 0.02 0.22 −0.08 0.20.7 −0.03 0.21 0.01 0.22 −0.07 0.21

42 (21-21) .5 −0.01 0.12 0.02 0.13 −0.04 0.12.7 −0.01 0.12 0.01 0.13 −0.02 0.12

40 (15-25) .5 −0.01 0.13 0.01 0.14 −0.03 0.12.7 −0.01 0.14 0.00 0.14 −0.02 0.13

DETECT parameter. First, consider the DETECT index parameter results from Ta-bles 5, 7, 9, and 11. As expected, the DETECT index parameter is smaller when thecorrelation is greater (holding other factors constant). Indeed, the drop in the DE-TECT parameter is rather large in going from a correlation of 0.5 to 0.7, indicatingthat our choice of these two values resulted in substantially different results, as de-sired. The index also decreases rather strongly as the departure from simple structureincreases (again holding other factors constant). Thus, the DETECT parameter is notsimply a reflection of the correlations of the underlying latent traits, but is also af-fected by how the items measure those traits, in particularly, the degree of violation oflocal independence caused by the underlying traits. Given that the amount of viola-tion of local independence seems like the most important parameter to consider whendetermining whether a unidimensional model is appropriate for data, we believe thisis an especially desirable behavior for the DETECT parameter. In this regard, it is

TABLE 7DETECT Results: 2D Approximate Simple Structure Models

Test Fan DETECT DETECT DETECT IDN Theo. Accuracy inLength ρ Angles Estimate Parameter Bias Index IDN Clusters (%)

20 (10-10) .5 15 15 0.71 0.68 0.03 0.96 1.00 10030 30 0.47 0.44 0.03 0.88 1.00 100

.7 15 15 0.44 0.41 0.03 0.91 1.00 10030 30 0.28 0.23 0.05 0.86 0.91 91

42 (21-21) .5 15 15 0.68 0.66 0.02 0.99 1.00 10030 30 0.41 0.40 0.01 0.95 1.00 100

.7 15 15 0.41 0.40 0.01 0.96 1.00 10030 30 0.24 0.23 0.01 0.88 1.00 100

40 (15-25) .5 15 15 0.62 0.61 0.01 0.98 1.00 10030 30 0.37 0.37 0.00 0.91 1.00 100

.7 15 15 0.37 0.36 0.01 0.94 1.00 10030 30 0.22 0.21 0.01 0.86 0.94 94

234

TABLE 8CCOV Bias Results: 2D Approximate Simple Structure Models



Test Length ρ Fan Angles Bias Bias Bias Bias Bias Bias

20 (10-10) .5 15 15 −0.03 0.21 0.01 0.21 −0.05 0.2130 30 −0.03 0.22 0.01 0.22 −0.05 0.22

.7 15 15 −0.03 0.22 0.01 0.22 −0.06 0.2130 30 −0.03 0.23 −0.02 0.22 −0.03 0.23

42 (21-21) .5 15 15 −0.01 0.12 0.01 0.13 −0.02 0.1230 30 −0.01 0.13 0.00 0.13 −0.01 0.12

.7 15 15 −0.01 0.13 −0.00 0.13 −0.01 0.1230 30 −0.01 0.13 −0.04 0.13 −0.01 0.13

40 (15-25) .5 15 15 −0.01 0.14 0.01 0.14 −0.02 0.1330 30 −0.01 0.14 −0.01 0.14 −0.01 0.14

.7 15 15 −0.01 0.14 −0.01 0.14 −0.01 0.1430 30 −0.01 0.14 −0.01 0.14 −0.01 0.15

interesting to note that a correlation of 0.7 in the case of 2D simple structure yieldsa moderate to strong DETECT parameter of 0.58, but DETECT drops all the waydown to 0.25, indicating weak multidimensionality, for the 30◦ fan structure. Thus,the amount of weakened multidimensionality induced by our approximate simplestructure was substantial, as we desired.

In constructing the 2D 10-10 and the 2D 21-21 cases we almost perfectly doubledthe test in terms of using the same item parameters. If we had perfectly doubledthe test, we would expect to see no difference in the DETECT parameter, and ournearly identical results for the 10-10 and 21-21 cases bear this out. This highlights anice feature of the DETECT index—the effect size parameter is not affected by testlength per se, but only to the extent that lengthening the test adds items that measure

TABLE 9DETECT Results: 3D Simple Structure Model

Test DETECT DETECT DETECT IDN Theo. Accuracy inLength ρ Estimate Parameter Bias Index IDN Clusters (%)

21 (7-7-7) .5 .5 .5 0.80 0.77 0.03 0.99 1.00 100.7 .7 .7 0.48 0.47 0.01 0.98 1.00 100.5 .5 .7 0.69 0.66 0.03 0.99 1.00 100

42 (14-14-14) .5 .5 .5 0.87 0.86 0.01 1.00 1.00 100.7 .7 .7 0.53 0.52 0.01 0.99 1.00 100.5 .5 .7 0.76 0.74 0.02 0.99 1.00 100

42 (10-10-22) .5 .5 .5 0.75 0.74 0.01 0.99 1.00 100.7 .7 .7 0.46 0.46 0.00 0.96 1.00 100.5 .5 .7 0.60 0.59 0.01 0.99 1.00 100

235

TABLE 10CCOV Bias Results: 3D Simple Structure Models



Test Length ρ Bias Bias Bias Bias Bias Bias

21 (7-7-7) .5 .5 .5 −0.02 0.16 0.01 0.18 −0.03 0.16.7 .7 .7 −0.02 0.17 −0.00 0.18 −0.03 0.18.5 .5 .7 −0.02 0.17 0.01 0.18 −0.03 0.16

42 (14-14-14) .5 .5 .5 0.00 0.12 0.03 0.14 −0.01 0.12.7 .7 .7 −0.00 0.13 0.02 0.14 −0.01 0.12.5 .5 .7 0.00 0.13 0.03 0.14 −0.01 0.12

42 (10-10-22) .5 .5 .5 −0.01 0.12 −0.00 0.14 −0.01 0.11.7 .7 .7 −0.01 0.12 −0.01 0.13 −0.01 0.11.5 .5 .7 −0.01 0.12 0.00 0.14 −0.01 0.11

the latent dimensions differently from the original test. For example, in the 3D case,in constructing our 7-7-7 and 14-14-14 cases, we did not simply double the shortertest in terms of the item parameters. Thus, in comparing these two cases the DETECTparameter changed more so than in the comparison of the 10-10 and 21-21 2D cases.Another example of how test length and latent trait correlations do not tell the wholestory can be seen in comparing the two different ways we constructed the 40- and 42-item tests. When one dimension has about 50% more items than the other dimension

TABLE 11DETECT Results: 3D Approximate Simple Structure Models

Cone DETECT DETECT DETECT IDN Theo. Accuracy inTest Length ρ Angles Estimate Parameter Bias Index IDN Clusters (%)

21 (7-7-7) .5 .5 .5 15 15 15 0.60 0.58 0.02 0.99 1.00 10030 30 30 0.41 0.40 0.01 0.94 1.00 100

.7 .7 .7 15 15 15 0.35 0.34 0.01 0.94 1.00 10030 30 30 0.24 0.22 0.02 0.84 0.97 97

.5 .5 .7 15 15 15 0.52 0.50 0.02 0.91 1.00 10030 30 30 0.36 0.35 0.01 0.87 0.96 100

42 (14-14-14) .5 .5 .5 15 15 15 0.68 0.67 0.01 1.00 1.00 10030 30 30 0.50 0.49 0.01 0.98 0.99 100

.7 .7 .7 15 15 15 0.41 0.40 0.01 0.99 1.00 10030 30 30 0.29 0.29 0.00 0.94 0.99 100

.5 .5 .7 15 15 15 0.59 0.57 0.02 0.98 0.99 10030 30 30 0.43 0.42 0.01 0.94 0.97 100

42 (10-10-22) .5 .5 .5 15 15 15 0.58 0.59 −0.01 0.97 0.98 10030 30 30 0.43 0.43 0.00 0.93 0.97 100

.7 .7 .7 15 15 15 0.35 0.35 0.00 0.93 0.98 10030 30 30 0.25 0.26 −0.01 0.87 0.97 100

.5 .5 .7 15 15 15 0.47 0.47 0.00 0.97 1.00 10030 30 30 0.34 0.35 −0.01 0.90 0.98 100

236

TABLE 12CCOV Bias Results: 3D Approximate Simple Structure Models


Average R.M.S. Average R.M.S. Average R.M.S.Cone CCOV CCOV CCOV CCOV CCOV CCOV

Test Length ρ Angles Bias Bias Bias Bias Bias Bias

21 (7-7-7) .5 .5 .5 15 15 15 −0.01 0.18 0.03 0.19 −0.02 0.1830 30 30 −0.01 0.20 0.02 0.20 −0.01 0.19

.7 .7 .7 15 15 15 −0.01 0.19 0.00 0.20 −0.02 0.1930 30 30 −0.01 0.20 −0.01 0.20 −0.01 0.21

.5 .5 .7 15 15 15 −0.01 0.19 0.02 0.19 −0.02 0.1830 30 30 −0.01 0.20 0.01 0.20 −0.01 0.19

42 (14-14-14) .5 .5 .5 15 15 15 0.01 0.13 0.03 0.13 −0.00 0.1330 30 30 0.01 0.12 0.02 0.13 −0.00 0.12

.7 .7 .7 15 15 15 0.00 0.13 0.01 0.12 −0.00 0.1330 30 30 0.00 0.13 0.01 0.12 −0.00 0.13

.5 .5 .7 15 15 15 0.00 0.13 0.02 0.13 −0.00 0.1330 30 30 0.00 0.13 0.01 0.12 −0.00 0.13

42 (10-10-22) .5 .5 .5 15 15 15 −0.00 0.12 −0.01 0.13 −0.00 0.1130 30 30 −0.00 0.12 −0.01 0.13 0.00 0.11

.7 .7 .7 15 15 15 −0.01 0.12 −0.01 0.13 −0.00 0.1130 30 30 −0.01 0.12 −0.02 0.13 0.00 0.12

.5 .5 .7 15 15 15 −0.01 0.12 −0.01 0.13 0.00 0.1130 30 30 −0.00 0.12 −0.01 0.13 0.00 0.12

in the 2D case or 100% more items than the other two dimensions in the 3D case,this causes a reduction in the DETECT parameter, though the effect is generally onlysmall (approximately 0.05) to moderate (approximately 0.10).

DETECT estimator. Now let us consider the DETECT estimator results—DETECTbias, IDN index, and classification accuracy (“Accuracy in Clusters”)—from Tables5, 7, 9, and 11. The large-sample DETECT index itself shows remarkably small biasfor all the simulated conditions. The largest bias was a positive 0.05, which is notlarge enough to adversely affect the interpretation of the index. What little bias therewas exhibited, as expected, a strong tendency to decrease as test length increased.

Next, we turn to the classification accuracy results (the last column in Tables 5,7, 9, and 11). Only three cases out of 45 had an accuracy of less than a perfect100%, and those three all had accuracies greater than 90%. In the 2D case that hadthe lowest accuracy rate of 92%, the amount of true multidimensionality was rathersmall as the item vectors of the two dimensions formed rather wide fans of 30◦

each, the latent dimensions were correlated 0.7 with each other, and the DETECTpopulation parameter was only 0.21.

Consider next the large-sample DETECT IDN index results. The IDN index givesthe proportion of item pairs whose estimated signs were the correct value, in thesense that two items in the same cluster should have positive conditional covariance

237

Roussos and Ozbek

with each other while two items from different clusters should have negative condi-tional covariance with each other. The IDN indices show that DETECT’s conditionalcovariance estimator correctly estimates the signs of the conditional covariances forover 90% of the items pairs in 38 out of 45 cases. In two of the remaining seven cases(the two with IDN indices of 0.87), the maximum possible IDN values (the theoret-ical values that occur with perfect clustering) were only 0.96 and 0.97 because thesimulated structure departed so much from pure simple structure that some of thebetween-cluster ccov parameters were actually positive. Taking this into account,the estimated IDN values were actually 90% or more of the maximum theoreticalvalues for these two cases. In the five remaining cases, the theoretical IDN wouldhave been 1.0 if the clustering had been perfect. Still, four of these IDN estimateswere over 0.85 while the lowest was 0.84. For these five cases the dimensionalitystructures were all approximate simple structure and the fan or cone angles were all30◦. For four of these cases the correlations were all 0.7 and the DETECT parame-ters were all 0.23 or less. Thus, four of these cases were ones where the violations oflocal independence were very small and difficult to detect. The one case that had acorrelation of 0.5 and a DETECT parameter of 0.44 also had the highest IDN of thefive, 0.88 and the clustering accuracy was a perfect 100%.

These results do imply that there is some bias occurring with the conditional co-variance estimator, but that the bias seems to be small in that it tends to becomemanifest to a significant degree when the conditional covariances are fairly close tozero. Furthermore, the results show that this occurs infrequently enough to have onlya minimal effect on large-sample correct classification. Next we turn to the resultsfor the conditional covariance estimation where we directly investigate the bias inthe ccov estimation.

Conditional covariance estimator. Next we consider the conditional covariance esti-mation bias results presented in Tables 6, 8, 10, and 12. First, note that the magnitudeof the RMS bias is, as expected from the unidimensional results, generally small, atmost around 0.23 for the tests with 20-21 items and 0.14 for the tests with 40-42items. These values are to be interpreted on the DETECT scale for which values lessthan 0.20 are considered to indicate very weak multidimensionality.

These results also indicate that the bias in the estimation of the conditional co-variances for many item pairs is clearly greater than the bias in the estimation of theDETECT parameter. However, the fact that the IDN values are typically over 90%(over 85% in 44 of 45 cases) indicates that in the vast majority of cases the bias in theccov estimator does not result in the ccov sign being incorrectly estimated, and, as aconsequence, had a nearly negligible effect on the large-sample correct classificationrates as indicated by the “Accuracy in clusters” column. Note also that the size of theRMS bias is almost always smaller than the magnitude of the DETECT parameter,which can be interpreted as the average ccov in absolute value. It is only in cases ofvery weak multidimensionality in the 2D cases with the shorter tests that the ccovbias takes on values close to that of the DETECT parameter.

Because the ccov estimator bias is often much larger than that in the DETECTestimator, the question naturally arises as to why the ccov estimator bias does notcause more of a bias in the DETECT estimator. In short, the main reason (beyond

238


that of the signs being mostly not affected) is that there tends to be both negativeand positive biases in both the within- and between-ccov’s, and these biases tend tocancel themselves out. The largest mean conditional covariance bias was only −0.03.Furthermore, even when broken down to look at the within-ccov’s and the between-ccov’s, these average biases, though larger than the overall means, were still quitesmall—the largest for a between-ccov bias was −0.08 and the largest for a within-ccov was 0.05. And as test length increased these maxima reduced to −0.03 for abetween-ccov and 0.04 for a within-ccov. Also note that the RMS ccov bias for thebetween-ccov’s and the within-ccov’s are very similar to each other.

There is a trend for the average bias in the within-ccov’s to be slightly positiveand the average bias in the between-ccov’s to be slightly negative (both of which,incidentally, favor more accurate classification); however, these two tendencies arevery small, especially in comparison to the reported RMS ccov biases.

It is important to note that, as expected, the ccov estimator bias does decrease astest length increases, and the amount of RMS bias in the multidimensional cases wasvery similar to that in the unidimensional cases of similar test length. One small, butunexpected, effect was that the bias for the 3D 21-item tests tended to be slightlysmaller than for the 20-item 2D tests, more so than would be expected based on the3D tests being only one item longer. There was no notable difference between the 40-and 42-item 2D tests in comparison with the 42-item 3D tests. We are not sure whythe slight difference occurred between the 2D and 3D 20-item tests. Also note thatthe amount of bias in the ccov estimator was unrelated with the size of the DETECTparameter or the amount of correlation, or the number of items associated with eachdimension.

We also investigated whether there was a pattern for certain pairs of items (interms of their item parameters) to result in a particular type of bias. The largestnegative biases, about 0.20 in magnitude, occurred for ccov values involving itemshaving the higher discrimination parameters being paired with items having the lowerdiscrimination values. The largest positive biases, about 0.20–0.25 in magnitude,occurred for items having medium discrimination and easier difficulty being pairedwith each other.

Finally, one should note that if the multidimensional simulation models had in-cluded shorter test lengths as were included in the unidimensional models, the ccovestimator bias would have been expected to increase to the point where it would leadto more estimated sign errors that would, in turn, lead to increased misclassificationand, thus, increased error in the DETECT analysis. Indeed, because of the results ofthe unidimensional cases, we purposely did not include shorter tests in the multidi-mensional simulation models, and we recommend that DETECT not be used withtests having fewer than 20 items until a new bias correction procedure is developedfor it.

Concluding Remarks

Using very large sample sizes, this study has evaluated the amount of statis-tical bias present in the DETECT index estimator in applications to unidimen-sional data and to the types of multidimensional structures that DETECT was

239

Roussos and Ozbek

especially intended to be applied to, that is, simple structure and approximate simplestructure.

The results showed that while DETECT’s conditional covariance estimator some-times showed notable bias (greater than 0.20 in magnitude on the DETECT scale,which multiplies all conditional covariances by 100), the bias was generally smallenough to result in accurate estimation of the vast majority of the signs of the condi-tional covariances for the simulated models that had 20 or more items. Moreover, thebiases in the conditional covariance estimates fortunately tended to cancel each otherout for both the within-cluster estimates and the between-cluster estimates. Becausethe conditional covariance signs were generally accurately estimated and the nega-tive and positive biases in the conditional covariance large-sample estimates tendedto cancel each other out, this resulted in minimal bias in the DETECT estimator andin nearly perfect clustering in almost all the cases studied.

The bias in DETECT and its conditional covariance estimator was shown to in-crease substantially as the length of the test decreased to 10 or 5 items. A new biascorrection must be incorporated in DETECT to address this bias before DETECTcan be reliably applied to smaller item sets. This is an important issue to addressbecause the application of DETECT to small item sets could lead to DETECT be-ing able to provide a reliable indication of test dimensionality for a wider variety ofdimensionality structures than just simple structure and approximate simple struc-ture. Thus, future studies are planned to investigate new bias correction methods forDETECT, in particular the incorporation of a bias correction similar to that used inDIMTEST.

Furthermore, the DETECT theory was developed for a generalized compensatorymultidimensional model. It would be of interest to investigate how DETECT wouldbehave with non-compensatory models. By taking the same approach as in this arti-cle, we plan to investigate the DETECT parameter in the case of non-compensatorymodels to see if the conditional covariance parameters in this case follow the samesign pattern as in the multidimensional compensatory case.

We should note that, for a number of reasons, DETECT analyses with real datawill not always come out as clean as in the simulation studies presented here. First,because we were focussed on evaluating statistical bias, we needed to use as largesample sizes as we possibly could, much larger than you would normally expectto have with real data. Thus, in real data analyses, there would be more noise inthe DETECT statistic, and this noise alone could be great enough to cause moremisclassifications and lower IDN values. Indeed, a simulation study to evaluate thestandard error (DETECT has no standard error estimator) of DETECT with realisticsample sizes is another high-priority research need.

Another reason why some real data analyses may give less interpretable DETECTresults is that the dimensionality structure may deviate rather strongly from thoseused in the current study. The simulated structures we used are most similar totests that are relatively difficult, have a fairly broad range of item discriminationand difficulty, and have dimensionality structures with two or more dimensions thateach have substantial numbers of items. For example, the application of DETECTby Stout et al. (1996) to the Law School Admissions Test (LSAT) and its readingcomprehension (RC) and analytical reasoning (AR) subtests all resulted in easily

240


interpretable solutions, the results of which corresponded well with other analyses.However, the Stout et al. application of DETECT to the LSAT logical reasoning(LR) subtest was not as easily interpretable. The LR subtest seemed to be composedof many fine-grained highly correlated skills with limited numbers of items per skill.Although the DETECT results were helpful in this case, a more complete pictureof the dimensionality structure was provided by augmenting the DETECT analyseswith other dimensionality analyses. A similar example is an analysis of TOEFL databy Jang and Roussos (2004). In analyzing the entire test, they found that DETECTeffectively identified the dominant dimensions measured by the test as a whole,but when they analyzed just the RC section of the TOEFL, they found that theyneeded to use additional tools to obtain a more complete picture of the dimensionalitystructure.

In summary, the current study has provided an important initial evaluation of thestatistical bias in the DETECT estimation procedure and has shown that, while amore effective bias correction is needed, the bias was of low enough magnitude fora large-sample DETECT to work quite effectively for the simple structure and ap-proximate simple structure simulation models that the current article was limitedto, conditional on the overall test length being at least 20 items. This study shouldbe replicated with smaller sample sizes to evaluate the standard error of DETECTin these situations. Also, because practitioners do apply DETECT to real tests withother structures, such as those having many dimensions, at least some of which arehighly correlated and have only a few items associated them, it is important thatfurther simulations be conducted with such structures and a greater variety of itemparameter distributions, in terms of evaluating both the DETECT bias and its stan-dard error.

Notes1A scale score is a monotonic transformation of the raw score onto a fixed scale; placing

raw scores from alternate test forms onto a common scale allows for meaningful comparisonof scores across forms that have small, but noticeable differences in difficulty level.

2See Habing and Roussos (2003) for a more thorough theoretical discussion of how theseand other dimensionality estimation procedures are strongly related in terms of conditionalcovariances.

3The DETECT software is available from Assessment Systems Corporation, located on theweb at www.assess.com.

4Software for calculating the DETECT index population parameter and the conditional co-variance population parameters is available from the first author.

5Readers should refer to Zhang and Stout (1999b), for a more rigorous theoretical descrip-tion of the types of approximate simple structure supported by the DETECT theory.

6A detailed discussion of this topic is beyond the scope of this article.

References

Ackerman, T. A. (1994). Using multidimensional item response theory to understand whatitems and tests are measuring. Applied Measurement in Education, 7, 255–278.

Ackerman, T. (1996). Graphical representation of multidimensional item response theory anal-yses. Applied Psychological Measurement, 20, 311–330.

241

Roussos and Ozbek

Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item responsetheory to evaluate educational and psychological tests. Educational Measurement: Issues &Practices, 22, 37–53.

Beguin, A. A., & Glas, C. A. W. (2001). MCMC estimation and some model-fit analysis ofmultidimensional IRT models. Psychometrika, 66, 541–562.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability.In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 392–479). Reading, Mass.: Addison-Wesley.

Habing, B., & Roussos, L. A. (2003). On the need for negative local item dependence. Psy-chometrika, 68, 435–451.

Hambleton, R. K., & Rovinelli, R. J. (1986). Assessing the dimensionality of a set of testitems. Applied Psychological Measurement, 10, 287–302.

Holland, P. W., & Rosenbaum, P. R. (1986). Conditional association and unidimensionality inmonotone latent variable models. The Annals of Statistics, 14, 1523–1543.

Jang, E. E., & Roussos, L. A. (2004). An investigation into the dimensionality of TOEFL usingconditional covariance-based non-parametric approach. Submitted for publication. Q2

Kim, H. R. (1994). New Techniques for the Dimensionality Assessment of Standardized TestData. Unpublished doctoral dissertation, University of Illinois, Champaign-Urbana.

Miller, T. R., & Hirsch, T. M. (1992). Cluster analysis of angular data in applications of mul-tidimensional item-response theory. Applied Measurement in Education, 5, 193–211.

Nandakumar, R. (1994). Assessing dimensionality of a set of items—Comparison of differentapproaches. Journal of Educational Measurement, 31, 17–35.

Nandakumar, R., & Roussos, L. A. (2004). Evaluation of the CATSIB DIF procedure in apretest setting. Journal of Educational and Behavioral Statistics, 29, 177–199.

Nandakumar, R., & Stout, W. F. (1993). Refinements of Stout’s procedure for assessing latenttrait unidimensionality. Journal of Educational Statistics, 18, 41–68.

Reckase, M. D., Ackerman, T. A., & Carlson, J. E. (1988). Building a unidimensional testusing multidimensional items. Journal of Educational Measurement, 25, 193–203.

Reckase, M. D., & McKinley, R. L. (1991). The discriminating power of items that measuremore than one dimension. Applied Psychological Measurement, 15, 361–373.

Rosenbaum, P. A. (1984). Testing the conditional independence and monotonicity assumptionsof item response theory. Psychometrika, 49, 425–435.

Roussos, L. A., Stout, W. F., & Marden, J. I. (1998). Using new proximity measures withhierarchical cluster analysis to detect multidimensionality. Journal of Educational Mea-surement, 35, 1–30.

Roznowski, M., Tucker, L. R., & Humphreys, L. G. (1991). Three approaches to determiningthe dimensionality of binary items. Applied Psychological Measurement, 15, 109–127.

Stout, W. F. (1987). A nonparametric approach for assessing latent trait dimensionality. Psy-chometrika, 52, 589–617.

Stout, W. F., Froelich, A. G., & Gao, F. (2001). Using resampling methods to produce animproved DIMTEST procedure. In A. Boomsma, M. A. J. van Duijn, & T. A. B. Snijders(Eds.), Essays on item response theory (pp. 357–375). New York: Springer-Verlag.

Stout, W. F., Habing, B., Douglas, J., Kim, H. R., Roussos, L. A., & Zhang, J. (1996). Con-ditional covariance-based nonparametric multidimensionality assessment. Applied Psycho-logical Measurement, 20, 331–354.

Wainer, H., & Wang, X. (2001). Using a new statistical model for testlets to score TOEFL.TOEFL Technical Report. TR-16. Princeton, NJ: Educational Testing Service.

Walker, C. M., & Beretvas, S. N. (2003). Comparing multidimensional and unidimensionalproficiency classifications: Multidimensional IRT as a diagnostic aid. Journal of Educa-tional Measurement, 40, 255–275.

242


Wang, M. (1988). Measurement bias in the application of a unidimensional model to mul-tidimensional item response data. Unpublished manuscript. Educational Testing Service,Princeton, NJ. Q3

Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance ofthe three-parameter logistic model. Applied Psychological Measurement, 8, 125–145.

Zhang, J., & Stout, W. F. (1999a). Conditional covariance structure of generalized compen-satory multidimensional items. Psychometrika, 64, 129–152.

Zhang, J., & Stout, W. F. (1999b). The theoretical DETECT index of dimensionality and itsapplication to approximate simple structure. Psychometrika, 64, 213–249.

Authors

LOUIS A. ROUSSOS is a Senior Psychometrician, Measured Progress, 100 Education Way,Dover, NH 03820; [email protected]. His primary research interests includedimensionality assessment, differential item functioning, skills diagnosis, and computer-based testing.

OZLEM YESIM OZBEK is an Assistant Professor, Education Department No: 315, Gazios-manpasa University, Taslyciftlik, Tokat, 60100, Turkey; [email protected]. Her pri-mary research interests include DIF and dimensionality.

243

Queries

Q1 Author: Please spell out the terms SAT and ACT.

Q2 Author: Please update reference Jang and Roussos (2004).

Q3 Author: Please update reference Wang (1988).

formulation of the detect population parameter and evaluation of detect estimator bias

Documents