coefficients for interrater agreement...papers, the pmc between these two sets of three scores would...

13
321 Coefficients for Interrater Agreement Frits E. Zegers University of Groningen The degree of agreement between two raters who rate a number of objects on a certain characteristic can be expressed by means of an association coefficient (e.g., the product-moment correlation). A large number of association coeffi- cients have been proposed, many of which belong to the class of Euclidean coefficients (ECs). A discussion of desirable properties of ECs demonstrates how the identity coefficient and its generalizations, which constitute a family of ECs, can be used to assess interrater agreement. This family of ECs contains coefficients for both nominal and non-nominal (ordinal and metric) data. In particular, it is pointed out which infor- mation contained in the data is accounted for by the various coefficients and which information is ignored. Index terms: association coefficients, cor- relation, Euclidean coefficients, generalized identity coefficients, interrater agreement. It often occurs that a number of persons or objects are rated on a certain characteristic by two or more judges (e.g., ratings of social helplessness of mentally disabled people, the quality of songs, or the quality of scientific papers). Consequent- ly, there are two or more judgments or scores for each judged entity. Moreover, judges generally will not agree completely, unless the judgment task is trivial. Of course, a certain degree of discrepancy between judges cannot be avoided, but when there is a large amount of discrepancy, the value of the judgment procedure will be doubtful. Thus, the question arises of how to assess the degree of agreement between judges. This agreement is often expressed by means of the product-moment correlation (PMC), also known as Pearson’s r. Two properties of the PMC are related to the idea of agreement: If the two judges produce two identical sets of scores, the PMC will attain its maximum value ( + 1), and if the judgments are randomly made, the PMC will be approximately zero. In many situations, however, the PMC is not the proper measure of agreement. If the scores are qualitative (nominal data), for exam- ple, the PMC cannot be computed. In addi- tion, some properties of the PMC may be undesirable in a given situation. Consider two teachers who grade the papers of three students on a 10-point scale ranging from 1 (very poor) to 10 (excellent), and scores of 6 or higher are considered &dquo;sufficient&dquo; and scores of 5 or lower are considered &dquo;insufficient.&dquo; If one teacher gave grades of 7, 8, and 9, and the other teacher gave grades of 2, 3, and 4 to the same papers, the PMC between these two sets of three scores would be + 1. The teachers did not fully agree, however. They agreed about the relative positions of the three papers, but they did not agree in an absolute sense about the quality of the papers. The PMC is one of the many association coefficients that can be used in such an in- stance to assess agreement between judges. This paper explores the question of how to select an appropriate association coefficient, given a specific judgment task. After a discussion of the class of E coefficients (ECs), which is an important class of association coefficients, a family of association coefficients belonging to this class of ECs is discussed. Coefficients for non-nominal data (i.e., data with at least or- dinal information) are described, and the family of ECs is generalized for nominal (categorical) data. APPLIED PSYCHOLOGICAL MEASUREMENT vol. 15, No. 4, December 1991, pp. 321-333 @ Copyright 1991 Applied Psychological Measurement Inc. 07~6-6276/97/~0~27-7~7.90 Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227 . May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Upload: others

Post on 28-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Coefficients for Interrater Agreement...papers, the PMC between these two sets of three scores would be + 1. The teachers did not fully agree, however. ... the association between

321

Coefficients for Interrater AgreementFrits E. ZegersUniversity of Groningen

The degree of agreement between two raterswho rate a number of objects on a certaincharacteristic can be expressed by means of anassociation coefficient (e.g., the product-momentcorrelation). A large number of association coeffi-cients have been proposed, many of which belongto the class of Euclidean coefficients (ECs). Adiscussion of desirable properties of ECsdemonstrates how the identity coefficient and itsgeneralizations, which constitute a family of ECs,can be used to assess interrater agreement. Thisfamily of ECs contains coefficients for bothnominal and non-nominal (ordinal and metric)data. In particular, it is pointed out which infor-mation contained in the data is accounted for bythe various coefficients and which information is

ignored. Index terms: association coefficients, cor-relation, Euclidean coefficients, generalized identitycoefficients, interrater agreement.

It often occurs that a number of persons or

objects are rated on a certain characteristic by twoor more judges (e.g., ratings of social helplessnessof mentally disabled people, the quality of songs,or the quality of scientific papers). Consequent-ly, there are two or more judgments or scores foreach judged entity. Moreover, judges generallywill not agree completely, unless the judgmenttask is trivial. Of course, a certain degree ofdiscrepancy between judges cannot be avoided,but when there is a large amount of discrepancy,the value of the judgment procedure will bedoubtful. Thus, the question arises of how toassess the degree of agreement between judges.This agreement is often expressed by means ofthe product-moment correlation (PMC), alsoknown as Pearson’s r. Two properties of the

PMC are related to the idea of agreement: Ifthe two judges produce two identical sets of

scores, the PMC will attain its maximum value

( + 1), and if the judgments are randomly made,the PMC will be approximately zero.

In many situations, however, the PMC isnot the proper measure of agreement. If thescores are qualitative (nominal data), for exam-ple, the PMC cannot be computed. In addi-

tion, some properties of the PMC may beundesirable in a given situation. Consider twoteachers who grade the papers of three studentson a 10-point scale ranging from 1 (very poor)to 10 (excellent), and scores of 6 or higher areconsidered &dquo;sufficient&dquo; and scores of 5 or

lower are considered &dquo;insufficient.&dquo; If oneteacher gave grades of 7, 8, and 9, and the otherteacher gave grades of 2, 3, and 4 to the samepapers, the PMC between these two sets of threescores would be + 1. The teachers did not fullyagree, however. They agreed about the relativepositions of the three papers, but they did notagree in an absolute sense about the quality ofthe papers.

The PMC is one of the many associationcoefficients that can be used in such an in-

stance to assess agreement between judges.This paper explores the question of how to selectan appropriate association coefficient, givena specific judgment task. After a discussion ofthe class of E coefficients (ECs), which isan important class of association coefficients, afamily of association coefficients belonging tothis class of ECs is discussed. Coefficientsfor non-nominal data (i.e., data with at least or-dinal information) are described, and the familyof ECs is generalized for nominal (categorical)data.

APPLIED PSYCHOLOGICAL MEASUREMENTvol. 15, No. 4, December 1991, pp. 321-333@ Copyright 1991 Applied Psychological Measurement Inc.07~6-6276/97/~0~27-7~7.90

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 2: Coefficients for Interrater Agreement...papers, the PMC between these two sets of three scores would be + 1. The teachers did not fully agree, however. ... the association between

322

E Coefficients

The association coefficients that will bediscussed here belong to the class of ECs (Jan-son & Vegelius, 1978b)-the E referring to Eu-clidean. A specific association coefficient is anEC if the association matrix has certain char-

acteristics, which are discussed below. For a givennumber of variables, the association matrix con-tains the association coefficients of all pairs ofvariables. A well-known example is the correla-tion matrix (with PMCs). In order to be an EC,the association coefficient itself does not need tobe a PMC. In principle, however, variables shouldexist that have the association matrix as their cor-relation matrix with PMCs. These hypotheticalvariables need not have a simple relation to theoriginal variables.

If an association coefficient is an EC, it will

have a number of properties:1. The association coefficient is symmetric

(i.e., the association between variables X andY equals the association between Y and X);

2. The maximum value of the association co-efficient is 1;

3. The association coefficient of a variable withitself is 1;

4. The value of the association coefficient isnever less than -1; and

5. If the association between variables X and Yis perfect ( + 1), the association between Xand an arbitrary third variable W equals theassociation between Y and W.

Property 5 implies the transitivity of perfectassociation: If the association between X and Yand between Y and W are both perfect, then theassociation between X and W is perfect.

ECs have a number of desirable properties.First, some of the properties of ECs mentionedabove fit with the concept of agreement. The

symmetry of an EC (Property 1) mirrors the ideathat there generally is no reason to distinguishthe agreement between Judges A and B from theagreement between Judges B and A. Properties2 and 3 imply that no person can agree more witha judge than the judge him/herself. Second, in

order to facilitate the interpretation of the valueof an agreement measure, this measure should benormed on a bounded interval. ECs are normedon the interval [-1, + 1] (see Properties 2 and 4).

Third, from the definition of an EC, it followsthat the association matrix of an EC can be usedin components analysis (e.g., to look for struc-ture in judgment data). Moreover, such a matrixmay be converted into a matrix with distancesbetween the variables in Euclidean space (Gower,1966); this distance matrix can be used in metricmultidimensional scaling.

Often it is difficult to determine whether ornot a certain association coefficient is an EC.Sometimes it can be shown that a coefficientlacks one or more properties of an EC, which im-plies that the coefficient does not belong to theclass of ECs. A practical way of proving that anassociation coefficient is an EC is to show thatthe coefficient belongs to a family of ECs.Association coefficients can be classified into

(partially overlapping) families of coefficients.Coefficients within a family have a commonstructure: They can be written as variations onone basic formula. If it can be proven that sucha basic formula yields an EC, then, by implica-tion, all members of the family belong to the classof ECs.A general family of ECs has been proposed by

Zegers (1986b). This family comprises variousother families of association coefficients (e.g., thefamily proposed by Daniels, 1944, which includesKendall’s tau and Spearman’s rho; the family ofCohen, 1969, with coefficients for profile com-parisons ; the families of Janson and Vegelius,1978a, 1978b, 1979, 1982a, 1982b, which includecoefficients for variables of mixed measurement

level; the family of Zegers and ten Berge, 1985;and the family of Zegers, 1986a, with coefficientsfor metric scales).

Coefficients for Non-Nominal Data

The scores of a judge may be numbers orlabels. If these numbers or labels refer to various

categories without any order relation among thecategories, the data are called nominal. In all

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 3: Coefficients for Interrater Agreement...papers, the PMC between these two sets of three scores would be + 1. The teachers did not fully agree, however. ... the association between

323

other cases, the data are non-nominal. The choiceof a coefficient to assess agreement depends part-ly on whether the data are nominal or non-nominal ; therefore, the coefficients for these twotypes of data are discussed separately.

As argued above, two judges will rarely pro-duce identical sets of judgments. Yet it is usefulto analyze the circumstances under which variousassociation coefficients indicate perfect agree-ment. In general, association coefficients belong-ing to the class of ECs attain their maximumvalue of 1 if the sets of scores of two judges areidentical. Conversely, an association coefficientof + 1 does not imply that the two sets of scoresare identical. For example, a PMC of + 1 impliesthat the two sets of scores are identical within a

positive linear (or affine) transformation (e.g., theexample above of two teachers grading a numberof papers). The identity coefficient introduced byZegers and ten Berge (1985) as an associationcoefficient for absolute scales is + 1 if and onlyif the two sets of scores are identical.

The identity coefficient is based on the sumof squared differences between correspondingscores of the two judges. If the two judges com-pletely agree, this sum is 0, and the more thejudges disagree, the larger this sum will be. Bysubtracting the sum of squared differences from+ 1, an index is obtained that is + 1 in the caseof identical judgments, and is smaller the morethe judges differ. In order to obtain a coefficientthat cannot have a value smaller than -1, the sumof squared differences is divided first by a factord before it is subtracted from + 1. This factor dis the sum of the two sums-of-squared scores, andthe identity coefficient is obtained. If the scoresof the two judges is given by X, and Z, respec-tively, with i denoting the ith judged object, thenthe identity coefficient (eX)’) is given by

where n is the number of judged objects. Anequivalent but computationally simpler formula

The identity coefficient belongs to the class ofECs (Zegers & ten Berge, 1985). The problem ofassessing the agreement between the two teachersmentioned above now seems solved: Contrary tothe PMC, the identity coefficient is not + 1, butis smaller (.66), which indicates imperfect agree-ment. Suppose, however, that the scores onlyserve to give first, second, and third prizes to thethree students. In this case, the usual meaningof the school grade scale is irrelevant; only therank order of the scores given by the judges isof importance. The nonperfect identity coeffi-cient now wrongly suggests disagreement amongthe teachers with respect to the distribution ofthe prizes. Clearly, a rank correlation coefficientin this situation is a better method for assessingagreement.

Stine (1989) argues that only certain relation-ships among the scores of one judge are em-pirically meaningful, and that other relationshipsare irrelevant. An agreement coefficient should

express the agreement between the judges interms of the empirically meaningful relationshipsamong the scores, and it should ignore the

(dis)agreement with respect to the irrelevant rela-tionships. The empirically meaningful relation-ships are termed here meaningful information, andthe irrelevant relationships are termed irrelevantinformation. According to Stine (1989), the scaletype (measurement level) of the data determineswhich relationships among the data are em-pirically meaningful and which are irrelevant. Itis impossible, however, to prove in a formal waythe scale type of judgment data. It will be im-possible, therefore, to determine on this basiswhich information will be meaningful and whichwill be irrelevant. A pragmatic approach to theissue of meaningful information is proposedbelow.

Judgment data are collected with a certain

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 4: Coefficients for Interrater Agreement...papers, the PMC between these two sets of three scores would be + 1. The teachers did not fully agree, however. ... the association between

324

purpose. This purpose determines which infor-mation is meaningful and which is irrelevant. Inone of the examples above, school grades werecollected in order to distribute prizes to thestudents; thus, only the ordinal information wasmeaningful. If the school grades were collectedin order to make pass/fail decisions, the mean-ingful information would be whether a score wasabove or below the pass/fail criterion. The ques-tion of how reliably the judgment procedureyields meaningful information can be answeredby computing an appropriate agreement coeffi-cient (i.e., a coefficient that takes into accountonly the meaningful information, and ignores theirrelevant information). In Stine’s approach, themeaningful information results from themeasurement level; in the pragmatic approach ad-vocated here, the measurement level is implied bythe identification of the meaningful information.

The concept of meaningful information canbe used in situations in which an agreement co-efficient should attain its maximum value of + 1:An agreement coefficient should attain the value+ 1 if and only if the sets of scores of the twojudges are identical in terms of the meaningfulinformation. One method of finding a coefficientthat satisfies this demand is to transform thescores of each judge to a meaningful version andthen compute the identity coefficient betweenthese meaningful versions. With this method, theproblem of selecting an appropriate agreementcoefficient is replaced by the problem of deter-mining the meaningful versions of the scoresgiven by the judges.

Transforming scores to their meaningful ver-sion consists of removing irrelevant informationwhile preserving meaningful information, in

order to obtain some kind of standardized ver-sion of the scores. A simple example of such atransformation is the standardization of

variables, which results in standard scores withmean 0 and variance 1. Standardization preservesthe interval information, but it removes informa-tion about the mean and the scaling. Zegers andten Berge (1985) provide another example of sucha transformation, including standardization as a

special case (i.e., the uniforming of variables).The construction of the meaningful version is

performed here in three steps: (1) rank orderingor not rank ordering, (2) subtracting or not sub-tracting a reference point, and (3) rescaling or notrescaling. These steps are described in detail

below; steps 2 and 3 use the scores resulting fromthe previous step(s).1. If the only meaningful information is rank

order information, the scores are replaced bytheir rank orders.

2. The scores are expressed as differences to areference point by subtracting the value ofthe reference point from the scores. Thereference point may be absolute or relative.An absolute reference point is sample-independent, which means that it does notdepend on the observed scores. An absolutereference point may be thought of as a typeof natural zero point, or a neutral point ofthe judgment scale (e.g., pass/fail point ona grade scale). A relative reference pointdepends on the observed scores (e.g., themean or another sample-dependent measureof central tendency). The sample mean is theonly relative reference point that is con-sidered here.

3. If the scaling of the scores is arbitrary, thescores are rescaled to obtain a mean squaredscore equal to 1. This is done by dividing thescores by the square root of the mean

squared score. Equation 2 shows that thevalue of the identity coefficient is not af-fected by multiplying or dividing both X andY scores by the same constant. Therefore,rescaling as described here affects the re-

sulting identity coefficient only if X and Yscores have different mean squares after step2. Such a difference in mean squares may bethe result of different rating styles of the twojudges: one judge may be inclined to use moreextreme points of the rating scale than theother. Whether a difference in scaling is con-sidered meaningful depends on the specificjudgment task. If it is not, the scores of bothjudges should be rescaled as described here

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 5: Coefficients for Interrater Agreement...papers, the PMC between these two sets of three scores would be + 1. The teachers did not fully agree, however. ... the association between

325

to remove the irrelevant scaling difference.The three steps described above yield the

meaningful versions of the scores. Inserting themeaningful versions in the formula of the iden-tity coefficient (Equations 1 or 2) yields thedesired agreement coefficient, which is in prin-ciple a specific agreement coefficient for eachcombination of the three steps. These coefficientsare presented in Table 1; these coefficients werederived by examining the conditions in whichagreement coefficients should be maximal ( + 1).

The combination of absolute reference pointwith rescaling and with reference point c, yieldsCohen’s (1969) r, coefficient. In the special caseof c = 0, this coefficient is identical to Tucker’s(1951) congruence coefficient. The combinationof absolute reference point with no rescaling andwith c = 0 yields the identity coefficient (e) ofZegers and ten Berge (1985). By analogy with ther. coefficient, the coefficient that results from thecombination of absolute reference point with norescaling and with c # 0 is denoted by e,. or cidentity.

The combination of relative reference pointwith rescaling yields ordinary standard scores(with mean 0 and variance 1) as meaningful ver-sions, with the PMC as the resulting coefficient.The combination of relative reference point withno rescaling yields the additivity coefficient

(Zegers & ten Berge, 1985).The scores will not be rescaled if a difference

in original scaling should result in an agreement

coefficient with a value less than + 1. In the caseof rank orders, a scaling difference indicates thatthe two sets of rank orders have different

numbers of ties, or ties with different numbersof elements. In these cases, both rescaling andnot rescaling will yield a coefficient with a valueless than + 1. Rescaling yields well-known

coefficients-namely Spearman’s rank correla-tion (p) and roz of Vegelius (1976; see Zegers,1986b, pp. 42, 43)-and not rescaling yields twonew coefficients with no apparent advantages.

Negative Agreement and Zero Agreement

All coefficients in Table 1 change sign withoutchanging their absolute value if the scores of onejudge are reflected with respect to the referencepoint. It can be verified easily that such a reflec-tion changes the sign of the meaningful scores,which results in a sign change of the coefficient(see the numerator in Equation 2). With a relativereference point (the mean), a sign change of thescores is identical to reflection with respect to thereference point, because a sign change of thescores also changes the sign of the relativereference point. With an absolute reference pointof 0 (c = 0), changing the sign of the scores isclearly identical to reflection. The sign change ofan agreement coefficient as described here yieldsan interpretation of negative agreement or ananti-agreement.

In the case of coefficients based on scores witha relative reference point (the PMC and addi-

Table 1

Agreement Coefficients for Non-Nominal Data

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 6: Coefficients for Interrater Agreement...papers, the PMC between these two sets of three scores would be + 1. The teachers did not fully agree, however. ... the association between

326

tivity coefficient), a second interpretation ofnegative agreement results. The PMC (rXy) can beexpressed as

and the additivity coefficient (a~) can be ex-pressed as

’ &dquo;

I ~ 0

where s,, denotes the covariance between the(original) scores, Sx and s,, denote the standarddeviations, and sl and sy denote the variances ofthe (original) scores. It is well known that thecovariance is 0 if the two variables are statisticallyindependent. In the case of agreement scores, thecovariance will have zero expectation if the judgesproduce scores in an unsystematic or randommanner; this results in zero expectation of thePMC and the additivity coefficient. Negativeagreement of these coefficients can be interpretedas less agreement than expected under chance.

For the coefficients in Table 1 that are basedon scores with an absolute reference point, thissecond interpretation of negative agreement is notvalid, because these coefficients cannot be ex-pressed by means of a formula with only acovariance term in the numerator. Consider, forexample, the identity coefficient for a rating scalewith only positive scale points. It is clear from

Equation 2 that the identity coefficient for thisscale is guaranteed positive even if the judges pro-duce scores randomly, within the restrictions ofthe rating scale. This demonstrates that a positivevalue of an agreement coefficient does not

necessarily imply that the agreement is more thanexpected under chance. Therefore, it is importantto determine the value of the coefficient underchance for a valid interpretation of the value ofan agreement coefficient.

Agreement Coefficients Under Chance

Let the scores of two judges who judged n ob-jects be given by X, and Y,, i = 1, 2, ..., n,

which results in n pairs of scores (X,, Y;). If thepairing of the scores has been achieved random-ly, then any other pairing of the X and Y scoreswill be as equally probable as the observed pair-ing. Thus for a fixed order of the X scores, eachpermutation of the Y scores is equally probable.There are n! permutations, each resulting in a dif-ferent pairing of the X and Y scores. For each ofthe n! pairings, the value of the agreement coef-ficient can be computed. The value of the agree-ment coefficient under chance is defined as themean of these n! values. For the identity coeffi-cient under chance (êxy) it is not necessary to

actually compute these n! values, as a simpleformula can be derived:

(see Zegers, 1986a). The interpretation of theobserved value of the identity coefficient in Equa-tion 2 can be compared with the value underchance in Equation 5.

The various coefficients in Table 1 were de-rived by using the formula of the identity coeffi-cient with the meaningful versions of the scores.In like manner, the coefficient under chance canbe determined by using the meaningful versionsof the scores in Equation 5. It can be verified

readily that the coefficients based on scores witha relative reference point equal to the mean havezero expectation under chance; under these cir-cumstances, the numerator of Equation 5 con-tains the sums of deviation scores, which is 0. The

comparison of the value of a coefficient (general-ly denoted as gxV> and the value under chance

(gX~) is based on the difference (gXY - ~,). Thisdifference can be used to construct a chance-corrected version of the coefficient.

Chance-Corrected Agreement Coefficients

A well-known method of chance correction ofan association coefficient (gX)’) is to relate the dif-ference of the coefficient and the coefficient

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 7: Coefficients for Interrater Agreement...papers, the PMC between these two sets of three scores would be + 1. The teachers did not fully agree, however. ... the association between

327

under chance (gX)’) to the theoretical maximumof this difference, which is (1 - gX)’) for ECs. Thisyields a chance-corrected coefficient (g~,), givenby

(see Zegers, 1986a, 1986b). If gX)’ is an EC, thenchance-correction by Equations 5 and 6 yields achance-corrected coefficient g’ , which also

belongs to the class of ECs. The result (i.e., achance-corrected EC is an EC itself) is valid forthe proposed method of chance correction, butit is not necessarily valid for other correctionmethods. Chance correction by Equation 6 doesnot affect the coefficients based on scores witha relative reference point (the mean), because thevalue under chance (gX)’) is 0 for these coeffi-cients. Whether or not corrected or uncorrectedcoefficients should be used in the case of scoreswith an absolute reference point depends onwhether the chance correction as described hereis adequate in the given situation.

It is important to note that for scores with anabsolute reference point, the interpretation ofnegative agreement differs for uncorrected andcorrected coefficients. Negative agreement foruncorrected coefficients can be interpreted asanti-agreement, because the sign of the coeffi-cient changes if the scores of one of the two

judges are reflected with respect to the referencepoint. The corrected coefficients do not have thisproperty, because negative agreement for themhas to be interpreted as less agreement than ex-pected under chance. For coefficients based onscores with the mean as the relative reference

point, both interpretations are valid, as is arguedabove; this agrees with the fact that these co-efficients coincide with their chance-corrected

versions.

Consider, for example, a situation with a largedifference between corrected and uncorrectedcoefficients in which two teachers (X and Y)judge the quality of papers using a 10-pointschool grade scale. The scores are given in

Table 2.

Table 2 .

Scores of Two Teachers (X and Y) -

for Four Papers Before and AfterSubtraction of Reference Point (5.5)

The question of how well the two teachersagree can be answered in two ways. In one

respect, the teachers show a high measure ofagreement: Both judge the quality of the papersas &dquo;good&dquo; (8) or &dquo;very good&dquo; (9). This agree-ment is expressed by the identity coefficient,exy = .997. After subtracting the absolute

reference point 5.5 from each score, as shown inTable 2, the value of the identity coefficient is stillvery high: exy = .973. Conversely, the teachersalso show no agreement: A relatively low scoreof teacher X (8) is paired with both a relativelylow score (8) and a relatively high score of teacherY (9); and a relatively high score of teacher Xgoes with both a relatively low score and arelatively high score of teacher Y. This lack ofagreement is properly expressed by the chance-corrected identity coefficient, ei = 0. A closerinvestigation of the chance-correction method ofEquations 5 and 6 may shed some light on theseconflicting interpretations.

Although it was not stated explicitly, a nullmodel was used above to derive a coefficientunder chance; this model states that the observ-ed scores show no agreement, except for agree-ment obtained by random factors. In fact, thenull model underlying Equation 5 is relative orsample dependent: Given the observed scores orthe observed marginal distributions, each specificpairing of X and Y scores is equally probable.If the agreement between the teachers in Table2 is considered high (see the discussion above),then an absolute or sample independent nullmodel should be used. Such an absolute nullmodel can have various forms, depending on the

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 8: Coefficients for Interrater Agreement...papers, the PMC between these two sets of three scores would be + 1. The teachers did not fully agree, however. ... the association between

328

supposed population distributions. In the case ofan absolute null model, the scores of the judgesare assumed to constitute a random sample froma specified distribution. If the samples of bothjudges are drawn independently, the coefficientunder chance is the expectation of the coefficient.

Suppose, for example, that teachers gradingpapers with the 10-point scale generally producescores with the following frequency distribution:10% for Grade 4, 15 % for Grade 5, 25% forGrade 6, 25 % for Grade 7, 1507o for Grade 8, and10% for Grade 9. Using computer simulation, theexpected value of the identity coefficient for fourpapers is .953, and the expected value of the iden-tity coefficient using the absolute reference point5.5 is .278. The identity coefficients obtained forthe data of Table 2 can be compared with theseexpected values. Using Equation 6 yields chance-corrected coefficients with values .936 and .963

for the &dquo;before&dquo; and &dquo;after&dquo; ratings, respectively.The values of the expected identity coefficients

given above were computed with 200,000 samplesof size 4. An alternative is to use results obtained

by Fagot and Mazo (1989). Without making anydistributional assumption, they derived an

asymptotically correct formula for the expecta-tion of the identity coefficient between two

independent variables. This expectation is ex-

pressed in terms of population means and vari-ances of both variables; these parameter valuesare available in the case of an absolute nullmodel.

Even in the small example used here and theassumed distribution, the computer simulatedvalues were close to the values computed bymeans of the Fagot and Mazo formula (.954 and.328, respectively). This result suggests that evenwith moderate sample sizes, the Fagot and Mazoformula may be used safely for computing theexpected value of the identity coefficient in thecase of an absolute null model. Using expectedvalues based on an absolute null model in the for-mula of chance correction (Equation 6) has oneserious drawback, however: The resulting cor-rected coefficient is not guaranteed to be an EC.

Assessing agreement between two judges with

non-nominal scores, which is discussed above,can be summarized as follows: First, the scoresof the judges are transformed to their meaningfulversions, which may involve rank ordering, sub-traction of an absolute or relative reference point,and rescaling. Next, the identity coefficient be-tween these meaningful versions is computed.The obtained value may be compared with theexpected value, using a relative (sample depen-dent) or absolute (sample independent) nullmodel. A potential drawback of the identity co-efficient is discussed below.

Gower’s Coefficient

Suppose two judges X and Y rate four objectson a 5-point scale, ranging from 1 to 5, with anabsolute reference point of 3 (the scale center).Two different sets of meaningful scores are givenin Table 3, with values of 2/3 and 1/2 for the

identity coefficient. After chance correction byEquations 5 and 6, these values are 49/81 and1/2. The agreement apparently differs for the twosets of data. It can be argued, however, that theagreement between the two sets of data in thistable is equal, because the differences between thejudges for each single object are the same. Thenumerator of the identity coefficient in Equation1 contains the sum of the squared differences,and these sums are equal for the two datasets.But the denominators differ because it is the totalsum of squares. Therefore, the values of the iden-tity coefficient are not equal.

Table 3

Meaningful Scores ofTwo Sets of Judges

A coefficient that does have equal values forthe data in Table 3 has been developed by Gower(1971):

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 9: Coefficients for Interrater Agreement...papers, the PMC between these two sets of three scores would be + 1. The teachers did not fully agree, however. ... the association between

329

where X, and Y, are scores of the two judges; nis the number of objects; and R is the range ofthe rating scale, that is, the maximum value ofthe absolute difference IX, - E] (R = 4 in Table3). For both sets of data in Table 3, G~ = .75.

Gower (1971) showed that GXY is an EC. It isobvious from Equation 7 that G~, uses the sumof absolute differences, not the sum of squareddifferences. In order to obtain a normed co-

efficient, this sum of absolute differences is divid-ed by the theoretical maximum of that sum, nR;this maximum is obtained if the judges differmaximally for each object. Under these cir-

cumstances, the judges use the opposite extremesof the rating scale for each object, which resultsin Gxv = 0. Obviously, G~ can have values on theinterval [0,1]. The main difference betweenGower’s coefficient and the identity coefficientis the way in which the coefficients are normed.The identity coefficient is normed in a relativeor sample-dependent manner by means of thesums of squares of the meaningful versions.Gower’s coefficient, however, is normed in anabsolute or sample-independent manner usingthe range of the rating scale.

As a result of the absolute method of

norming, Gower’s coefficient can be interpretedas a measure of average agreement between

judges per object. If

denotes the agreement between X and Y in

terms of object i, then the mean of this objectagreement (averaged over the n objects) is equalto Gower’s coefficient.

Agreement Coefficients for Nominal Data

Nominal data are obtained if a judge hasto place a number (n) of persons or objects intoa number (k) of mutually exclusive catego-ries without ordinal relationships between the

categories. If two judges place the same n objectsinto categories, they can use the same set ofcategories or different sets. This results in twoclasses of agreement coefficients for nominal

data.

The Same CategoriesIf two judges place the same n objects into the

same k categories, the data can be represented bymeans of a bivariate frequency table, with k rowsand k columns. An arbitrary entry (g, h) of thistable contains the number of objects placed intocategory g by judge X and into category h byjudge Y. An arbitrary diagonal entry (g, g) con-tains the number of objects placed into categoryg by both judges. The categories contain no or-dinal information; therefore, the order in whichthe categories are displayed is arbitrary, but itmust be the same for both judges in order toensure that category g of judge X is the same ascategory g of judge Y. Table 4 represents thejudgments of two judges X and Y who judged10 objects (n = 10) using three categories A, B,and C (k = 3).An obvious agreement coefficient is the pro-

portion agreement (p,,), defined as the sum of thediagonal entries of the bivariate frequency table(shown for these data in Table 5) divided by n.For the data in Tables 4 and 5, po = 5/10 = .50.Obviously, the proportion agreement is bound-ed on the interval [0,1].

Table 4Nominal Data

Categorizations of’ 10 Objects by

. Judges X and Y

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 10: Coefficients for Interrater Agreement...papers, the PMC between these two sets of three scores would be + 1. The teachers did not fully agree, however. ... the association between

330

Table 5Bivariate Frequency Tablefor the Data in Table 4

The proportion agreement can be comparedwith the value under chance, which is deter-mined as follows: A column with row sums isadded to the bivariate frequency table. The gthelement of this column is the number of objectsplaced into category g by judge X. This columnwill be called the marginal distribution of judgeX. In like manner, the marginal distribution ofjudge Y can be added (as a row) to this table.Given these marginal distributions, the value

under chance (expected frequency) can be com-puted for each entry (g, h) in the table. This ex-pected frequency is the product of the gth ele-ment of the marginal distribution of X and thehth element in the marginal distribution of Y,divided by n (cf. the computation of X2). Theproportion agreement under chance (pe) is thesum of the expected frequencies on the diagonal,divided by n. For the data of Table 5,pe = (.6 + 1.5 + 1.2)/10 = .33. Obviously, thiscomputation is based on a relative or sample-dependent null model, given the observed

marginal distributions.Correcting the proportion agreement for

chance yields Cohen’s (1960) kappa coefficient,which is expressed as

with p, as described above. For the data of Table

4, K = (.50 - .33)/(1 - .33) = .17/.67 = .25.Both K and various other agreement coefficientsfor nominal data with the same categories canbe derived in a manner analogous to that usedfor the coefficients for non-nominal data de-scribed above (Gower’s coefficient excluded). A

short outline of this procedure follows, which isbased on indicator matrices. For a more detailed

description, see Zegers (1986b, pp. 45-53).Nominal data of one judge can be represented

by means of an indicator matrix with n (thenumber of objects) rows and k (the number ofcategories) columns. Entry (i, g) of this matrixis 1 if the judge placed object i into category g;otherwise, it is 0. Each row of the indicatormatrix contains a single 1 and k - 1 Os. The in-dicator matrices of the two judges of Table 4 aregiven in Table 6.

The indicator matrix may be interpreted as aquantification of the nominal data of one judge,without loss of information. This quantificationenables the use of (adapted) coefficients for non-nominal data. The identity coefficient can begeneralized to assess the amount of identity oftwo matrices of scores instead of the identity oftwo columns of scores. A necessary and sufficient

condition is that corresponding rows and col-umns of the two matrices have. the same mean-

ing. The indicator matrices of two judges whojudge the same n objects using the same kcategories satisfy this condition.

The generalized identity coefficient for twomatrices is obtained as follows: (1) the numeratorof the last term of Equation 1 is replaced by thesum of squared differences of corresponding en-tries of the two matrices; and (2) the denominatoris replaced by the sum of the sum of squared en-tries of the first matrix and the sum of squared

Table 6Indicator Matrices for the Data of Table 4

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 11: Coefficients for Interrater Agreement...papers, the PMC between these two sets of three scores would be + 1. The teachers did not fully agree, however. ... the association between

331

entries of the second matrix (the same result isobtained by placing the columns of each matrixunder each other and computing the ordinaryidentity coefficient between these long columns).This formulation demonstrates that the general-ized identity coefficient, being the ordinary iden-tity coefficient between long columns, is an EC.It can be verified that the generalized identitycoefficient used with indicator matrices yields theproportion agreement (po). If X, and Y, (withi = 1, 2, ..., 30) denote the scores in the longcolumns derived from the indicator matrices inTable 6, then EX,Y, = 5 and I.X12 = EY,2 = 10.The resulting value of the identify coefficient is2 x 5/(10 + 10) = 1/2 = po (the proportionagreement).

The method of deriving a coefficient underchance, based on permutations of the scores ofone judge, also may be generalized by permutingrows of one of the two matrices. A permutationof rows is a specific ordering of the rows. Withn rows in the indicator matrix, there are n! dif-ferent orderings or permutations of these rows.In the case of indicator matrices, this yields theproportion agreement under chance (p,,). Thisproves that K in Equation 9 can be considereda chance-corrected generalized identity coeffi-cient, which implies that K is an EC (see alsoZegers, 1986b, p. 47).

Non-nominal scores can be transformed into

meaningful versions (as has been discussed

above). Using the identity coefficient with thesemeaningful versions yields various specific coef-ficients. In an analogous way, indicator matricesmay be transformed (e.g., by centering rowsand/or columns). Using the generalized identitycoefficient with specifically transformed in-

dicator matrices yields specific ECs. Well-knowncoefficients which can be obtained in this way arethe C and S coefficients of Janson and Vege-lius (1979) and the simple matching coefficient(Sokal & Sneath, 1963; see Zegers, 1986b, pp.59-60).

The concept of meaningful versions is not veryuseful with nominal data. In fact, all meaningfulinformation is contained in the indicator matrix

and no meaningful information is removed oradded by a transformation of the indicator

matrix. In practice, that transformation will beselected which leads to a coefficient with whichthe researcher is familiar, or which fits in thetradition of a particular field of research. Themain advantage of expressing an agreement co-efficient as a (chance-corrected) generalizedidentity coefficient is that this implies that thecoefficient in question is an EC.

Different Categories

Sometimes judges must classify a number ofobjects in a number of nonoverlapping classes,the meaning of which has to be determined in-dependently by each judge. The number ofclasses may be specified in advance, or it may beleft to the judge. The nominal categories are thenames or labels of the distinct classes. The pro-portion agreement cannot be computed with thiskind of data, because the categories of one judgedo not correspond with the categories of theother (i.e., the gth category of judge X does notmatch the gth category of judge Y). For the samereason, it makes no sense to compute the iden-

tity between the two indicator matrices.One possible solution is to base the agreement

coefficient on the number of pairs of objects thatare placed in the same category by both the firstand the second judge. Dividing this number bythe total number of pairs of objects yields the dotproduct (see Popping, 1983, p. 96). This dot prod-uct can also be expressed as a generalized iden-tity coefficient-not between (transformationsof) indicator matrices, but between two objectmatrices. The object matrix of a judge is a squarematrix with n rows and columns. Entry (i, j) is 1

if the judge placed objects i and j in the samecategory, and is 0 otherwise. It can be shown thatthe generalized identity coefficient between twoobject matrices is identical to the dot product.

The generalized identity coefficient can alsobe computed between transformations of theobject matrices (see Zegers, 1986b, pp. 50-53).Depending on the type of transformation thefollowing coefficients result: The TZ coefficient

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 12: Coefficients for Interrater Agreement...papers, the PMC between these two sets of three scores would be + 1. The teachers did not fully agree, however. ... the association between

332

(Tschuprow, 1939); the gamma coefficient

(Hubert, 1977); the J coefficient (Janson &

Vegelius, 1982b); and the I coefficient (Saporta,1975), which is identical to the Tcoefficient (Jan-son & Vegelius, 1978b). A detailed discussion ofthe J and T coefficients and their relation with

the C and S coefficients (for nominal data withthe same categories) can be found in Zegers andten Berge (1986). As in the case of nominal datawith the same categories, the choice of a specificcoefficient does not depend on considerationswith respect to meaningful information, butdepends rather on preferences or tradition.

Conclusions

A large number of agreement coefficients havebeen discussed. The choice of a coefficient in a

specific context depends on a number of implicitor explicit assumptions with respect to the natureof the data and chance correction.

This overview, however, is not at all complete.In particular, association coefficients for

variables of mixed measurement levels have not

been discussed, because such coefficients do not

play an important role in the context of assess-ing agreement between judges. However, ECswhich can also be expressed as generalized iden-tity coefficients have been developed for variablesof mixed measurement levels, (see Janson andVegelius, 1982a; Zegers, 1986b, pp. 55-56; andZegers & ten Berge, 1986). With such coefficients,it would be possible to reflect the association be-tween variables such as income, educational level,gender, weight, and blood group in a singleassociation matrix.

References

Cohen, J. (1960). A coefficient of agreement fornominal scales. Educational and PsychologicalMeasurement, 20, 37-46.

Cohen, J. (1969). r,: A profile similarity coefficient in-variant over variable reflection. PsychologicalBulletin, 71, 281-284.

Daniels, H. E. (1944). The relation between measuresof correlation in the universe of sample permuta-tions. Biometrika, 36, 129-135.

Fagot, R. F., & Mazo, R. M. (1989). Association co-

efficients of identity and proportionality for metricscales. Psychometrika, 54, 93-104.

Gower, J. C. (1966). Some distance properties of la-tent root and vector methods used in multivariate

analysis. Biometrika, 53, 315-328.Gower, J. C. (1971). A general coefficient of similari-

ty and some of its properties. Biometrics, 27,857-871.

Hubert, L. (1977). Nominal scale response agreementas a generalized correlation. British Journal ofMathematical and Statistical Psychology, 30, 98-103.

Janson, S., & Vegelius, J. (1978a). Correlation coeffi-cients for more than one scale type and symmetriza-tion as a method of obtaining them (Research Rep.No. 78-2). Uppsala, Sweden: University of Upp-sala, Department of Statistics.

Janson, S., & Vegelius, J. (1978b). On construction ofE correlation coefficients (Research Report 78-11).Uppsala, Sweden: University of Uppsala, Depart-ment of Statistics.

Janson, S., & Vegelius, J. (1979). On generalizationsof the G index and the phi coefficient to nominalscales. Multivariate Behavioral Research, 14, 255-269.

Janson, S., & Vegelius, J. (1982a). Correlation coeffi-cients for more than one scale type. MultivariateBehavioral Research, 17, 271-284.

Janson, S., & Vegelius, J. (1982b). The J index as ameasure of nominal scale response agreement. Ap-plied Psychological Measurement, 6, 111-121.

Popping, R. (1983). Overeenstemmingsmaten voor

nominale data (Agreement coefficients for nominaldata). Doctoral dissertation. University of Gron-ingen, The Netherlands.

Saporta, G. (1975). Liaison entre plusieurs ensembles desvariables et codage de donnés qualitatives (Relationbetween various sets of variables and the coding ofqualitative data). Doctoral dissertation. Universityof Paris, France.

Sokal, R. R., & Sneath, P. H. A. (1963). Principles ofnumerical taxonomy. San Francisco: Freeman & Co.

Stine, W. W. (1989). Interobserver relational agreement.Psychological Bulletin, 106, 341-347.

Tschuprow, A. A. (1939). Principles of the mathematicaltheory of correlation. New York: William Hodge.

Tucker, L. R. (1951). A method for synthesis of factoranalysis studies (Report No. 984). Washington DC:Department of the Army, Personnel ResearchSection.

Vegelius, J. (1976). On generalizations of the G index.Educational and Psychological Measurement, 36,595-600.

Zegers, F. E. (1986a). A family of chance-correctedassociation coefficients for metric scales.

Psychometrika, 51, 559-569.Zegers, F. E. (1986b). A general family of association

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 13: Coefficients for Interrater Agreement...papers, the PMC between these two sets of three scores would be + 1. The teachers did not fully agree, however. ... the association between

333

coefficients. Groningen, Netherlands: Boomker.Zegers, F. E., & ten Berge, J. M. F. (1985). A family

of association coefficients for metric scales.

Psychometrika, 50, 17-24.Zegers, F. E., & ten Berge, J. M. F. (1986). Correla-

tion coefficients for more than one scale type: Analternative to the Janson and Vegelius approach.Psychometrika, 51, 549-557.

Acknowledgments

The author thanks J. M. F. ten Berge for his helpful com-ments on various versions of the manuscript and ananonymous reviewer for pointing out a result obtained

by Fagot and Mazo (1989). This article is an adaptationof Het meten van overeenstemming, which appearedin the Nederlands Tijdschrift voor de Psychologie,44, (1989), 145-156. A computer program for most of thecoefficients discussed will be published by ProGamma:Groningen, The Netherlands.

Author’s Address

Send requests for reprints or further information toFrits E. Zegers, University of Groningen, Departmentof Psychology, Grote Kruisstraat 2/1, 9712 TS Gron-ingen, The Netherlands.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/