a method of optimal scaling for multivariate ordinal data and its extensions

PSYCHOMETRIKA--VOL 53, NO. 1, 5--25 MARCH 1988

A M E T H O D O F O P T I M A L S C A L I N G F O R M U L T I V A R I A T E O R D I N A L D A T A A N D ITS E X T E N S I O N S

TAKAYUKI SAITO

DEPARTMENT OF BEHAVIORAL SCIENCE, HOKKAIDO UNIVERSITY

TATSUO OTSU

FUYO DATA PROCESSING ~ SYSTEMS DEVELOPMENT, LTD.

This paper develops a method of optimal scaling for multivariate ordinal data, in the framework of a generalized principal component analysis. This method yields a multidimensional configuration of items, a unidimensional scale of category weights for each item and, optionally, a multidimensional configuration of subjects. The computation is performed by alternately solving an eigenvalue problem and executing a quasi-Newton projection method. The algorithm is extended for analysis of data with mixed measurement levels or for analysis with a combined weighting of items. Numerical examples and simulations are provided. The algorithm is discussed and compared with some related methods.

Key words: categorical data, OSMOD, principal component analysis, quasi-Newton projection method.

1. I n t roduc t ion

We are concerned with the op t ima l scal ing of mul t iva r ia te o rd ina l da t a in terms of o rdered categories. Such da t a are often ga thered in a sample survey where N subjects r espond to J i tems of a quest ionnaire . Suppose tha t the j - t h i tem is composed of kj o rdered categories such that

c j l -< cj~ < . . . -< cjk~ ( j = 1, 2 . . . . . J) . (1)

F o r example , I tem 1 might ask a b o u t the degree of suppor t of a poli t ical s ta tement ;

C11-strongly disagree, C12-disagree, C13-neutral, C14-agree, C15-strongly agree. I tem 2 might be an eva lua t ion of the qual i ty of some services (bad, fair, good, excellent, ou t s t and- ing). I tem 3 might be the frequency of a cer ta in act ivi ty (never, rarely, sometimes, often, always). When each subject is required to choose only one ca tegory per item, we denote his response to i tem j as

fl0 . . . if s u b j e c t / c h o o s e s ca tegory Cjk , 61(Jk) = otherwise.

Then the exclusive response cond i t ion is deno ted as

kj 6 , ( j k ) = 1, ( j = 1, 2 . . . . . J ; i = 1, 2 , . . . , N) . (2)

k = l

Earlier results of this research appeared in Saito and Otsu (1983). The authors would like to acknowledge the helpful comments and encouragement of the editor.

Requests for reprints should be sent to Takayuki Saito, Department of Behavioral Science, Hokkaido University, Bungakubu, Kita 10 Nishi 7, Sapporo 060, JAPAN.

0033-3123-/87/1200-6112500.75/0 © 1987 The Psychometric Society

6 PSYCHOMETRIKA

The total responses given by a group of N subjects are expressed as

D = [61(jk)]

of which size is N x K where K = ~ ] kj (see Table 1). For later description we define

N

n~k = ~ 6,(jk),

and

N

i = 1

The data expressed as D contain J ordinal variables (items), each of which is observed in terms of k~ ordered categories. Such data may be called multivariate ordinal data.

If we would neglect the ordinal property of the data, we could treat D as an ordinary indicator matrix where d nominal variables might have been answered by each subject. For polychotomous data in terms of the indicator matrix, weighting items (variables) and/or categories, scaling subjects have been considered within the framework of optimal scaling or multiple correspondence analysis by many investigators (e.g., Greenacre, 1984; Guttman, 1941; Hayashi, 1952; Tenenhaus and Young, 1985 among others). The criteria of optimality were defined in a variety of ways.

Given the data matrix D, we want to perform optimal scaling of D in order to assign numerical weights to items, categories, and if desired, to subjects. In particular, we wish to assign weights to the ordered categories so that the weights follow the order. For example, the weight assigned to the category "excellent" should not be smaller than the weight assigned to the category "good". The present aim is stated more clearly as follows. For a prescribed number of dimensions m, we will obtain a J x m matrix of item scores X = (x j,), a vector of K category weights w' = (w't, w~ . . . . . w~) where w i = (Wik) whose elements should satisfy

w~l < w j2 < .-- < wik ~ (j = 1, 2 . . . . . J). (3)

Further we will optionally obtain an N x m matrix of subject scores Y = (y,,). For this scaling, we aim to obtain X, w and Y that are best on the basis of an optimality criterion, which is defined below as objective function (10). It is emphasized that we give one weight to each category of each item. In other words, we derive a unidimensional scale of kj categories for each item and not a multidimensional scale of those categories. In contrast, the item scores would be represented in a space of m dimensions and the subject scores in a separate space of the same dimensionality.

A scaling problem closely related to ours has been studied for another form of ordinal data. Bradley, Katti and Coons (1962) dealt with a two-way contingency table where treatments were arranged in rows and ordered categories in columns, and the cells in the table contained response frequencies of category-by-treatment combinations. They suggested a method to scale the ordered categories under the restriction of complete order. For a table of the same type, extending the previous work (Nishisato & Inukai, 1972), Nishisato and Arri (1975) developed a refined method of scaling for categories that were partially ordered in some particular forms by using the separable programming in nonlinear optimization. (See also Nishisato 1980, chap. 8) Following this study, Tanaka (1979) presented a more generalized algorithm using the reduced gradient method for arbitrarily partially ordered categories. It was further examined by Tanaka and Kodake (1980),

TAKAYUKI SAITO AND TATSUO OTSU

u3

O3

~ 0 ~ 0 0 ~ 0 0 ~ 0 0 0 0 0 0 0 0 0 0 ~ 0 0 0 0 0

0 ~ 0 ~ 0 0 0 0 ~ 0 0 0 ~ 0 0 0 0 0 ~ 0 0 0 0 0 0 0 0 0

0 0 ~ 0 ~ 0 0 0 0 0 0 0 0 ~ 0 ~ 0 ~ 0 0 0 0 ~ 0 ~ 0 0 0 ~

~ 0 0 0 0 0 0 0 0 0 ~ 0 0 ~ 0 0 0 0 0 0 0 0 0 ~ 0 0 ~ 0 0

~ 0 ~ 0 0 0 0 ~ 0 0 0 0 0 ~ 0 0 ~ 0 0 0 0 ~ 0 0 0 0 0 0 0

~ 0 ~ 0 0 ~ 0 0 0 ~ 0 0 0 ~ 0 0 0 0 0 0 0 0 0 0 ~ 0

~ 0 ~ 0 0 ~ 0 0 0 ~ 0 ~ 0 0 0 0 0 0 0 0 0 0 ~ 0 0 0 ~ 0 0 0

~ 0 0 0 0 ~ 0 0 0 0 0 ~ 0 0 ~ 0 0 0 ~ 0 0 0 0 ~ 0 0 0 ~ 0

0 ~ 0 0 ~ 0 0 ~ 0 0 ~ 0 0 0 0 0 0 0 0 ~ 0 0 0 ~ 0 ~

0 0 0 ~ 0 0 ~ 0 0 0 0 0 ~ 0 ~ 0 0 0 ~ 0 ~ 0 0 0 0 0 0 0 0

0 0 ~ 0 ~ 0 0 ~ 0 0 ~ 0 0 ~ 0 0 ~ 0 0 ~ 0 0 ~ 0 0 0

0 0 0 0 0 0 0 0 0 0 ~ 0 0 0 ~ 0 0 0 0 0 0 ~ 0 0 ~ 0 ~ 0 0

~ 0 0 ~ 0 0 ~ 0 0 0 ~ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ~ 0 ~

~ 0 0 0 0 0 0 ~ 0 0 ~ 0 0 0 0 0 0 0 ~ 0 ~ 0 0 0 0 0 0 0 0 0

~ 0 0 0 ~ 0 0 0 ~ 0 ~ 0 ~ 0 ~ 0 0 0 ~ 0 0 0 0 ~ 0 0 0 ~ 0

~ 0 0 0 ~ 0 0 ~ 0 0 0 0 0 0 ~ 0 0 0 0 0 0 0 0 0 0 0 0 ~ 0 0

~ 0 0 0 0 0 ~ 0 0 0 0 ~ 0 0 0 0 ~ 0 0 0 ~ 0 ~ 0 ~ 0 0 0

0 0 0 ~ 0 0 0 0 ~ 0 0 0 0 ~ 0 0 0 0 0 0 ~ 0 ~ 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 ~ 0 ~ 0 0 ~ 0 0 ~ 0 0 0 0 0 0 ~ 0 0 0 0

0 ~ 0 0 ~ 0 0 0 0 0 ~ 0 0 0 0 0 ~ 0 ~ 0 0 0 0 0 0 0 0 ~

~ 0 0 0 0 0 0 ~ 0 0 0 0 0 ~ 0 ~ 0 0 0 0 0 0 ~ 0 0 ~ 0 0 0

~ 0 0 0 0 0 0 0 ~ 0 0 0 0 ~ 0 0 0 ~ 0 ~ 0 0 0 ~ 0 0 0 0

0 0 0 0 ~ 0 0 0 ~ 0 0 ~ 0 0 0 0 0 0 0 0 0 0 0 ~ 0 ~ 0

0 ~ 0 0 0 0 0 0 0 0 ~ 0 ~ 0 0 0 ~ 0 0 0 ~ 0 0 0 0 0 0 ~ 0 0

~ 0 ~ 0 ~ 0 0 0 0 0 0 0 0 0 0 0 ~ 0 ~ 0 0 0 ~ 0 0 0 ~

O0

09

QO

03

CO

CO

O0

qD

8 P S Y C H O M E T R I K A

Our scaling method for D is based on the framework of a generalized principal component analysis. OSMOD stands for the method of optimal scaling for multivariate ordinal data in the sequel. Methods conceptually similar to ours have been proposed, such as NLFA (nonmetric linear factor analysis) by Kruskal and Shepard (1974), and PRIN- CIPALS by Young, Takane and de Leeuw (1978). Let us compare and contrast these methods with ours. In the nonmetric analysis, NLFA performs principal component analysis by using the monotonic regression (Kruskal, 1964). On the other hand, PRIN- CIPALS deals with multivariate data measured at a variety of nominal, ordinal and interval scale levels, and performs principal component analysis based on the alternating least squares principle with the monotonic regression.

In contrast, O S M O D utilizes a quasi-Newton projection method to deal with the order constraints like (3). A slight extension of O S M O D can also treat data with mixed measurement levels (nominal and/or ordinal). Another extension of O S M O D can perform an analysis of D by combining items according to hypotheses. These extensions will be mentioned later.

There are some differences in outputs between O S M O D and the others. NLFA yields an item configuration X and a subject configuration Y (the direction vectors and the point configuration, respectively, in the original terminology) in the space of specified dimensions. PRINCIPALS provides an item configuration X and a subject configuration Y for a prescribed dimensionality m (the loading matrix and the principal component matrix for a prescribed number of components in the original context) and also finds a unidimensional scaling of category weights for each ordinal variable in the analysis of D.

Similarly, OSMOD yields a multidimensional configuration of items for a prescribed m and J unidimensional scales of category weights for ordinal variables. It should be recognized that estimation of Y is optional, since it is given after X and w have been determined. In this connection, we note that for both NLFA and PRINCIPALS, Y is required to be estimated together with X. This requirement may affect the computational efficiency in analyzing sample survey data for which the number of subjects is much larger than the number of items (N >~ J). In summary, the framework of principal component analysis is common to all three methods. But they differ in the formulations and analyses, and in the outputs.

2. The Algorithm of Scaling

2.1 Formulation

As already stated, wc would like to assign {wjk } to item j in order to satisfy (3) (j = 1, 2 . . . . . d). For this scaling, the origin and unit may be arbitrarily specified. Then, without loss of generality, we can impose on the mean that

I ~ kj N X wj~ ~,(jk) = O, (4)

i = 1 k = l

and on the variance that

1 N k~ ~ 2 ~. W]kb,(Jk) = 1, (j = 1, 2 . . . . , J). (5)

i----'I- k = 1

By s~j we denote the value of a scaled variable sj in terms of {wjk }. Then the matrix of size N × J consisting of scaled values is defined as

kj

S = (so) where s 0 = ~ w~k~l(jk). (6) k = l

TAKAYUKI SAITO AND TATSUO OTSU 9

Write the correlat ion matrix a m o n g {s j} as

1 R = - - S 'S . (7)

N

Let x t ' s be eigenvalues of R where 2 x > 2 2 ~_~ - . . ~ / ~ j , and x t be the eigenvector associated with x t, that is,

Rx, = 2, x,. (8)

Let us suppose that S has been given in terms of w. We wish to obtain the item score matrix X for a prescribed dimensionality m. For this purpose we consider performing a principal componen t analysis in m dimensions. Then the score matrix is given by X = (xx, x2 . . . . . Xm), the set of those eigenvectors associated with the largest m eigenvalues of R. The degree of fit for the assignment is measured by the following index,

0 - ' ~ - ~ ;~' t = l 7 " (9)

t = l

Remember that w is initially unknown. Then 2,(w) and 0(w) as functions of w are not explicitly expressed in terms of w. Since the fit should be increased as much as possible, we may set an objective function to be maximized as

fo(w) = ~ ,L(w). (10) t = l

Note that a direct maximizat ion offo(W ) with respect to w is very difficult. As is known (e.g. Bellman, 1970; p. 113), the eigenvalues are written as

(x 'R(w)x~, 2,(w) = Max \ ~ ,/ (11)

~t

where ~ t is the region of x space determined by the or thogonal i ty conditions such that

x'xp = 0 (p = 1, 2 , . . . , t - - 1), x # 0.

Therefore maximizingfo(W ) is, in fact, equivalent to maximizing

subject to

x; R(w)x, (12) t = l

x'ax b = AQb (a, b = 1, 2 . . . . . m), (13)

where Aab is the Kronecker delta. Hence, we redefine the objective function to be maximized as

f = f ( X , w) = ~ x~R(w)xt. (14)

Those X and w that attain the max imum o f f will be denoted by X* and w* respectively, and the maximum by f * = f ( X * , w*) hereafter. Let 2* be the eigenvalues of R(w*), and let 0* be the value of 0 in terms of 2*. When the item scores and the category weights are

10 PSYCHOMETRIKA

determined through maximization, the subject scores are optionally given by

Y* = S(w*)X*, (15)

which is the principal component matrix of S(w*). We next develop an algorithm to maximizef(X, w) subject to (3), (4), (5) and (13). For

practical reasons of computation, we adopt an approach of successive approximation towards the solution for the problem, by alternating maximization o f f w.r.t.X and that w.r.t.w. When w is given, we t r e a t f a s f ( X I w), a function of X. Referring to (14), we find that f ( X I w) is maximized by taking the eigenvectors associated with the largest m eigenvalues of R (w). Since R is symmetric, (13) is obviously satisfied with those eigenvectors. When X is given, on the other hand, we will maximize f (w I X) subject to (3). This maximization can be accomplished by using nonlinear optimization. Once the optimal solution is obtained, we can normalize it so that (4) and (5) are satisfied. The entire algorithm is composed of three main phases and one optimal phase (see Fig. I):

1) Initialization to give rough estimates of category weights. 2) Computation of item scores by treating an eigenvalue problem. 3) Estimation of category weights by using a nonlinear optimization method. 4) Optional computation of subject scores.

In the following sections we will describe each of these phases. At this point let us explain why OSMOD utilizes the quasi-Newton projection

method in place of the monotonic regression for the present optimization. As stated above, the estimation of X and w is mandatory but Y is optionally estimated according to the algorithm. Note that Y* is computed by (15) after the iterative process to seek X* and w* terminates. Thus it is emphasized that we need not minimize II S - Y X ' II with respect to S for fixed X and Y. Such minimization can effectively be treated by using the monotonic regression as in NLFA and PRINCIPALS, which differ from OSMOD in the framework of algorithms and outputs. The use of the quasi-Newton projection method makes it possible for us to extend our algorithm easily in section 4.

2.2 Phase 1: Initialization

We give {wjk } initial values so that their differences are equal within each item, namely they should satisfy for each j ( = 1, 2, . . . , J),

wjk -- wl.k_ I = w~.k+ I -- wlk > 0 (k = 2, 3 ..... kj -- I). (16)

Because of constraints (4) and (5), such {wjk } are uniquely determined.

2.3 Phase 2: Computation o f I tem Scores

Given {wjk}, the correlation matrix R defined by (7) is easily computed. By solving (8), we obtain m eigenvectors X = {xt}. As mentioned before, the entire process of our algorithm alternates Phase 2 and Phase 3 iteratively. Phase 2 is usually executed more than one cycle and the estimate of X derived from the previous execution is used as a good approximation to start the next execution. Thus the simultaneous iteration method by Clint and Jennings (1970) is efficiently applied to our eigenvalue problem.

2.4 Phase 3: Optimization to Estimate Category Weights

Given X, we search for w to maximizef(w I X) under constraints (3), (4) and (5). For the sake of computation, we shall introduce a set of new parameters {vj~}. For w that satisfy (3), (4) and (5), we find such an affine transformation that

Wik = ~tj + fl~Vjk, (k = 1, 2 . . . . . kj) (17)

T A K A Y U K I SAITO AND TATSUO OTSU 11

t

Phase 1

Phase 2

Phase 3

Init ializat ion

W = ( ~jk ) , ~

Computation of item scores

x = ( % t )

,,,~ Optimization to estimate

category weights W

Step 1 Initialization

no @ , . . S t e p ~ no

I phase 3 end ]

Phase 4

n o ,

n o

Computation of subject scores

r = ( Y i t )

12 PSYCHOMETRIKA

where {V~k} are constrained by

Vll = 0

and

!)j, k --- 1 ~ Ujk

and vj, kj = 1, (18)

( k = 2 , 3 . . . . . k ~ - 1) (19)

for each j. (In fact, setting cti = wi,, flj = Wjk ~ -- W'il and Vik = (W~k -- Cti)/fl i satisfies (18) and (19). As a result, we have two conditions on {Vik } which correspond to (4) and (5).) Conversely, when we have {V~k } that satisfy (18) and (19), we can transform them to {Wjk } by (17) so that they satisfy (3), (4) and (5). The aj and fli should then be given by

where

Pj N (20) ~l = - (NQ~ - p 2 ) , / 2 and f l j = (NQ~ - - p 2 ) , / 2 ,

P i = ZnikVik and Q i = ZnjkvZk • k = l k = l

In view of the transformations, we regard elements of R as functions of v = (Vjk) and t reatf(v ] X) in place off (w ] X). Combining (18) and (19) yields

0 = vjl _< vj2 _< --- ___ vjk ~ = 1. (21)

Let us find v to maximize f (v I X) subject to (21). Once the solution is obtained, we transform it to w through (17), in which ctj and flj should be given by (20). We can utilize one of the quasi-Newton projection methods, the one which is very effective for the present constrained optimization. As is well-known, the quasi-Newton methods perform nonlinear optimization iteratively and approximate the Hessian by accumulating infor- mation from the preceding iterations using only first derivatives and function values. The projection methods can effectively deal with linearly constrained optimization. They perform optimization in a projected space of lower dimensionality, namely on the boundary where some constraints hold with equality. In a quasi-Newton projection method, we make use of a quasi-Newton method to update the vector of parameters and also a projection method to deal with the linear constraints. Since explanation of the numerical methods can be found in papers (e.g., Gill & Murray, 1974), our description is confined only to some remarks for implementation of the algorithm. Thus in the remainder of this section, we will take up computation of the gradient, estimation of the Hessian and the constrained optimization.

The Gradient

We give the derivative of the objective function. Rewrite (14) as

,x2 f ( X , w ) = ~ , = l i=, i=1 =

Manipulating (22) yields the derivative as

Of ko Of Owoc

c~vob - cZ= l ~

__ 1 ~ ~ x, . x,j Wik njk,. ~ N C = l t = l j = l k = l

T A K A Y U K I SAITO AND TATSUO OTSU 13

Nnab ] × L ( g O ' a ~ 2 ) 3 / 2 {Pa(1)ab "q- 1)ac) -- Qa - gVabVac} "J¢- [JaAbc '

( a = 1,2 . . . . . J ; b = 2 , 3 . . . . . k a - l ) . (23)

Then we have the ( K - 2J)-dimensionat gradient vector g = (df/dvab). This gradient in terms of Vjk is used to perform the quasi-Newton method.

Estimation of the Hessian

Among alternatives to estimate the Hessian, we adopted the so-called BFGS (Broyden-Fletcher-Goldfarb-Shanno) formula (Gill & Murray, 1974). For convenience sake, define the negative Hessian as

H = OVjk OV, b j .

Then the direction vector d is given by solving

U d = g(v). (24)

Given d and a step size ~, H is updated according to the BFGS formula as

yy' gg' H ÷ = H + (25)

~y'd g'd

where

y = g(v) -- g(v + ~d).

In order to decide the step size, a weak line search is performed as is usual with the quasi-Newton methods.

The Constrained Optimization

The active constraint strategy proposed by Gill & Murray (1974) is efficiently applied to our constrained optimization. In their terminology, the constraint for which the equality holds in (19) is called active. The strategy is briefly stated for the present computation as follows (see Phase 3 in Fig. 1).

Step 1. Give parameters {V~k} initial values. Step 2. If we have a set of active constraints, we perform optimization off(v), in the

space constrained only by the active constraints. Otherwise we perform unconstrained optimization of f(v) by using the ordinary quasi-Newton method. During the optimization, additional constraints may become active according to the strategy. If this situation occurs, we optimize f subject to all the active constraints including the additional ones.

Step 3. With prespecified convergence criteria, the iterative computation converges during the process of Step 2. Then we examine the Kuhn-Tucker condition.

Denote the gradient at the converged point ~ by g(~). Suppose active constraints are

a~v + b z_> 0 (l = 1, 2 . . . . . L).

We examine whether the following expression holds,

g(V) = - [ a l , a 2 . . . . . az] i t (26)

where It --- (Pl) is a vector of Lagrangean multipliers, such that

/~ ___ 0 (l = 1, 2, . . . , L).

14 PSYCHOMETRIKA

If (26) is satisfied, ~ is considered to be a maximal point and then Phase 3 stops. Other- wise there may be at least one negative /~. Excluding the constraint with the most negative/~, we update the set of active constraints and go to Step 2.

2.5 Summary of OSMOD Al#orithm Figure 1 shows the entire flow of the algorithm, which consists of Phase 1, the outer

loop of Phases 2 and 3 as well as the inner loops in Phase 3, and Phase 4. To begin with, we specify the dimensionality (m) and convergence criteria for the outer loop to terminate and those for the inner loops. At Phase 1, the process starts with the initial estimate for w. At Phase 2, the item scores X are given by solving the eigenvatue problem.

Phase 3 performs the optimization process iteratively to estimate category weights w. Step 1 initializes the inner loops by providing them with values to satisfy constraint (21), based on the result of Phase 1 or Phase 3 in the preceding iteration. When Phase 3 is executed for the first iteration of the outer loop, H should be set as the identity matrix. Steps 2 and 3 execute the quasi-Newton projection method to optimize f(v) by means of the active constraint strategy or optimize f(v) by using the quasi-Newton method. Step 3 checks the Kuhn-Tucker condition. Since active constraints may be updated at every iteration of the inner loops, the number of free Vjk parameters varies. Denote the number b y / ( . Note t h a t / ( _< K - 2J, because vAl = 0 and vik j = 1. Hence the computation should be carried out, in practice, with those K parameters. When kj = 2 for all j, it is clear that /( -- 0. In this case, the optimization process of Phase 3 is not executed, and the process of the outer loop terminates immediately after Phase 2. Phase 4 is optionally executed, which computes subject scores by (15).

3. Illustrative Example

We will use hypothetical data to give an example of analysis using OSMOD. Sup- pose that we are given the polychotomous data, illustrated in Table 1, which were gath- ered through a questionnaire. The data are the responses of thirty subjects to five items, each composed of five ordered categories (N = 30, k~ = 5 and J = 5). We mentioned an example of possible items and categories for this table in the introductory section. The last row of the table shows the frequency for each category.

In an attempt to obtain a latent structure of items in a two-dimensional space and weights of the ordered categories, we applied O S M O D to these data, setting m equal to 2. The maximization o f f worked successfully, yielding f * = 3.2863, i.e. 0* = 0.6573 after 9 iterations of the outer loop. Table 2 indicates the iterative process of Step 2, the first execution of the nonlinear optimization. After the first execution of Phase 2, O S M O D yielded f--- 3.0203 and no active constraint. At iteration 0, O S M O D performed then the quasi-Newton method, yielding f--- 3.1519 and one active constraint. In the subsequent iterations, it performed the quasi-Newton projection method under more than one active constraint, increasing f gradually. Figure 2 represents a plot of X*, the derived space of items. The plot is normalized by multiplying x* with (2*) ~/2 so that the variances of projections of items on the axes are in proportion to 2*. In practical situations, W e would interpret the axes of the space by inspecting plots of this type. Category weights were determined so as to satisfy the order constraints (3).

Figure 3 shows the weights {wE} for each item to illustrate that they constitute a unidimensional scale. As is clear, all the weights for Items 1 and 2 were determined with strict inequalities. On the other hand, we observe some equal weights for Items 3, 4 and 5 which indicate monotonic curves in a weak sense.

A p r o c e s s


T A B L E 2

of the quas i -Newton p r o j e c t i o n method

15

i t e r a t i on

o f s t e p 2

in phase 2

0

1

2

3

4

5

6

7

8

9

10

number of

a c t i v e

c o n s t r a i n t s

0

1

2

3

4

4

5

6

6

6

6

K " number

of f r e e

p a r a m e t e r s

f uric t i on

va lue o f

f ( ~ l X )

15

14

13

12

11

11

10

9

9

9

9

0203

1519

& 2055

3.2307

3.2396

3.2460

3.2515

3.2521

3.2540

3.2549

3.2553

4. Extensions of O S M O D

Data With Mixed Measurement Levels

So far we have offered OSMOD mainly for the analysis of multivariate ordinal data. However, the algorithm with a slight modification can treat multivariate data with mixed nominal and ordinal measurement levels. Here we let a variable (say j) at nominal level correspond to a polychotomous item composed of k~ categories for which no order is assumed. In the framework of OSMOD, treating variable j as nominal leads to removing the order constraint (3) for {wj~; k--- I, 2 . . . . . kj}. In this case, O S MO D maximizes f subject to constraint (3) imposed on category weights of the remaining (J - 1) items, using the quasi-Newton projection method. When all the variables can be treated as nominal, the computational work leads to unconstrained maximization o f f Then the maximization can be carried out by the quasi-Newton method.

Let us demonstrate the extended use of O S M O D with the data of Table 1. We investigated six combinations o f different levels of measurement and then obtained the optimal solution in two dimensions for each case. The middle part of Table 3 summarizes the fit and the largest two eigenvalues of those optimal solutions: Case 1, which consists of five nominal variables, Cases 2 to 5 of the mixed measurement as indicated, and Case 6 of five ordinal variables. The analysis under Case 6 is identical to the standard execution of OSMOD. (This result has been mentioned in the preceding section.) As we move from Case 1 to Case 6, the number of order constraints becomes larger, accordingly the space for optimization becomes more constrained. It is expected that 0* decreases with more

16 PSYCHOMETRIKA

2

dim. 2

3

4

5

dim. 1

constraints. Checking 0* values from the top to lower cases in the middle part of the table, we confirm the expectation.

Analysis for a Combined Weighting of Items

O S M O D can incorporate hypotheses about item scores. For the sake of exposition, we start with a description of a practical case. Let us consider a questionnaire designed to measure attitudes toward education which may take a form such as in Table 1. Two of the items might ask about education for children as follows:

Q1. What degree of education do you want your son to receive? (Select one of the following categories.)

Q2. What degree of education do you want your daughter to receive? (Same as above).

If an investigator hypothesizes that there exists a latent variable (education for children) underlying the two questions, he may wish to perform a data analysis based on the hypothesis. If Q1 and Q2 are highly correlated, the investigator may carry out a similar analysis. Among all possible alternatives to meet the requirements, it is natural to impose

T A K A Y U K I S A I T O A N D T A T S U O O ~ U 17

1.0-

0.0-

-1 .0-

-2.0-

w e i g h t

Z a - -- --

"'/

#

I

I I I / I l

o." o

oO° o°

Z~.- #

2 / / . .

~ .oO

. ,

, v O " / ! • I

• " t I " / t / I

/ I / /

I !

I - A

[]

1 2 3 4 5

c a t e g o r y n u m b e r

a condition that the scores of the two items are equal, namely that

xlt = x2t (t = 1, 2, . . . , m).

However, since subjects might answer QI and Q2 with different subjective weighting of

18 PSYCHOMETRIKA

TABLE 3

R e s u l t s by the E x t e n d e d O S M O D and P R I N C I P A L S

measurement c o n d i t i o n OSMOD P R I N C I P A L S

nominal o r d i n a l e i g e n v a l ues f i t e i genva l ues f i l case

v a r i a b l e s v a r i a b l e s 2~ 2~ O* 21 22 O*

1) I, 2, 3, 4, 5 none

2) 2, 3, 4, 5 1

3) 3, 4, 5 1, 2

4) 4,5 1 , 2 , 3

5) 5 1, 2, 3, 4

6) none 1, 2, 3, 4, 5

1.909 1.701 0.722

2.026 1.438 0.693

1.947 1.435 0.676

1.972 1.408 0.676

1.896 1.409 0.661

1.881 1.405 0.657

7) 1, 2, 3, { 4 , 5 } none 2.004 1.377 0.676

8) none 1, 2, 3, 14 ,5} 1.971 1.148 0.624

1.768 1.345 0.623

2.000 1.151 0.630

1.869 1.349 0.644

1. 896 1. 387 O. 657

L 899 I. 336 O. 647

1.893 1.389 0.656

categories, we should not impose on the category weights any further constraints besides (3).

In a similar way, if with a class of p items {Jl, J2 . . . . . jp} the answers are made on the basis of a single latent variable, we propose a restriction that

x j, t = xj~ t . . . . . xj,~ (t = 1, 2 . . . . . m).

We now consider a general situation in which J'( < J) hypotheses hold for the entire set of items. In that situation, all the items are divided into J" mutually exclusive classes of items. Denote the h-th class consisting of p, items by ~¢h, that is,

J' Jt U t a h = { 1 , 2 . . . . . J } and y , . p h = j .

h = l h ~ l

Write the t-th vector of item scores as x~ *) = (x~)). Then we are led to impose the following restriction,

x~t hJ = xt2ht ) = " ' " = x (h) (h = 1, 2, J ' ; t = 1, 2, m). (27) pht " . . ~ • • • ,

For the present purposes it is necessary to revise O S M O D so that item scores are determined under (27). Then we are faced with maximizat ion of (12) w.r.t. X under (13) and (27). As will be mentioned below, this maximizat ion is easily undertaken. As for the categories, we will derive a unidimensional scale of category weights for each item in ~ . The order constraint like (3) may or may not be imposed on the weights. This depends on the measurement level of the item.

Let us define a new variable u h, whose i-th value is given by

Uih = ~ Sq (h = 1, 2, J'). j e ~¢n (Ph) 1/2 . . . .

Let F represent the covariance matrix among those variables. Further define a J'- dimensional vector as ~t = (Cat) and also F. = (gl, ~2 . . . . , ~m)- Through some manipulations, it is found that maximizing (12) w.r.t. X subject to (13) and (27) is equivalent to maximizing

m f(=., w) = ~ ~',l-'(w)~t (28)

t = l

T A K A Y U K I SAITO A N D T A T S U O O T S U 19

w.r.t. -= subject to

~ ~ = Aab (a, b = 1, 2 . . . . . m). (29)

Once the optimal solution (E*, w*) is given, we can determine

x(h)* - - ~ht* p, ~ , ( p = l , 2 . . . . . P h ; h = 1 ,2 . . . . . J ' ; t = 1 ,2 . . . . . m). (30)

Let us show examples. The lowest two cases in Table 3 indicate the results of the analyses in two dimensions by the combined weighting of items 4 and 5, using the da ta of Table 1. Case 7 treats all the variables measured at nominal level, and Case 8 treats them at ordinal level. Compar ing Case 1 with Case 7, we note a decrease in the goodness of fit due to the combined weighting. Such a decrease is also observed between Case 6 and Case 8. It is clear that this decrease is caused by restriction (27). We notice that 0* is lower in Case 8 than in Case 7 even for the same combina t ion of items because different levels of measurement were assumed for the two cases.

5. Robustness and Accuracy Tests

In order to examine the effects of r andom error on the fit of O S M O D and its accuracy of recovery of the original structure, we present a simulation study.

Generation of Test Data

We generated two series of 500 r andom numbers {(z n, zi2); i = 1, 2 . . . . . 500} which accorded to a two-dimensional normal distribution with the mean (0, 0) and the covariance matrix of identity. Denote the data by Z = (z~t). We defined five new variables bj in terms of nonlinear t ransformations of z I and z 2 as follows. First we let

fj = z 1 cos ~,j + z2 sin ~,j where ~bj = 0.3(j - 1)re, (j = 1, 2 . . . . . 5).

Next we defined bj by the logari thmic and power functions o f f j . Thus we constructed a matrix of numerical da ta B = (b~l). For each j we classified all the 500 values of b o into five categories (Cji, Cj2 . . . . . Cj5 ), in such a way that the smallest 100 values are put into Cjl, the next smallest 100 into C~2, and so forth till the largest 100 values are put into Cjs. Th rough the categorization, we provided a data matrix D.

Supposing fallible cases with error perturbation, we designed more sets of simulated data. Each b i was per turbed by adding an error componen t ej generated from a normal distribution N[0, s2]. Three different values were employed for sj to control the error level: 0.025 tr i , 0.05~r i , 0.10cr~ where trj was the s tandard deviation of bi(j = 1, 2 . . . . . 5). After the incorpora t ion of error was carried out for the five variables, we had three sets of error per turbed da ta : B1, B 2 , B 3 in increasing order of error levels. F r o m B~, three sets of d a t a , D t (l = 1, 2, 3) were generated according to the procedure mentioned above. Fo r notat ional convenience, hereafter the data of the error-free level are denoted by B o and D o •

Robustness of the Fit

We applied O S M O D to the simulated multivariate ordinal da ta {Dz}, setting the dimensionality m to 1, 2, 3. The upper section of Table 4 gives the results in terms of the fit 0* defined by (9) and for m = 2 the largest two eigenvalues of S'S as well. As may be seen, the fit worsens as the error increases, while it improves with increasing dimensionality. This finding is consistent with c o m m o n experiences in multivariate analysis. Write the goodness of fit for m dimensions as O*(m). Inspection of 0* values indicates that 0*(2) is generally closer to 0*(3) than to 0"(1), a l though the discrepancy between 0*(3) and 0*(2) becomes larger as the error increases. Thus, we expect that for the low level of error

20

The F i t a t t a i n e d

PSYCHOMETRIKA

T A B L E 4

by OSMOI) and the Principal Component Analys is

me t hod

O S M O D

p r i n c i p a l

component

a n a l y s i s

e r r o r dimens i ona 1 i t y

l e v e l ra=l m=2

O* O* d a t a

0.0 Do 0.607 3.016 1.786 0.961

2.5 D1 0.377 1.864 1.472 0.667

5.0 D2 0.341 1.685 1.429 0.623

10.0 D3 0.330 1.641 1.403 0.609

0.0 Bo 0.334 1.671 1.454 0.625

2,5 B, 0,332 1.658 1.443 0,620

5,0 B2 0.329 1.645 1,433 0.616

10,0 Ba 0.322 1.609 1.430 0.608

m=3

O*

0.984

0.821

0.791

0.790

ff824

0.818

ff813

0.805

perturbation, the original two-dimensional structure Z will be recovered very well from D~. This expectation is discussed later.

Accuracy of the Recovery Now we are going to discuss the recovery of the original structure Z. In practical

situations such an original structure would be unknown, and the subject scores Y derived from S by (15) may be regarded as a subject configuration. Although in general, we might not compare Y with Z since the true structure is usually unknown, we do so for the present discussion to study the characteristics of OSMOD. Two measures of the recovery are suggested for the comparison. First we can use the index z proposed by Lingoes and SchSnemann (1974). It measures the degree of discrepancy between Y and Z which is invariant under transformations of translation, stretching and orthogonal rotation of the axes. Its range is bounded as 0 < z < 1. As the discrepancy between Y and Z widens, the value of z increases. Next we measure the congruence between Y and Z based on canonical correlation analysis. As is known, the canonical correlation coefficients Pl and P2 are invariant under the affine transformation of the axes. The correlation indicates the congruence that does not depend on the transformation. Further, the trace correlation coefficient 7 defined by

7 = 2

serves as a summary index of Pl and P2. For each of the two-dimensional solutions referred to in Table 3, we obtained Y and

then computed z, pl, P2 and 7. Table 5 shows the values of these indices. Here it should be


T A B L E 5

The R e c o v e r y of the O r i g i n a l S t r u c t u r e by OSMOD and the p r i n c i p a l Gorroonent A n a l y s i s

21

i n d i c e s of r e c o v e r y me thod

d a t a r p~ P2 Y

O S M O D

p r i n c i p a l

component

a n a l y s i s

Do 0.085 ~ 9 6 5 ~961 ~ 9 6 3

DI ~300 0.882 0.787 0.836

Dz 0.376 0.851 0.724 0.790

Ds 0.432 0.799 ~705 0.753

B0 0.442 0.847 0.641 0.751

B, 0.454 0.836 0.637 0.743

B2 0.464 0.827 0.632 0.736

Bs 0.456 0.832 0.639 0.742

remembered that D~ involves three noise factors: nonlinear transformations, categorization and error incorporation. At the error-free level, D O is concerned with only the first two factors and we find that the recovery is very good according to the indices. As may be seen, the recovery generally becomes worse as the error increases. This is explained as follows. The incorporated error influences the ranking of bij, and accordingly the categorization. Thus the D~ (l = 1, 2, 3) are different from D O in proport ion to the amount of error. Inspecting the recovery indices in view of the process of the data generation, we can say that the original structure is recovered by O S M O D to a good extent.

6. Discussion

Comparison with the Principal Component Analysis

In comparison with the linear method of the standard principal component analysis, O S M O D is regarded as a nonlinear one. From this standpoint, it may be of interest to compare the results given by both methods. We applied the linear method to the numerical data B~ (1 = 0, I, 2, 3) to obtain solutions in 1, 2 and 3 dimensions. The lower section of Table 4 shows the result. The 0* is given by (9) in terms of the eigenvalues of the correlation matrix of bSs. For both methods, the fit becomes worse as the error increases. But we see that the fit attained by O S M O D is generally better than that by the linear method. Since D~ was more perturbed with the categorization process than B s without it, applying the linear method to D l would result in far lower fits. Thus, regarding goodness of fit, O S M O D is more robust than the linear method. Also we find that differences of 0* values across the methods become smaller with increasing level of the error. Hence the fit by the framework of the principal component analysis seems to be degraded by the error perturbation more than by the difference of methods (linear, nonlinear) or the categor-

750

2000

TA

BL

E 6

Rej

ectio

n R

ates

for

d =

2,

c =

0.2

Log

isti

c St

udy

Num

ber o

f Rej

ectio

ns p

er 1

00 T

rial

s

NI:

N2:

N 3

p c~

0.01

0.5

0.05

0.

10

0.01

0.7

0.05

0.10

0.5

0.7

Tes

t SA

TV

A

CT

M

AC

IE

M

8 12

7

8

17:1

7:16

17

:17:

16

13:1

3:14

35

:5:0

17

:17:

16

40:1

0:0

0.01

0.05

0.10

0.01

0.05

0.10

44

62

71

17

.36

48

91

94

96

99

96

100

58

74

81

85

86

89

37

33

69

58

83

70

52

29

74

49

86

60

34

46

59

71

76

76

12

40:1

0:0

75

99

93

68

77

90

99

96

80

94

91

100

98

88

98

27

55

65

I A

SVA

B A

S I A

SVA

B A

R

5 7

7

8:8:

9 8:

8:9

10:1

0:10

36

32

53

54

62

61

58

76

84

57

67

74

t-~

f3

O


ization process. This argument does not diminish the applicability of O S M O D which has been developed mainly for analysis of multivariate ordinal data.

Let us turn to the recovery of the original structure. Based on each of the two- dimensional solutions with numerical data, we obtained a matrix of two principal components in an ordinary way. We then computed the indices of recovery, which are shown in the lower section of Table 5. It is obvious that O S M O D is more effective in recovering the original data structure than the linear method.

Comparison with PRINCIPALS

We analyzed the data of Table 1 by using P R I N C I P A L S (a version included in SAS, 1983), under the six cases of mixed measurement levels. The right side of Table 3 shows the obtained fit and eigenvalues. We see that O S M O D yielded a better fit than PRIN- CIPALS for Cases 1 to 5. The difference in model fitting between O S M O D and PRIN- CIPALS becomes larger as the number of nominal variables increases. The detail of the reason remains to be investigated. For Case 6 with all ordinal variables, however, it is found that the two methods yielded almost the same degree of fit. We also observed this tendency analyzing other sets of multivariate data with all ordinal variables. For example, applying P R I N C I P A L S to data D o, D 1, D 2, D 3 (in Table 4) yielded two-dimensional solutions with 0* = 0.960, 0.667, 0.622, 0.609 respectively. Analysis by combined weighting of items was not performed by PRINCIPALS, since this method does not handle that case.

Remarks on the Algorthim

A question may be posed concerning the algorithm of OSMOD. As stated above, the iterative process alternates solving the eigenvalue problem and executing the nonlinear optimization. Supposing a general function h(x, y), we notice that repeating the maximiza- tions of h(x [ y) and h(y [ x) does not always reach even a local maximum of h(x, y). This consideration leads to a question whether our process really converges to a local maximum. To state the optimality strictly, we need to prove, at first, the continuity o f f (X , w). Although we leave the strict proof for a future study, we may assume at present the continuity in the sense that a slight change of {X tk), w (k)} to {X (k+l), w (k÷l)} should produce a slight change o f f (The superscript means the iteration of the outer loop of our algorithm.) Next we see from (4), (5) and (13) that the updates of X and w are in a compact set. Third, it should be reminded from (10) that 0 < f < J. Thus, in a similar way as de Leeuw and van Rijckevorsel (1980) proved the convergence property about their alternating algorithm, we have the indication that the solution point given by the alternating algorithm of O S M O D is an optimal point.

Our algorithm seems to be supported by another consideration. Let us suppose numerical data generated from a multivariate normal distribution and apply the algorithm of O S M O D to them. Then, through a voluminous amount of manipulations, we can show that ~2f/3X3w becomes a zero matrix with the solution by OSMOD. It implies that the alternating algorithm gives an optimal solution for these data. This argument may approximately hold in case of data with ordered categories generated from the normal distribution. For reasons mentioned above, it is plausible that the algorithm gives an optimal solution in practical situations.

Let us consider the implications of (18). In view of the transformation from v to w, we realize that (18) means no loss of generality in the unidimensional scaling of ordered categories for each item. The unidimensional scales are determined, as illustrated in Figure 3, in a different space from the multidimensional space of items which is represented by X = {xt}. Recall that {xt} are orthogonal (see (13)), and in this sense un- correlated. For a further extension of O S M O D , we point out that the use of the quasi-

24 PSYCHOMETRIKA

Newton projection method makes it easy to impose partial order on the category weights. Healy and Goldstein (1976) studied, using endpoint constraints, the scaling of or-

dered categories with such data as D in the classical (Guttman-type) approach of optimal scaling. The difficulty arising from the constraints was pointed out by themselves; see also Greenacre (1984) and Goldstein (1987). It should be noted that their approach is funda- mentally different from ours in several respects. First, they did not impose any order constraint on Wjk (category weights). Second, they considered multidimensional scaling of categories, see their Equation (2.6), in contrast to our unidimensional scaling of categories and multidimensional scaling of items. Third, their endpoint constraints (Equation (2.5) in the original paper) are expressed as

J J

w jl = 0 and ~wjk J= 1. j = l j = l

Obviously these linear constraints affect {Wig } , whereas our constraints of (18) do not.

Supplementary Comments for Applications

When one is concerned with the speed of computat ion, it would be a good idea to compare the three methods, O S M O D , N L F A and P R I N C I P A L S on several sets of data, especially by computers of the same ability. However one should remember their differences in analysis as well as outputs at such comparisons. In order to evaluate the exact speed of execution, the three methods should be implemented with the same programming technique of data file handling, ideally on the same machine. At present we leave such comparisons for further investigation. For reference, we report that O S M O D attained fast convergence in a few iterations even for data with 12 items, 108 categories and 500 subjects (in CPU time 9.5 seconds on a HITAC-M280H machine).

After data collection, we may sometimes find categories with very small njk'S, frequencies of subject responses. Some njk might even be zero. If a category had zero frequency, we should perform the analysis by excluding it. Let us consider applying O S M O D to data that involve category Cjk with a very small njk. According to the active constraint strategy to deal with the order constraint (3), the weights tend to be determined in a way that w* k=w*k_l or w~' k= wj .k+ r It will rarely occur that * Wj, k_ 1 "( W~k

W'k+ 1" Further Constraints (4) and (5) also serve to keep weights from running too far away. We experienced this tendency with many sets of data involving categories with very small frequencies.

When O S M O D analyzes a set of data by changing the dimensionality in a consecu- tive way, first in one dimension and next in two dimensions, it performs each analysis separately. In this regard, it may differ from other methods based on the framework of the principal component analysis. For example, according to the classical (Guttman-type) method of scaling for categorical data, the first dimension and its eigenvalue of a unidimensional solution remain unaltered, if the second dimension is obtained in a two- dimensional solution, and similarly for adding more dimensions.

References

Bellman, R. (1970). Introduction to matrix analysis, (2nd ed.), New York: McGraw-Hill. Bradley, R. A., Katti, S. K. & Coons, I. J. (1962). Optimal scaling for ordered categories. Psychometrika, 27,

355-374. Clint, M., & Jennings, A. (1970). The evaluation of eigenvalues and eigenvectors of real symmetric matrices by

simultaneous iteration. The Computer Journal, 13, 76-80. de Leeuw, J., & van Rijckevorsel, J. (1980) HOMALS & PRINCALS: Some generalizations of principal

component analysis. In E. Diday, L. Lebart, J. P. Pag6s, & R. Tomasonne. (Eds.), Data Analysis and Informatics. Amsterdam: North Holland.


Gill, P. E., & Murray, W. (Eds.). (1974). Numerical methods for constrained optimization, London: Academic Press.

Goldstein, H. (in press). The choice of constraints in correspondence analysis. Psychometrika. Greenacre, M. J. (1984). Theory and applications of correspondence analysis. London: Academic Press. Guttman, L. (1941). The quantification of a class of attributes: A theory and method of scale construction. In P.

Horst (Ed.), The prediction of personal adjustment (pp. 319-348). New York: Social Science Research Council.

Hayashi, C. (1952). On the prediction of phenomena from qualitative data and the quantification of qualitative data from the mathematico-statistical point of view. Annals of the Institute of Statistical Mathematics, 3, 69-98.

Healy, M. J. R., & Goldstein, H. (1976). An approach to the scaling of categorized attributes. Biometrika, 63, 219-229.

Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 1-27.

Kruskal, J. B., & Shepard, R. N. (1974). A nonmetric variety of~linear factor analysis. Psychometrika, 39, 123-157.

Lingoes, J. C., & Schrnemann, P. H. (1974). Alternative measure fit for Schrnemann-Carroll matrix fitting algorithm. Psychometrika, 39, 423-427.

Nishisato, S. (1980). Analysis of cate#orical data: Dual scalin# and its applications, Toronto: University of Toronto Press.

Nishisato, S., & Arri, P. S. (1975). Nonlinear programming approach to optimal scaling of partially ordered categories. Psychometrika, 40, 525-548.

Nishisato, S., & Inukai, Y. (1972). Partially optimal scaling of items with ordered categories. Japanese Psycho- logical Research, 14, 109-119.

Saito, T., & Otsu, T. (1983). A method of optimal scaling for multivariate ordinal data. Hokkaido Behavioral Science Report, Series M, No. 5.

SAS Institute (1983), Technical Report, P-131. Cary, NC: Author. Tanaka, Y. (1979). Optimal scaling for arbitrarily ordered categories. Annals of the Institute of Statistical

Mathematics, 31, 115-124. Tanaka, Y., & Kodake, K. (1980). Computational aspects of optimal scaling for ordered categories. Behav-

iormetrika, 7, 35-46. Tenenhaus, M. & Young, F. W. (1985) An analysis and synthesis of multiple correspondence analysis, optimal

scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika, 50, 91-119.

Young, F. W., Takane, Y., & de Leeuw, J. (I978) The principal components of mixed measurement level multivariate data: An alternating least squares method with optimal scaling features. Psychometrika, 43, 279-281.

Manuscript received 6/12/84 Final version received 11/11/86

a method of optimal scaling for multivariate ordinal data and its extensions

Documents