matrix algebra - my.ilstu.edumshesso/classes/psy445/lectures/lectures.pdfyou already know scalar...

PSY 445 — Spring, 2016 — Lecture Notes for Matrix Algebra

Matrix Algebra

§ 1 Basic Definitions

We need to talk about the arithmatic and algebra of scalars, vectors, and matrices. Many ofthe operations are very similar, but each of these systems have their quirks.

We’ll start by talking about the relationship of scalars, vectors, and matrices.

[A] A matrix is a two-dimensional arrangement of elements.

• The elements of a matrix can be numbers, variables, or symbols.

• The elements are indexed by the row and then the column in which they occur.

• The size of a matrix is denoted by the number of rows and then the number ofcolumns.

• For example, A2x3 =

a b c

d e f

and B3x2 =

1 2

3 4

5 6

[B] A vector is a matrix that has only a single row or a matrix that has only a single column.

• A row-vector has only a single row, such as R1x4 =[

1 2 3 4

]

• A column-vector has only a signle column, such as C4x1 =

1

2

3

4

• If we don’t specify row or column, we assume that a vector is a column vector.

[C] A scalar is a matrix (or row-vector or column-vector) that has only a single element.

• Scalars are non-dimensional numbers, variables, or symbols like you’re used to seeingand have seen since grade school.

• You already know scalar arithmatic and algebra, but you may not have known that itwas scalar arithmatic and algebra. A scalar algebra by any other name would smellas sweet. . .

• For example,[

1

]or[a

]or[σ

]are all scalars.


§ 2 Arithmetic

Let’s build “up” from scalar arithmetic:

� 2.1 Scalar Arithmetic

You’ve known how to perform these operations for a long time. . .

• Addition (& subtraction): 1 + 2 = 3

• Multiplication: 2 ∗ 3 = 6

• Division: 3/4 = 3 ∗ (1/4) = 3 ∗ (4)−1

� 2.2 Vector Arithmetic

For vector arithmatic, we have to make sure that the vectors are conformable for that particularoperation.

[A] Addition (& Subtraction): To be conformable, the vectors must both be the “same size”(that is, they must both be row-vectors with the same number of columns or both becolumn-vectors with the same number of rows).

1

2

3

+

4

5

6

=

5

7

9

[B] Multiplication: To be conformable, we can either multiply a row-vector times a column-

vector or a column-vector times a row-vector, but we cannot multiply two row vectors ortwo column vectors. Unlike scalar multiplication, the order is important: RC 6= CR

• Inner Product: row-vector * column-vector = scalar In addition, to be conformablefor inner-product multiplication, the number of columns in the row-vector must equalthe number of rows in the column-vector.

[1 2 3

]·

4

5

6

= 1 ∗ 4 + 2 ∗ 5 + 3 ∗ 6 = 32

More generally, if R =[r1 r2 r3

]and C =

c1

c2

c3

then


RC =∑ri ∗ ci

• Outer Product: column-vector * row-vector = matrix

The vectors may be of any size, but their sizes will determine the size of the matrixyielded by their outer product.

4

5

6

·[

1 2 3

]=

4 8 12

5 10 15

6 12 18

More generally, if R =[r1 r2 r3

]and C =

c1

c2

c3

then

CR =

r1 ∗ c1 r1 ∗ c2 r1 ∗ c3

r2 ∗ c1 r2 ∗ c2 r2 ∗ c3

r3 ∗ c1 r3 ∗ c2 r3 ∗ c3

We can generalize the formula by indicating the formula for each element of theproduct. If we call the product matrix M, mij refers to the element in the ith rowand jth column of M.

mij = ri · cj

[C] Division: There is no convenient “division” operator for vectors.

� 2.3 Matrix Arithmetic

For matrix arithmetic, we must also make sure the matrices are conformable as well.

[A] Addition (& subtraction): To be conformable, the matrices must have both the samenumber of rows and the same number of columns. Like vector addition, matrix additionin an element by element operator (i.e., corresponding elements are simply added together(or subtracted as the case may be).

[B] Multiplication: To be conformable for multiplication, the number of columns of the firstmatrix must equal the number of rows of the second matrix. The resulting product withhave as many rows as the first matrix and as many columns as the second matrix.

As with vector multiplication, the order of multiplication is important. In fact, we havesome special nomenclature to indicate the order of multiplication:

• A is premultiplied by B = BA

• B is premultiplied by A = AB


• A is postmultiplied by B = AB

• B is postmultiplied by A = BA

The most elegant way to express the process of matrix multiplication is to say that (i,j)element is given by the inner product of the ith row of the first matrix and the jth columnof the second matrix.

For example, if A =

1 2 3

4 5 6

and B =

1 2

1 2

1 2

then

AB =

1 2 3

4 5 6

·

1 2

1 2

1 2

=

1 ∗ 1 + 2 ∗ 1 + 3 ∗ 1 1 ∗ 2 + 2 ∗ 2 + 3 ∗ 2

4 ∗ 1 + 5 ∗ 1 + 6 ∗ 1 4 ∗ 2 + 5 ∗ 2 + 6 ∗ 2

=

6 12

15 30

BA =

1 2

1 2

1 2

· 1 2 3

4 5 6

=

1 ∗ 1 + 2 ∗ 4 1 ∗ 2 + 2 ∗ 5 1 ∗ 3 + 2 ∗ 6

1 ∗ 1 + 2 ∗ 4 1 ∗ 2 + 2 ∗ 5 1 ∗ 3 + 2 ∗ 6

1 ∗ 1 + 2 ∗ 4 1 ∗ 2 + 2 ∗ 5 1 ∗ 3 + 2 ∗ 6

=

9 12 15

9 12 15

9 12 15

Clearly, AB 6= BA, which is obvious because they are different sizes.

[C] Inverses: The matrix analogue of division is multiplying by an inverse (which is anotherway of thinking about division in scalar arithmetic).

• Finding the inverse of a matrix is (relatively) easy for square matrices and (relatively)complex for rectangular matrices. Fortunately for us, we will only need to find theinverse of square matrices.

• For scalars, the inverse of a number times that number is one. A matrix inverse timesthat matrix should be equal to “one” – but we need to find a matrix version of “one.”

• The matrix equivalent of one is called the “Identity Matrix.” The identity matrix hasone’s on the main diagonal (elements with row and column indices that are equal)and zeroes elsewhere.


• For instance, I4x4 =

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

• Thus, for matrix inverses, A · A−1 = I = A−1A

• We have a fairly simple formula for 2x2 matrices.

If A =

a b

c d

then A−1 = 1ad−bc

d −b

−c a

• For example, if A =

2 1

3 4

then A−1 = 12∗4−1∗3

4 −1

−3 2

=

4/5 −1/5

−3/5 2/5

Let’s check to make sure that AA−1 = I 2 1

3 4

· 0.8 −0.2

−0.6 0.4

=

1.6− 0.6 −0.4 + 0.4

2.4− 2.4 −0.6 + 1.6

=

1 0

0 1

• This formula applies only to 2x2 matrices, although there is a similar (but much more

complicated) formula for 3x3 matrices. Instead of developing a series of formulas, itis desirable to find a more general approach, one that might work on a desert island.

– The method we will develop is called Gaussian elimination and uses a series ofrow-reduction operations.

– We start by concatenating the matrix to be inverted (on the left) with the same-sized identity matrix (on the right).

– We can replace any row in this rectangular matrix with any linear combination ofrows from the previous step. Our goal in doing so is to simplify (i.e., reduce) theelements of the original matrix to equal the identity matrix on the left, and thosesame row operations have the (seemingly magical) effect of turning the elementson the right into the inverse of the original matrix.

– Consider this example:


[A I

]=

2 1 1 0

3 4 0 1

−−−−−−−−−−→R1 = 4R1 −R2

5 0 4 −1

3 4 0 1

−−−−−−−−−−−→R2 = 3R1 − 5R2

5 0 4 −1

0 −20 12 −8

−−−−−−−−−→R1 = (1/5)R1

1 0 4/5 −1/5

0 −20 12 −8

−−−−−−−−−−−→R2 = (−1/20)R2

1 0 4/5 −1/5

0 1 −12/20 −8/20

=[I A−1

]

So, A−1 =

0.8 −0.2

−0.6 0.4

– We can generalize this approach to any square matrix for which the inverse exists.

� 2.4 Relationships of Matrices, Vectors, & Scalars

If we look carefully at these matrix operations, we will see that all of the matrix operations stillhold when applied to vectors (because vectors are a “special case” of matrices in that they onlyhave a single row or a single column). The same is true of scalars.

§ 3 Additional Matrix Operations

In addition to generalizing scalar arithmetic, there are other operations specific to matrices.

� 3.1 Transpose

To transpose a matrix (or vector), each row is made into a column, which means that eachcolumn will become a row in the transposed matrix.

• If A =

1 2 3

4 5 6

, then A′ =

1 4

2 5

3 6

• If A = A′, then we say that A is symmetric. Thus, only square matrices may be symmetric

since the sizes would not work out for rectangular matrices.


� 3.2 Determinants

Determinants are only defined for square matrices, and they are found using a recursive algo-rithm.

[A] The determinant of a scalar is that scalar:∣∣∣∣ a ∣∣∣∣ = a.

[B] The determinant of a 2x2 matrix is found by multiplying each element in a given rowtimes the determinant of the reduced matrix formed by deleting the row and column ofthis element, then multiplying by (−1)i+j. These products are then added together toyield the determinant.∣∣∣∣ A ∣∣∣∣ =

∣∣∣∣∣∣∣2 1

3 4

∣∣∣∣∣∣∣ = 2 ∗ 4 ∗ (−1)1+1 − 3 ∗ 1 ∗ (−1)1+2 = 8− 3 = 5

[C] The determinant of a 3x3 matrix is found the same way, except that the determinant ofthe reduced matrix after eliminating the row & column is the determinant of a 2x2 matrixrather than the degenerate case previously when it was the determinant of a scalar.

∣∣∣∣ A ∣∣∣∣ =

∣∣∣∣∣∣∣∣∣∣∣4 3 2

1 5 3

2 1 4

∣∣∣∣∣∣∣∣∣∣∣= 4

∣∣∣∣∣∣∣5 3

1 4

∣∣∣∣∣∣∣ (−1)1+1 + 3

∣∣∣∣∣∣∣1 3

2 4

∣∣∣∣∣∣∣ (−1)1+2 + 2

∣∣∣∣∣∣∣1 5

2 1

∣∣∣∣∣∣∣ (−1)1+3

= 4(17)(1) + 3(−2)(−1) + 2(−9)(1) = 56

[D] The determinants of larger matrices are found the same way, except that it takes moresteps to find the determinant of the reduced matrix. Since we define determinants arelinear combinations of the determinants of smaller sub-matrices, we call the algorithm a“recursive” algorithm.

� 3.3 Trace

The trace of a matrix is only defined for square matrices and refers to the sum of the maindiagonal elements.

Trace(A) = Trace

2 1

3 4

= 2 + 4 = 6.

� 3.4 Eigenstructure: Eigenvalues and Eigenvectors

Finding the eigenvalues and eigenvectors of a matrix (solving for the eigenstructure) is a hugepain in the rump. Solving for the eigenstructure of a matrix is also called “spectral decomposi-


tion,” and the things we can do with the spectral decomposition are nothing short of miraculous.We will only worry ourselves about solving for the eigenstructure of square matrices, and mostoften we will only worry about symmetric matrices (in fact, SPSS will only solve for the eigen-structure of symmetric matrices without writing LOTS of code).

[A] Eigenvalues are denoted by the greek letter, λ, and they are also called “characteristicroots.”

• We find λ by finding all possible solutions to the equation,∣∣∣∣ A− λI ∣∣∣∣ = 0

• Let’s take a small example where A =

5 1

4 2

∣∣∣∣∣∣∣ 5 1

4 2

− λI∣∣∣∣∣∣∣ =

∣∣∣∣∣∣∣ 5− λ 1− 0

4− 0 2− λ

∣∣∣∣∣∣∣ = 0, so

(5− λ)(2− λ)− (1)(4) = λ2 − 7λ+ 6 = 0

There are two solutions to this quadratic: λ = 6 (because 62 − 7(6) + 6 = 0) andλ = 1 (because 12 − 7(1) + 6 = 0).

• One needed pick these solutions out of thin air because of the general solution to

quadratic equations of the form ax2+bx+c = 0 has two solutions given by −b±√b2−4ac

2a.

• We often collect all of the eigenvalues, in descending order, in the main diagonal of a

zero matrix: Λ =

6 0

0 1

• Eigenvalues may not seem very exciting now, but we will see the cool things that

they can do after we figure out eigenvectors.

• In general, when A is a PxP matrix, then there are P eigenvalues, but sometimessome of the solutions are equal to each other. If there are P unique eigenvalues, weway that the rank of A is P.

• If all of the eigenvalues of A are greater than zero, then we say that A is positive-definite. If all of the eigenvalues of A are greather than or equal to zero, then we saythat A is positive-semi-definite. If all of the eigenvalues are negative, then we saythat A is negative-definite. If all of the eigenvalues are less than or equal to zero,then we say that A is negative-semi-definite. In general, we hope to be dealing withsymmetric, positive-definite matrices.

[B] Eigenvectors are denoted by v.

• We find v by finding all possibly solutions to the equation, Av − λv = 0.

• In this example, there are two solutions:

v1 =

1

1

because

5 1

4 2

1

1

=

6

6

= 6

1

1


and v2 =

−1

4

because

5 1

4 2

−1

4

=

−1

4

= 1

−1

4

.

• Typically, we concatenate these two column vectors into a matrix:

V =[v1 v2

]=

1 −1

1 4

• One really cool thing about eigenvectors is that they can easily be orthonormalized

so that V ′V = I and V V ′ = I. We won’t do that by hand, but we’ll let the computerdo it for us.

[C] We can do some amazing things with eigenvalues and eigenvectors:

• A = λ1v1v′1 + λ2v2v

′2

• More generally, A =∑i λiviv

′i, and it is in this sense that we call the process a spectral

decomposition. It should not be surprising to find the sum gets closer to the originalmatrix as we keep adding terms. Since the eigenvalues are ordered from largest tosmallest, however, the discrepancy between the sum and the matrix gets smaller andsmaller as we keep adding terms.

• The spectral decomposition is like Transporter Beam technology from Star Trek.We compute the Eigenvalues and Eigenvectors and then multiply them out; it’s likedisintegrating people at the origin, recording the molecular pattern, then reassemblingthe molecular pattern at the destination.

• A = V ΛV ′

• A−1 = V Λ−1V ′, and we should note that since the eigenvalue matrix is diagonal,finding it’s inverse is trivial (we just take the inverse of each main diagonal elementand leave all the zeroes alone).

• More generally, Ak = V ΛkV ′, where K is any real number.

• Trace(A) =∑λi

•∣∣∣∣ A ∣∣∣∣ =

∏λi

§ 4 Statistical Matrix Applications

It’s all well and good to define these unary and binary matrix operators, but our initial purposein learning matrix algebra was to apply these techniques to statistical problems Let us turnto computing the means, variances, and covariances of a univariate data set and then for amultivariate data set.

We will consider this data set:


COMPUTE X = {1, 1, 7;

2, 3, 2;

3, 5, 5;

4, 7, 4;

5, 2, 3;

6, 4, 6;

7, 6, 1; 8, 8, 8}.COMPUTE X1 = X(:,1).

COMPUTE X2 = X(:,2).

COMPUTE X3 = X(:,3).

� 4.1 Means

[A] We all remember how to compute means for a single variable (e.g., X1):

X1 =∑

X1

n= 36

8= 4.5

[B] To generalize computing means to matrix algebra, we’ll use a row-vector with as manycolumns as there are rows in the data (cases) as a pre-multiplier to add up the observations:

COMPUTE N = nrow(X1).

COMPUTE rowunit = make(1,N,1.00).

COMPUTE SUMX1 = rowunit*X1.

COMPUTE meanx1 = (1/N)*SUMX1.

[C] We can use the unit row-vector to sum as many variables as there are in a data matrixthe same way:

COMPUTE SUMX = rowunit*X.

COMPUTE MEANX = (1/N)*SUMX.

� 4.2 Variances and Covariances

[A] We all remember how to compute the variance for a single variable (e.g., X1):

S2 =∑

(X1−X1)2

n−1= 42

8−1= 6.0

[B] We can extend this process to use matrix algebra instead. We will need to find thedeviation scores, which means subtracting the mean from every row. Then, we simplytake the inner-product of the deviation score vector with itself.


COMPUTE meanvec = T(rowunit)*4.5.

COMPUTE dev = X1 - meanvec.

COMPUTE SS = T(dev)*dev.

COMPUTE VAR = SS/(N-1).

PRINT VAR.

[C] We can easily extend these computations to multiple variables, and as a bonus, we get thecovariances as well as the variances.

COMPUTE xbarm = T(rowunit)*meanx).

COMPUTE dev = X - xbarm.

COMPUTE SSCPM = T(dev)*dev.

COMPUTE S = SSCPM/(N-1).

PRINT S.

� 4.3 Standard Deviations

[A] We all remember that the standard deviation is the square-root of the variance:

S =√S2 =

√6.0 = 2.45

[B] All we need to find the standard deviation using matrices is to learn about the square-rootfunction:

COMPUTE SD = sqrtVAR.

[C] For multivariate data, we need to peel the main diagonal out of the variance-covariancematrix, then put it on the main diagonal of a zero matrix, then we can take the squareroot:

compute varvec = diag(S).

compute vardiag = mdiag(varvec).

compute sd = sqrt(vardiag).

� 4.4 The Correlation Matrix

Although the variance-covariance matrix describes the degree to which a set of variables varyand co-vary together, it is unstandardized. We compute the correlation coefficient as a “stan-dardized” covariance, and we can compute a whole matrix of correlation coefficients.

[A] We all remember the correlation coefficient forumula between X1 and X2:

r = cov(X1,X2)(SX1

)(SX2)


[B] We can extend this formula by standardizing the variance-covariance matrix. Essentially,we will need to divide each element by the standard deviation of the row-variable and bythe column-variable. So, we’ll find the inverse of the diagonal standard-deviation matrix.Premultiplying by this matrix will divide each element by the row-variable standard devi-ation, and postmultiplying by this matrix will divide each element by the column-variablestandard deviation.

COMPUTE sdinv = INV(SD).

COMPUTE R = sdinv*S*sdinv.

PSY 445 — Spring, 2016 — Lecture Notes for Exploratory Factor Analysis

Exploratory Factor Analysis

§ 1 Definitional Issues

� 1.1 The Homogeneous Subsets Definition

[A] Factor analysis (and especially exploratory factor analysis or EFA) are often described asmethods for uncovering coherent subsets of “items” within a single data set.

[B] Often times, however, this definition is interpreted to mean that EFA will find homoge-neous subsets of items, and this result is virtually never obtained.

[C] If we really are interested in homogeneous subsets in which the members of a subset arehighly similar to each other and in which members of different subsets are highly dissimilarfrom each other, then we should employ clustering techniques rather than factor analysis.The inappropriate use of EFA or FA as a clustering technique is one of my major “petpeeves” and one to which we should attend in reading the literature that employs EFA.CFA, confirmatory factor analysis, is far less often abused in this manner.

� 1.2 The Data Reduction Definition

[A] A more appropriate definition of FA and EFA is to describe them as methods for uncoveringlatent variables, called factors, that explain the relationships among observed variables.

[B] In this sense, we use EFA in hopes of finding a small number of factors that might explainthe observed relationships in a data set with a much larger number of observed variables,or “items.”

[C] Consider an example.

(1) If we think of developing a psychological instrument to measure parenting frustra-tions, we might develop a pool of 100 items that reflect different frustrations in beinga parent.

(2) We would hope, however, that people don’t really vary according to 100 different“types” of parenting frustration.

(3) In fact, if we were then to examine the content of these items, we would probably beable to generate a relatively smaller number of content “themes.”

(4) EFA, then, represents an attempt to uncover a set of latent variables that wouldperform the numerical equivalent.

(5) If we’re lucky, these numerically derived factors will correspond nicely with content“themes” and demonstrate “simple structure”.


§ 2 Fundamental Equations

� 2.1 Extraction

[A] Let us note the correlation matrix of observed variables to be R.

[B] Let us define Λ to be a diagonal matrix with elements λi equal to the ith eigenvalue ofR. For convenience and unambiguousness, let us order the eigenvalues from largest tosmallest.

[C] Let us define V to be a matrix with columns, vi, equal to the ith eigenvector of R (corre-sponding to the ith eigenvalue).

[D] Because eigenvectors are “orthonormal”, V ′V = I and V V ′ = I.

[E] The definition of eigenvectors and eigenvalues generate the following relationships.

(1) Λ = V ′RV

(2) R = V ΛV ′

[F] Let us defineA = V√

Λ, where√

Λ =

√λ1 0 0 . . . 0

0√λ2 0 . . . 0

......

......

0 . . . . . . 0√λp

=

a11 a12 . . . a1k

a11 a12 . . . a1k

......

......

ak1 ak2 . . . akk

[G] This definition, then allows the following.

R = V ΛV ′ = V√

Λ√

ΛV ′

R =(V√

Λ) (√

ΛV ′)

= AA′

[H] We note that A′A = Λ, which is a diagonal matrix of eigenvalues; but AA′ = R, which isa full rank correlation matrix!

[I] Each element of A, denoted as aij refers to the correlation of the ith item (row) with thejth component (column).

[J] The diagonality of the inner product of each loading vector suggests that each componentextracted is independent of the remaining components. Thus, each successive componentrepresents an optimal linear combination of observed variables that is independent of thepreviously extracted component.

[K] If we replace the m smallest eigenvalues with zeroes, the corresponding eigenvectors willthen be effectively eliminated. If we denote this reduced-rank eigenvalue diagonal matrixas Λ∗, then we get an approximation of the correlation matrix:


√Λ∗ =

√λ1 0 0 . . . 0

0√λ2 0 . . . 0

0 0 0 . . . 0...

......

......

0 . . . . . . . . . 0

so A∗ = V

√Λ∗ =

a11 a12 0 . . . 0

a11 a12 0 . . . 0...

......

......

ak1 ak2 0 . . . 0

A∗ = V√

Λ∗,

[L] Thus, R ≈ A∗A∗′

� 2.2 Principal Components Analysis(PCA)

[A] If we are dealing with real data, then there should be p unique, positive eigenvalues(keeping in mind that there are p observed variables).

[B] If, however, we create a submatrix of Λ and of V in which we retain only the largest meigenvalues and their associated eigenvectors, then we obtain an approximation of R.

[C] Consider using only the largest eigenvalue.

[D] Consider using only the two largest eigenvalues.

[E] Consider using all but the smallest eigenvalues.

[F] Run SYNTAX examples in SPSS.

[G] Run GUI examples in SPSS.

� 2.3 Rotation

We may impose any arbitrary rotation on the factor/component loadings, and the relationshipbetween the observed correlation matrix and the “model” implied correlation matrix will remainconstant.

[A] Orthogonal Rotation

Orthogonal rotations maintain the independence (orthogonality) of each component. Thus,each linear combination extracted will not be correlated with the others.

[B] Oblique Rotation

Oblique rotations do not maintain the independence (orthogonality) of each component.Thus, each linear combination extracted may be correlated with the others.


� 2.4 Communalities and percent VAF

[A] If we retain all components (one for each item analyzed), all the components will accountfor all the observed variance of each and every item.

[B] h2i is the communality of the ith item, which represents the proportion of variance in theith item explained by the retained/extracted components.

[C] If all components are retained/extracted, then all communalities will be equal to 1.

[D] If fewer components are retained, then the communalities will be less than or equal to 1,with the value representing the proportion of variance in the item that can be explainedby the retained items; higher communalities indicate items with less unique variance thatcannot be explained.

[E] h2i =

∑mj=1 a

2ij, which is the sum of squared loadings (summed across the ith row).

[F] λj =∑ki=1 a

2ij, which is the sum of the squared loadings (summed down the jth column).

[G] Because retaining all components will result in h2i = 1.0 for all items, we say that principle

components analysis involves an analysis of all item variance. Each item is understood tohave both true score variance and error variance as part of the total item variance. Becausewe recover and explain all of this total item variance, we say that principle componentsanalysis uses all item variance rather than the “true score” variance (or common variance).We will revisit this issue in the next section on “True” factor analysis.

§ 3 True Factor Analysis

True factor analysis methods do not analyze the full (or total) item variance. Instead of at-tempting to explain the total variance of each item, we attempt to explain the common varianceof each item (i.e., the variance that item “shares” with other variables). Because the defini-tion of shared item variance is ambiguous and could utilize a number of definitions, we have amultitude of true factor analysis methods.

We typically use iterative methods to estimate the factor loadings.

[A] We start with a correlation or covariance matrix and replace the main diagonal elementswith an estimate of the (proportion) of shared variance. We call this matrix the initialreduced correlation (or covariance) matrix.

[B] We compute the spectral decomposition of the reduced correlation (or covariance) matrixto obtain the first-iteration factor loading estimates

[C] We use the first-iteration factor loading estimates to estimate the communalities of eachitem, based on an a priori determined number of factors to extract


[D] These first-iteration communality estimates are then placed on the main diagonal of theinitial reduced correlation (or covariance) matrix, creating a first-iteration reduced corre-lation matrix.

[E] The process moves on to a new iteration, using the reduced correlation matrix obtainedin the previous iteration based on the previous iteration’s communality estimates (whichwere based on the previous iteration’s factor loading estimates).

[F] Iteration continues until success factor loadings show miniscule changes; if the loadingestimates do not change, then the communality estimates will not change; if the com-munality estimates do not change, then the reduced correlation (or covariance) matrixwill not change; finally, if the reduced correlation matrix did not change, then the “new”loadings will not change. Further iteration is pointless and will not result in additionalrefinement of the loadings.

[G] Most analyses will converge in under 25 iterations (often within a few iterations).

� 3.1 Principal Axis Factoring or Principal Factors

Replaces unities on the main diagonal with estimates of communalities based on principle com-ponents analysis and computes loadings in an iterative fashion as described above.

� 3.2 Image Factoring

Uses multiple regression residuals to estimate common variance on main diagonal of R andcomputes loadings in an iterative fashion as described above.

� 3.3 Maximum Likelihood

Uses ML estimation theory to derive communality estimates directly and computes loadings inan iterative fashion as described above in an attempt to find the set of factor loadings for whichthe observed correlation is most likely to have occurred.

� 3.4 Alpha

Replaces unities on the main diagonal with internal consistency estimates and computes loadingsin an iterative fashion as described above in an attempt to find a set of loadings for which theinternal consistency is highest.


� 3.5 Unweighted Least Squares

Finds a set of loadings that minimize the squared residual correlations.

� 3.6 Generalized Least Squares

Finds a set of loadings that minimize the weighted squared residual correlations.

PSY 445 — Spring, 2016 — Lecture Notes for Bollen, Chapter 1

Bollen, Chapter 1

§ 1 Introduction

[A] Our previous statistical work has demonstrated an emphasis on modeling individual ob-servations. In both regression and ANOVA, the error variance is (Y − Y )2, which is clearlybased on individual errors in prediction.

[B] We will need to recast this emphasis and conceptualization of individual predictions forindividual cases more in terms of pattern of associations represented by an entire samplethat represents a larger population.

[C] We think back to a sample or population covariance matrix, we understand that we cannotcompute a covariance for a single person. If we attempt such a computation, we get anull matrix because our one observation is equal to the mean of that one observation, thusgiving deviations of each variable equal to zero.

[D] Rather, we will attemp to create models to represent the population covariance matrix (Σ)based on a set of model parameters (the set is called θ). The structure of this model impliesa covariance matrix that can be represented as a function of the parameter estimates, calledΣ(θ). If our model fits a population perfectly, then Σ = Σ(θ).

[E] We will hold off on the specifics of the model implied covariance matrix, special cases,examples, until a bit later.

§ 2 Historical Background

[A] SEM/CSM has a long, often disputed, history, with separate but often converging lineagesin psychology, sociology, biology, and economics.

[B] Sewell Wright, a biometrician, is often credited with the first work on path analysis, buthis initial work looks more like factor analysis.

[C] Wright was apparently unaware of Spearman’s work in 1904 on exploratory factor analysis.

[D] Many other statisticians in these four fields worked to develop latent variable methodolo-gies, but these efforts were largely ignored by applied researchers in these fields, possiblydue to their complex nature.

[E] With the advent and availability of “high speed” computers in the 1970s (my Palm Pilothas more compute power than these original computers), though, the landscape for thesecomplex models changed dramatically.

[F] Joreskog, Keesling, and Wiley were instrumental in developing very general structuralmodels and the notation required to describe them.


[G] LISREL 1 was then born, a creation of Joreskog and Sorbom.

[H] Bentler and Weeks (1980) and McArdle and McDonald (1984) published methods forstructural equation models that was far different from the “JKW” methods and notation.This method and notation, sometimes called the “RAM” formulation of SEM/CSM isimplemented by EQS, RAMONA, CALIS, among other programs for estimating thesemodels.

[I] McDonald (1991, personal communication) has shown that both the JKW or LISRELformulation and the RAM notation (and models) can each be shown to be a special case,thus it seems that both methods for describing these models are essentially equivalentrather than super-ordinate.


Bollen, Chapter 7

§ 1 Confirmatory Factor Analysis: The Model

[A] The CFA model is also called a “measurement model” or “measurement sub-model” be-cause it is a special case of covariance structure modeling, and we’ll see later why it is aspecial case of a larger model.

[B] We can write the CFA model with a single matrix equation:

X = Λxξ + δ

[C] Let’s consider the Brady Data we collected. We have roughly 30 observations on 12variables. Keep in mind that we are going to represent the data “sideways.” Let’s alsokeep in mind that there are probably 3 factors.

Xpxn = Λpx3ξ3x1 + δpxn

x1 . . .

x2 . . .

x3 . . .

x4 . . .

x5 . . .

x6 . . .

x7 . . .

x8 . . .

x9 . . .

x10 . . .

x11 . . .

x12 . . .

=

λ11 0 0

λ12 0 0

λ13 0 0

λ14 0 0

0 λ52 0

0 λ62 0

0 λ72 0

0 λ82 0

0 0 λ93

0 0 λ10,3

0 0 λ11,3

0 0 λ12,3

ξ1 . . .

ξ2 . . .

ξ3 . . .

+

δ1 . . .

δ2 . . .

δ3 . . .

δ4 . . .

δ5 . . .

δ6 . . .

δ7 . . .

δ8 . . .

δ9 . . .

δ10 . . .

δ11 . . .

δ12 . . .

[D] We need to make some assumptions and then apply “expected value” operators to this

model to see how it applies to covariance structure modeling, and in Bollen these detailsare left to Chapter 2.

(1) Expected Values are just a fancy way of refering to a population mean (i.e., the valuewe expect over the long run).

E(X) = µX

i. The expected value of a constant is that constant: E(c) = c.

ii. E(a+X) = a+ E(X)


iii. E(bX) = bE(X)

iv. E(X + Y ) = E(X) + E(Y )

v. E(XY ) = E(X)E(Y ) IFF X and Y are independent.

(2) Just as the population mean of X is E(X), the population variance of X can be foundas the second moment of x.

σ2X = E[(X−µX)2] = E[X2−2XµX +µ2

x] = E[X2]−2µxE(X)+µ2x = E(X2)−µ2

x =E(X2)− [E(X)]2

(3) Following this same line of thought, the covariance of two variables, X and Y is asecond moment:

cov(X, Y ) = E[(X−µx)(Y −µy)] = E[XY −Xµy−Y µx−µxµy] = E[XY ]−µyE(X)−µxE(Y ) + µxµy = E(XY )− E(X)E(Y )

(4) We will assume that E(X) = 0 for all observed variables, which essentially meansthat we have to subtract X from each observation, but doing so would be trivial.

i. This assumption means that cov(X, Y ) = E(XY )− E(X)E(Y ) = E(XY )− 0

ii. Thus, we have a very easy way to denote the population variance-covariancematrix of all X variables when the (arbitrary) scaling has created means that areall zero:Σ = E(XX ′)

iii. We can also assume that E(ξ) = 0 and E(δ) = 0 These assumptions are also noproblem because ξ and δ are latent variables with no explicit scaling, so we’refree to make whatever scaling constraints or assumptions we want.

(5) Let us also recall that, if two variables are independent of each other, their covariancewill be zero. Thus, if X & Y are independent, then

cov(X, Y ) = E(XY )− E(X)E(Y ) = E(X)E(Y )− E(X)E(Y ) = 0

regardless of whether or not the variables are rescaled to have means of zero.

[E] The above formulations allow us to look at our model and play with the expected valuesof the model equations themselves. We’ll assume arbitrary rescaling of all observed andlatent variables to have zero means.

• Σ = E(XX ′)

• Our model, though, stipulates that X = Λξ + δ

• Thus, E(XX ′) = E[(Λξ + δ)(Λξ + δ)′]

• E(XX ′) = E[(Λξ + δ)(δ′ + ξ′Λ′)] = E[Λξξ′Λ′ + Λξδ′ + δξ′Λ′ + δδ′]

• E(XX ′) = E[Λξξ′Λ′] + E[Λξδ′] + E[δξ′Λ′] + E[δδ′]

• But, we need to keep in mind that Λ is a matrix of constants, which means thismatrix can “factor out” of the expected values since E(bX) = bE(X).

E(XX ′) = ΛE(ξξ′)Λ′ + ΛE(ξδ′) + E(δξ′)Λ′ + E(δδ′)


• Let us assume that ξ and δ are independent of each other. This assumption seemsimminently reasonable, practical, and (actually) essential. What do we mean bymeasurement error if it is not that component of an observed variable that cannot beexplained by the latent factor we are modeling? If we are partitioning the varianceof X into common and unique components, it is essential that there be no overlap inthe variance of X attributable to the latent factor(s) on which it loads and the unique(i.e., measurement error) factors.

Thus, E(δξ′) = 0 = E(ξδ′) and

E(XX ′) = ΛE(ξξ′)Λ′ + 0 + 0 + E(δδ′)

• We actually have another name for E(ξξ′) because ξ has a mean of zero. E(ξξ′) =cov(ξ) = Φ

Similarly, we actually have another name for E(δδ′) because δ has a mean of zero.E(δδ′) = cov(δ) = Θδ

• So, ultimately, E(XX ′) = ΛΦΛ + Θδ, which we call the “model implied variance-covariance matrix of X”, and this matrix is denoted as Σ(θ). THe θ represents theentire collection of parameters in Λ, Φ, and Θδ.

[F] Let’s take a moment to reflect on how much we have impressed ourselves!

§ 2 Estimating Model Parameters

Here’s the real deal: We want to find a set of sample estimates (θ) for this collection of param-eters (θ), such that the model-implied variance-covariance matrix and the observed variance-covariance matrix are as close as possible.

• The sample variance-covariance matrix is an approximation of the population variance-covariance matrix: S ≈ Σ

• If the model is “perfect” then the model-implied variance-covariance matrix should beexactly equal to the population variance-covariance matrix: Σ(θ) = Σ

• The parameter estimates may not be perfect, so Σ(θ) ≈ Σ(θ)

• Thus, S ≈ Σ = Σ(θ) ≈ Σ(θ)

[A] But, exactly how do we go about finding θ?

(1) First, if we set S = Σ(θ), then we have a system of equations in a set of unknowns.

Specifically, we have P (P−1)2

equations in t unknowns, assuming that t represents thenumber of free parameters to be estimated (a free parameter is one that has not beenconstrained, i.e., like factor-loadings constrained to be equal to zero).


(2) One approach is to attempt to solve the system of unknowns, but we have a non-linear system, so this task is not always easy. Let us consider a simple three-indicatormodel for one latent factor.

i. X = Λξ + δX1

X2

X3

=

λ11

λ21

λ31

[ξ1

]+

δ1

δ2

δ3

ii. We have to make at least one assumption here, and we will constrain λ11 = 1.0000

so that the latent variable will have a natural scale. We’ll see later why thisconstraint is important.X1

X2

X3

=

1.0

λ21

λ31

[ξ1

]+

δ1

δ2

δ3

iii. Σ(θ) = E(XX ′) = E[(Λξ + δ)(Λξ + δ)′] = ΛΦΛ′ + Θδ

iv. Σ(θ) =

φ11 + Θδ11 λ21φ11 λ31φ11

λ21φ11 λ221φ11 + Θδ22 λ21λ31φ11

λ31φ11 λ21λ31φ11 λ231φ11 + Θδ33

v. Thus, we have six equations in 6 unknowns:

S11 = φ11 + Θδ11

S21 = λ21φ11

S22 = λ221φ11 + Θδ22

S31 = λ31φ11

S32 = λ21λ31φ11

S33 = λ231φ11 + Θδ33

vi. We need to solve for each parameter as a function of observed variances or covari-ances. Or, if we can solve each subsequent parameter as a function of observedvariances, observed covariances, or previously solved-for parameters, we can solvethe whole system!

• λ21 = S12

φ11and λ31 = S13

φ11and φ11 = S23

λ21λ31

• Neither of these three manipulations gets us even one parameter as a functionof only variances and covariances, but lets do some back-substitutions:

• φ11 = S23

λ21λ31= S23

S12φ11

S13φ11

This equation, though, only has one unknown, φ11, so we can solve thisequation!!

φ11 =S23φ211S12S13


Dividing both sides by φ211, we get

1φ11

= S23

S12S13so φ11 = S12S13

S23

• Now that we have solved for φ11, we can back-substitute that in the previousequations to solve for the two λ parameters.λ21 = S12

φ11= S12

S12S13S23

= S23

S13

λ31 = S13

φ11= S13

S12S13

S23

= S23

S12

• The only task remaining is one final set of back-substitutions and subtractionsto find the measurement error variance estimates:Θδ11 = S11 − φ11 = S11 − S12S13

S23

Θδ22 = S22 − λ221φ11 = S22 − S2

23

S213

(S12S13

S23

)Θδ33 = S33 − λ2

31φ11 = S33 − S223

S212

(S12S13

S23

)[B] Let’s consider some properties of the solution we worked out for this three-indicator, one-

factor model with no correlated measurement error.

(1) There was exactly one solution that worked perfectly in that we did not get twodifferent equations for a single parameter that do not agree.

(2) If we isolated the parameters and solved the system in a different order, we wouldget exactly the same solution.

(3) This situation is not entirely unexpected because we have just as many unknowns asequations. If we have a linear set of equations, we know that the only condition forguaranteeing a unique solution is that the number of unknowns and equations areequal. Because we actually have non-linearity (some of the parameters are squared),we are not guaranteed such a result simply because of the equal number of unkownsand equations.

(4) What this boils down to, though, is that we have df = 0, and we call the system ofequations exactly determined.

(5) If the number of unknowns exceeds the number of equations, then we can find aninfinite number of solutions that will fit perfectly, as situation we call an under-determined system of equations.

(6) If the number of equations exceeds the number of unknowns, then we call the systemover-determined.

(7) When we translate the over-, exactly-, and under-determined status to models, be-cause of their non-linearity, we cannot simply go by the number of equations andthe number of unkowns. To emphasize this difference, we introduce the terminology“parameter identification” and “model identification.”

i. Let us first consider parameter identification.

• An identified parameter is one for which our system of equations yields at leastone solution for the parameter as a function of observed variances, observedcovariances, or other identified parameters.


• We say the parameter is over-identified if there are more than one such solu-tions that are unique.

• We say that the parameter is exactly-identified if there is exactly one uniquesolution for that parameter.

• An un-identified (or non-identified or under-identified) parameter is one forwhich there are no unique solutions (i.e., there is at least one other un-identified parameter in any and all “solutions”).

ii. Let us now consider entire models.

• An identified model is one for which all parameters are identified.

• An over-identified model has at least one parameter that is over-identified.

• An exactly-identified model is one for which all parameters are exactly iden-tified.

• An unidentified (or under-identified or non-identified) model is one for whichat least one parameter is not identified.

(8) This issue of model identification is crucially important. It is essential that we onlyconsider estimating models that are identified (and hopefully over-identified). Modelsthat are exactly identified will always fit the data perfectly! Thus, to really test amodel (and hence, to have falsifiability, an important principle in empirical science),we really need to have over-identified models.

It is truly amazing how many researchers use CSM/SEM without establishing (oreven asserting without evidence) that their models are identified. We shall crucifyanyone who publishes a non-identified model and publicly humiliate them to the fullextent of the law!!!!!

(9) Identification Rules for CFA (Bollen, page 242)

i. For a model to be identified, all necessary rules must be met and (1) the modelpasses at least one sufficient rule or (2) matrix algebra reveals at least one uniquesolution for each parameter as a function of observed variances, covariances,or other identified parameters or (3) empirical/local identification holds (we’lladdress that topic later).

ii. Necessary RulesAll identified models should pass all necessary rules, but unidentified models mayalso pass these rules.

A. t-rulet ≤ q(q+1)

2, where t = number of “free” parameters (i.e., number of parameters

freely estimated rather than constrained to a constant value, like zero or one).

B. Latent Variable Scaling RuleAll latent variables must have their “scale” set either by having a constantvalue for its variance parameter (i.e., setting φii = constant, typically 1.000for convenience) or by having the loading of one indicator set to a constant(i.e., setting λij = constant, typically 1.000 for convenience). Thus, for allξi, either φii or one λij will be fixed to a constant value. Either of thesetwo approaches can be used exclusively, or they may be combined within thesame model.


iii. Sufficient Rules

A. 2 Indicator Rule — All of the following conditions must be met to pass thisrule.

• There are at least two indicators for each factor.

• There must be more than one factor.

• Θδ must be diagonal (thus, measurement errors must not be allowed tocorrelate).

• The factor complexity of each indicator must be one – that is, each indica-tor must load on one and only one factor (i.e., there is only one non-zeroelement per row of Λ.

• There is at least one non-zero, off-diagonal element in every row (andhence every column) of Φ. Thus, if all elements of Φ are free, rather thanconstrained to zero, this condition will be met (Bollen suggests this caseas a separate rule, but it is a special case of this more general condition).

B. 3 Indicator Rule – All of the following conditions must be met to pass thisrule.

• There are at least three indicators for each factor.

• Θδ must be diagonal (thus, measurement errors must not be allowed tocorrelate).

• The factor complexity of each indicator must be one – that is, each indica-tor must load on one and only one factor (i.e., there is only one non-zeroelement per row of Λ.

§ 3 Practical Model Estimation

In practice, performing non-linear matrix algebra to solve for parameters is not feasible for tworeasons. First, the equations are non-linear, which makes them especially ugly for anythingexcept very simple models. Second, when models are over-identified, several solutions for agiven parameter may result in different parameter values, creating ambiguity of which valueto use. Thus, we need a broader perspective on model estimation and a practical approach toestimating parameters.

[A] Fit Functions

Our general approach is to define an objective fit function to describe the descrepancybetween the observed covariance matrix and the model implied covariance matrix:

F = 12tr {[S − Σ(θ)]W−1 [S − Σ(θ)]}.

Depending on our choice for the weight matrix W , we can obtain different types of esti-mates, with different properties. Regardless of our choice of W , the goal is to find a setof estimates that minimize this fit function, making the function a measure of the “bad-ness” of fit rather than “goodness” of fit, since larger numbers indicate a higher degree ofdiscrepancy between the model and the data.


(1) W = I — Setting the weight matrix to the identity matrix essentially weights all vari-ance discrepancies and all covariance discrepancies equally, which essentially meanstreating these differences between S and Σ(θ) as absolute rather than relative. Thisapproach is called “Unweighted Least Squares” or ULS because minimizing the fitfunction yields a set of estimates for which the sum of squared discrepancies is thesmallest over all possible set of estimates. This approach is not used very often.Simplifying the general fit function yields

FULS = 12tr{

[S − Σ(θ)]2}

.

(2) W = Σ(θ) — Setting the weight matrix to the model implied covariance matrixessentially weights the discrepancies relatively rather than absolutely. This approachis called “Maximum Likelihood” or ML estimation because the resulting parameterestimates maximize the likelihood of observing these particular data points. Thisapproach is very frequently used. Simplifying the general fit function yields

FML = log∣∣∣∣ Σ(θ)

∣∣∣∣ + tr [SΣ−1(θ)] − log∣∣∣∣ s ∣∣∣∣ − q, where q = number of observed

variables.

(3) W = S — Setting the weight matrix to the sample covariance matrix essentiallyweights the discrepancies relatively rather than absolutely, like the ML approach.This approach is called “Generalized Least Squares” or GLS estimation because weare still trying to minimize squared discrepancies, but we are weighting these squaredresiduals based on the observed covariance matrix. This approach, though, is alsoan “asymptotic” maximum likelihood approach because, as the sample size becomehigher and higher, the resulting parameter estimates result in higher and higher like-lihoods for observing these particular data points. This approach is also frequentlyused, especially because these estimates are relatively robust for non-normally dis-tributed data. Simplifying the general fit function yields

FGLS = 12tr{

[I − Σ(θ)S−1]2}

.

[B] Iterative Estimation Regardless of which approach we select for our weight matrix in thegeneral fit function, the only reasonable approach to minimizing the fit function is to usean iterative approach. When we talk about estimating the parameters, we typically lumpthem all into a single vector, θ to refer to all parameter estimates conveniently.

• Select an initial “guess” for each parameter (called a starting value). These startingvalues are denoted as θ0.

• Use calculus to determine how to adjust each estimate (higher, lower, or no change)to increase the fit (thus decreasing the fit function). The first set of updates is foundusing a weight matrix and a gradient:

θ1 = θ0−C0g0, where C =[∂2F∂θ∂θ

]−1and g =

[∂F∂θ

]. The gradient tells us the direction

to adjust, and C tells us how much to adjust. It is important to note that any or allof the parameters may change in this process, called the first iteration.

• Continue to adjust the parameter estimates in further iterations, where the estimatesin iteration i+1 based on the previous iteration (i):

θi+1 = θi − Cigi


• Iterations stop when the parameter estimates converge, that is, when the total ab-solute parameter differences between the current and previous iteration are less thanan arbitrary tolerance (typically 0.000001).

• Models that are identified typically will converge in a reasonable number of iterationsgiven a sufficiently large sample size and starting values that are sufficiently close tothe “true” values in the population. It is important to keep in mind that convergencedoes not implied identification. Simply because the algorithm converges and stopsdoes not mean that there is at least one unique solution for each parameter. It ispossible that a non-identified model will converge to one of an infinite number ofpossible solutions. Many, many, many researchers labor under this misconceptionand assume that convergence implies identification.

[C] Starting Values

Although the estimation algorithms have theorems to prove that identified and correctlyspecified models with very large samples will converge eventually, it is important to have“good” starting values so that we will converge relatively quickly. We could, though, justuse random starting values, but doing so increases the number of iterations required andincreases the likelihood of something strange happening during estimation (e.g., resultingin a negative variance estimate or a covariance that translates to a correlation greater than1 in absolute value).

(1) One strategy would be to use previous research to make logical choices for the startingvalues, but we don’t always have previous research or clear theory.

(2) A more general strategy, though, is to use a non-iterative method to estimate eachparameter. Obviously we don’t want to use an iterative method to start anotheriterative method, or we might never finish our estimation procedures.

The classic non-iterative method is called the “Instrument-Variables” approach (orIV estimates).

i. In this approach, we partition the observed variables into three blocks:

• Block A = reference indicator for a given factor (used to identify the latentvariable scale)

• Block B = one non-reference indicator

• Block C = remaining indicators not included in A or B

ii. We can partition the variances and covariances in S.

• SAC = S ′CA = covariances between variable A and the variables in C.

• SCC = the variances and covariances of the variables in block C.

• SCB = the covariances between the variable in block B and the variables inblock C.

iii. To find the loading for the variable in block B, we use this equation:

λB′=(SACS

−1CCSCA

)−1SACS

−1CCSCB

iv. To find the other loadings, we change the variable we place in blocks B and C,and then we use the above formula.


v. To find Θδ, we need to use elementwise matrix multiplication, denoted by ×,where we just multiply corresponding elements of each matrix (requiring thatthe matrices have the same exact dimensions to be conformable for elementwisemultiplication).diag(Θδ) = (I −D ×D)−1g, where D = Λ(Λ′Λ)−1Λ′ and g = diag(S −DSD).

vi. To find Φ, we use the following formulaΦ = (Λ′Λ)−1Λ′(S − Θδ)Λ(Λ′Λ)−1

§ 4 LISREL and CFA

We will approach LISREL mostly through syntax because the GUI is awkward, clunky,and does not lead to conceptual clarity for issues like identification.

See LISREL examples.


Bollen, Chapter 4

In chapter 4, we cover path analysis models. These models do not focus on latent variables, as didthe CFA models discussed previously. In this chapter, we assume that each construct measuredhas only a single indicator. With only a single indicator or manifest variable per construct,it is not possible to estimate error variance directly. How can we partition the variance of amanifest (observed) variable into common variance and unique variance when there are no otherindicators believed to reflect the same, common construct?

§ 1 Model Specification

We can specify path analytic models either with implicit or explicit representations of therebeing a single, latent variance (construct) measured by each indicator (manifest variable) in themodel. These trivial relationships (the ΛX parameter matrix is a constant matrix equal to theidentity matrix), are not typically included explicitly in path diagrams as a matter of parsimony,clarity, and space.

[see class notes for diagrams]

What’s new about these models is the fact that we are postulating predictive relationships amongthe variables and classifying variables to be exogenous (determined by factors/constructs/variablesOUTSIDE the model system) or endogenous (determined by factors/constructs/variables IN-SIDE the model system).

The effects of exogenous variables on endogenous variables are denoted by Γ; and the effects ofendogenous variables on other endogenous variables are denoted by β; there are no effects ofexogenous variables on exogenous variables, because if there were a predictive, directed effect,then the variable so predicted would no longer be exogenous.

Φ, the variance-covariance matrix of the ξ variables becomes the variance-covariance matrix ofthe X variables because each X variable is constrained to be exactly equal to its correspondingξ variable because of the fixed values of “1” on the main diagonal of ΛX .

For endogneous variables, we must introduce a new latent variable, ζ, which represents theerror associated with predicting the endogenous variables. Any time we predict something (likein CFA with factors predicting their indicators), we have to have a variable representing theunexplained variance in addition to that which is predicted. Each endogenous variable, by virtueof being predicted by a variable within the model system, has a corresponding ζ error term,called a model specification error. This terminology reflects the idea that, in addition to randomerror, the endogenous variables’ error terms reflect predictive effects not being specified correctly(including effects that should be absent, excluding effects that should be present, or excludingwhole variables from the model that are necessary to predict accurately).

Y = ΓX + βY + ζ


We need to get the Y on one side of the equation only:

Y = ΓX + βY + ζ

Y − βY = ΓX + ζ

(I − β)Y = ΓX + ζ

(I − β)−1(I − β)y = (I − β)−1(ΓX + ζ)

Y = (I − β)−1(ΓX + ζ)

§ 2 Implied Covariance Matrix

We have a few assumptions to make before we can determine the model implied variance-covariance matrix. First, we assume that E(Xζ ′) = 0 (which implies that E(ζX ′) is also zero).Second, we define E(ζζ ′) = Ψ, the variance-covariance matrix of the model specification errors.

Next we consider that the observed variance-covariance matrix and model implied variance-covariance matrix can be partitioned into four sub-matrices. We will deal with each in turn,with less hassle than dealing with all of them at once. Syy′ Syx′

Sxy′ Xxx′

≈ Σ(θ)yy′ Σ(θ)yx′

Σ(θ)xy′ Σ(θ)xx′

Working first with the lower-right sub-matrix (because it is the easiest): Σ(θ)xx′ = E(xx′) = Φ

By definition, the expected values of xx′ is the population variance-covariance matrix of theexogeneous variables, which is defined to be Φ.

Working next with the upper-right sub-matrix:

Σ(θ)yx′ = E(yx′)

= E[(I − β)−1(ΓX + ζ)X ′

]= (I − β)−1E [(ΓXX ′ + ζX ′]

= (I − β)−1 [ΓE(XX ′) + E(ζX ′)]

= (I − β)−1(ΓΦ + 0)

= (I − β)−1ΓΦ

Working next with the lower-left sub-matrix:


Σ(θ)xy′ = E(xy′)

= E[X{

(I − β)−1(ΓX + ζ)}′]

= E[X(ΓX + ζ)′(I − β)′−1

]= E [X(ζ ′ +X ′Γ′)] (I − β)′−1

= E [Xζ ′ +XX ′Γ′)] (I − β)′−1

= [E(Xζ ′) + E(XX ′)Γ′] (I − β)′−1

= [0 + ΦΓ′] (I − β)′−1

= ΦΓ′(I − β)′−1

This is a relief, since the last two sub-matrices should be the tranposes of each other (and theyare).

Finally, we must deal with the upper-left sub-matrix:

Σ(θ)yy′ = E(yy′)

= E[{

(I − β)−1(ΓX + ζ)}{

(I − β)−1(ΓX + ζ)}′]

= E[(I − β)−1(ΓX + ζ)(ΓX + ζ)′(I − β)′−1

]= (I − β)−1E [(ΓX + ζ)(ζ ′ +X ′Γ′)] (I − β)′−1

= (I − β)−1 [E(ΓXX ′Γ′) + E(ΓXζ ′) + E(ζX ′Γ′) + E(ζζ ′)] (I − β)′−1

= (I − β)−1 [ΓE(XX ′)Γ′ + E(ζζ ′)] (I − β)′−1

= (I − β)−1 [ΓΦΓ′ + Ψ)] (I − β)′−1

So, let’s put the whole thing together:

Σ(θ) =

Σ(θ)yy′ Σ(θ)yx′

Σ(θ)xy′ Σ(θ)xx′

=

(I − β)−1 [ΓΦΓ′ + Ψ)] (I − β)′−1 (I − β)−1ΓΦ

ΦΓ′(I − β)′−1 Φ

≈ S

§ 3 Identification

If we can find at least one unique solution for each parameter using the equations implied above,then all parameters are identified, and the model as a whole is identified.

The system of equations is very nonlinear at this point, and the (I − β)−1 term doesn’t helpmuch. Fortunately, there are some identification rules that help us out.


� 3.1 Necessary Rules

[A] t-rule

This rule is the same as it was for CFA: t = number of parameters, and t must be lessthan or equal to the number of unique elements in the observed variance-covariance matrix.That is,

t ≤ (p+q)∗(p+q+1)2

Another way of stating this rule is the the df must be non-negative, as the df is equal tothe difference of the right-side minus t.

[B] Order Condition

The order condition is necessary but not sufficient and is only necessary for models inwhich Ψ is free and unrestricted.

• Ψ must be unrestricted, with all elements freely estimated

• All Xi are uncorrelated with all ζj

• (I −B) is nonsingular (i.e., its determinant is 6= zero).

• All equations must be identified for the model to be identified, and an equation isidentified when the number of variables excluded from that equation is at least p−1.

• See Bollen, page 100 for examples.

[C] Rank Condition

The Rank Condition is both necessary and sufficient for all models in which Ψ is unre-stricted.

• Ψ must be unrestricted, with all elements freely estimated

• All Xi are uncorrelated with all ζj

• (I −B) is nonsingular (i.e., its determinant is 6= zero).

• Create C = [(I −B)| − Γ]

• Create Ci for each equation (row) by deleting all columns of C that do not have zerosin the ith row of C.

• All equations must be identified for the model to be identified, and the ith equationis identified when the rank of Ci = p− 1.

• See Bollen, page 101 for examples.

� 3.2 Sufficient Rules

[A] Null B rule

The requirement of this sufficient but not necessary rule is that B is a null matrix (i.e., allelements are equal to zero). It should be obvious that this rule is not terribly helpful. We


might also call this rule the “multivariate, multiple linear regression” rule because a pathanalysis in which Γ is unconstrained but B is constrained to be null essentially describesmultivariate multiple regression.

[B] Recursive Rule

The recursive rule is sufficient but not necessary and has the following conditions andrequirements.

• Ψ must be diagonal

• B must be lower-triangular (or must be transformable to a lower-triangular form byre-ordering the subscripted order of the endogenous variables).

§ 4 Estimation

§ 5 Further Topics

§ 6 LISREL Examples


Bollen, Chapter 8

Combining Latent Variable and Measurement Models

In previous chapters, we looked just at measurement models (i.e., CFA) or just at structuralmodel (i.e., Path Analysis). Now we will consider models for latent variables that includemultiple indicators for each latent variable and directed effects of latent variables on otherlatent variables (i.e., combining Path Analysis and CFA).

§ 1 Structural Sub-Model

Thus, our first goal is to reformulate the central equations from path analysis to specify rela-tionships between latent variables rather than observed relationships.

The path analysis matrix equation is given by

Y = βY + ΓX + ζ.

The latent exogenous variables are represented as ξ, and the latent endogenous variables arerepresented as η.

Thus, the directed relationships (i.e., those relationships consistent with a causal relationship),can be reformulated as

η = βη + Γξ + ζ.

We have a similar issue with this equation that we do for path analysis in that η appears onboth sides of the equation. Thus, we have the same solution to find a reduced form:

η = βη + Γξ + ζ

η − βη = Γξ + ζ

(I − β)η = Γξ + ζ

(I − β)−1(I − β)η = (I − β)−1(Γξ + ζ)

η = (I − β)−1(Γξ + ζ)

§ 2 Measurement Sub-Models

Now, for the exogenous latent variables, we need to consider the equations that specify therelationships between these latent variables and the observed indicators of these constructs.This measurement model is exactly the same as our CFA equations (without minor alteration):


X = ΛXξ + δ

(note that we subscript the Λ matrix to identify the loadings as loadings for the exogenousconstructs).

We need a similar matrix equation to represent the relationships between the latent endogenousvariables and the observed indicators for these constructs:

Y = ΛY η + ε

(note that there is a similar loading matrix and a similar measurement error matrix).

§ 3 Implied Covariance Matrix

We need to find an analytic representation of the model-implied covariance matrix as a functionof the model parameters. We will consider four sub-matrices.

� 3.1 Finding Σyy′(θ)

We will first consider finding model-implied variances and covariances of the observed indicatorsfor endogenous variables.

We start with the measurement equation, and we must assume that E(Y ε′) = 0 (i.e., that Y isindependent of ε).

Σyy′(θ) = E(Y Y ′)

= E[(Λyη + ε)(ε′ + η′Λ′y)]

= E(Λyηη′Λ′y) + E(εε′) + E(Λyηε

′) + E(εη′Λ′y)

= ΛyE(ηη′)Λ′y + E(εε′) + 0 + 0

= ΛyE(ηη′)Λ′y + Θε

We are not finished, however, because this equation includes more than parameters of the model.The equation includes E(ηη′), and we do not know what the latent variable values are for η.Thus, we need to bring in the structural equation (reduced form): η = (I − β)−1(Γξ + ζ). Inaddition, we need to assume that E(ξζ ′) = 0 (i.e., that ξ is independent of ζ).

Σyy′(θ) = ΛyE(ηη′)Λ′y + Θε

= ΛyE[{(I − β)−1(Γξ + ζ)}{(ζ ′ + ξ′Γ′)((I − β)−1)′}]Λ′y + Θε

= ΛyE[(I − β)−1{(Γξ + ζ)(ζ ′ + ξ′Γ′)}((I − β)−1)′]Λ′y + Θε


= Λy(I − β)−1E[Γξξ′Γ′ + ζζ ′ + Γξζ ′ + ζξ′Γ′]((I − β)−1)′Λ′y + Θε

= Λy(I − β)−1[ΓE(ξξ′)Γ′ + E(ζζ ′) + 0 + 0]((I − β)−1)′Λ′y + Θε

= Λy(I − β)−1[ΓΦΓ′ + Ψ]((I − β)−1)′Λ′y + Θε

This final equation consists only of parameter matrices that are constant for a given population– there are no variables (latent or observed).

� 3.2 Finding Σyx′(θ)

We start with the measurement equations, and we must assume that E(ξε′) = 0, that E(ξζ ′) = 0,and that E(εδ′) = 0. We will again use the reduced form of η for the structural equation andmake that substitution in the second step as before.

Σyx′(θ) = E(Y X ′)

= E[(Λyη + ε)(δ′ + ξ′Λ′x)]

= E[{Λy(I − β)−1(Γξ + ζ) + ε}{δ′ + ξ′Λ′x}]= E[Λy(I − β)−1(Γξ + ζ)ξ′Λ′x + Λy(I − β)−1(Γξ + ζ)δ′ + εξ′Λ′x + εδ′]

= Λy(I − β)−1E[(Γξ + ζ)ξ′)]Λ′x + Λy(I − β)−1E[(Γξ + ζ)δ′] + E(εξ′)Λ′x + E(εδ′)

= Λy(I − β)−1[ΓE(ξξ′) + E(ζξ′)]Λ′x + Λy(I − β)−1[ΓE(ξδ′) + E(ζδ′)] + 0 + 0

= Λy(I − β)−1[ΓΦ + 0)]Λ′x + Λy(I − β)−1[Γ(0) + 0]

= Λy(I − β)−1ΓΦΛ′x + 0

And we’re done because this equation is a function of parameters only.

� 3.3 Finding Σxy′(θ)

We could start with the measurement equations, and we must make the same assumptionsabove: that E(ξε′) = 0, that E(ξζ ′) = 0, and that E(εδ′) = 0. We would again use the reducedform of η for the structural equation and make that substitution in the second step as before.But, we can also intuit that this sub-matrix must be the transpose of the one just found:

Σxy′(θ) = E(xy′) = ΛxΦΓ′((I − β)−1)′Λ′y


� 3.4 Finding Σxx′(θ)

We start with the measurement equations for X and make the same assumptions we did forCFA. In fact, these equations are identical to solving for the implied covariance matrix with aCFA only model.

Thus, Σxx′(θ) = E(xx′) = ΛxΦΛ′x + Θδ

� 3.5 Summary

Given that we can solve all four submatrices of the full Σ(θ) for the full latent variable model withmultiple measures in terms of model parameters only, we can attempt to solve the equationscreated by setting the model-implied covariance matrix equal to the sample data covariancematrix:

Σ(θ) = S

We can use the three unique sub-matrices to generate unique sets of equations to be solvedsimultaneously:

Σyy′(θ) = Syy′

Σyx′(θ) = Syx′

Σxx′(θ) = Sxx′

There is no need to worry about Σxy′(θ) = Sxy′ because these equations will be identical to thoseinvolving Syx′ due to the symmetry of Σ(θ) and S.

§ 4 Identification Rules

� 4.1 t-rule — necessary

� 4.2 Two-Step Rule — sufficient

PSY 445 — Spring, 2016 — Lecture Notes for LISREL Command/Card/Code Reference

LISREL Command/Card/Code Reference

DA = DAta Card — uses NI, NO, and MA commands

NI = Number of Input Variables in the data set being read, which may be more than thenumber of variables to be analyzed, e.g., NI=12

SE = SElect variables to be analyzed — used to reorder variables or to omit variables, (e.g.,SE ; 3 5 4 2 1 /). If you select fewer than NI variables, terminate the command witha forward slash. Also, for some reason, you must separate the SE command from thelist of variables names or numbers with a semi-colon.

NO = Number of Observations (sample size), e.g., NO=345

MA = MAtrix to be analyzed — possible values are CM, KM, e.g., MA=CM to analyze thecovariance matrix

CM = Covariance Matrix Card — set it equal to the file name of the covariance matrix,e.g., CM=filename.cov

KM = Korrelation Matrix Card — set it equal to the file name of the correlation matrix,e.g., KM=filename.cor

MO = MOdel Card — specifies default size and configuration of various matrices

NX = Number of X (manifest) variables in the model (e.g., NX=6)

NK = Number of KSI (latent) factors in the model, i.e., ξ (e.g., NK=2)

LX = Lambda X (ΛX) matrix shape [FU, ID] and default status [FI, FR], (e.g.,LX=FU,FR)

PH = PHI (Φ) matrix shape [DI,SY,ST] and default status [FI, FR], (e.g.,PH=SY,FR)

TD = Theta-Delta (Θδ) matrix shape [DI,SY,ST] and default status [FI, FR], (e.g.,TD=DI,FR)

LA = LAbels for observed variables (separate LA from list of labels with a semi-colon; notethat there must be exactly NX labels separated by one or more spaces; note that thelist may span as many lines as necessary or desired)

LK = Labels for KSI variables (separate LK from list of labels with a semi-colon; note thatthere must be exactly NK labels separated by one or more spaces; note that the listmay span as many lines as necessary or desired)

PSY 445 — Spring, 2016 — Lecture Notes for LISREL Command/Card/Code Reference

FI = FIxed card — used to override default stipulated on MO card for any parameter

FR = FRee card — used to override default stipulated on MO card for any parameter

PD = Path Diagram — produces a path diagram, but it is not as useful as you might thinkbecause the diagrams tend to suck

OU = OUtput Card — used to specify method of estimation, output produced, and severalother behaviors of LISREL

ME = MEthod of estimation [UL, GL, ML] for unweighted least-squares, generalizedleast-squares, and maximum likelihood estimation, respectively

SE = print Standard Errors

TV = print T Values

RS = print ResidualS

SS = print Standardized Ssolution

SC = Standardize solution Completely

MI = print Modification Indices

ND = Number of Decimal places to use for output

IT = ITeration limit for program termination

Matrix Shape Declaration Options

FU = FUll matrix shape (e.g., rectangular matrix)

ID = IDentity matrix shape (e.g., ones on the main diagonal, zeroes elsewhere

DI = DIagonal matrix shape (e.g., precludes off-diagonal elements)

FI = FIxed matrix elements by default (override with FR cards)

FR = FRee matrix elements by default

matrix algebra - my.ilstu.edumshesso/classes/psy445/lectures/lectures.pdfyou already know scalar...

Documents