from pca to confirmatory fa (from using stata to using mx and other sem software) references:...
TRANSCRIPT
From PCA to Confirmatory FA
(from using Stata to using Mx and other SEM software)
References:
Chapter 8 of Hamilton
Chapter 10 of Lattin et al
Data sets: College.txt, Govern.sav, Adoption.txt
Class 1
• Principal Components
• Exploratory Factor Model
• Confirmatory Factor Model
Principal Components
Basic principles and the use of the method, with an example
Chapter 8 of Hamilton, pp. 249-267
data=read.table("G:/Albert/COURSES/RMMSS/Schools1.txt", header=T)
names(data)
[1] "School" "SchoolT" "SAT" "Accept" "CostSt" "Top10" "PhD"
[8] "Grad"
attach(data)
pairs(data[,3:8])
lCost=log(CostSt)
cdata=cbind(data[,3:4], lCost, data[,6:8])
pairs(cdata)
http://lib.stat.cmu.edu/DASL/Datafiles/Colleges.html
Principal Components Analysis
(PCA) Yj = aj1 PC1 + aj2 PC2 + Ej, j = 1, 2, ... P
the Yj are manifest variables
Ej = aj3 PC3 + .... ajp PCp
the PC are called principal components
Let Rj2 the R2 of the (linear) regression of Yj on PC1 and PC2
In PCA, the a’s are choosen so to maximize sumj Rj2
use "G:\Albert\COURSES\RMMSS\school1.dta", clear
. edit
- preserve
. summarize sat accept costst top10 phd grad
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
sat | 50 1263.96 62.32959 1109 1400
accept | 50 37.84 13.36361 17 67
costst | 50 30247.2 15266.17 17520 102262
top10 | 50 74.44 13.51516 47 98
phd | 50 90.56 8.258972 58 100
-------------+--------------------------------------------------------
grad | 50 83.48 7.557237 61 95
.
.
. gen lcost = log(costst)
. pca sat accept lcost top10 phd grad, factors(2)
(obs=50)
(principal components; 2 components retained)
Component Eigenvalue Difference Proportion Cumulative
------------------------------------------------------------------
1 3.01940 1.74300 0.5032 0.5032
2 1.27640 0.52532 0.2127 0.7160
3 0.75108 0.25948 0.1252 0.8411
4 0.49160 0.25118 0.0819 0.9231
5 0.24042 0.01930 0.0401 0.9631
6 0.22112 . 0.0369 1.0000
Eigenvectors
Variable | 1 2
-------------+---------------------
sat | 0.48705 -0.20272
accept | -0.47435 0.20082
lcost | 0.38708 0.30674
top10 | 0.45710 0.28373
phd | 0.27982 0.55460
grad | 0.31732 -0.66060
. greigen
. score f1 f2
(based on unrotated principal components)
Scoring Coefficients
Variable | 1 2
-------------+---------------------
sat | 0.48705 -0.20272
accept | -0.47435 0.20082
lcost | 0.38708 0.30674
top10 | 0.45710 0.28373
phd | 0.27982 0.55460
grad | 0.31732 -0.66060
.
. summarize f1 f2
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
f1 | 50 2.76e-09 1.737641 -2.693964 3.290203
f2 | 50 -7.38e-09 1.129777 -2.067842 3.50152
Normalized pc
01
23
Eigenvalues
1 2 3 4 5 6Number
. graph f2 f1, s([_n])f2
f1-2.69396 3.2902
-2.06784
3.50152
1
2
3
4
5
6
7 8
9
10
1112
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
2829
30
31
3233
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
4950
. cor sat accept lcost top10 phd grad f1 f2
(obs=50)
| sat accept lcost top10 phd grad f1 f2
-------------+------------------------------------------------------------------------
sat | 1.0000
accept | -0.6068 1.0000
lcost | 0.5697 -0.2972 1.0000
top10 | 0.5093 -0.6163 0.5321 1.0000
phd | 0.2209 -0.3117 0.3155 0.4486 1.0000
grad | 0.5691 -0.5622 0.0999 0.1613 -0.0554 1.0000
f1 | 0.8463 -0.8243 0.6726 0.7943 0.4862 0.5514 1.0000
f2 | -0.2290 0.2269 0.3465 0.3206 0.6266 -0.7463 -0.0000 1.0000
library(mva)
help('factanal')
help('princomp')
pca=princomp(cdata,cor=T, scores=T)
biplot(pca)
> summary(pca)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 1.7376411 1.1297771 0.8666462 0.70114124 0.49032369
Proportion of Variance 0.5032328 0.2127327 0.1251793 0.08193317 0.04006955
Cumulative Proportion 0.5032328 0.7159655 0.8411447 0.92307790 0.96314745
round(cov(pca$scores[,1:2]),3)
Comp.1 Comp.2
Comp.1 3.081 0.000
Comp.2 0.000 1.302
> data[,1]
[1] Amherst Swarthmore Williams Bowdoin Wellesley
[6] Pomona Wesleyan Middlebury Smith Davidson
[11] Vassar Carleton ClarMcKenna Oberlin WashingtonLee
[16] Grinnell MountHolyoke Colby Hamilton Bates
[21] Haverford Colgate BrynMawr Occidental Barnard
[26] Harvard Stanford Yale Princeton CalTech
[31] MIT Duke Dartmouth Cornell Columbia
[36] UofChicago Brown UPenn Berkeley JohnsHopkins
[41] Rice UCLA UVa. Georgetown UNC
[46] UMichican CarnegieMellon Northwestern WashingtonU UofRochester
DD=dist(pca$scores[,1:2], method ="euclidean", diag=FALSE)
clust=hclust(DD, method="complete", members=NULL)
plot(clust, labels=data[,1], cex=.8, col="blue", main="clustering of education")
(Exploratory) Factor Analysis
Yj = aj1 F1 + aj2 F2 + Ej, j = 1, 2, ... P
Ej = ....
uncorrelated across j !!
The a’s are choosen by principal factor method, ML, ...
There is no unique solution (model is non-identified). Rotation methods to maximize interpretation (e.g., Varimax).
Chapter 8 of Hamilton, pp. 270-281
. factor sat accept lcost top10 phd grad, factors(3) ipf
(obs=50)
(iterated principal factors; 3 factors retained)
Factor Eigenvalue Difference Proportion Cumulative
------------------------------------------------------------------
1 2.75866 1.77477 0.6573 0.6573
2 0.98390 0.52915 0.2344 0.8917
3 0.45474 0.45357 0.1083 1.0000
4 0.00118 0.00100 0.0003 1.0003
5 0.00018 0.00160 0.0000 1.0003
6 -0.00142 . -0.0003 1.0000
Factor Loadings
Variable | 1 2 3 Uniqueness
-------------+-------------------------------------------
sat | 0.80984 -0.12555 0.22792 0.27645
accept | -0.81206 0.20282 0.33555 0.18682
lcost | 0.65212 0.44139 0.42542 0.19894
top10 | 0.74504 0.32592 -0.23040 0.28561
phd | 0.38481 0.34884 -0.20905 0.68653
grad | 0.56121 -0.71011 0.11153 0.16835
Exploratory Factor Analysis
> factanal(cdata, factors=2)
Call:
fac=factanal(cdata, factors=2, scores="regression")
Uniquenesses:
SAT Accept lCost Top10 PhD Grad
0.388 0.353 0.600 0.256 0.708 0.005
Loadings:
Factor1 Factor2
SAT 0.484 0.615
Accept -0.523 -0.612
lCost 0.613 0.155
Top10 0.830 0.235
PhD 0.540
Grad 0.994
Factor1 Factor2
SS loadings 1.871 1.819
Proportion Var 0.312 0.303
Cumulative Var 0.312 0.615
Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 11.47 on 4 degrees of freedom.
The p-value is 0.0217
>
Exploratory Factor Analysis
> summary(fac)
Length Class Mode
converged 1 -none- logical
loadings 12 loadings numeric
uniquenesses 6 -none- numeric
correlation 36 -none- numeric
criteria 3 -none- numeric
factors 1 -none- numeric
dof 1 -none- numeric
method 1 -none- character
scores 100 -none- numeric
STATISTIC 1 -none- numeric
PVAL 1 -none- numeric
n.obs 1 -none- numeric
call 4 -none- call
>
(Confirmatory) Factor Analysis
Yj = aj1 F1 + aj2 F2 + Ej, j = 1, 2, ... P
Ej = ....
uncorrelated across j !!
Some of the a’s are free, other restricted a priori (to 0s, 1s, or by equality among them), estimation method is ML, GLS,... There is uniqueness in the solution (an identified model).
Lattin and Roberts data of adoption new technologies
p. 366 of Lattin et al.
See the data file adoption.txt in RMMRS
Analysis of Adoption data
data=read.table("E:/Albert/COURSES/RMMSS/Mx/ADOPTION.txt", header=T)
names(data)
[1] "ADOPt1" "ADOPt2" "VALUE1" "VALUE2" "VALUE3" "USAGE1" "USAGE2" "USAGE3"
attach(data)
round(cov(data, use="complete.obs"),2)
ADOPt1 ADOPt2 VALUE1 VALUE2 VALUE3 USAGE1 USAGE2 USAGE3
ADOPt1 675.17 489.24 6.25 5.46 4.08 10.62 11.69 7.12
ADOPt2 489.24 994.31 4.16 4.46 3.42 16.35 17.92 12.17
VALUE1 6.25 4.16 0.95 0.37 0.45 0.16 0.19 0.12
VALUE2 5.46 4.46 0.37 0.83 0.31 0.11 0.12 0.07
VALUE3 4.08 3.42 0.45 0.31 0.86 0.13 0.18 0.05
USAGE1 10.62 16.35 0.16 0.11 0.13 0.76 0.64 0.45
USAGE2 11.69 17.92 0.19 0.12 0.18 0.64 0.92 0.55
USAGE3 7.12 12.17 0.12 0.07 0.05 0.45 0.55 0.64
dim(data)
[1] 188 8
Data Nimput=8 Nobservations=188
CMatrix
675.17
489.24 994.31
6.25 4.16 0.95
5.46 4.46 0.37 0.83
4.08 3.42 0.45 0.31 0.86
10.62 16.35 0.16 0.11 0.13 0.76
11.69 17.92 0.19 0.12 0.18 0.64 0.92
7.12 12.17 0.12 0.07 0.05 0.45 0.55 0.64
Labels ADOPt1 ADOPt2 VALUE1 VALUE2 VALUE3 USAGE1 USAGE2 USAGE3
Adoption.dat
> data=read.table("E:/Albert/COURSES/RMMSS/Mx/ADOPTION.txt", header=T)
> names(data)
[1] "ADOPt1" "ADOPt2" "VALUE1" "VALUE2" "VALUE3" "USAGE1" "USAGE2" "USAGE3"
attach(data)
factanal(cbind(VALUE1, VALUE2,VALUE3,USAGE1, USAGE2,USAGE3), factors=2, rotation="varimax")
Call:
factanal(x = cbind(VALUE1, VALUE2, VALUE3, USAGE1, USAGE2, USAGE3), factors = 2)
Uniquenesses:
VALUE1 VALUE2 VALUE3 USAGE1 USAGE2 USAGE3
0.493 0.648 0.484 0.291 0.165 0.292
Loadings:
Factor1 Factor2
VALUE1 0.127 0.700
VALUE2 0.586
VALUE3 0.714
USAGE1 0.823 0.179
USAGE2 0.896 0.179
USAGE3 0.836 Factor1 Factor2
SS loadings 2.209 1.418
Proportion Var 0.368 0.236
Cumulative Var 0.368 0.604
Exploratory Factor Analysis, ML method
Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 1.82 on 4 degrees of freedom.
The p-value is 0.768
Data Nimput=8 Nobservations=188
CMatrix
675.17
489.24 994.31
6.25 4.16 0.95
5.46 4.46 0.37 0.83
4.08 3.42 0.45 0.31 0.86
10.62 16.35 0.16 0.11 0.13 0.76
11.69 17.92 0.19 0.12 0.18 0.64 0.92
7.12 12.17 0.12 0.07 0.05 0.45 0.55 0.64
Labels ADOPt1 ADOPt2 VALUE1 VALUE2 VALUE3 USAGE1 USAGE2 USAGE3
Adoption.dat
One factor model for Value
Two factor model
Factor Analysis
Charles Spearman, 1904Acording to the two-factor theory of intelligence, the performance of any intellectual act requires some combination of "g", which is available to the same individual to the same degreefor all intellectual acts, and of "specific factors" or "s" which are specific to that act and which varies in strength from one act to another. If one knows how a person performs on onetask that is highly saturated with "g", one can safely predict a similar level of performance for a another highly "g" saturated task. Prediction of performance on tasks with high "s" factors are less accurate. Nevertheless, since "g" pervades all tasks, prediction will be significantly better than chance. Thus, the most important information to have about a person's intellectual ability is an estimate of their "g".
Spearman, 1904
Variables
CLASSIC = V1 FRENCH = V2 ENGLISH = V3 MATH = V4 DISCRIM = V5 MUSIC = V6
Correlation matrix
1.83 1.78 .67 1.70 .64 .64 1.66 .65 .54 .45 1.63 .57 .51 .51 .40 1
cases = 23;
Single-Factor Model
V1 V4V3V2
F1
* * * *
* * * *
V6V5
**
**
EQS code for a factor model /Title confirmatory factor analysis: 1 factor ! (Spearman, 1904 ) eqs/exer3.eqs/Specifications var = 6; cases = 23;/Label v1 = classic; v2 = french; v3 =english; v4 = math; V5 = discrim;V6=music;/equationsV1 = *f1 + e1;V2 = *f1+ e2;V3 = *f1 + e3;V4 = *f1 + e4;V5 = *f1 + e5;V6 = *f1 + e6;/variances f1 = 1; e1 to e6 = *;/matrix1.83 1.78 .67 1.70 .64 .64 1.66 .65 .54 .45 1.63 .57 .51 .51 .40 1/LMTEST/end
NT analysis
RESIDUAL COVARIANCE MATRIX (S-SIGMA) :
CLASSIC FRENCH ENGLISH MATH DISCRIM V 1 V 2 V 3 V 4 V 5 CLASSIC V 1 0.000 FRENCH V 2 -0.001 0.000 ENGLISH V 3 0.005 -0.029 0.000 MATH V 4 -0.006 0.003 0.046 0.000 DISCRIM V 5 -0.001 0.054 -0.015 -0.056 0.000 MUSIC V 6 0.003 0.005 -0.017 0.030 -0.049
MUSIC V 6 MUSIC V 6 0.000
CHI-SQUARE = 1.663 BASED ON 9 DEGREES OF FREEDOM PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS 0.99575 THE NORMAL THEORY RLS CHI-SQUARE FOR THIS ML SOLUTION IS 1.648
.
Loadings’ estimates, s.e. and z-test statistics
CLASSIC =V1 = .960*F1 +1.000 E1 .160 6.019 FRENCH =V2 = .866*F1 +1.000 E2 .171 5.049 ENGLISH =V3 = .807*F1 +1.000 E3 .178 4.529 MATH =V4 = .736*F1 +1.000 E4 .186 3.964 DISCRIM =V5 = .688*F1 +1.000 E5 .190 3.621 MUSIC =V6 = .653*F1 +1.000 E6 .193 3.382
Estimates of unique-factors
E1 -CLASSIC .078*I .064 I 1.224 I I E2 -FRENCH .251*I .093 I 2.695 I I E3 -ENGLISH .349*I .118 I 2.958 I I E4 - MATH .459*I .148 I 3.100 I I E5 -DISCRIM .527*I .167 I 3.155 I I E6 -MUSIC .574*I .180 I 3.184 I I
STANDARDIZED SOLUTION:
CLASSIC =V1 = .960*F1 + .279 E1 FRENCH =V2 = .866*F1 + .501 E2 ENGLISH =V3 = .807*F1 + .591 E3 MATH =V4 = .736*F1 + .677 E4 DISCRIM =V5 = .688*F1 + .726 E5 MUSIC =V6 = .653*F1 + .758 E6
Data of Lawley and Maxwell
/TITLE Lawley and Maxwell data /SPECIFICATIONS CAS=220; VAR=6; ME=ML;/LABEL v1 =Gaelic; v2 = English;v3 = Histo;v4 =aritm;v5 =Algebra;v6 =Geometry; /EQUATIONSV1= *F1 + E1;V2= *F1 + E2; V3= *F1 + E3; V4= *F1 + E4; V5= *F1 + E5; V6= *F1 + E6;/VARIANCES F1 = 1; E1 TO E6 = *;/COVARIANCES/MATRIX1 .439 .410 .288 .329 .248.439 1 .351 .354 .320 .329.410 .351 1 .164 .190 .181.288 .354 .164 1 .595 .470.329 .320 .190 .595 1 .464.248 .329 .181 .470 .464 1/END
/EQUATIONSV1= *F1 + E1;V2= *F1 + E2; V3= *F1 + E3; V4= *F2 + E4; V5= *F2 + E5; V6= *F2 + E6;/VARIANCES F1 = 1; F2=1; E1 TO E6 = *;/COVARIANCES F1, F2 = *;
GAELIC =V1 = .687*F1 + 1.000 E1 .076 9.079 ENGLISH =V2 = .672*F1 + 1.000 E2 .076 8.896 HISTO =V3 = .533*F1 + 1.000 E3 .076 7.047 ARITM =V4 = .766*F2 + 1.000 E4 .067 11.379 ALGEBRA =V5 = .768*F2 + 1.000 E5 .067 11.411 GEOMETRY=V6 = .616*F2 + 1.000 E6 .069 8.942
COVARIANCES AMONG INDEPENDENT VARIABLES --------------------------------------- I F2 - F2 .597*I I F1 - F1 .072 I 8.308
M0:
M1:
M0, Single factor model CHI-SQUARE = 52.841, 9 dfP-value LESS THAN 0.001
M1, Two factor model with correlated factors: CHI-SQUARE = 7.953, 8 dfP-value = 0.43804