the r package lavaan - meetupfiles.meetup.com/2968362/rbelgium5_lavaan.pdf · 2013. 9. 16. ·...
TRANSCRIPT
-
Department of Data Analysis Ghent University
The R package lavaan
Yves RosseelDepartment of Data AnalysisGhent University – Belgium
RBelgium meeting 5 @ UGent13 September 2013
Yves Rosseel The R package lavaan 1 / 42
-
Department of Data Analysis Ghent University
lavaan: latent variable analysis
• large family of statistical models, exploiting the concept of ‘latent variables’
• latent variables are:
– unobserved constructs (eg ‘level of depression’)– random effects (as in mixed models)– missing data
• most well-known subclass: ‘structural equation models (SEM)’
Yves Rosseel The R package lavaan 2 / 42
-
Department of Data Analysis Ghent University
SEM examples: path analysis (no latent variables)
y1
y2
y3
y4
y5
y6 y7
Yves Rosseel The R package lavaan 3 / 42
-
Department of Data Analysis Ghent University
SEM examples: confirmatory factor analysis (CFA)
y1
y2
y3
y4
y5
y6
y7
y8
y9
η1
η2
η3
Yves Rosseel The R package lavaan 4 / 42
-
Department of Data Analysis Ghent University
SEM examples: ‘full’ structural equation modeling
y1
y2
y3
y4
y5
y6
η1
η2
y7 y8 y9 y10 y11 y12
η3 η4
Yves Rosseel The R package lavaan 5 / 42
-
Department of Data Analysis Ghent University
software for SEM: commercial – closed-source
• the big four:
– LISREL– EQS– AMOS– Mplus
• SAS/Stat: proc CALIS, proc TCALIS
• SEPATH (Statistica), RAMONA (Systat), Stata 12
• Mx (free, closed-source)
Yves Rosseel The R package lavaan 6 / 42
-
Department of Data Analysis Ghent University
software for SEM: non-commercial – open-source
• outside the R ecosystem: gllamm (Stata module), . . .
• R packages:
– sem– OpenMx– lavaan– lava
• interfaces between R and commercial packages:
– REQS– MplusAutomation
Yves Rosseel The R package lavaan 7 / 42
-
Department of Data Analysis Ghent University
the lavaan project
1. lavaan subproject: the lavaan package/program
• lavaan is an R package for latent variable analysis• the long-term goal of lavaan is to implement all the state-of-the-art
capabilities that are currently available in commercial packages
2. lavaan subproject: Rosetta
• collection of tools for reading/parsing and writing legacy syntax (egclassic LISREL syntax)
• intermediate representation: the lavaan parameter table• currently (0.5-14) only: Mplus/LISREL to lavaan, lavaan to Mplus
3. lavaan subproject: Chameleon
• mimic legacy software• reproducibility
Yves Rosseel The R package lavaan 8 / 42
-
Department of Data Analysis Ghent University
the lavaan package (1)
• lavaan is an R package for latent variable analysis
• the long-term goal of lavaan is to implement all the state-of-the-art capabil-ities that are currently available in commercial packages
Yves Rosseel The R package lavaan 9 / 42
-
Department of Data Analysis Ghent University
the lavaan package (2)
• the lavaan source code is hosted on github:
https://github.com/yrosseel/lavaan
• more information about lavaan:
http://lavaan.org
• the lavaan paper:
Rosseel (2012). lavaan: an R package for structural equationmodeling. Journal of Statistical Software, 48(2), 1–36.
• lavaan discussion group (mailing list)
https://groups.google.com/d/forum/lavaan
Yves Rosseel The R package lavaan 10 / 42
-
Department of Data Analysis Ghent University
how big is lavaan?
• > 25K lines of code (currently, 0.5-15, R code only)
• how many people are using lavaan?
– no idea– the lavaan paper has been downloaded > 12, 600 times (since April
2012)
– lavaan.org gets around 60–120 hits per day
Yves Rosseel The R package lavaan 11 / 42
-
Department of Data Analysis Ghent University
where are the lavaan users?
Yves Rosseel The R package lavaan 12 / 42
-
Department of Data Analysis Ghent University
why do we need lavaan?
1. lavaan is for statisticians working in the field of SEM
• it seems unfortunate that new developments in this field are hindered bythe lack of open source software that researchers can use to implementtheir newest ideas
2. lavaan is for teachers
• teaching these techniques to students was often complicated by theforced choice for one of the commercial packages
3. lavaan is for applied researchers
• keep it simple, provide all the features they need
Yves Rosseel The R package lavaan 13 / 42
-
Department of Data Analysis Ghent University
features of lavaan
• lavaan is well-tested
• user-friendly fitting functions (cfa, sem, growth)
• power-user fitting function (lavaan)
• support for non-normal continuous data:
– robust standard errors, Satorra-Bentler correction, ADF estimation, boot-strapping
• support for categorical (binary/ordinal) data
– lavaan has implemented the three-stage WLS approach as developedby Bengt Muthén (1984); including robust variants (aka WLSMV)
• full support for missing data, meanstructures, and multiple groups
• linear and non-linear equality and inequality constraints
Yves Rosseel The R package lavaan 14 / 42
-
Department of Data Analysis Ghent University
unique features
• default model specification: lavaan model syntax
– Mplus2lavaan (Michael Hallquist)– lisrel2lavaan (Corbin Quick)– graphical (via Onyx)– . . .
• mimic the (numerical) results of commercial packages:
– mimic="Mplus"– mimic="EQS"
• new technical features:
– informative hypothesis testing (Leonard Vanbrabant)– pairwise ML for binary/ordinal data (Myrsini Katsikatsou)– fraction of missing information (Mijke Rhemtulla)– . . .
Yves Rosseel The R package lavaan 15 / 42
-
Department of Data Analysis Ghent University
features NOT in lavaan (yet)
• multilevel SEM
• mixture (latent class) SEM
• . . .
features we are working on
• Bayesian SEM (BUGS interface, stan interface, native)
• small-sample corrections
• causal inference
• standard errors for standardized parameters
• ML estimation for categorical data (IRT)
• . . .
• better (technical) documentation
Yves Rosseel The R package lavaan 16 / 42
-
Department of Data Analysis Ghent University
the lavaan ecosystem
• lavaan.survey (Daniel Oberski)
survey weights, clustering, strata, and finite sampling correctionsin SEM
• Onyx (Timo von Oertzen, Andreas M. Brandmaier, Siny Tsang)
interactive graphical interface for SEM (written in Java)
• semTools (Sunthud Pornprasertmanit and many others)
collection of useful functions for SEM
• simsem (Sunthud Pornprasertmanit and many others)
simulation of SEM models
• semPlot (Sacha Epskamp)
visualizations of SEM models
Yves Rosseel The R package lavaan 17 / 42
-
Department of Data Analysis Ghent University
semPlot
y1 y2 y3 y4 y5 y6
x1 x2 x3
f1 f2
Yves Rosseel The R package lavaan 18 / 42
-
Department of Data Analysis Ghent University
a simple regression analysis in R
x1
x2
x3
x4
y
# read in your datamyData
-
Department of Data Analysis Ghent University
lm() output artificial data (N=100)Call:lm(formula = y ˜ x1 + x2 + x3 + x4, data = myData)
Residuals:Min 1Q Median 3Q Max
-102.372 -29.458 -3.658 27.275 148.404
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) 97.7210 4.7200 20.704
-
Department of Data Analysis Ghent University
the lavaan model syntax – a simple regression
x1
x2
x3
x4
y
library(lavaan)myData
-
Department of Data Analysis Ghent University
output (artificial data, N=100)lavaan (0.5-13) converged normally after 1 iterations
Number of observations 100
Estimator MLMinimum Function Test Statistic 0.000Degrees of freedom 0P-value (Chi-square) 1.000
Parameter estimates:
Information ExpectedStandard Errors Standard
Estimate Std.err Z-value P(>|z|)Regressions:y ˜x1 5.773 0.511 11.309 0.000x2 -1.321 0.479 -2.757 0.006x3 1.135 0.446 2.545 0.011x4 0.271 0.466 0.581 0.561
Variances:y 2075.100 293.463
Yves Rosseel The R package lavaan 22 / 42
-
Department of Data Analysis Ghent University
the lavaan model syntax – multivariate regression
x1
x2
x3
x4
y1
y2
myModel
-
Department of Data Analysis Ghent University
the lavaan model syntax – path analysis
x1
x2
x3
x4
x5
x6
x7
myModel
-
Department of Data Analysis Ghent University
the lavaan model syntax – mediation analysis
X
M
Y
a
c
b
model
-
Department of Data Analysis Ghent University
output...
Parameter estimates:
Information ObservedStandard Errors BootstrapNumber of requested bootstrap draws 1000Number of successful bootstrap draws 1000
Estimate Std.err Z-value P(>|z|)Regressions:Y ˜M (b) 0.597 0.098 6.068 0.000X (c) 2.594 1.210 2.145 0.032
M ˜X (a) 2.739 0.999 2.741 0.006
Variances:Y 108.700 17.747M 105.408 16.556
Defined parameters:indirect 1.636 0.645 2.535 0.011total 4.230 1.383 3.059 0.002
Yves Rosseel The R package lavaan 26 / 42
-
Department of Data Analysis Ghent University
the lavaan model syntax – using cfa() or sem()
x1
x2
x3
x4
x5
x6
x7
x8
x9
visual
textual
speed
HS.model
-
Department of Data Analysis Ghent University
the lavaan model syntax – using lavaan()
x1
x2
x3
x4
x5
x6
x7
x8
x9
visual
textual
speed
HS.model
-
Department of Data Analysis Ghent University
outputlavaan (0.5-12) converged normally after 41 iterations
Number of observations 301
Estimator MLMinimum Function Chi-square 85.306Degrees of freedom 24P-value 0.000
Chi-square test baseline model:
Minimum Function Chi-square 918.852Degrees of freedom 36P-value 0.000
Full model versus baseline model:
Comparative Fit Index (CFI) 0.931Tucker-Lewis Index (TLI) 0.896
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -3737.745Loglikelihood unrestricted model (H1) -3695.092
Number of free parameters 21
Yves Rosseel The R package lavaan 29 / 42
-
Department of Data Analysis Ghent University
Akaike (AIC) 7517.490Bayesian (BIC) 7595.339Sample-size adjusted Bayesian (BIC) 7528.739
Root Mean Square Error of Approximation:
RMSEA 0.09290 Percent Confidence Interval 0.071 0.114P-value RMSEA |z|) Std.lv Std.allLatent variables:visual =˜x1 1.000 0.900 0.772x2 0.553 0.100 5.554 0.000 0.498 0.424x3 0.729 0.109 6.685 0.000 0.656 0.581
textual =˜x4 1.000 0.990 0.852x5 1.113 0.065 17.014 0.000 1.102 0.855
Yves Rosseel The R package lavaan 30 / 42
-
Department of Data Analysis Ghent University
x6 0.926 0.055 16.703 0.000 0.917 0.838speed =˜x7 1.000 0.619 0.570x8 1.180 0.165 7.152 0.000 0.731 0.723x9 1.082 0.151 7.155 0.000 0.670 0.665
Covariances:visual ˜˜textual 0.408 0.074 5.552 0.000 0.459 0.459speed 0.262 0.056 4.660 0.000 0.471 0.471
textual ˜˜speed 0.173 0.049 3.518 0.000 0.283 0.283
Variances:x1 0.549 0.114 0.549 0.404x2 1.134 0.102 1.134 0.821x3 0.844 0.091 0.844 0.662x4 0.371 0.048 0.371 0.275x5 0.446 0.058 0.446 0.269x6 0.356 0.043 0.356 0.298x7 0.799 0.081 0.799 0.676x8 0.488 0.074 0.488 0.477x9 0.566 0.071 0.566 0.558visual 0.809 0.145 1.000 1.000textual 0.979 0.112 1.000 1.000speed 0.384 0.086 1.000 1.000
Yves Rosseel The R package lavaan 31 / 42
-
Department of Data Analysis Ghent University
testing for measurement invariance# model 1: configural invariancefit1
-
Department of Data Analysis Ghent University
lavaan model syntax: full sem
y1
y2
y3
y4
y5
y6
y7
y8
x1 x2 x3
dem60
dem65
ind60
myModel
-
Department of Data Analysis Ghent University
a simple growth curve model with time-varying covariates
c1
c2
c3
c4
t1 t2 t3 t4
i s
x1 x2
model
-
Department of Data Analysis Ghent University
further syntax
• fixing parameters, and overriding auto-fixed parametersHS.model.bis
-
Department of Data Analysis Ghent University
the parameter table (Holzinger & Swineford CFA example)> parTable(fit)
id lhs op rhs user group free ustart exo label eq.id unco1 1 visual =˜ x1 1 1 0 1 0 0 02 2 visual =˜ x2 1 1 1 NA 0 0 13 3 visual =˜ x3 1 1 2 NA 0 0 24 4 textual =˜ x4 1 1 0 1 0 0 05 5 textual =˜ x5 1 1 3 NA 0 0 36 6 textual =˜ x6 1 1 4 NA 0 0 47 7 speed =˜ x7 1 1 0 1 0 0 08 8 speed =˜ x8 1 1 5 NA 0 0 59 9 speed =˜ x9 1 1 6 NA 0 0 610 10 x1 ˜˜ x1 0 1 7 NA 0 0 711 11 x2 ˜˜ x2 0 1 8 NA 0 0 812 12 x3 ˜˜ x3 0 1 9 NA 0 0 913 13 x4 ˜˜ x4 0 1 10 NA 0 0 1014 14 x5 ˜˜ x5 0 1 11 NA 0 0 1115 15 x6 ˜˜ x6 0 1 12 NA 0 0 1216 16 x7 ˜˜ x7 0 1 13 NA 0 0 1317 17 x8 ˜˜ x8 0 1 14 NA 0 0 1418 18 x9 ˜˜ x9 0 1 15 NA 0 0 1519 19 visual ˜˜ visual 0 1 16 NA 0 0 1620 20 textual ˜˜ textual 0 1 17 NA 0 0 1721 21 speed ˜˜ speed 0 1 18 NA 0 0 1822 22 visual ˜˜ textual 0 1 19 NA 0 0 1923 23 visual ˜˜ speed 0 1 20 NA 0 0 2024 24 textual ˜˜ speed 0 1 21 NA 0 0 21
Yves Rosseel The R package lavaan 36 / 42
-
Department of Data Analysis Ghent University
the parameter table (2)> PT lavNames(fit, "ov")[1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9"
> lavNames(fit, "lv")[1] "visual" "textual" "speed"
> lavNames(fit, "ov.x")character(0)
> lavNames(fit, "lv.x")[1] "visual" "textual" "speed"
> lavaan:::getDF(PT)[1] 24
> lMR
-
Department of Data Analysis Ghent University
> lMR[,c("id","lhs","op","rhs","mat","row","col")]id lhs op rhs mat row col1 1 visual =˜ x1 lambda 1 12 2 visual =˜ x2 lambda 2 13 3 visual =˜ x3 lambda 3 14 4 textual =˜ x4 lambda 4 25 5 textual =˜ x5 lambda 5 26 6 textual =˜ x6 lambda 6 27 7 speed =˜ x7 lambda 7 38 8 speed =˜ x8 lambda 8 39 9 speed =˜ x9 lambda 9 310 10 x1 ˜˜ x1 theta 1 111 11 x2 ˜˜ x2 theta 2 212 12 x3 ˜˜ x3 theta 3 313 13 x4 ˜˜ x4 theta 4 414 14 x5 ˜˜ x5 theta 5 515 15 x6 ˜˜ x6 theta 6 616 16 x7 ˜˜ x7 theta 7 717 17 x8 ˜˜ x8 theta 8 818 18 x9 ˜˜ x9 theta 9 919 19 visual ˜˜ visual psi 1 120 20 textual ˜˜ textual psi 2 221 21 speed ˜˜ speed psi 3 322 22 visual ˜˜ textual psi 1 223 23 visual ˜˜ speed psi 1 324 24 textual ˜˜ speed psi 2 3
Yves Rosseel The R package lavaan 38 / 42
-
Department of Data Analysis Ghent University
future plans
• S4 classes: nice, but clumsy and ridiculously slow
• newer code relies on ‘Reference Classes’
• for large-scale simulation studies: lavaan (or R) is too slow
• eventually, I will rewrite everything in C++ (using the Eigen library)
– ideally, only a thin layer is written in R– to Rcpp or not to Rcpp?– the python/MATLAB/. . . communities also need a high-quality pack-
age for latent variable analysis
Yves Rosseel The R package lavaan 39 / 42
-
Department of Data Analysis Ghent University
why you should not create an R package
• I get 5–20 emails per day (lavaan related)
• contributed code is usually of low quality
• . . .
• R core is not a democracy
• CRAN is not a democracy
• . . .
• dependency hell: packages are (sometimes) removed from CRAN
• I shall not break any packages that depend on lavaan
Yves Rosseel The R package lavaan 40 / 42
-
Department of Data Analysis Ghent University
R is (not so) great
• R, as a language, is not perfect
– the copy-by-value semantics– a lot of unnecessary internal copying; this affects speed– big data, parallelization: can be done, but not easily– no native support for many basic matrix operations (sparse, ginv, . . . )– optimizers are of medium-quality (optim, nlminb, . . . )– vectorized code is relatively fast, but not always possible– computing p-values under a multivariate normal distribution (package
mvtnorm) is NOT vectorized (and hence slow)
– . . .
• I do not expect any spectacular changes in the future
• future alternatives? http://julialang.org/
Yves Rosseel The R package lavaan 41 / 42
-
Department of Data Analysis Ghent University
Thank you!
http://lavaan.org
Yves Rosseel The R package lavaan 42 / 42