the r package lavaan - meetupfiles.meetup.com/2968362/rbelgium5_lavaan.pdf · 2013. 9. 16. ·...

42
Department of Data Analysis Ghent University The R package lavaan Yves Rosseel Department of Data Analysis Ghent University – Belgium RBelgium meeting 5 @ UGent 13 September 2013 Yves Rosseel The R package lavaan 1 / 42

Upload: others

Post on 03-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Department of Data Analysis Ghent University

    The R package lavaan

    Yves RosseelDepartment of Data AnalysisGhent University – Belgium

    RBelgium meeting 5 @ UGent13 September 2013

    Yves Rosseel The R package lavaan 1 / 42

  • Department of Data Analysis Ghent University

    lavaan: latent variable analysis

    • large family of statistical models, exploiting the concept of ‘latent variables’

    • latent variables are:

    – unobserved constructs (eg ‘level of depression’)– random effects (as in mixed models)– missing data

    • most well-known subclass: ‘structural equation models (SEM)’

    Yves Rosseel The R package lavaan 2 / 42

  • Department of Data Analysis Ghent University

    SEM examples: path analysis (no latent variables)

    y1

    y2

    y3

    y4

    y5

    y6 y7

    Yves Rosseel The R package lavaan 3 / 42

  • Department of Data Analysis Ghent University

    SEM examples: confirmatory factor analysis (CFA)

    y1

    y2

    y3

    y4

    y5

    y6

    y7

    y8

    y9

    η1

    η2

    η3

    Yves Rosseel The R package lavaan 4 / 42

  • Department of Data Analysis Ghent University

    SEM examples: ‘full’ structural equation modeling

    y1

    y2

    y3

    y4

    y5

    y6

    η1

    η2

    y7 y8 y9 y10 y11 y12

    η3 η4

    Yves Rosseel The R package lavaan 5 / 42

  • Department of Data Analysis Ghent University

    software for SEM: commercial – closed-source

    • the big four:

    – LISREL– EQS– AMOS– Mplus

    • SAS/Stat: proc CALIS, proc TCALIS

    • SEPATH (Statistica), RAMONA (Systat), Stata 12

    • Mx (free, closed-source)

    Yves Rosseel The R package lavaan 6 / 42

  • Department of Data Analysis Ghent University

    software for SEM: non-commercial – open-source

    • outside the R ecosystem: gllamm (Stata module), . . .

    • R packages:

    – sem– OpenMx– lavaan– lava

    • interfaces between R and commercial packages:

    – REQS– MplusAutomation

    Yves Rosseel The R package lavaan 7 / 42

  • Department of Data Analysis Ghent University

    the lavaan project

    1. lavaan subproject: the lavaan package/program

    • lavaan is an R package for latent variable analysis• the long-term goal of lavaan is to implement all the state-of-the-art

    capabilities that are currently available in commercial packages

    2. lavaan subproject: Rosetta

    • collection of tools for reading/parsing and writing legacy syntax (egclassic LISREL syntax)

    • intermediate representation: the lavaan parameter table• currently (0.5-14) only: Mplus/LISREL to lavaan, lavaan to Mplus

    3. lavaan subproject: Chameleon

    • mimic legacy software• reproducibility

    Yves Rosseel The R package lavaan 8 / 42

  • Department of Data Analysis Ghent University

    the lavaan package (1)

    • lavaan is an R package for latent variable analysis

    • the long-term goal of lavaan is to implement all the state-of-the-art capabil-ities that are currently available in commercial packages

    Yves Rosseel The R package lavaan 9 / 42

  • Department of Data Analysis Ghent University

    the lavaan package (2)

    • the lavaan source code is hosted on github:

    https://github.com/yrosseel/lavaan

    • more information about lavaan:

    http://lavaan.org

    • the lavaan paper:

    Rosseel (2012). lavaan: an R package for structural equationmodeling. Journal of Statistical Software, 48(2), 1–36.

    • lavaan discussion group (mailing list)

    https://groups.google.com/d/forum/lavaan

    Yves Rosseel The R package lavaan 10 / 42

  • Department of Data Analysis Ghent University

    how big is lavaan?

    • > 25K lines of code (currently, 0.5-15, R code only)

    • how many people are using lavaan?

    – no idea– the lavaan paper has been downloaded > 12, 600 times (since April

    2012)

    – lavaan.org gets around 60–120 hits per day

    Yves Rosseel The R package lavaan 11 / 42

  • Department of Data Analysis Ghent University

    where are the lavaan users?

    Yves Rosseel The R package lavaan 12 / 42

  • Department of Data Analysis Ghent University

    why do we need lavaan?

    1. lavaan is for statisticians working in the field of SEM

    • it seems unfortunate that new developments in this field are hindered bythe lack of open source software that researchers can use to implementtheir newest ideas

    2. lavaan is for teachers

    • teaching these techniques to students was often complicated by theforced choice for one of the commercial packages

    3. lavaan is for applied researchers

    • keep it simple, provide all the features they need

    Yves Rosseel The R package lavaan 13 / 42

  • Department of Data Analysis Ghent University

    features of lavaan

    • lavaan is well-tested

    • user-friendly fitting functions (cfa, sem, growth)

    • power-user fitting function (lavaan)

    • support for non-normal continuous data:

    – robust standard errors, Satorra-Bentler correction, ADF estimation, boot-strapping

    • support for categorical (binary/ordinal) data

    – lavaan has implemented the three-stage WLS approach as developedby Bengt Muthén (1984); including robust variants (aka WLSMV)

    • full support for missing data, meanstructures, and multiple groups

    • linear and non-linear equality and inequality constraints

    Yves Rosseel The R package lavaan 14 / 42

  • Department of Data Analysis Ghent University

    unique features

    • default model specification: lavaan model syntax

    – Mplus2lavaan (Michael Hallquist)– lisrel2lavaan (Corbin Quick)– graphical (via Onyx)– . . .

    • mimic the (numerical) results of commercial packages:

    – mimic="Mplus"– mimic="EQS"

    • new technical features:

    – informative hypothesis testing (Leonard Vanbrabant)– pairwise ML for binary/ordinal data (Myrsini Katsikatsou)– fraction of missing information (Mijke Rhemtulla)– . . .

    Yves Rosseel The R package lavaan 15 / 42

  • Department of Data Analysis Ghent University

    features NOT in lavaan (yet)

    • multilevel SEM

    • mixture (latent class) SEM

    • . . .

    features we are working on

    • Bayesian SEM (BUGS interface, stan interface, native)

    • small-sample corrections

    • causal inference

    • standard errors for standardized parameters

    • ML estimation for categorical data (IRT)

    • . . .

    • better (technical) documentation

    Yves Rosseel The R package lavaan 16 / 42

  • Department of Data Analysis Ghent University

    the lavaan ecosystem

    • lavaan.survey (Daniel Oberski)

    survey weights, clustering, strata, and finite sampling correctionsin SEM

    • Onyx (Timo von Oertzen, Andreas M. Brandmaier, Siny Tsang)

    interactive graphical interface for SEM (written in Java)

    • semTools (Sunthud Pornprasertmanit and many others)

    collection of useful functions for SEM

    • simsem (Sunthud Pornprasertmanit and many others)

    simulation of SEM models

    • semPlot (Sacha Epskamp)

    visualizations of SEM models

    Yves Rosseel The R package lavaan 17 / 42

  • Department of Data Analysis Ghent University

    semPlot

    y1 y2 y3 y4 y5 y6

    x1 x2 x3

    f1 f2

    Yves Rosseel The R package lavaan 18 / 42

  • Department of Data Analysis Ghent University

    a simple regression analysis in R

    x1

    x2

    x3

    x4

    y

    # read in your datamyData

  • Department of Data Analysis Ghent University

    lm() output artificial data (N=100)Call:lm(formula = y ˜ x1 + x2 + x3 + x4, data = myData)

    Residuals:Min 1Q Median 3Q Max

    -102.372 -29.458 -3.658 27.275 148.404

    Coefficients:Estimate Std. Error t value Pr(>|t|)

    (Intercept) 97.7210 4.7200 20.704

  • Department of Data Analysis Ghent University

    the lavaan model syntax – a simple regression

    x1

    x2

    x3

    x4

    y

    library(lavaan)myData

  • Department of Data Analysis Ghent University

    output (artificial data, N=100)lavaan (0.5-13) converged normally after 1 iterations

    Number of observations 100

    Estimator MLMinimum Function Test Statistic 0.000Degrees of freedom 0P-value (Chi-square) 1.000

    Parameter estimates:

    Information ExpectedStandard Errors Standard

    Estimate Std.err Z-value P(>|z|)Regressions:y ˜x1 5.773 0.511 11.309 0.000x2 -1.321 0.479 -2.757 0.006x3 1.135 0.446 2.545 0.011x4 0.271 0.466 0.581 0.561

    Variances:y 2075.100 293.463

    Yves Rosseel The R package lavaan 22 / 42

  • Department of Data Analysis Ghent University

    the lavaan model syntax – multivariate regression

    x1

    x2

    x3

    x4

    y1

    y2

    myModel

  • Department of Data Analysis Ghent University

    the lavaan model syntax – path analysis

    x1

    x2

    x3

    x4

    x5

    x6

    x7

    myModel

  • Department of Data Analysis Ghent University

    the lavaan model syntax – mediation analysis

    X

    M

    Y

    a

    c

    b

    model

  • Department of Data Analysis Ghent University

    output...

    Parameter estimates:

    Information ObservedStandard Errors BootstrapNumber of requested bootstrap draws 1000Number of successful bootstrap draws 1000

    Estimate Std.err Z-value P(>|z|)Regressions:Y ˜M (b) 0.597 0.098 6.068 0.000X (c) 2.594 1.210 2.145 0.032

    M ˜X (a) 2.739 0.999 2.741 0.006

    Variances:Y 108.700 17.747M 105.408 16.556

    Defined parameters:indirect 1.636 0.645 2.535 0.011total 4.230 1.383 3.059 0.002

    Yves Rosseel The R package lavaan 26 / 42

  • Department of Data Analysis Ghent University

    the lavaan model syntax – using cfa() or sem()

    x1

    x2

    x3

    x4

    x5

    x6

    x7

    x8

    x9

    visual

    textual

    speed

    HS.model

  • Department of Data Analysis Ghent University

    the lavaan model syntax – using lavaan()

    x1

    x2

    x3

    x4

    x5

    x6

    x7

    x8

    x9

    visual

    textual

    speed

    HS.model

  • Department of Data Analysis Ghent University

    outputlavaan (0.5-12) converged normally after 41 iterations

    Number of observations 301

    Estimator MLMinimum Function Chi-square 85.306Degrees of freedom 24P-value 0.000

    Chi-square test baseline model:

    Minimum Function Chi-square 918.852Degrees of freedom 36P-value 0.000

    Full model versus baseline model:

    Comparative Fit Index (CFI) 0.931Tucker-Lewis Index (TLI) 0.896

    Loglikelihood and Information Criteria:

    Loglikelihood user model (H0) -3737.745Loglikelihood unrestricted model (H1) -3695.092

    Number of free parameters 21

    Yves Rosseel The R package lavaan 29 / 42

  • Department of Data Analysis Ghent University

    Akaike (AIC) 7517.490Bayesian (BIC) 7595.339Sample-size adjusted Bayesian (BIC) 7528.739

    Root Mean Square Error of Approximation:

    RMSEA 0.09290 Percent Confidence Interval 0.071 0.114P-value RMSEA |z|) Std.lv Std.allLatent variables:visual =˜x1 1.000 0.900 0.772x2 0.553 0.100 5.554 0.000 0.498 0.424x3 0.729 0.109 6.685 0.000 0.656 0.581

    textual =˜x4 1.000 0.990 0.852x5 1.113 0.065 17.014 0.000 1.102 0.855

    Yves Rosseel The R package lavaan 30 / 42

  • Department of Data Analysis Ghent University

    x6 0.926 0.055 16.703 0.000 0.917 0.838speed =˜x7 1.000 0.619 0.570x8 1.180 0.165 7.152 0.000 0.731 0.723x9 1.082 0.151 7.155 0.000 0.670 0.665

    Covariances:visual ˜˜textual 0.408 0.074 5.552 0.000 0.459 0.459speed 0.262 0.056 4.660 0.000 0.471 0.471

    textual ˜˜speed 0.173 0.049 3.518 0.000 0.283 0.283

    Variances:x1 0.549 0.114 0.549 0.404x2 1.134 0.102 1.134 0.821x3 0.844 0.091 0.844 0.662x4 0.371 0.048 0.371 0.275x5 0.446 0.058 0.446 0.269x6 0.356 0.043 0.356 0.298x7 0.799 0.081 0.799 0.676x8 0.488 0.074 0.488 0.477x9 0.566 0.071 0.566 0.558visual 0.809 0.145 1.000 1.000textual 0.979 0.112 1.000 1.000speed 0.384 0.086 1.000 1.000

    Yves Rosseel The R package lavaan 31 / 42

  • Department of Data Analysis Ghent University

    testing for measurement invariance# model 1: configural invariancefit1

  • Department of Data Analysis Ghent University

    lavaan model syntax: full sem

    y1

    y2

    y3

    y4

    y5

    y6

    y7

    y8

    x1 x2 x3

    dem60

    dem65

    ind60

    myModel

  • Department of Data Analysis Ghent University

    a simple growth curve model with time-varying covariates

    c1

    c2

    c3

    c4

    t1 t2 t3 t4

    i s

    x1 x2

    model

  • Department of Data Analysis Ghent University

    further syntax

    • fixing parameters, and overriding auto-fixed parametersHS.model.bis

  • Department of Data Analysis Ghent University

    the parameter table (Holzinger & Swineford CFA example)> parTable(fit)

    id lhs op rhs user group free ustart exo label eq.id unco1 1 visual =˜ x1 1 1 0 1 0 0 02 2 visual =˜ x2 1 1 1 NA 0 0 13 3 visual =˜ x3 1 1 2 NA 0 0 24 4 textual =˜ x4 1 1 0 1 0 0 05 5 textual =˜ x5 1 1 3 NA 0 0 36 6 textual =˜ x6 1 1 4 NA 0 0 47 7 speed =˜ x7 1 1 0 1 0 0 08 8 speed =˜ x8 1 1 5 NA 0 0 59 9 speed =˜ x9 1 1 6 NA 0 0 610 10 x1 ˜˜ x1 0 1 7 NA 0 0 711 11 x2 ˜˜ x2 0 1 8 NA 0 0 812 12 x3 ˜˜ x3 0 1 9 NA 0 0 913 13 x4 ˜˜ x4 0 1 10 NA 0 0 1014 14 x5 ˜˜ x5 0 1 11 NA 0 0 1115 15 x6 ˜˜ x6 0 1 12 NA 0 0 1216 16 x7 ˜˜ x7 0 1 13 NA 0 0 1317 17 x8 ˜˜ x8 0 1 14 NA 0 0 1418 18 x9 ˜˜ x9 0 1 15 NA 0 0 1519 19 visual ˜˜ visual 0 1 16 NA 0 0 1620 20 textual ˜˜ textual 0 1 17 NA 0 0 1721 21 speed ˜˜ speed 0 1 18 NA 0 0 1822 22 visual ˜˜ textual 0 1 19 NA 0 0 1923 23 visual ˜˜ speed 0 1 20 NA 0 0 2024 24 textual ˜˜ speed 0 1 21 NA 0 0 21

    Yves Rosseel The R package lavaan 36 / 42

  • Department of Data Analysis Ghent University

    the parameter table (2)> PT lavNames(fit, "ov")[1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9"

    > lavNames(fit, "lv")[1] "visual" "textual" "speed"

    > lavNames(fit, "ov.x")character(0)

    > lavNames(fit, "lv.x")[1] "visual" "textual" "speed"

    > lavaan:::getDF(PT)[1] 24

    > lMR

  • Department of Data Analysis Ghent University

    > lMR[,c("id","lhs","op","rhs","mat","row","col")]id lhs op rhs mat row col1 1 visual =˜ x1 lambda 1 12 2 visual =˜ x2 lambda 2 13 3 visual =˜ x3 lambda 3 14 4 textual =˜ x4 lambda 4 25 5 textual =˜ x5 lambda 5 26 6 textual =˜ x6 lambda 6 27 7 speed =˜ x7 lambda 7 38 8 speed =˜ x8 lambda 8 39 9 speed =˜ x9 lambda 9 310 10 x1 ˜˜ x1 theta 1 111 11 x2 ˜˜ x2 theta 2 212 12 x3 ˜˜ x3 theta 3 313 13 x4 ˜˜ x4 theta 4 414 14 x5 ˜˜ x5 theta 5 515 15 x6 ˜˜ x6 theta 6 616 16 x7 ˜˜ x7 theta 7 717 17 x8 ˜˜ x8 theta 8 818 18 x9 ˜˜ x9 theta 9 919 19 visual ˜˜ visual psi 1 120 20 textual ˜˜ textual psi 2 221 21 speed ˜˜ speed psi 3 322 22 visual ˜˜ textual psi 1 223 23 visual ˜˜ speed psi 1 324 24 textual ˜˜ speed psi 2 3

    Yves Rosseel The R package lavaan 38 / 42

  • Department of Data Analysis Ghent University

    future plans

    • S4 classes: nice, but clumsy and ridiculously slow

    • newer code relies on ‘Reference Classes’

    • for large-scale simulation studies: lavaan (or R) is too slow

    • eventually, I will rewrite everything in C++ (using the Eigen library)

    – ideally, only a thin layer is written in R– to Rcpp or not to Rcpp?– the python/MATLAB/. . . communities also need a high-quality pack-

    age for latent variable analysis

    Yves Rosseel The R package lavaan 39 / 42

  • Department of Data Analysis Ghent University

    why you should not create an R package

    • I get 5–20 emails per day (lavaan related)

    • contributed code is usually of low quality

    • . . .

    • R core is not a democracy

    • CRAN is not a democracy

    • . . .

    • dependency hell: packages are (sometimes) removed from CRAN

    • I shall not break any packages that depend on lavaan

    Yves Rosseel The R package lavaan 40 / 42

  • Department of Data Analysis Ghent University

    R is (not so) great

    • R, as a language, is not perfect

    – the copy-by-value semantics– a lot of unnecessary internal copying; this affects speed– big data, parallelization: can be done, but not easily– no native support for many basic matrix operations (sparse, ginv, . . . )– optimizers are of medium-quality (optim, nlminb, . . . )– vectorized code is relatively fast, but not always possible– computing p-values under a multivariate normal distribution (package

    mvtnorm) is NOT vectorized (and hence slow)

    – . . .

    • I do not expect any spectacular changes in the future

    • future alternatives? http://julialang.org/

    Yves Rosseel The R package lavaan 41 / 42

  • Department of Data Analysis Ghent University

    Thank you!

    http://lavaan.org

    Yves Rosseel The R package lavaan 42 / 42