further understanding of fit indices in evaluating model fit of

FURTHER UNDERSTANDING OF FIT INDICES IN EVALUATING MODEL FIT OF STRUCTURAL EQUATION MODELS

GORDON W. CHEUNG

The Chinese University of Hong Kong Room 425C, Leung Kau Kui Building, CUHK, Shatin, Hong Kong

REBECCA S. LAU

The City University of Hong Kong

LINDA C. WANG Michigan State University

ABSTRACT

A simulation of over one million samples from 5,184 population parameters was

conducted to examine the performance of fit indices and rules of thumb in evaluating model fit. Results recommend a two-index strategy – Standardized Root-Mean-Square Residual (SRMR) and C1 Gamma Hat – and their cutoff values at various model parameter combinations.

INTRODUCTION Although the use of fit indices and the recommendations of their cutoff values have

advanced structural equation modeling (SEM) research, criticisms about the generalizability of these cutoff values remain. While the development of these cutoff values were based on simulations evaluating the effects of such issues as sample size, estimation method, non-normality, and model misspecification on the performance of fit indices, several other critical aspects like parsimony errors and measurement errors were not considered. Hence, these cutoff values may not apply in evaluating models where these errors exist. Another issue concerns the recent development in various formulas for calculating χ2. The conventional cutoff values were developed based upon fit indices derived from the minimum fit function χ2 (C1). Some SEM programs (e.g. LISREL), however, report fit indices derived from normal theory weighted least squares χ2 (C2). The implications of this change in χ2 calculations in the performance of fit indices are twofold. First, the fit index values reported in different studies may become incomparable if they were based upon different χ2 calculations. Second, the applicability of C1-derived absolute cutoff values in assessing model fit becomes questionable if the model fit was assessed with fit indices derived from C2.

The central purpose of this study is to take these critical yet neglected issues into consideration in assessing the behaviors of a range of fit indices. By conducting a series of simulations, we aim to answer two questions: Which fit indices are most stable in assessing model fit? What are the most appropriate cutoff values for the recommended fit indices?

METHOD

LISREL 8.72 and PRELIS 2.72 were used to conduct simulations. For those fit indices that are rooted in χ2, we included both C1 and C2 χ2 calculations in our simulations. In total,

we evaluated the performance of 53 fit indices. Data were simulated based on eight true population full models. The hypothesized models developed for testing against these simulated data sets contained different degrees of parsimony error due to missing a small cross-loading and/or misspecification error due to missing a medium or large cross loading, or missing a small or large direct structural path. Also, three levels of number of factors (3, 4, and 5), four levels of number of items per factor (3, 4, 5, and 6), and three levels of sample size (200, 400, and 800) were examined. Further, reliability was manipulated by varying the magnitude of measurement errors, such that two levels of multiple correlations of items (Rho) were examined: 0.5 and 0.4. These values correspond to the Cronbach’s Alpha values between 0.67 and 0.86. Finally, three forms of distribution of factor loadings and measurement errors were considered: normal distribution, χ2 distribution with three degrees of freedom, and uniform distribution.

These combinations of sample size, model specifications, and model parameters resulted in the generation of 3 x 4 x 3 x 2 x 3 x 3 x 8 (number of factors x number of items per factor x sample size x measurement errors x factor loading distribution x measurement error distribution x model) = 5,184 population models with different population parameters. Next, 200 samples were generated for each of the population models, leading to a total of 1,036,800 simulated samples. Sample data were then fitted to the corresponding hypothesized models. Model fit was evaluated by comparing the model implied covariance matrix to the observed covariance matrix, using maximum likelihood estimation method.

RESULTS AND DISCUSSION

Sensitivity to Model Parameters

The model parameters that have received the most research attention were sample size and model complexity. Ideally, fit indices should be insensitive to or independent of them. Another model parameter that is of critical importance but is less researched is the underlying assumption of multivariate normality of observed variables. Most SEM applications such as maximum likelihood and generalized least squares rely on normal theory methods with an assumption of multivariate normality. When the assumption does not hold, the χ2 test may reject the hypothesized model more frequently than the nominal rejection rate (Barrett, 2007). The parameter estimates and standard errors tend to be biased as well. Yet the impact of skewness and kurtosis propagates to those fit indices defined through the test statistic based on χ2 (Yuan, Bentler, & Zhang, 2005). In consideration of the importance of these model parameters, we examined the performance of fit indices under a systematic variation of number of factors, number of items per factor, sample size, and distributional properties.

Our simulation findings have several critical implications concerning the impact of model parameters on the performance of fit indices. First, number of items per factor and number of factors have considerable impact on almost all fit indices, no matter whether the model is correctly or incorrectly specified. Second, violation of multivariate normality assumption has small impact on fit indices. Third, when the model is of perfect fit, sample size has trivial impact on fit function based indices. This is because when the model is properly specified, the minimum fit function value is equal to zero, giving rise to zero association between sample size and these fit indices. However, sample size gains its impact on this category of fit indices when the model suffers from specification errors. Fourth, sample size may affect the values of fit indices, such as Normed Fit Index (NFI) and Relative Fit Index (RFI), because it is embedded in

the calculation of the value of these fit indices through χ2. Also, sample size will affect the sampling distribution of these fit indices. All of these findings thus suggest that the association among sample size, modeling conditions, and fit index performance is complicated. To accommodate such complication, different cutoff values for various sample sizes and model complexity are thus necessary. Performance of Rules of Thumbs

As the intention of most SEM users is to retain their hypothesized models, a valid rule of thumb should not only result in a minimum Type I error rate but also have sufficient power to reject a misspecified hypothesized model (Tomarken & Waller, 2003). Most recommended rules of thumb, however, have been criticized as lacking enough power to detect substantial misfit (Marsh, Hau, & Wen, 2004). Consequently, some cutoff values may be able to limit Type I error but are not sensitive enough to reject models suffering from large misspecification errors. Also, researchers have paid more attention to specification errors caused by missing factor loadings in measurement models than those caused by missing structural paths in full models (e.g. Hu & Bentler, 1999). Yet, some researchers have examined model misfit caused by missing factor covariances (e.g. Hu & Bentler, 1999) which are quite uncommon in data analysis. To focus on what researchers commonly encounter in data analysis, we examined two types of misspecification – missing factor loadings and missing structural paths – in full models. The use of full models in our simulations also allowed us to examine how well the existing rules of thumb, which were mainly derived from measurement models, work in detecting model misspecification in full models.

Our findings show that some rules of thumb (e.g. NFI with a recommended cutoff value of 0.9 or 0.95) are too powerful. While they have high rejection rates for misfit models, they also have high rejection rates for true and parsimonious models. On the other hand, some rules of thumb lack enough power to detect false models. The existing rules of thumb of Root-Mean-Square Error of Approximation (RMSEA) and SRMR, although retain true and parsimonious models, fail to reject even substantially misspecified models. Hence, hypothesized models that have been retained by using these rules of thumb may merely be the consequences of inadequate power in rejecting false hypothesized models. Compared to other rules of thumb, McDonald’s Fit Index (Mc) is performing well at a cutoff value of 0.9, especially for detecting missing cross-loadings. C1 Tucker-Lewis Index (TLI) and C1 Comparative Fit Index (CFI) with cutoff values at 0.95 also have adequate power for detecting missing cross-loadings. Yet, these rules of thumb lack sufficient power to detect missing structural paths. New rules of thumb thus become desirable. Sensitivity to Measurement Errors

Unlike regression-based approaches to estimating structural paths, SEM offers an analytical technique which partials out measurement errors from the estimates of structural relationships (James, Mulaik, & Brett, 1982). In the SEM context, measurement errors are defined as the portion of an observed variable that is measuring something other than what the latent variable is supposed to measure. The factor reliability, thus, is meant to quantify the amount of variance in the items that are explained by the factor. Although measurement errors are controlled from the estimates of structural relationships in SEM, their magnitude can still

influence the power of fit indices (Tomarken & Waller, 2003). Nevertheless, most simulation studies have simply assumed perfect reliability of measures. Despite the incorporation of measurement errors into some past simulation studies (e.g. Kenny & McCoach, 2003), the investigation did not involve the manipulation of these errors as a model parameter when evaluating the behavior of fit indices. To address this concern, we examined how fit indices may respond to measurement errors by systematically varying the magnitude of these errors.

Our findings show that first, although measurement errors account for some variance of some fit indices such as NFI and RFI, the amount of variance explained is not as big as that attributed to number of items per factor or number of factors in a model. Second, the impact of measurement errors on some fit indices including SRMR and Mc is trivial when the models are true but becomes more substantial when the models contain specification errors. Third, the size of measurement errors is related to model rejection rates, particularly for χ2 values and NFI. When measurement errors are smaller (Rho = 0.5), the model rejection rates by χ2 values are higher but the power for NFI to detect model misfit is lower. However, the influence of Rho on the values of other fit indices like SRMR, C1 and C2 RMSEA, C1 and C2 Gamma Hat, and C1 and C2 CFI is minimal. In sum, although measurement errors may not impact all fit indices to the same degree, their influence on some fit indices can be substantial. This thus highlights the importance of taking measurement errors into the consideration of recommending cutoff values. Influence of Parsimony Errors

Parsimony errors stem from missing secondary relationships in SEM models (Cheung & Rensvold, 2001), including secondary factor loadings and error term correlations. They usually have trivial values in empirical samples, lack theoretical bases, and show no substantive meanings. Parsimony errors are commonly found in exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). In both EFA and CFA, the secondary factor loadings are intentionally constrained to zero to achieve a simple factor structure for additional analysis. Another possible source of parsimony errors is constraining small correlations of error terms to zero. By doing so, researchers can achieve parsimonious models – a scientific principle in research. Excluding these secondary relationships thus generates a discrepancy between the true model in population and the hypothesized model, influencing not only parameter estimation but also indicators of model fit (Hall, Snell, & Foust, 1999). Given their lack of theoretical significance, it is ideal that fit indices are insensitive to the missing of secondary relationships so as to allow model parsimony without rejecting the model, but will change substantially to indicate that a model should be rejected when there are substantive misspecifications.

Our simulation findings reveal that some fit indices such as fit function based indices are more sensitive to parsimony errors than others. When secondary relationships are constrained to zero, the discrepancy thereby created between the population and the hypothesized models causes χ2 and fit function based indices to increase accordingly, leading to model rejection. NFI and Mc are two other fit indices bearing some sensitivity to parsimony errors. Compared to fit function based indices, their sensitivity is much smaller. The rest of the fit indices as well as the two-index strategies are rather insensitive to parsimony errors. They, however, are insensitive to other forms of model misspecification as well. In general, these findings suggest that without considering parsimony errors, researchers may reject a parsimonious model which contains no detrimental model misspecification, especially when fit function based indices are used to evaluate model fit. Given this potential drawback, it is necessary to take parsimony

errors into account when recommending rules of thumb. Influence of the Calculation of χ2 Values

Most fit indices are derived from χ2. Over time, more χ2 calculations have been developed in SEM. Currently, there are four calculations: minimum fit function χ2 (C1), normal theory weighted least squares χ2 (C2), Satorra-Bentler scaled χ2 (C3), and χ2 corrected for non-normality (C4; Jöreskog, 2004). Considering that the model fit output from SEM programs usually provides fit index values using C1 and C2, we focus our discussion on these two calculations. When the hypothesized model is the same as the population model under multivariate normality of observed variables, both χ2 values for the model should be asymptotically equivalent (Jöreskog, 2004). The values tend to differ when the hypothesized model does not hold. When Hu and Bentler (1999) recommended cutoff values for fit indices, the cutoff values were derived from fit indices using C1. It remains unclear whether the recommended C1-based rules of thumb have similar validity when fit indices are based upon a different χ2 calculation, and if not, what other fit indices and cutoff values should be used instead.

Our simulation results show that the values of χ2 for the hypothesized model and RMSEA are not considerably affected by different calculations of χ2. However, when incremental fit indices such as CFI is considered, the difference between C1 and C2 CFI values can be substantial, especially when the model is wrong. It is because incremental fit indices including CFI are calculated based upon the increment from the χ2 value of the hypothesized model to the χ2 value of the independence model. Since the independence model is very likely to be false in the population, the C1 and C2 independence χ2 values are likely to differ. Because there is a correspondingly larger increment in model fit using C2 than C1 calculations, the CFI values calculated with C2 are substantially higher than those calculated with C1. C2 CFI values are thus inflated. Consequently, applying cutoff values derived from C1-based incremental fit indices to evaluate model fit assessed with C2-based fit index values is questionable. Recent SEM software programs report fit indices using different χ2 calculations. For instance, LISREL reports C2 whereas AMOS reports C1 fit indices. While researchers are free to choose the χ2 calculations they prefer, they are advised to report the χ2 calculation they used and select the consistent rules of thumb for model evaluation. If researchers employ C1-derived rules of thumb to evaluate model fit with C2-based fit index values, there may be inadequate power to reject misspecified models.

CONCLUSION

In this study, simulations were conducted to examine the impact of various model

parameters as well as several pressing yet ignored issues including measurement errors, parsimony errors, and χ2 calculations on fit index performance. The sensitivity of existing rules of thumb to different forms and degrees of model misspecification was also examined. Our findings show that Gamma Hat is particularly sensitive to missing cross-loadings whereas SRMR is particularly sensitive to missing direct paths. We thus recommend that practitioners use a two-index strategy: a combination of SRMR and C1 Gamma Hat for model evaluation. We also recommend different cutoff values for different sample sizes and modeling scenarios. This

combinational strategy outperforms previous rules of thumb in that it not only provides adequate power to detect model misspecification but also protects Type I errors for models with merely parsimonious errors.

REFERENCES Barrett, P. 2007. Structural equation modeling: Adjusting model fit. Personality and Individual

Differences, 42: 815-824. Cheung, G. W., & Rensvold, R. B. 2001. The effects of model parsimony and sampling error on

the fit of structural equation models. Organizational Research Methods, 4: 236-264. Hall, R. J., Snell, A. F., & Foust, M. S. 1999. Item parcelling strategies in SEM: Investigating the

subtle effects of unmodeled secondary constructs. Organizational Research Methods, 2: 233-256.

Hu, L., & Bentler, P. M. 1999. Cutoff criteria for fit indices in covariance structure analysis:

Conventional criteria versus new alternatives. Structural Equation Modeling, 6: 1-55. James, L. R., Mulaik, S. A., & Brett, J. M. 1982. Causal Analysis: Models, Assumptions, and

Data. Beverly Hill, CA: Sage. Jöreskog, K. G. (2004). On chi-squares for the independence model and fit measures in LISREL.

URL http://www.ssicentral.com/lisrel/advancedtopics.html. Accessed December 31, 2008. Kenny, D. A., & McCoach, D. 2003. Effect of the number of variables on measures of fit in

structural equation modeling. Structural Equation Modeling, 10: 333-351. Marsh, H. W., Hau, K., & Wen, Z. 2004. In search of golden rules: Comment on

hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and Bentler’s 1999 findings. Structural Equation Modeling, 11: 320-341.

Tomarken, A. J., & Waller, N. G. 2003. Potential problems with “well fitting” models. Journal

of Abnormal Psychology, 112: 578-598. Yuan, K. -H., Bentler, P. M., & Zhang, W. 2005. The effect of skewness and kurtosis on mean

and covariance structure analysis: The univariate case and its multivariate implication. Sociological Methods and Research, 34: 240-258.

Copyright of Academy of Management Annual Meeting Proceedings is the property of Academy of

Management and its content may not be copied or emailed to multiple sites or posted to a listserv without the

copyright holder's express written permission. However, users may print, download, or email articles for

individual use.

further understanding of fit indices in evaluating model fit of

Documents