model flexibility analysis

15
Model Flexibility Analysis Vladislav D. Veksler U.S. Army Research Laboratory, Aberdeen, Maryland Christopher W. Myers and Kevin A. Gluck U.S. Air Force Research Laboratory, Wright-Patterson AFB, Ohio A good fit of model predictions to empirical data are often used as an argument for model validity. However, if the model is flexible enough to fit a large proportion of potential empirical outcomes, finding a good fit becomes less meaningful. We propose a method for estimating the proportion of potential empirical outcomes that the model can fit: Model Flexibility Analysis (MFA). MFA aids model evaluation by providing a metric for gauging the persuasiveness of a given fit. We demonstrate that MFA can be more informative than merely discounting the fit by the number of free parameters in the model, and show how the number of free parameters does not necessarily correlate with the flexibility of the model. Additionally, we contrast MFA with other flexibility assessment techniques, including Parameter Space Partitioning, Model Mimicry, Minimum Description Length, and Prior Predictive Evaluation. Finally, we provide examples of how MFA can help to inform modeling results and discuss a variety of issues relating to the use of MFA in model validation. Keywords: model evaluation, model selection, goodness of fit, model flexibility, parametric complexity In validating psychological models we can examine whether model predictions match the observed behavioral data. Formal modeling efforts, which involve some combination of mathemat- ical and computational processes, are especially convincing rela- tive to verbal theories in that models can be quantitatively evalu- ated in terms of how well they fit behavioral data (Campbell & Bolton, 2005; McClelland, 2009; Myers, Gluck, Gunzelmann, & Krusmark, 2011). That is, we can precisely measure differences between data values or patterns predicted by the model and those observed in empirical studies. 1 More often than not, the models we are evaluating have free parameters. These are necessary either to account for individual or group differences, or simply because of gaps in scientific theory. Consequently, a model does not produce a single set of predictions, but rather a range of predictions derived from a variety of parameter settings. The standard practice in the field is to vary model parameters and report the best fit of model predictions to observed data. Roberts and Pashler (2000) report that the research literature contains thou- sands of articles, some from as early as the 1930s, where the best fit of model predictions to observed data are presented as evidence for model validity. If one model’s best fit to data are better than another’s, it is concluded to be the better of the two models. The problem, as Roberts and Pashler point out, is that the best fitting model may merely be the most flexible model— one that can fit a greater range of hypothetical datasets given the full enumeration of its free parameters. “Without knowing how much a theory constrains possible outcomes, you cannot know how impressed to be when observation and theory are consistent” (p. 359). Consider the case of the two models depicted in Figure 1. One of the models (left) makes a tight set of predictions, whereas the other (right) could fit almost any set of potentially observable data. As such, presenting a good fit to data as evidence for the validity of the model on the right may be misleading, and should be interpreted in the context of the model’s flexibility. 2 In this article we propose a method for estimating the proportion of hypothetical data space that a model can account for as a simple and direct metric of that model’s flexibility. We refer to this method as 1 There are many scientific considerations besides goodness of fit that are important in model evaluation. There are also many simulations that do not focus on a single best-fit, or do not focus on model validation through best-fit. Additionally, there are valid concerns about the replicability of the data being fit. Although all of these concerns are important to model evaluation, they are not the focus of this article. 2 Model flexibility is also known as model complexity (e.g., Myung, Balasubramanian, & Pitt, 2000; Myung & Pitt, 1997; Pitt, Myung, & Zhang, 2002). We believe that flexibility is the better term, because the term complexity may have a lay interpretation implying that the model is difficult to understand, which is not the intended definition. This article was published Online First August 31, 2015. Vladislav D. Veksler, Human Research & Engineering, DCS Corpora- tion, U.S. Army Research Laboratory, Aberdeen Proving Ground, Ab- erdeen, Maryland; Christopher W. Myers and Kevin A. Gluck, 711th Human Performance Wing, U.S. Air Force Research Laboratory, Wright- Patterson AFB, Ohio. This research was performed partly while the first author held a National Research Council Research Associateship Award with the Air Force Re- search Laboratory’s Cognitive Models and Agents Branch. A majority of this work was supported by AFOSR Grant 13RH06COR. A part of this work was performed by the first author as a DCS Corp contractor under Army Research Laboratory contract W911NF-10-D-0002 and supporting work funded under Cooperative Agreement Number W911NF-09-2-0053. We thank Joseph W. Houpt and Matthew M. Walsh for their invaluable input. We extend many thanks to the reviewers of this article, especially to Peter Grünwald, who provided an extended review with a complete NML tutorial. Additionally, we thank MindModeling.org for providing easy access to large scale computing resources that supported this research. Correspondence concerning this article should be addressed to Vladislav D. Veksler, DCS Corp, U.S. Army Research Laboratory, Building 417, Aberdeen Proving Ground, Aberdeen, MD 21001. E-mail: vdv718@ gmail.com Psychological Review In the public domain 2015, Vol. 122, No. 4, 755–769 http://dx.doi.org/10.1037/a0039657 755

Upload: af

Post on 19-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Model Flexibility Analysis

Vladislav D. VekslerU.S. Army Research Laboratory, Aberdeen, Maryland

Christopher W. Myers and Kevin A. GluckU.S. Air Force Research Laboratory,

Wright-Patterson AFB, Ohio

A good fit of model predictions to empirical data are often used as an argument for model validity.However, if the model is flexible enough to fit a large proportion of potential empirical outcomes, findinga good fit becomes less meaningful. We propose a method for estimating the proportion of potentialempirical outcomes that the model can fit: Model Flexibility Analysis (MFA). MFA aids modelevaluation by providing a metric for gauging the persuasiveness of a given fit. We demonstrate that MFAcan be more informative than merely discounting the fit by the number of free parameters in the model,and show how the number of free parameters does not necessarily correlate with the flexibility of themodel. Additionally, we contrast MFA with other flexibility assessment techniques, including ParameterSpace Partitioning, Model Mimicry, Minimum Description Length, and Prior Predictive Evaluation.Finally, we provide examples of how MFA can help to inform modeling results and discuss a variety ofissues relating to the use of MFA in model validation.

Keywords: model evaluation, model selection, goodness of fit, model flexibility, parametric complexity

In validating psychological models we can examine whethermodel predictions match the observed behavioral data. Formalmodeling efforts, which involve some combination of mathemat-ical and computational processes, are especially convincing rela-tive to verbal theories in that models can be quantitatively evalu-ated in terms of how well they fit behavioral data (Campbell &Bolton, 2005; McClelland, 2009; Myers, Gluck, Gunzelmann, &Krusmark, 2011). That is, we can precisely measure differencesbetween data values or patterns predicted by the model and thoseobserved in empirical studies.1

More often than not, the models we are evaluating have freeparameters. These are necessary either to account for individual orgroup differences, or simply because of gaps in scientific theory.

Consequently, a model does not produce a single set of predictions,but rather a range of predictions derived from a variety of parametersettings. The standard practice in the field is to vary model parametersand report the best fit of model predictions to observed data. Robertsand Pashler (2000) report that the research literature contains thou-sands of articles, some from as early as the 1930s, where the best fitof model predictions to observed data are presented as evidence formodel validity. If one model’s best fit to data are better than another’s,it is concluded to be the better of the two models. The problem, asRoberts and Pashler point out, is that the best fitting model maymerely be the most flexible model—one that can fit a greater range ofhypothetical datasets given the full enumeration of its free parameters.“Without knowing how much a theory constrains possible outcomes,you cannot know how impressed to be when observation and theoryare consistent” (p. 359).

Consider the case of the two models depicted in Figure 1. Oneof the models (left) makes a tight set of predictions, whereas theother (right) could fit almost any set of potentially observable data.As such, presenting a good fit to data as evidence for the validityof the model on the right may be misleading, and should beinterpreted in the context of the model’s flexibility.2

In this article we propose a method for estimating the proportion ofhypothetical data space that a model can account for as a simple anddirect metric of that model’s flexibility. We refer to this method as

1 There are many scientific considerations besides goodness of fit thatare important in model evaluation. There are also many simulations that donot focus on a single best-fit, or do not focus on model validation throughbest-fit. Additionally, there are valid concerns about the replicability of thedata being fit. Although all of these concerns are important to modelevaluation, they are not the focus of this article.

2 Model flexibility is also known as model complexity (e.g., Myung,Balasubramanian, & Pitt, 2000; Myung & Pitt, 1997; Pitt, Myung, &Zhang, 2002). We believe that flexibility is the better term, because theterm complexity may have a lay interpretation implying that the model isdifficult to understand, which is not the intended definition.

This article was published Online First August 31, 2015.Vladislav D. Veksler, Human Research & Engineering, DCS Corpora-

tion, U.S. Army Research Laboratory, Aberdeen Proving Ground, Ab-erdeen, Maryland; Christopher W. Myers and Kevin A. Gluck, 711thHuman Performance Wing, U.S. Air Force Research Laboratory, Wright-Patterson AFB, Ohio.

This research was performed partly while the first author held a NationalResearch Council Research Associateship Award with the Air Force Re-search Laboratory’s Cognitive Models and Agents Branch. A majority ofthis work was supported by AFOSR Grant 13RH06COR. A part of thiswork was performed by the first author as a DCS Corp contractor underArmy Research Laboratory contract W911NF-10-D-0002 and supportingwork funded under Cooperative Agreement Number W911NF-09-2-0053.We thank Joseph W. Houpt and Matthew M. Walsh for their invaluableinput. We extend many thanks to the reviewers of this article, especially toPeter Grünwald, who provided an extended review with a complete NMLtutorial. Additionally, we thank MindModeling.org for providing easyaccess to large scale computing resources that supported this research.

Correspondence concerning this article should be addressed to VladislavD. Veksler, DCS Corp, U.S. Army Research Laboratory, Building 417,Aberdeen Proving Ground, Aberdeen, MD 21001. E-mail: [email protected]

Psychological Review In the public domain2015, Vol. 122, No. 4, 755–769 http://dx.doi.org/10.1037/a0039657

755

Model Flexibility Analysis (MFA). Given a good fit of a model toobserved data, MFA can determine the likelihood that this fit wasfound because of model flexibility. As such, the method is meant toaddress the fundamental question asked by Roberts and Pashler(2000)—How Persuasive is a [particular] Good Fit?

MFA may be used with any class of models and any goodnessof fit measure. It is more computationally expensive than using araw count of free parameters, but it is also more informative.Additionally, MFA provides an easily interpretable quantitativemetric of flexibility for each individual model, which enablesabsolute model evaluation, in addition to relative model compar-ison and selection.

We envision two use cases for MFA. First, MFA aids modelevaluation and model selection in that it helps to interpret thegoodness of fit. To be clear, MFA is a complement to (rather thana replacement for) goodness of fit metrics. For this reason wesuggest the inclusion of MFA statistics alongside goodness of fitstatistics whenever the goodness of fit between model predictionsand empirical results is presented as evidence for model validity.Second, MFA can predict model flexibility for a given task envi-ronment a priori—before empirical results are collected. For thisreason we suggest using MFA to down-select which empiricalstudies may be most useful for model validation.

The rest of this article outlines some limitations of currentflexibility-estimation techniques, describes the details of MFA,and presents four simulations as sample use cases for MFA inmodel evaluation. Simulations 1 and 2 use MFA to interpret thefits of process-based cognitive models to empirical data, Simula-tion 3 uses MFA to interpret the fits of mathematical models, andSimulation 4 uses MFA a priori to improve experimental design tomake it more useful in model evaluation.

Established Methods for Assessing Model Flexibility

Number of Free Parameters

There have been many attempts to address the model flexibilityissue by means of counting the number of free parameters in the

model. It is often assumed that the more free parameters a modelhas, the more flexible it is. AIC (Akaike Information Criterion;Akaike, 1973), BIC (Bayesian Information Criterion; Schwarz,1978), adjusted r2, and RMSEA (root mean square error approx-imation; Myung & Pitt, 1997) are some of the model evaluationmetrics that use the raw count of model parameters at their core.This is a very popular metric of flexibility, and there is muchbenefit in its simplicity and computational frugality.

Unfortunately, for most formal psychological models, the rawnumber of free parameters provides is, at best, a very coarseestimate of the potential space of model predictions. Discountingby the number of free parameters is based on the assumptions oflinear regression models, where with enough free parameters, onecan fit any dataset. However, if models include any nonlinearcharacteristics, then the number of model parameters becomes lessmeaningful (e.g., Bamber & van Santen, 1985; Navarro, Pitt, &Myung, 2004; Roberts & Pashler, 2000; Wagenmakers, Ratcliff,Gomez, & Iverson, 2004).

To be more specific, let us consider the case of hypotheticalModels A and B making predictions for two behavioral measures,DV1 and DV2. Model A has three free parameters, 0 � x � 1, 0 �y � 1, 0 � z � 1, and is defined such that DV1 is predicted as .1x,and DV2 is predicted as .1y �.001z. Model B has two freeparameters, 0 � x � 1, 0 � y � 1, and is defined such that DV1is predicted as .4x, and DV2 is predicted as .4y. More formally:

A(x, y, z)) (.1x, .1y � .001z) (1)

B(x, y)) (.4x, .4y) (2)

Assuming that the potential ranges of DV1 and DV2 are bothbetween 0 and 1, the probability of Model A making the correctprediction is.0101, whereas that of Model B is.16 (see Figure 2).Assuming that both models provide a good fit to data, Model A isthe more informative of the two, because it is less flexible. Themodel’s constraints, rather than the parameter enumeration, arelikely to be responsible for finding a good fit between model anddata. Model A’s good fit is more persuasive than that of Model B.However, if we were to focus solely on the numbers of free

Figure 1. DV1 and DV2 are both measures of behavior. For both measures, the axes cover the whole rangeof possible values. Gray area represents model predictions. The model on the right is more flexible than themodel on the left.

756 VEKSLER, MYERS, AND GLUCK

parameters in the two models, we would have concluded theopposite. According to methods using the number of free param-eters (e.g., AIC, BIC), if the observed data were (DV1 � .05,DV2 � .05), and both models include this point in their predic-tions, Model B would be the preferred model because it has fewerfree parameters. To drive this point home, there are models with asingle free parameter that are flexible enough to produce a good fitto any observed data (e.g., C(x) ) (x, .5 � .5 sin(10000x))), andwhole classes of models (hierarchical models) where flexibility isinversely proportional to the number of model parameters.

Psychological models are constrained in much more interestingand intricate ways than Models A and B in the example above,often comprising complex relationships between psychologicalprocesses, the task, and the free parameters. Moreover, psycho-logical simulations often place various explicitly reported andtestable constraints on parameter bounds. Thus, such simulationsproduce behavioral predictions that are rarely linear derivations ofthe parameter values. It is almost never the case that a modelingsimulation with n free parameters can predict all hypotheticaldatasets of length n. For these reasons, the number of free param-eters is at best an imprecise (and at worst an inaccurate) estimateof model flexibility.

In summary, the number of free parameters in a model is notdirectly indicative of its flexibility. A better method is needed forunderstanding the extent to which a good fit is because of reportedmodel constraints, and the extent to which it may be attributable tomodel flexibility.

Rather than using a raw count of model parameters, Roberts andPashler (2000) recommend a full parameter-space enumeration—“varying each free parameter over its entire range, in all possiblecombinations” (p. 363)—to determine model flexibility. Generat-

ing the full set of model predictions via an exhaustive parameter-space enumeration is computationally expensive, and is not thecommon practice in model evaluation. Modeling efforts ofteninvolve the manipulation of parameter values by hand until a“good enough” fit is found, or the employment of some optimi-zation technique (e.g., genetic algorithms, hill-climbing) to speedup the best-fit search times. However, with the emerging avail-ability of free high-performance grid and volunteer computingresources (e.g., mindmodeling.org; Gluck, 2010; Harris, 2008;Moore & Gunzelmann, 2014), it has become possible to gather theentire space of model predictions to help analyze model flexibility.

Parameter Space Partitioning

Pitt, Kim, Navarro, and Myung (2006) suggest a technique forevaluating model flexibility based on the entire space of modelpredictions; Parameter Space Partitioning (PSP). PSP is useful foranalyzing the flexibility of a model to achieve various qualitativerelationships between behavioral measures. In PSP the predicteddata space is partitioned into qualitatively meaningful regions. Thehigher the number of unique partitions generated in the predictedspace, the greater the flexibility of the model.

For example, let us assume that an empirical study revealsbehavioral measure, DV1, to be less than another behavioralmeasure, DV2. Figure 3 depicts two models that, given the correctparameter settings, can both predict this phenomenon. However,whereas Model C includes three qualitatively distinct regions in itspredictions (DV 1 � DV 2, DV 1 � DV 2, and DV 1 � DV 2),Model D includes only two (DV 1 � DV 2, DV 1 � DV 2). In thiscase according to PSP Model C is considered the more flexiblemodel, and thus, a worse account of the data than Model D.

Figure 3. The space of potentially observable data are split up into threequalitatively distinct partitions: DV1 � DV2, DV1 � DV2, DV1 � DV2.Model C can potentially predict all three patterns, whereas Model Dpredicts only two patterns.

Figure 2. Predictions of Models A and B described in Equations 1 and 2,respectively. Model A has three free parameters, Model B has two freeparameters. DV1 and DV2 are both measures of behavior. For bothmeasures, the axes cover the whole range of possible values.

757MODEL FLEXIBILITY ANALYSIS

Being a strictly qualitative metric (Pitt et al., 2006), PSP offersonly a rough analysis of the predicted space. For example, PSPmay judge Models A and B described in Equations 1 and 2 aboveas similarly flexible in that both models include three qualitativelydistinct partitions in their predictions (DV 1 � DV 2, DV 1 � DV2, and DV 1 � DV 2). For simulations that are concerned withquantitative fit to data and more precise estimates of flexibility,PSP may not be the appropriate evaluation technique.

Model Mimicry

Model Mimicry (e.g., Kim, Navarro, Pitt, & Myung, 2004;Navarro et al., 2004; Wagenmakers et al., 2004; Van Zandt,Colonius, & Proctor, 2000; Van Zandt & Ratcliff, 1995) is anassessment of model flexibility that determines how well twomodels fit each other’s predictions. If Model X can fit more ModelY predictions than vice versa, then Model X is considered the moreflexible of the two models. Thus, Model Mimicry provides arank-order of model flexibility (with the exception of the casewhere two models provide a decent fit to data, but not to eachother, as depicted in Figure 4).

However, Model Mimicry does not answer the question askedby Roberts and Pashler (2000)—how impressed should we be witha good fit of model predictions to observed data? For example,suppose that we run a simulation and find that two models bothprovide a good fit to data. We then determine via Model Mimicrythat one model is more flexible than the other. Such rank orderingis informative, but it does not distinguish between the scenariodepicted in Figure 1 where one model is extremely flexible andanother is tightly constrained, the scenario depicted on the left ofFigure 5, where both models are extremely flexible, and thatdepicted on the right of Figure 5, where both models make tightsets of predictions.

Much psychological research focuses on the question as towhich model provides a better account of a given phenomenon. Insuch instances, model selection methods like Model Mimicryprovide the appropriate tools for deriving a single “best” modelbased on both fit and flexibility. However, there is also muchresearch that focuses on the evaluation of individual models. Thismay be the case because other models are not appropriate oravailable, or because the proposed model operates at a differentlevel of analysis, or includes some other features of interest. Inother words, whether a model, or a set of models, provides a goodaccount of given empirical observations is often an importantscientific question in itself, beyond the need to choose a single bestmodel.

For example, Ratcliff (2002) evaluated a specific model withoutcontrasting it with other models, providing good fits to the ob-served data as evidence in support for this model. In this instanceRatcliff went beyond similar modeling efforts and attempted toevaluate whether the good fits to the observed data were persua-sive. To do this, Ratcliff “generated fake data by hand . . . and thenattempted to fit the [proposed] model to them” (p. 286). Althougha more systematic approach to generating and fitting “fake” dataare warranted, Ratcliff’s work is a step toward interpreting good-ness of fit in the context of model flexibility.

Minimum Description Length

Minimum Description Length (MDL) is a family of methods forselecting among models based on the interaction between model fitand model complexity (COMP; i.e., flexibility), where models thatprovide greater data compression are considered preferable (e.g.,Hansen & Yu, 2001). Although some earlier MDL methods usedthe number of free parameters to estimate flexibility (e.g., Pitt &Myung, 2002), in the modern formulation of MDL, NormalizedMaximum Likelihood (NML; Grünwald, 2005, 2007; Myung,Navarro, & Pitt, 2006), COMP is based on all possible modelpredictions over the entire range of hypothetical datasets (ratherthan a few hand-picked datasets, a la Ratcliff, 2002).

Normalized Maximum Likelihood (NML; Grünwald, 2005,2007; Myung et al., 2006) is an evolved “universal code” form ofMDL methods for model selection. For a given model the NMLterm to be minimized is defined as:

�log p(x | �̂(x)) � COMP

where x is the observed data and �̂�x� is the most likely set ofmodel parameters given x. The COMP term is defined as:

log�y�Y

p(y | �̂(y))

where Y is the set of all potentially observable hypothetical data-sets and �̂�y� is the most likely set of model parameters given somehypothetical dataset y. In cases where the COMP term cannot besolved directly, it may be estimated via a Monte Carlo simulation(Roos, 2008).

One advantage of using NML as a model selection method isthat the goodness of fit term, p�x��̂�x�� , is normalized by theflexibility term, COMP. Additionally, NML may be used forgauging the persuasiveness of a given model’s fit to data. Specif-ically, Grünwald (2007) shows how the difference between NML

Figure 4. Results from a hypothetical simulation where the differencesbetween two models are negligible, and predictions are nonoverlapping.DV1 and DV2 are both measures of behavior.

758 VEKSLER, MYERS, AND GLUCK

scores of the model being evaluated and a model that can predicteverything could be converted into a p value. However, NML-derived p values differ from the standard p values in null-hypothesis testing, in that they are nonstrict (i.e., no upper boundat 1.0). Nonstrict p values are more difficult to interpret as errorlikelihoods, but provide concrete interpretation in the gamblingdomain.3

A potential limitation of NML is that this method is tied tolog-likelihood as the goodness of fit measure, and is only appro-priate where model predictions may be translated to probabilitydistributions. In the case where translating model predictions toprobability distributions is a labor- or computationally intensivetask, NML evaluation becomes less attractive.

Bayesian Approaches

Bayesian Model Selection (BMS; e.g., Kass & Raftery, 1995;Myung & Pitt, 1997) is similar to NML, except that it focuses onmean likelihoods, rather than maximum likelihoods of models, andprovides an arguable advantage of conditioning exclusively on theobserved data. Like Model Mimicry, BMS is more appropriate asa method for model selection, rather than individual model eval-uation, and has the same drawbacks as other relative rankingmeasures in that it does not inform us as to which of the model fits,if any, are persuasive.4

Bayesian Posterior Predictive Evaluation (Gelman et al., 2013)and Prior Predictive Evaluation (Vanpaemel, 2009) are methodsfor absolute model evaluation, rather than relative ranking. InPosterior Predictive Evaluation likelihoods are determined for allpotential model parameter values based on the observed data, andthen likely model predictions are contrasted with observed data.This method exclusively focuses on the observed, rather than thenonobserved data, and thus, it is more akin to a measure of fit,rather than a measure of flexibility. Prior Predictive Evaluation, onthe other hand, does focus on model flexibility.

Prior Predictive Evaluation (Vanpaemel, 2009) is a techniquefor estimating model flexibility based on model priors. Modelpriors are the prespecified parameter value ranges/distributions for

the model. To be clear, the idea that parameter value constraintsare required for estimating model flexibility is not novel. When-ever an assessment calls for a full parameter enumeration (e.g.,Grünwald, 2007; Pitt et al., 2006; Roberts & Pashler, 2000),including the method being proposed in this article, there is animplicit assumption that the enumeration is done within givenmodel parameter constraints. However, Vanpaemel (2009) takes astronger stance that no model specifications are complete withoutclear constraints on the priors.

Formally, Prior Predictive Evaluation estimates model flexibil-ity for a given simulation as the average predictive flexibilityacross all behavioral measures in the simulation:

1

m�i�1

m PIi

UIi

where for each behavioral measure of interest, i, UIi is the univer-sal interval (all hypothetical values of i) and PIi is the predictedinterval (all potentially predicted values of i based on modelspecifications).

3 Commentary by Peter Grünwald:Grünwald (2007) shows how the difference between NML scores of the

model being evaluated and a trivial model that can predict everythingequally well can be converted into a p value. For such a trivial model, wecan for example, take a uniform distribution, and then the transformedNML score becomes interpretable as a p value, similarly to the interpre-tation of MFA in terms of a p value against a uniform null (described insection Interpretation of � as a p value). However, whereas the MFA�-values are standard p values, NML-derived p values are nonstrict pvalues, that is, they are upper bounds on actual p values (and can thus, belarger than 1): for any nonstrict p value P1, there is a standard p value P2such that, on all possible sequences of data P2 � P1. In general, nonstrictp values may be trivial (i.e., a function that outputs a number larger than1 on every data sequence would be a nonstrict p value), but the NML-derived nonstrict p values are in fact �test martingales� in the sense ofShafer, Shen, Vereshchagin, and Vovk (2011). As the latter article pointsout, these always remain close to actual p values and have a concreteinterpretation in terms of gambling.

4 BMS quantification is not on an ordinal scale, but is difficult tointerpret in the absence of relative comparisons.

Figure 5. Results from two hypothetical simulations where one model is more flexible than another. DV1 andDV2 are both measures of behavior. Model Mimicry cannot distinguish between the results depicted on the left,where both models are overly flexible, and those on the right, where both models provide tight constraints.

759MODEL FLEXIBILITY ANALYSIS

There is a danger in estimating flexibility in this way becausemodel predictions are not independent across the behavioral mea-sures. For example, model predictions may vary across the entirerange of each behavioral measure without necessarily covering alarge proportion of the total hypothetical data space. This scenariois depicted in Figure 6, where Model L covers almost the entiretyof hypothetical values for both DV1 and DV2, but still predicts asmaller proportion of the hypothetical data space than Model K,which only covers about half of the hypothetical values for DV1.In this example, Prior Predictive Evaluation would conclude thatModel K is the less flexible model, whereas it is actually the moreflexible of the two models, covering a larger proportion of the totalhypothetical data space.5

Model Flexibility Analysis

We propose MFA as a technique for assessing model flexibility.MFA comprises estimating the proportion of potentially observ-able data that a model can account for. If, in a given simulation,model predictions can account for a large proportion of potentiallyobservable data, a good fit to any particular dataset becomes lesspersuasive. MFA is meant to be a general method for estimatingflexibility, independent of model type (e.g., closed-form equationor a production system model, regression or hierarchical, stochas-tic or deterministic) or measure of fit involved in model evaluation(Table 1).

MFA involves (a) systematic enumeration of model parametersto generate a representative space of model predictions, and (b)estimation of the proportion of the entire range of potentiallyobservable data that the model can predict, �. In practice, research-ers often generate a set of model predictions, so as to determine the

best fit of the model to observed data. This same set of predictionscan, and should, then be used to determine the total proportion ofpotentially observable data that the model can account for—modelflexibility.

MFA is meant to estimate parametric flexibility, rather thanflexibility because of stochastic properties of the model. Thus, foreach unique set of parameter values, �, MFA requires a singlepredicted point in the data space. In the case where model predic-tions are stochastic, the predicted point should represent the centralbehavioral tendency for �.6 Thus, if the model has k free param-eters, and predictions are generated for j unique values of eachparameter in a given simulation, there will be a total of jk predictedpoints in the n-dimensional data space for this simulation, where nis the number of behavioral measures of interest in this simulation.

Given the universal interval UIi (all hypothetical values of i) foreach behavioral measure of interest i, the total size of the potentialdata space is

�i�1

n

UIi.

For example, if each measure of interest is a proportion (e.g.,proportion of errors in a trial), each UIi is [0,1], and the totalpotential data space is 1.0.

The proportion, �, of the n-dimensional potential data space thatis covered by model predictions can be estimated from the pre-dicted points by placing a grid on top of the data space, andreporting the proportion of grid cells that include model predic-tions. The granularity of the grid should be a function of thenumber of predicted points, jk, and the number of dimensions ofthe predicted space, n. Specifically, we suggest breaking up each

grid dimension into �n

jk cells (see Table 2 for R code).7

For example, recall Models A and B discussed in the previoussection, described in Equations 1 and 2, respectively.8 Let usassume that a grid search (i.e., exhaustively checking model pre-dictions for every parameter value combination) with a granularityof .01 was performed to find the best fit of each model to someobserved dataset, and it was determined that both models providedgood fits to the observed data (e.g., a low RMSE or a high r2). At

5 The prior predictive method, in general, can easily be extended tononindependent behavioral measures by simply applying it to the jointdistribution (e.g., Lee, 2015). Thus, perhaps, the most important contribu-tion of PPE is not the exact method for flexibility estimation, but rather theprinciple of specifying parameter value distributions for each model.

6 In simulations where dependent measure variance is as important as themean, variance should be treated as an additional measure of interest. Thatis, if mean model predictions, as well as mean variances were being fit toobserved data across n behavioral measures, model predictions would bepoints in 2n-dimensional space.

8 Models A and B generate clear continuous prediction spaces and theirflexibility can be calculated without the use of MFA. The use of thesesimple models is strictly for initial expository purposes. MFA becomesmuch more useful for real modeling efforts where the size of the modelprediction-space cannot be clearly or easily calculated from model speci-fications, as is the case for the modeling simulations presented later in thisarticle.

7 Other methods exist for estimating continuous areas from simulateddata points. The exact method of estimation is not nearly as important asits precision, and the ability to report meaningful flexibility estimates forvarying model types. If a different method is used for estimating a �-value,it should be clearly reported.

Figure 6. Results from a hypothetical simulation where Model K is moreflexible than Model L. DV1 and DV2 are both measures of behavior. PriorPredictive Evaluation would incorrectly determine here that Model L is themore flexible model, as it predicts almost the entire range of hypotheticalvalues for each measure of behavior.

760 VEKSLER, MYERS, AND GLUCK

this point, the grid search for Models A and B would havegenerated the predictions displayed in Table 3. Given these pre-dictions, we can determine the proportion of the potential spacethat the model is flexible enough to account for.

We ran MFA for Models A and B to find the probabilities thatthese models could account for any hypothetical dataset (assumingthat the potential ranges of behavioral measures DV1 and DV2 areboth between 0 and 1), deriving the following:

Model A � � .01Model B � � .16Based on these MFA results we can conclude that a good fit is

unlikely to have been found because of the flexibility of Model A.However, Model B is flexible enough to find a good fit to data forabout one in six potentially observable empirical outcomes. Theconclusion to be drawn here is that the good fit of Model B to dataare not very persuasive.

We are hesitant to give a recommendation as to the exact�-value that may be used as a threshold for declaring a fit to be“significant” because model flexibility may be interpreted differ-ently based on the empirical domain, model generalizability, andother factors. Rather, we recommend that the order of magnitudeof the �-value is used to gauge the level of persuasiveness of thereported fit. As a guide, a reported �-value of 1.0 means that themodel can fit any hypothetical dataset in a given simulation, a

�-value of 0.1 means that one in 10 potential outcomes could havebeen fit by the model (we do not believe this to be very persua-sive), a �-value of 0.01 means that model predictions are con-strained to about one percent of the data space, and a �-valueof �0.001 means that the model predicts virtually no data outsideof the empirical results.

Acceptable �-value ranges will likely evolve separately in eachmodel’s respective domain (as is the case for acceptable goodnessof fit statistics). Even if a �-value above 0.10 is not desirable ingeneral, in a certain scientific pursuit, a simulation resulting in a�-value of 0.15 may be impressive, and may make for a valuablescientific contribution. What we recommend is simply that�-values are reported alongside goodness of fit statistics, so as toprovide context, and help the modeler and the reader to moremeaningfully interpret the reported fit.

MFA Precision

To examine MFA precision we ran simulations with models whereit was possible to specify the exact size of the prediction space, � T.Whereas the MFA estimated �-value is based on a finite number ofsimulation-generated predictions, � T is the proportion of the pre-dicted space that would be covered if the model parameter space wasenumerated with infinitely fine granularity.

Table 1Metrics of Model Flexibility

Flexibility AssessmentAppropriate for all

model typesAppropriate for quantitative

analysesInterpretable flexibility metric

for each model

Number of free parameters Noa Yes Yes(e.g., AIC, BIC, RMSEA, adjusted r2)Parameter Space Partitioning Yes Nob Yes(PSP; Pitt et al. 2006)Model Mimicry Yes Yes Noc

(Navarro et al. 2004; Wagenmakers et al. 2004)Normalized Maximum Likelihood Nod Yes Yese

(NML; Grünwald, 2005, 2007)Bayesian Model Selection Nod Yes Noc

(BMS; e.g., Myung & Pitt, 1997)Prior Predictive Evaluation Nof Yes Yes(Vanpaemel, 2009)Model Flexibility Analysis Yes Yes Yes

a Appropriate for linear mathematical models, where with n free quantitative parameters one can fit any n-dimensional dataset. b Appropriate forqualitative analyses of relationships between behavioral measures. c Provides rank-order of model flexibility. d Appropriate for probability modelswhere goodness of fit is measured via a log-likelihood. e Provides nonstrict p-values, which may be difficult to interpret in some cases. f Appropriatewhere model predictions are independent across behavioral measures.

Table 2R Code for Computing MFA

mfa = function (modelPredictions){numberOfMeasures = ncol (modelPredictions)numberOfPredictions = nrow (modelPredictions)totalCells = numberOfPredictionsgranularity = totalCells ˆ (1 / numberOfMeasures)cellPredictions = floor (modelPredictions[,] � (granularity −.0000001))return (nrow(unique(cellPredictions)) / totalCells)}

Note. Potential range of each behavioral measure is assumed to be between 0 and 1. If the potential range isdifferent, data should be scaled for use with this function.

761MODEL FLEXIBILITY ANALYSIS

Three model types were used. The first model type had a singlecontinuous prediction space with uniform density (similar to ModelsA and B described in Equations 1 and 2). In the second model type theprediction space was split in two, with one part of the space beingsmaller and more dense than the other (a sample two-dimensionalprediction space is portrayed in Figure 7, left-most plot of the secondrow). The third model type had a single continuous prediction space,which was more dense in the center and less dense toward the edges(a sample two-dimensional prediction space is portrayed in Figure 7,left-most plot of the third row).

In addition to model types, we manipulated the granularity ofthe parameter space enumeration, the size of the prediction space(� T), and the number of dimensions of the prediction space.Simulation results are displayed in Figure 7.

Note that for even-density prediction spaces MFA error is negligi-ble, and falls on the conservative side. That is, MFA is more likely toreport a slight overestimate of model flexibility than an underestimate.Erring on the conservative side is beneficial in efforts where we arelooking for stronger evidence in support of a given model.

For the third model type, where some parts of the predictionspace are more sparsely covered than others, MFA may underes-timate � T, especially for higher �-values. This is not a problemfor two reasons. First, this phenomenon is nonexistent for thelower range of �-values. Thus, a low �-value may still be used toargue for the persuasiveness of a given fit, and a high �-value tosuggest a need for further model constraints. Second, as modelpredictions become more sparse within a given part of the dataspace, finding a good fit to observed data in that part of the spacebecomes less likely. Thus, in the absence of infinite data, predic-tive flexibility of models that contain such sparse prediction-regions is actually lower than � T (as is indicated via MFA).

One potential concern with volume estimation via a grid-cell countis that as the granularity increases (i.e., the space is split into morepartitions), the size of each cell in the grid becomes smaller, thereported volume proportion approaches zero. This concern is allevi-ated in the proposed method in that the number of cells in the grid isproportional to the number of predictions generated by the model. Forexample, Figure 8 displays the MFA error across exponentially in-creasing grid granularity for predictions in a two-dimensional spaceby the three model types from Figure 7 (� T � .04).

Considerations Regarding Behavioral Measures

MFA is agnostic as to what behavioral measures are beingobserved and predicted. However, each behavioral measure musthave a definitive range. For example, the range of an SAT section

score as a behavioral measure would be between 200 and 800. Ifmodel predictions were between 300 and 600, these would take up50% of the potential data space, and thus, the flexibility of themodel would be .5. For experiments where the ranges of behav-ioral measures are ill-defined, and there is no defendable common-sense potential value range, MFA may not be an appropriatemethod for flexibility assessment.

Note that as the range of any behavioral measure decreases, thesame set of model predictions begins to take up a larger proportionof the potential dataspace, and vice versa. Thus, MFA �-values areinversely proportional to the range sizes of the behavioral mea-sures. As the range of a behavioral measure approaches zero, theprobability of fitting any observable datapoint in the range ap-proaches 100%. Ad absurdum, if there was only a single possibleoutcome in a given simulation (i.e., there is no way for a model tonot produce this outcome), producing this prediction could not bemeaningfully interpreted as evidence in support of the model.Correspondingly, a high �-value means that a good fit cannot betaken as evidence in support of a model, and as data ranges growsmaller, the �-value grows higher, indicating that a good fit is lessmeaningful (this is in contrast to measures like NML, where theflexibility term, COMP, becomes smaller as the data range de-creases, thus, placing more weight, rather than less, on the best-found fit).

For these reasons, a conservative MFA assessment will at-tempt to minimize the behavioral ranges, and thus, the geomet-ric space from which hypothetical datasets are drawn, as muchas possible. For example, in experiments where the variables ofinterest are response times (RTs) the data may range fromnegative values (false starts) up to hundreds of seconds (lapsesin attention). However, depending on the focus of the study, ifa closer look at the data was to reveal that the bulk of RTs fallwithin a tighter range between 0 and x seconds, the reactiontime (RT) range in MFA should also be constrained between 0and x, so as to derive a more conservative �-value estimate.

Parameter Space Enumeration

Solution-biased searches are often used to arrive at best-fitting model parameters. Examples of such techniques includehill-climbing and genetic algorithms. MFA should not be com-puted using a set of model predictions that comes from asolution-biased search. The reason for this is that MFA requiresunbiased parameter space enumeration, whether it is via ran-dom parameter value selection or a systematic stepwise search.The enumeration does not have to be linear (e.g., if a logarith-mic exploration of the parameter space makes more sense foryour model, that is fine), but it must not be biased toward theobserved data values.

To this end, hand-tailoring parameter values should beviewed as a solution-biased search. It should not be enough toplug a few intuition-based parameter values into the model andreport the best-found fit. Rather, the constraints on parameterbounds should be made explicit, the parameter-space should beenumerated within these bounds to produce an unbiased set ofpredictions, and then the best fit and flexibility may be derivedfrom this set of predictions. If researchers are hesitant toenumerate the full parameter space because of limited computa-tional resources, they should consider using freely available

Table 3Predictions Made by Models A and B Given the FullParameter-Space Enumeration With a Granularity of .01

Model A Model B

DV1 DV2 DV1 DV2

0 0 0 00 .00001 0 .004

. . . . . ..1 .10099 .4 .396.1 .101 .4 .4

762 VEKSLER, MYERS, AND GLUCK

parameter-enumeration resources like mindmodeling.org (Gluck,2010; Harris, 2008; Moore & Gunzelmann, 2014).

Hand-tailoring parameter values is different from extrapolat-ing these values from the literature or using model defaults.When exact parameter values are picked and never varied, theyare no longer considered “free” parameters, but rather falsifi-able model constraints.

On a related note, validity of MFA results does not requireparameter enumeration constraints to be set a priori. That is, it isvery well possible to run a full enumeration of model parameters,find a good fit to data, but a high MFA value, and then to realizethat one of the parameters can be constrained, thus, reducing theMFA value. This may be interpreted as a form of statisticalcheating, in that the model was adjusted on the basis of the data,if the fact that the adjustment was post hoc was not honestlyreported. However, model design is an iterative debugging processwith the goal of improving model validity, and so post hoc adjust-ments are assumed in all modeling work, unless otherwise stated.For all modeling results that report AIC/BIC values readers mayassume that models were tweaked where possible to take out freeparameters, so as to reduce the perception of flexibility. In asimilar manner, as modelers begin to care about �-values, it shouldalso be assumed that the design process includes the goal ofdecreasing model flexibility. If all model constraints are clearlyreported, even if these constraints were deduced postanalysis, thenthese become falsifiable claims about the evolving model.

Interpretation of � as a p Value

MFAs � has an interpretation as a p value of the data under auniform null hypothesis. That is, if one performs MFA for aparticular simulation and gets a �-value of.05, that means that

Figure 7. Model Flexibility Analysis (MFA) predictions for various types, sizes, and densities of predictedmodel spaces. Each model had n free parameters to generate n-dimensional predictions. For each model, eachof the n free parameters was varied between 0 and 1, inclusively, with granularities of .05, .02, and .01 togenerate 21n, 51n, and 101n total predictions used for �-value estimates, respectively. See the online article forthe color version of this figure.

Figure 8. Example of Model Flexibility Analysis (MFA) precision in-variance with increasing grid granularity (i.e., number of partitions for eachdimension in the predicted space). Results shown for single-area, two-area,and varying-density model types, making predictions in a two-dimensionalspace, where the actual size of predicted space, � T, is .04.

763MODEL FLEXIBILITY ANALYSIS

about one in 20 similar modeling efforts would have yielded agood fit to data by chance, having nothing to do with the modelconstraints being evaluated.

More specifically, if we were evaluating model X, and thismodel provided a good fit to observed data, the null hypothesiswould be that model X accounts for the observed data by chance.A �-value of.05 would indicate that if we were to reject the nullhypothesis, and assume that model X constraints are indeed re-sponsible for a good fit to data, there is a 5% chance that we wouldbe committing a Type 1 error. If we were to accept the null, andassume that the model is not a good account for the data, but ratherthat the observed data happened to be close to model predictionsby chance, there is a 95% chance that we would be committing aType 2 error.

This raises the question as to why the null should be uniform,rather than problem-dependent. That is—we can alter the proce-dure such that �-values represent predicted proportions of nonlin-ear geometric spaces. Indeed, there may be behavioral simulationsthat justify corrections for nonlinear measures of interest (e.g.,using a log-scale).

We would like to reiterate that it would be ill-advised to have auniversal “significance” threshold for �-values, as is the commonpractice for p values (e.g., .01, .05). It may well be the case, forexample, that a certain class of models or a certain class ofbehavioral phenomena requires a more conservative or a moreliberal flexibility criterion than others. It may be the case that amodel can account for data across a wide range of behavioralphenomena at the cost of higher �-values for any one simulation.Thus, we merely suggest that �-values are reported alongside eachgoodness of fit statistic, helping reviewers and readers to gauge thepersuasiveness of each fit on a continuous scale.

Use Cases of MFA in Model Evaluation

In this section we present four examples of the use of MFA inmodeling. In Simulations 1–3 MFA is used to interpret goodnessof fit statistics in the evaluation of cognitive models. In Simulation4 MFA is used a priori (before running the experiment), so as todetermine whether a given experimental design could be used forevaluation of two competing models. We do not describe themodels used in these simulations in great detail, as these aretangential to the focus of the article—please refer to the originalcited articles for detailed model descriptions.

Simulation 1: Paired Associates (Bower, 1961)

As an example of how MFA may be useful in model evaluation,let us take the relatively simple case of a two-parameter Reinforce-ment Learning (RL) model attempting to fit two data-points froma paired associates task. More specifically, we use the RL mech-anism from the ACT-R cognitive architecture (Anderson, 2007) tosimulate the human cognitive processing in the Bower (1961)paired-associates task. This experiment required participants tolearn how 10 distinct stimuli mapped onto two potential responses.The two data-points that we are attempting to fit are (a) the averageparticipant error rate across the first five learning blocks, and (b)the average participant error rate across the second five learningblocks. The two parameters in the model are learning rate andexploratory noise. Varying each model parameter between .01 and

.30, with a granularity of .01, we get the set of predictions depictedas gray points in Figure 9, and a best-fit of RMSE � .0033.

For the sake of comparison, let us assume a competing model:Model B represented in Equation 2. Model B also has two freeparameters, 0 � x � 1, 0 � y � 1, and these were also varied witha granularity of .01. The best fit found for Model B was RMSE �.0016.

According to metrics based on the number of free parameters(e.g., AIC, BIC), Model B would be considered the more likely ofthe two models, as it has a better fit, and the same number of freeparameters. However, running MFA for both of these simulationswe get the following:

RL � � 0.04

Model B � � 0.16

As a simplification as to how MFA produces these numbers, wecan imagine placing a grid over Figure 9 and counting the numbersof squares containing the predictions of each model, finding RL tohave covered 4% and Model B to have covered 16% the squares.Adding a bit more nuance to this methodology, the number of gridsquares for each model would be equal to the number of predic-tions generated by that model.

What might we conclude from these statistics? Obviously thebest Model B fit is better than that of RL. However, according toMFA, there is a 16% probability that the good fit found for ModelB could have been found no matter what the behavioral data turnedout to be, whereas the probability of finding the good fit producedby RL is only about 4%. In other words, Model B is highlyflexible, and finding a good fit between empirical results andModel B predictions in this simulation is not persuasive evidencefor the validity of the model. Contrariwise, the RL model makes a

Figure 9. Predictions by Reinforcement Learning (RL) for the Bower(1961) paired associates task.

764 VEKSLER, MYERS, AND GLUCK

relatively constrained set of predictions in this simulation, and itsgood fit to observed data may be given more weight.

Note that unlike RL, Model B is not a cognitive model—it is astraw-man (or a straw-model), presented here for comparison anddiscussion purposes only. It is common to present straw-models(e.g., random or optimal performance) in research articles, but it isless than meaningful to say that some proposed model is bestamong these. A beneficial feature of MFA is that it provides aninterpretable metric of flexibility for each individual model, inde-pendent of other models’ results. Thus, in the case that we wereevaluating RL in absence of a comparison to Model B, we wouldconclude that it provides a good fit to data, RMSE � .0033, andthat this is fairly persuasive evidence for the model, because theparameter enumeration for this model would have accounted foronly about 4% of the potential datasets.

In summary, this simulation exemplifies how (a) model flexi-bility provides important context for interpreting the reportedgoodness of fit statistics, (b) MFA returns a more precise andaccurate estimate of model flexibility than the raw count of modelparameters, and (c) MFA may be used to provide context for asimulation whether the research involves model comparison orindependent model evaluation.

Simulation 2: Reward and Transition Factors inChoice Repetition (Daw et al., 2011)

Veksler, Myers, and Gluck (2014) presented an integratedmodel of Associative Learning (AL) and RL. The integrated model(SAwSu) was contrasted with standalone AL and RL models inaccounting for behavioral data from Daw, Gershman, Seymour,Dayan, and Dolan (2011). The human data in the Daw, Gershman,Seymour, Dayan, and Dolan (2011) experiment consisted of fourdata points, and the SAwSu model includes four free parameters(decay rate, associative and reinforcement learning rates, andnoise). The AL and RL standalone models have three parameterseach (decay rate, learning rate, and noise). By means of MFA wecan answer whether fitting four data points with three or four freeparameters is at all impressive.

In the Daw et al. (2011) experiment human participants wererequired to complete a two-stage Markov decision task (see Figure10). At the first stage of each trial (top of Figure 10, white)participants chose between two options (labeled by semantically

irrelevant Tibetan characters). Choosing one of the options wouldlead a participant to screen 1 (bottom-left of Figure 10, gray) 70%of the time (a common transition) and screen 2 (bottom-right ofFigure 10, black) 30% of the time (a rare transition). Choosing theother option would lead a participant to screen 2 70% of the time(a common transition) and screen 1 30% of the time (a raretransition). Screens 1 and 2 contained two options each (alsolabeled by semantically irrelevant Tibetan characters). These foursecond-stage options had different payoff probabilities. On eachnew trial, the chances of payoff associated with each second-stageoption were changed by adding small Gaussian noise (mean 0, SD.025).

The four dependent values derived from this experiment wereprobabilities of repeating prior first-stage choices, given that theprior first-stage choice either resulted in a common or a raretransition, and either did or did not result in a reward. A grid searchwas performed to find the parameter values for the three models,executing each model on the Daw et al. (2011) simulation. Eachfree parameter was varied between .05 and .95 with a granularityof .05. The lowest RMSE values found for SAwSu, RL, and ALmodels were .010, .072, and .072, respectively. Best model fits aredisplayed in Figure 11.

If we were to subscribe to methods that discount model fit by thenumber of free parameters, we might reject all three modelsoutright, on the basis that there is nothing noteworthy about fittingfour data points with 3–4 free parameters. The high parameter:data-point ratio should result in high flexibility and low RMSEvalues. With MFA, one can determine if this assumption is sup-ported.

We ran MFA on the predicted result sets from each model todetermine the degree to which the best-fit RMSE values can beattributed to the flexibility of each model. The behavioral measuresin this study are proportions, and thus, the potential value rangesare between 0 and 1 for each measure. MFA of the three model fitsreveals the following:

SAwSu � � 0.0016

RL � � 0.0067,

AL � � 0.0066.

This means that, for all three models, the probability of findingthe best fit via parameter-space exploration was relatively low.Thus, all three of these fits are persuasive, and can be meaningfullycompared, and we can report SAwSu to have produced the best fit.

Note that SAwSu has more free parameters than AL and RLmodels, and yet it is estimated to have lower predictive flexibility.To be clear, a model with a higher MFA �-value is not the lesslikely model, or somehow worse than one with a lower value. MFAis an evaluation of fit persuasiveness based on model flexibility.What one might conclude from the low MFA values for the threemodels in this simulation is that the best-found goodness of fitstatistics are likely to be because of the constraints of thosemodels, rather than parameter exploration, and can be meaning-fully interpreted.

A high �-value, on the other hand, would have indicated thatthe free parameters made the simulation flexible enough that thebest-found fit was likely to have been found regardless of theFigure 10. Daw et al. (2011) experiment setup.

765MODEL FLEXIBILITY ANALYSIS

observed data—and thus, the goodness of fit statistic could notbe used as persuasive evidence in support of the model. If any ofthe models produced a high �-value, that model’s fit would bediscounted as unpersuasive in the context of this simulation. If, forexample, SAwSu produced the best fit, but a high �-value, wewould recommend comparing only the RL and AL models, re-moving SAwSu from the analysis.

In summary, reporting MFA values for the Daw et al. (2011)simulation allows us to gauge the persuasiveness of goodness of fitstatistics for the three models, to meaningfully interpret and com-pare the models, and to claim that SAwSu produced a better fitthan the AL and RL models, regardless of the number of param-eters in each of the models.

Simulation 3: Performance Predictions Basedon Training Schedules (Jastrzembski, Gluck, &Rodgers, 2009)

MFA is useful for mathematical models as well as computa-tional ones. There may be a perception that the number of freeparameters indicates little about a computational model because ofthe functional complexity involved, whereas a fit of a mathemat-ical model must be discounted by the number of its free parame-ters. The common wisdom is that fitting three data points withthree free parameters in a math equation does not sound like animpressive feat. To this end, of course, we can point to thedifference between Models A and B described in Equations 1 and2 as an example of mathematical models where the number of freeparameters says little. However, a real example using formalpsychological models may be more convincing.

Jastrzembski, Gluck, and Rodgers (2009) presented a PredictivePerformance Equation (PPE), which is a formula with three freeparameters that predicts performance at a future point in time as afunction of prior training schedule and performance level. Forcurrent purposes we will focus only on one part of their work,where Jastrzembski et al. (2009) used PPE to fit six data points(please see the original cited article for a full account of how PPEis used in predicting performance). The behavioral measures re-

flected performance evaluations of an F-16 pilot team over thecourse of a training week, calculated as the ratio of the number oftimes that the Minimum Abort Range was violated by any of thefour pilots on the team to the number of threats in the scenario.

We performed a full enumeration of the three model parameters,varying learning rate, decay rate, and scaling factor parametersbetween 0 and 1 with a granularity of.01. Given this enumeration,the best fit of model predictions to observed data was found to beRMSE � .02. We computed MFA for this fit to assess whether thismathematical model with three free parameters is flexible enoughto fit a large proportion of potentially observable data. MFArevealed that the probability of finding a good fit because of modelflexibility was low, �. In other words, PPE makes a precise set ofpredictions, and the reported good fit is persuasive.

In summary, the good fit found for the PPE model could bemeaningfully interpreted as persuasive evidence for the model.The most important point to draw from this simulation is that MFAmay be just as useful for estimating mathematical model flexibilityas it is for process-based models. Simply discounting a mathemat-ical model fit by the number of free parameters is not necessarilyappropriate, because most mathematical models are not orthogonal(where with n free parameters one can fit any dataset of length n).

Simulation 4: Preference Reversal (Ainslie &Herrnstein, 1981)

As we mentioned in the Introduction, one can perform MFA ona potential data space without having any observed data. Althoughexperimental results may be necessary for model validation, modelpredictions may be gathered without running the actual experi-ment. Here we provide an example of how such an analysis maybe performed and when it may be useful.

Ainslie and Herrnstein (1981) describe a behavioral phenome-non of preference reversal. In their study subjects display a clearpreference for a smaller immediate reward over a larger rewarddelayed by four seconds (0 s vs. 4 s condition). In another condi-tion, where the smaller reward is delayed by eight seconds, and thelarger reward is delayed by 12 s (8 s vs. 12 s condition), subjects

Figure 11. Human stay probabilities from the Daw et al. (2011) experiment, and best fit results from SAwSu,Reinforcement Learning (RL), and Associative Learning (AL) models. Error bars on human data represent SE;model error is negligible.

766 VEKSLER, MYERS, AND GLUCK

display a preference reversal, opting to wait the extra four secondsfor the larger reward. Ainslie and Herrnstein (1981) present evi-dence of the preference reversal effect (though the authors do notreport quantitative mean tendencies) and argue for models withhyperbolic decay, rather than exponential decay, on the basis ofthis phenomenon.

Let us suppose that we wanted to replicate this study and findthe mean behavioral tendencies for 0 s versus 4 s and the 8 s versus12 s conditions, so as to evaluate RL models with hyperbolic andexponential decay functions based on their goodness of fit to data.MFA may be used to predict whether the proposed experimentaldesign would produce meaningful goodness of fit statistics foreither of the two models.

We ran preference reversal simulations for the 0 s versus 4 s andthe 8 s versus 12 s conditions using two RL models, one withexponential decay and one with hyperbolic decay, varying threeparameters (decay rate, learning rate, and exploratory noise) foreach model. The prediction spaces for the two models are dis-played in Figure 12. MFA reveals �-values of.17 and.11 for theexponential and hyperbolic models, respectively. These values canbe calculated without any observed data because MFA simplyestimates the proportion of the hypothetical data space that iscovered by model predictions (e.g., what proportion of Figure 12is covered by prediction points for each model). In this way aprospective MFA would suggest that a different experimentaldesign may be necessary if one wanted to evaluate these twoparticular models via goodness of fit to empirical data.

It is generally the case that finding a good fit becomes lessprobable as the number of dependent measures increases. Thus,changing the experimental design for a given study so as todecrease the probability of finding a good fit via parameter enu-meration may be as simple as adding another experimental condi-tion.

We added a 4 s versus 8 s condition to our simulation andgathered results from the exponential and hyperbolic models forthis condition. Rerunning prospective MFA with this added con-dition revealed much less model flexibility, � � .12 and � � .07for the exponential and hyperbolic models, respectively. Addingyet another experimental condition, 6 s versus 10 s, resulted ineven lower flexibility for each model, � � .080 and � � .046 forthe exponential and hyperbolic models, respectively. Thus, if weran this experiment with 0 s versus 4 s, 4 s versus 8 s, 6 s versus10 s, and 8 s versus 12 s conditions and we found a good fit for thehyperbolic model, this may be persuasive evidence (or, to beconservative, more behavioral measures may be added).

In summary, MFA can be used to evaluate whether an experi-mental design will produce sufficiently persuasive goodness of fitstatistics for a given model before the expensive process of gath-ering the empirical results. One may run simulations for multipleexperimental designs, until MFA indicates that finding a good fitfor a specific experimental design is sufficiently improbable, andthe results from this experiment may be useful as persuasiveevidence in support of a given model.

Summary and Discussion

A good fit of model predictions to empirical data are oftenpresented as evidence for model validity. However, how well amodel fits the observed dataset cannot be meaningfully interpretedwithout a consideration of how well the model can fit nonobserveddata—the model’s flexibility. In this article we present MFA as amethod for determining the probability that a good fit of modelpredictions to observed data was not achieved because of modelconstraints, but rather because of its flexibility.

The proposed technique estimates the proportion of potentiallyobservable data space that can be accounted for by the model.MFA aids model evaluation in that it provides a metric for inter-preting the goodness of fit, helping to answer the question asked byRoberts and Pashler (2000), “How Persuasive is a Good Fit?” Assuch, we recommend that MFA values should be reported along-side the goodness of fit metrics for all quantitative modelingefforts where goodness of fit to empirical results is presented asevidence for model validity. Additionally, MFA may be used apriori—before the execution of an empirical study, so as to estab-lish whether the experiment will yield a persuasive evaluation fora given model.

MFA shares many characteristics with other methods of flexi-bility assessment. For example, MFA, PSP, NML, and Prior Pre-dictive Evaluation all require definitive ranges for all dependentmeasures of interest, as well as definitive ranges for all modelparameters. Additionally, MFA is similar to PSP (Pitt et al., 2006)in that both methods rely on the full enumeration of the parameterspace for each model of interest. PSP uses parameter space enu-meration to establish what proportion of all possible qualitativeparameter relationships the model could predict, whereas MFAuses parameter space enumeration to establish what proportion ofhypothetical datasets the model could account for.

In this MFA is also similar to NML (Grünwald, 2005, 2007;Myung et al., 2006). NML discounts best fit to observed data bybest fits to all hypothetical datasets, and MFA interprets the best fitto observed data in the context of how likely the model was to finda good fit to any data. Both MFAs �-value and NMLs flexibility

Figure 12. Preference for a delayed large reward over a more immediatesmall reward. Model predictions shown for Reinforcement Learning mod-els with exponential and hyperbolic decays.

767MODEL FLEXIBILITY ANALYSIS

term, COMP, increase as the space of model predictions getslarger, irrespective of the number of free parameters in the model.

MFA may be viewed as a more general instantiation of currentflexibility assessment techniques, in that MFA is applicable forindividual model evaluation, as well as for relative model com-parison, and can help to provide context for evaluation of anymodel type.9 In contrast, relative fit/flexibility metrics (e.g., ModelMimicry, BMS) aid in selecting which model accounts best for theobserved data, but do not aid in individual fit assessment, leavingthe possibility that the model may be overly flexible despite beingjudged best among some candidate set of models. Counting thenumber of free parameters is not an accurate predictor of modelflexibility for models with nonorthogonal parameters, and PriorPredictive Evaluation is not an accurate measure of flexibilitywhere behavioral measure predictions are not independent. NMLis most appropriate for models where model predictions may beeasily translated into probability distributions, and ParameterSpace Partitioning is most appropriate where evaluation focuses onqualitative trends in model predictions.

There exist a number of model evaluation methods beyond thosedescribed in this article (e.g., Gluck, Bello, & Busemeyer, 2008;Myung & Pitt, 1997; Pitt & Myung, 2002; Shiffrin, Lee, Kim, &Wagenmakers, 2008; Taatgen & van Rijn, 2010; Weaver, 2008).These evaluation techniques differ in what they do, and in whatthey measure. We do not claim that MFA is universally better thanother metrics, but that there are cases where it is informative andappropriate.

Finally, it is important to note that a doctrinaire prescription ofminimum accepted MFA values may not be good for scientificprogress. It may be too easy to overstate the importance of aspecific statistical threshold for claiming failure or success in amodeling effort (Gigerenzer, 2004). The importance of a model isits usefulness, and it may be entirely plausible to find good use inmodels with higher MFA values. MFA results provide additionalcontext for interpreting modeling simulations, and should be in-terpreted within the broader context of the scientific agenda.

9 Methods for model evaluation may always be used in model compar-ison, whereas methods for model comparison do not always inform modelevaluation. Thus, MFA may be useful for both comparison and evaluation,whereas metrics like AIC/BIC are only useful for comparison.

References

Ainslie, G., & Herrnstein, R. (1981). Preference reversal and delayedreinforcement. Animal Learning & Behavior, 9, 476–482.

Akaike, H. (1973). Information theory and an extension of the maximumlikelihood principle. In B. Petrox & F. Caski (Eds.), Second interna-tional symposium on information theory (pp. 267–281). Budapest: Aka-demiai Kiado.

Anderson, J. R. (2007). How can the human mind occur in the physicaluniverse? Oxford: Oxford University Press.

Bamber, D., & van Santen, J. (1985). How many parameters can a modelhave and still be testable? Journal of Mathematical Psychology, 29,443–473.

Bower, G. (1961). Application of a model to paired-associate learning.Psychometrika, 26, 255–280.

Campbell, G. E., & Bolton, A. E. (2005). HBR validation: Integratinglessons learned from multiple academic disciplines, applied communi-

ties and the AMBR project. In K. A. Gluck & R. W. Pew (Eds.),Modeling human behavior with integrated cognitive architectures: Com-parison, evaluation, and validation (pp. 365–395). Mahwah, NJ: Law-rence Erlbaum Associates.

Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P., & Dolan, R. J.(2011). Model-based influences on humans’ choices and striatal predic-tion errors. Neuron, 69, 1204–1215.

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., &Rubin, D. B. (2013). Bayesian data analysis. Boca Raton, FL: CRCpress.

Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics,33, 587–606.

Gluck, K. A. (2010). Cognitive architectures for human factors in aviation.In E. Salas & D. Maurino (Eds.), Human factors in aviation (2nd ed., pp.375–400). New York, NY: Elsevier.

Gluck, K. A., Bello, P., & Busemeyer, J. (2008). Introduction to the specialissue. Cognitive Science, 32, 1245–1247.

Grünwald, P. (2005). A tutorial introduction to the minimum descriptionlength principle. In P. Grünwald, J. I. Myung, & M. A. Pitt (Eds.),Advances in minimum description length theory and applications (Vol.math. ST/04). Cambridge, MA: MIT Press.

Grünwald, P. (2007). The minimum description length principle (Vol. V).Cambridge, MA: MIT Press.

Hansen, M., & Yu, B. (2001). Model selection and the principle ofminimum description length. Journal of the American Statistical Asso-ciation, 96, 746–774.

Harris, J. (2008). MindModeling@Home: A large-scale computationalcognitive modeling infrastructure. In The sixth annual conference onsystems engineering research (pp. 246–252). Los Angeles, CA.

Jastrzembski, T., Gluck, K., & Rodgers, S. (2009). Improving militaryreadiness: A state-of-the-art cognitive tool to predict performance andoptimize training effectiveness. In The interservice/industry training,simulation, and education conference (i/itsec), Arlington, VA.

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of theAmerican Statistical Association, 90, 773–795. http://dx.doi.org/10.2307/2291091

Kim, W., Navarro, D. J., Pitt, M. A., & Myung, J. I. (2004). An MCMC-based method of comparing connectionist models in cognitive science.In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances in neuralinformation processing systems (Vol. 16, pp. 937–944). Cambridge,MA: MIT Press.

Lee, M. D. (2015). Bayesian outcome-based strategy classification. Behav-ior Research Methods, 1–13.

McClelland, J. (2009). The place of modeling in cognitive science. Topicsin Cognitive Science, 1, 11–38.

Moore, L. R., & Gunzelmann, G. (2014). An interpolation approach forfitting computationally intensive models. Cognitive Systems Research,29–30, 53–65.

Myers, C. W., Gluck, K. A., Gunzelmann, G., & Krusmark, M. (2011).Validating computational cognitive process models across multipletimescales. Journal of Artificial General Intelligence, 2, 108–127. http://dx.doi.org/10.2478/v10229-011-0012-6

Myung, J. I., Balasubramanian, V., & Pitt, M. A. (2000). Counting prob-ability distributions: Differential geometry and model selection. Pro-ceedings of the National Academy of Sciences of the United States ofAmerica, 97, 11170–11175. http://dx.doi.org/10.1073/pnas.170283897

Myung, J. I., Navarro, D. J., & Pitt, M. A. (2006). Model selection bynormalized maximum likelihood. Journal of Mathematical Psychology,50, 167–179. http://dx.doi.org/10.1016/j.jmp.2005.06.008

Myung, J. I., & Pitt, M. A. (1997). Applying Occam’s razor in modelingcognition: A Bayesian approach. Psychonomic Bulletin & Review, 4,79–95. http://dx.doi.org/10.3758/BF03210778

768 VEKSLER, MYERS, AND GLUCK

Navarro, D. J., Pitt, M. A., & Myung, J. I. (2004). Assessing the distin-guishability of models and the informativeness of data. Cognitive Psy-chology, 49, 47–84. http://dx.doi.org/10.1016/j.cogpsych.2003.11.001

Pitt, M. A., Kim, W., Navarro, D. J., & Myung, J. I. (2006). Global modelanalysis by parameter space partitioning. Psychological Review, 113,57–83. http://dx.doi.org/10.1037/0033-295X.113.1.57

Pitt, M. A., & Myung, J. I. (2002). When a good fit can be bad. Trends inCognitive Sciences, 6, 421–425.

Pitt, M. A., Myung, J. I., & Zhang, S. (2002). Toward a method of selectingamong computational models of cognition. Psychological Review, 109,472–491.

Ratcliff, R. (2002). A diffusion model account of response time andaccuracy in a brightness discrimination task: Fitting real data and failingto fit fake but plausible data. Psychonomic Bulletin & Review, 9,278–291. http://dx.doi.org/10.3758/BF03196283

Roberts, S., & Pashler, H. (2000). How persuasive is a good fit? Acomment on theory testing. Psychological Review, 107, 358–367.

Roos, T. (2008). Monte Carlo estimation of minimax regret with anapplication to MDL model selection. In 2008 IEEE information theoryworkshop (itw-2008). New York, NY: IEEE Press. http://dx.doi.org/10.1109/ITW.2008.4578670

Schwarz, G. (1978). Estimating the dimension of a model. The Annals ofStatistics. Retrieved from http://projecteuclid.org/euclid.aos/1176344136

Shafer, G., Shen, A., Vereshchagin, N., & Vovk, V. (2011). Test martin-gales, Bayes factors and p-values. Statistical Science, 26, 84–101.

Shiffrin, R., Lee, M., Kim, W., & Wagenmakers, E.-J. (2008). A survey ofmodel evaluation approaches with a tutorial on hierarchical Bayesianmethods. Cognitive Science, 32, 1248–1284.

Taatgen, N., & van Rijn, H. (2010). Nice graphs, good R2, but still a poorfit? How to be more sure your model explains your data. Proceedings ofthe 2010 international conference of cognitive modeling, Philadelphia,PA.

Vanpaemel, W. (2009). Measuring model complexity with the prior pre-dictive. Advanced in Neural Information Processing Systems, 22, 1919–1927.

Van Zandt, T., Colonius, H., & Proctor, R. W. (2000). A comparison oftwo response time models applied to perceptual matching. PsychonomicBulletin & Review, 7, 208–256.

Van Zandt, T., & Ratcliff, R. (1995). Statistical mimicking of reaction timedata: Single-process models, parameter variability, and mixtures. Psy-chonomic Bulletin & Review, 2, 20 –54. http://dx.doi.org/10.3758/BF03214411

Veksler, V. D., Myers, C. W., & Gluck, K. A. (2014). SAwSu: Anintegrated model of associative and reinforcement learning. CognitiveScience, 38, 580–598.

Wagenmakers, E. J., Ratcliff, R., Gomez, P., & Iverson, G. J. (2004).Assessing model mimicry using the parametric bootstrap. Journal ofMathematical Psychology, 48, 28–50.

Weaver, R. (2008). Parameters, predictions, and evidence in computationalmodeling: A statistical view informed by ACT-R. Cognitive Science, 32,1349–1375.

Received July 16, 2014Revision received June 30, 2015

Accepted July 8, 2015 �

769MODEL FLEXIBILITY ANALYSIS