a novel method for expediting the development of patient ... · patient-centered health care, ......

14
RESEARCH ARTICLE Open Access A novel method for expediting the development of patient-reported outcome measures and an evaluation of its performance via simulation Lili Garrard 1 , Larry R. Price 2 , Marjorie J. Bott 3 and Byron J. Gajewski 1,3* Abstract Background: Developing valid and reliable patient-reported outcome measures (PROMs) is a critical step in promoting patient-centered health care, a national priority in the U.S. Small populations or rare diseases often pose difficulties in developing PROMs using traditional methods due to small samples. Methods: To overcome the small sample size challenge while maintaining psychometric soundness, we propose an innovative Ordinal Bayesian Instrument Development (OBID) method that seamlessly integrates expert and participant data in a Bayesian item response theory (IRT) with a probit link model framework. Prior distributions obtained from expert data are imposed on the IRT model parameters and are updated with participantsdata. The efficiency of OBID is evaluated by comparing its performance to classical instrument development performance using actual and simulation data. Results and Discussion : The overall performance of OBID (i.e., more reliable parameter estimates, smaller mean squared errors (MSEs) and higher predictive validity) is superior to that of classical approaches when the sample size is small (e.g. less than 100 subjects). Although OBID may exhibit larger bias, it reduces the MSEs by decreasing variances. Results also closely align with recommendations in the current literature that six subject experts will be sufficient for establishing content validity evidence. However, in the presence of highly biased experts, three experts will be adequate. Conclusions: This study successfully demonstrated that the OBID approach is more efficient than the classical approach when the sample size is small. OBID promises an efficient and reliable method for researchers and clinicians in future PROMs development for small populations or rare diseases. Keywords: OBID, Bayesian psychometrics, Ordinal data analysis, Bayesian IRT, Patient-reported outcome measures, PROMs Background The Institute of Medicine (IOM) [1] released a landmark report, Crossing the Quality Chasm, which highlighted patient-centered care as one of the six specific aims (the others being safety; effectiveness; timeliness; effi- ciency; and equity) that defined quality health care. To promote patient-centered care, national entities such as the National Institute of Health (NIH) [2], the U. S. De- partment of Health and Human Services (DHHS) Food and Drug Administration (FDA) [3], the National Quality Forum (NQF) [4], and the Patient-Centered Outcomes Research Institute (PCORI) [5] have published specific guidelines on the development of patient-reported out- come measures (PROMs). The guidelines unanimously emphasize the critical requirement of rigorous psycho- metric testing for any new or adapted PROMs that often are designed as survey instruments. PROMs serve * Correspondence: [email protected] 1 Department of Biostatistics, University of Kansas Medical Center, Mail Stop 1026, 3901 Rainbow Blvd., Kansas City, KS 66160, USA 3 University of Kansas School of Nursing, Mail Stop 4043, 3901 Rainbow Blvd., Kansas City, KS 66160, USA Full list of author information is available at the end of the article © 2015 Garrard et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Garrard et al. BMC Medical Research Methodology (2015) 15:77 DOI 10.1186/s12874-015-0071-5

Upload: others

Post on 23-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A novel method for expediting the development of patient ... · patient-centered health care, ... gate endpoints for clinical trials and studies in humans, which are essential for

RESEARCH ARTICLE Open Access

A novel method for expediting thedevelopment of patient-reported outcomemeasures and an evaluation of itsperformance via simulationLili Garrard1, Larry R. Price2, Marjorie J. Bott3 and Byron J. Gajewski1,3*

Abstract

Background: Developing valid and reliable patient-reported outcome measures (PROMs) is a critical step in promotingpatient-centered health care, a national priority in the U.S. Small populations or rare diseases often pose difficulties indeveloping PROMs using traditional methods due to small samples.

Methods: To overcome the small sample size challenge while maintaining psychometric soundness, we propose aninnovative Ordinal Bayesian Instrument Development (OBID) method that seamlessly integrates expert and participantdata in a Bayesian item response theory (IRT) with a probit link model framework. Prior distributions obtainedfrom expert data are imposed on the IRT model parameters and are updated with participants’ data. The efficiency ofOBID is evaluated by comparing its performance to classical instrument development performance using actual andsimulation data.

Results and Discussion : The overall performance of OBID (i.e., more reliable parameter estimates, smaller meansquared errors (MSEs) and higher predictive validity) is superior to that of classical approaches when the sample size issmall (e.g. less than 100 subjects). Although OBID may exhibit larger bias, it reduces the MSEs by decreasing variances.Results also closely align with recommendations in the current literature that six subject experts will be sufficient forestablishing content validity evidence. However, in the presence of highly biased experts, three experts will beadequate.

Conclusions: This study successfully demonstrated that the OBID approach is more efficient than the classicalapproach when the sample size is small. OBID promises an efficient and reliable method for researchers andclinicians in future PROMs development for small populations or rare diseases.

Keywords: OBID, Bayesian psychometrics, Ordinal data analysis, Bayesian IRT, Patient-reported outcomemeasures, PROMs

BackgroundThe Institute of Medicine (IOM) [1] released a landmarkreport, Crossing the Quality Chasm, which highlightedpatient-centered care as one of the six specific aims(the others being safety; effectiveness; timeliness; effi-ciency; and equity) that defined quality health care. To

promote patient-centered care, national entities such asthe National Institute of Health (NIH) [2], the U. S. De-partment of Health and Human Services (DHHS) Foodand Drug Administration (FDA) [3], the National QualityForum (NQF) [4], and the Patient-Centered OutcomesResearch Institute (PCORI) [5] have published specificguidelines on the development of patient-reported out-come measures (PROMs). The guidelines unanimouslyemphasize the critical requirement of rigorous psycho-metric testing for any new or adapted PROMs thatoften are designed as survey instruments. PROMs serve

* Correspondence: [email protected] of Biostatistics, University of Kansas Medical Center, Mail Stop1026, 3901 Rainbow Blvd., Kansas City, KS 66160, USA3University of Kansas School of Nursing, Mail Stop 4043, 3901 Rainbow Blvd.,Kansas City, KS 66160, USAFull list of author information is available at the end of the article

© 2015 Garrard et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Garrard et al. BMC Medical Research Methodology (2015) 15:77 DOI 10.1186/s12874-015-0071-5

Page 2: A novel method for expediting the development of patient ... · patient-centered health care, ... gate endpoints for clinical trials and studies in humans, which are essential for

a critical role in translational research as data collectedusing PROMs are commonly used as primary or surro-gate endpoints for clinical trials and studies in humans,which are essential for promoting both clinical applicationand public awareness. However, the lengthy process ofdeveloping valid and reliable psychometric instruments(e.g., PROMs) is recognized as one of the greater bar-riers for disseminating and transitioning research find-ings into clinical practice in a timely manner.For decades classical instrument development meth-

odologies (e.g., frequentist approach to factor analysisthat ignores prior information regarding item reliabil-ity) dominated the psychometric literature [6]. Bayesianmethods have been severely limited until modern com-putation techniques provided researchers the capacityto employ Bayesian inference in actual applications [7].As Bayesian inference becomes more popular, limita-tions arise with the use of classical (i.e. frequentist)methods when developing instruments or PROMs forsmall populations (e.g., in cases of rare diseases). Sinceit is not the intent of the authors to provide a compre-hensive review of both classical and Bayesian statisticalapproaches, we focus our discussions on two co-existingissues with the classical approach to confirmatory factoranalysis (CFA) in establishing evidence of construct valid-ity: (a) the requirement of large samples, and (b) modelingordinal data as continuous.Two essential components of establishing evidence

that scores acquired by an instrument exhibit scorevalidity include content and construct-related evidence[8, 9]. Subject experts’ opinions are typically consulted inevaluating the content of items, such as how well theitems match the empirical indicators of the construct(s) ofinterest, and the relevancy and clarity of the items. Theitems evolve through rigorous revision (e.g., iterativelythrough pilot-testing with a small representative sample ofrespondents) until the instrument is deemed ready for es-tablishing construct validity evidence through a statisticaltechnique such as factor analysis. It is a common prac-tice to conduct expert evaluation for content analysis;however, under the classical setting data collected fromthe experts are not utilized in establishing constructvalidity as content validity focuses on the instrumentsrather than measurements [10]. The expert and partici-pant data are analyzed separately, which results in po-tential loss of information and leads to the increasingdemand for a large participant sample.There is no consensus among health care researchers re-

garding the number of subjects required for CFA. Knappand Brown [11] list several competing rules regarding thenumber of subjects required and argue that original studieson factor analysis (e.g., [12]) only assumed very large sam-ples relative to the number of items, and made no recom-mendations on a minimum sample size. Pett et al. [6] make

the recommendation of at least 10 to 15 subjects per item, acommonly suggested ratio in psychometric literature. How-ever, Brown [13] urges researchers to not rely on these gen-eral rules of thumb and proposes more reliable model-based(e.g., Satorra-Saris’s method) and Monte Carlo methods todetermine the most appropriate sample size for obtainingsufficient statistical power and precision of parameter esti-mates. A recent systematic review study [14] on sample sizeused to validate newly-developed PROMs reports that 90 %of the reviewed articles had a sample size ≥ 100, whereas 7% had a sample size ≥ 1000. In addition, Weenink, Braspen-ning and Wensing [15] explore the potential developmentof PROMs in primary care using seven generic instruments.The authors report challenges of low response rates to ques-tionnaires (i.e., small sample), and that a replication in largerstudies would require a sample size of at least 400 patients.Apart from the large sample issue, the other issue con-

cerns how data are analyzed using traditional approaches.The most common form of data acquired from measure-ment instruments in the social, behavioral, and healthsciences are ordinal; however, such data often are analyzedwithout regard for their ordinal nature [7]. The practice oftreating ordinal data as continuous is considered a contro-versy and has generated debates in the psychometricliterature [16]. With solid theoretical developments in or-dinal data modeling, it is considered best practice to usemodeling techniques that treat ordinal data as ordinal.Structure equation modeling (SEM) with categorical vari-ables first was introduced by Muthén [17] in a landmarkstudy that revolutionized psychometric work. Althoughtechniques for handling ordinal data in latent variableanalysis have been incorporated into several commercialstatistical software (e.g., Mplus) since the 1980’s, it is onlyin 2012 that the free R package lavaan incorporated theweighted least squares means- and variance-adjusted(WLSMV) estimator for performing ordinal CFA duringits version 0.5-9 release [18, 19]. Ordinal CFA offers newinsight for modeling ordinal data under the classical setting;yet it is still challenged by small samples, as we will show inthis study. A more complete solution is needed to resolveboth limitations and still provide reliable model estimates.New methods proposed by Gajewski, Price, Coffland,

Boyle and Bott [20] and Jiang, Boyle, Bott, Wick, Yu andGajewski [21] use Bayesian approaches to resolve thesample size limitation of traditional CFA. The IntegratedAnalysis of Content and Construct Validity (IACCV)approach establishes a unified model that seamlessly in-tegrates the content and construct validity analyses [20].Prior distributions derived from content subject experts’data are updated with participants’ data to obtain a pos-terior distribution. Under the IACCV approach, some ofthe response burden from the participants can be allevi-ated by using experts; thus fewer participants are neededto achieve the desired validity evidence in developing

Garrard et al. BMC Medical Research Methodology (2015) 15:77 Page 2 of 14

Page 3: A novel method for expediting the development of patient ... · patient-centered health care, ... gate endpoints for clinical trials and studies in humans, which are essential for

instruments. Using both simulation data and real data,Bayesian Instrument Development (BID) [21] advancesthe theoretical work of IACCV by demonstrating the su-perior performance of BID to that of classical CFAwhen the sample size is small. BID also advances thepractical application of IACCV by incorporating themethodology into a user-friendly GUI software that isshown to be reliable and efficient in a clinical studyfor developing an instrument to assess symptoms inheart failure patients. Although BID has shown greatpotential, the method is limited by the assumption ofcontinuous participant response data. As previously men-tioned, many clinical questionnaires data are collected asordinal or binary (a special type of ordinal data). Giventhis fact, there is an urgent need to adapt the BID ap-proach for ordinal responses.In this article, we propose an Ordinal Bayesian Instrument

Development (OBID) approach within a Bayesian item re-sponse theory (IRT) framework to further advance BIDmethodology for ordinal data. On first glance, the currentstudy appears to be a straightforward extension from previ-ous studies; however it differs from previous studies and con-tributes to the literature from several perspectives. First, aspreviously mentioned, ordinal or binary data are the mostcommon form of data collected using clinical instruments.The underlying distribution assumption required by continu-ous data modeling is often violated due to skewed responses.Our study effectively promotes the proper usage of ordinaldata modeling methods and brings awareness to a broaderaudience regarding the psychometric integrity of the meas-urement, which is essential for the development of PROMsand clinical trial outcomes. Although several simulation stud-ies on Bayesian IRT models have been discussed in the litera-ture, the studies arbitrarily select non-informative or weaklyinformative priors for model parameters without a clearelicitation process (e.g., [22, 23]). Alternatively, our approachis distinct because we leverage experts in elicitation of thepriors for the IRT parameters. Second, the consideration ofthe predictive validity of the instrument [9] that is oftenneglected in the literature is addressed here. These importantsteps are implemented in the simulation study for contribu-tion to the methodological literature.Results from our approach also have several prac-

tical implications to the development of PROMs, asOBID overcomes the small sample size (e.g., patientsfrom small populations) challenge while maintainingpsychometric integrity. Special considerations for reducingthe resource and cost burden incurred by researchers andclinicians are provided through the usage of fast and reli-able free R packages to implement the OBID method-ology. In our approach, a Markov chain Monte Carlo(MCMC) procedure is implemented to estimate the modelparameters; we provide general guidelines for selectingtuning parameters required in the MCMC procedure for

achieving appropriate acceptance/rejection rates. Our pro-posed method demonstrates that the overall performanceof OBID (i.e., more reliable parameter estimates, smallermean squared errors (MSE) and higher predictive validity)is superior to that of ordinal CFA when the sample size issmall. Most importantly, OBID promises an efficient andreliable method for researchers and clinicians in futurePROM development.

MethodsOBID further advances the work of Jiang et al. [21] thatexpands IACCV of Gajewski et al. [20], by adapting theBID methodology for ordinal scale data. Here we demon-strate the OBID approach using a unidimensional (i.e.,single factor) psychometric model and refer interestedreaders to Gajewski et al. and Jiang et al. for a detailed de-scription of the general model and the BID approach. Inaddition, we use a similar model and incorporate mathem-atical notation as presented in Jiang et al. to maintainsome level of consistency between both studies.

Bayesian IRT modelPrior to introducing the OBID model, it is important toclarify that both OBID and BID are CFA-based ap-proaches. IRT is a psychometric technique that providesa probabilistic framework for estimating how examineeswill perform on a set of items based on their ability andcharacteristics of the items [24]. IRT is a model-basedtheory of statistical estimation that conveniently placespersons and items on the same metric based on theprobability of response outcomes. Traditional factor ana-lysis is based on a deterministic model and does not reston a probabilistic framework. Here we provide a prob-abilistic connection between our approach and IRT, byusing Bayesian CFA, including an inherently probabilis-tic framework. From a modeling perspective, IRT is theordinal version of traditional factor analysis. When allmanifest variables are ordinal, the traditional factor ana-lysis model is equivalent to a two-parameter IRT modelwith a probit link function [7, 25]. The two-parameterIRT model with the probit link can be written as

yij ¼ c if y�ij∈ Tj c−1ð Þ;Tjc� �

; i ¼ 1;…; N ;j ¼ 1; …; P; c ¼ 1;…; Cj

ð1Þ

y�ij ¼ αj þ λjf i þ εij; f i e N 0; 1ð Þ; εij e N 0; 1ð Þ; i ¼ 1;…; N ;j ¼ 1; …; P;

ð2Þ

where yij is the ith participant’s ordinal response to the

jth item; and Cj is the total number of response categor-ies for item j (e.g., a five-point Likert scale). The ordinalresponse yij is linked to y�ij , an underlying continuous

Garrard et al. BMC Medical Research Methodology (2015) 15:77 Page 3 of 14

Page 4: A novel method for expediting the development of patient ... · patient-centered health care, ... gate endpoints for clinical trials and studies in humans, which are essential for

latent variable that follows a normal distribution, througha set of Cj � 1 ordered cut-points, Tjc , on y�ij . The prob-

ability of a subject selecting a particular response categoryis indicated by the probability that y�ij falls within an inter-

val defined by the cut-points Tjc . In IRT, the continuouslatent variable y�ij is characterized by two item-specific pa-

rameters: αj , the negative difficulty parameter for the jthitem and λj , the discrimination parameter for item j. Inaddition, the underlying latent ability f i of the subjects isconstrained to follow a standard normal and εij is themeasurement error [7].To see the equivalence between the IRT model and

traditional factor analysis model, note that a classical uni-dimensional factor analysis model can be expressed as

z�ij ¼ ρjf i þ eij; i ¼ 1;…; N ; j ¼ 1;…; P; ð3Þ

where z�ij represents the standardized y�ij from equa-

tions 1 and 2; f i is the ith participant’s factor scorefor the domain; ρj is the factor loading or item-to-

domain correlation for the jth item; and eij representsthe measurement errors or sometimes referred to aslatent unique factors or residuals. f i is assumed tofollow a standard normal distribution, which implies

that eijeN 0; 1� ρ2j

� �where ρ2j is the reliability of the

jth item. The standardization of y�ij is expressed by

y�ij−αjffiffiffiffiffiffiffiffiffiffiffiffiffi1þ λ2j

q ¼ λjffiffiffiffiffiffiffiffiffiffiffiffiffi1þ λ2j

q f i þεijffiffiffiffiffiffiffiffiffiffiffiffiffi1þ λ2j

q ; i ¼ 1;…; N ; j ¼ 1;…; P;

ð4Þsuch that the IRT model parameter λj can be inter-

preted interchangeably through the item-to-domain cor-relations ρj using the following expressions

λj ¼ρjffiffiffiffiffiffiffiffiffiffi1−ρ2j

q ð5Þ

ρj ¼λjffiffiffiffiffiffiffiffiffiffiffiffiffi1þ λ2j

q : ð6Þ

Equations 5 and 6 can be interpreted such that an itemthat well-discriminates among individuals with differentabilities also will have a high item-to-domain correlation.The true Bayesian application comes from specifying ap-propriate prior distributions on the IRT parameters,which leads us into the essence of the OBID method.

OBID – expert data and modelEliciting subject experts’ perception regarding the rele-vancy of each item to the domain (construct) of interestis a common practice to aid in verifying content validity

evidence. For example, during instrument development,a logical structure is developed and applied in a way thatmaps the items on the test to a content domain [8]. Inthis way, the relevance of each item and the adequacywith which the set of items represents the content do-main is established. To illustrate, a panel of subject ex-perts are asked to review a set of potential items andinstructed to provide response for questions such as“please rate the relevancy of each item to the overalltopic of [domain].” The response options are generallydesigned on a four-point Likert scale that ranges from“not relevant” to “highly relevant.” Gajewski, Coffland,Boyle, Bott, Price, Leopold and Dunton [26] laid import-ant groundwork from an empirical perspective by dem-onstrating the approximate equivalency of measuringcontent validity using relevance scales versus using cor-relation scales. In other words, content validity orientedevidence can be statistically interpreted as a representa-tion of the experts’ perceptions regarding the item-to-domain latent correlation [21].Continuing the notations from Jiang et al., suppose

the expert data are collected from a panel of k ¼ 1;…;K experts that respond to j ¼ 1; …; P items. Let X de-note the K × P matrix of observed ordinal responseswhere the xjkth entry represents the kth expert’s opinionregarding the relevancy of the jth item to its assigneddomain. Similarly, the kth expert’s latent correlationbetween the jth item and its respective domain is denotedby ρjk and is related to xjk using the following function,with correlation cut-points suggested by Cohen [27]:

xjk ¼1 }not relevant} if 0:00≤ρjk < 0:102 }somewhat relevant} if 0:10≤ρjk < 0:303 }quite relevant}4 }highly relevant}

if 0:30≤ρjk < 0:50if 0:50≤ρjk≤1:00

8>><>>:9>>=>>;:

ð7ÞA sensitivity analysis conducted by Gajewski et al.

[26] demonstrated the approximate equivalency ofusing correlation scale and using relevancy scale tomeasure content validity, under both equally-spaced(i.e., 0:00≤ρjk < 0:25, 0:25≤ρjk < 0:50, 0:50≤ρjk < 0:75,

and 0:75≤ρjk < 1:00) and unequally spaced (i.e., equation 2)

cut-points assumptions. One of the reviewers pointed outthat under certain circumstances, the equally-spaced trans-formation might be more appropriate (e.g., a panel withmoderate level of expertise in the area of interest) [26].However, the results were based on unexpected secondaryfindings, which require further confirmation in a morethorough study [26]. For the purpose of the current study,we want to primarily focus on showcasing a propermethod of establishing evidence for construct validity usingcarefully selected “true” subject experts. For developingPROMs, the level of expertise of the selected subject

Garrard et al. BMC Medical Research Methodology (2015) 15:77 Page 4 of 14

Page 5: A novel method for expediting the development of patient ... · patient-centered health care, ... gate endpoints for clinical trials and studies in humans, which are essential for

experts’ has a direct impact on the validity of the measure-ment instrument.In our assumed single factor model, the item-to-

domain correlation based on pooled information fromall experts can be denoted by ρj ¼ corr f ; zj

� �, where f

represents the domain factor score and is typically as-sumed to follow a standard normal distribution; and zjrepresents the standardized response of item j. To en-sure the proper range of correlations, Fisher’s transform-ation is used to transform ρj and we denote μj as

μj ¼ g ρj

� �¼ 1

2log

1þ ρj1−ρj

: ð8Þ

A hierarchical model that combines all experts and in-cludes all items is defined by

g ρjk

� �¼ g ρj

� �þ ejk ; ð9Þ

where ejkeN 0; σ2ð Þ . Following the BID model, the priordistribution of the experts after Fisher’s transformationis approximately normal and can be expressed by

μj ¼ g ρj

� � e N g ρ0j

� �;

1n0j

� ; ð10Þ

where g ρ0j

� �is the transformed prior mean item-to-do-

main correlation; and n0j ¼ 5� K is the prior samples sizesuch that each expert is equivalent to approximately fiveparticipants [21]. This approximation is based on aweighted average from previous study findings by Gajewskiet al. [20, 26] and Jiang et al. [21]. The prior sample size n0jcan be approximated by computing the ratio of the varianceof the subject experts’ transformed ρj and the variance of

the participants’ transformed ρj (i.e., using a flat prior). The

“five participants” assumption will be further evaluated asmore data become available. Moreover, the current ap-proximation is solely needed to help execute the simulationstudy and not used within any real data application.Informative priors only should be used when appropri-

ate content information is available. When items aresubstantially revised without further review from subjectexperts, flat priors should be used. Although elicitingprior distribution from subject experts is highlighted, weare not restricted solely to this approach. When reliableand relevant external data are available (i.e., not neces-sarily experts), a different data driven approach can beutilized. For instance, developing PROMs for pediatricpopulations can be challenging due to low disease in-cidence in children, thus resulting in small samples.Reliable evidence from the adult populations can betreated as a “general prior” for establishing constructvalidity in the pediatric populations.

OBID – participant data and modelEstablishing evidence of score validity involves inte-grating various strategies or techniques culminating ina comprehensive account for the degree to which existingevidence and theory support the intended interpretationof scores acquired from the instrument [24]. From apurely psychometric or statistical perspective, establishingcontent validity evidence has traditionally been carried outseparately from establishing evidence of construct validity.Importantly, the OBID approach more closely aligns withcurrent practice forwarded by the American EducationalResearch Association (AERA), American PsychologicalAssociation (APA) and the National Council on Measure-ment in Education (NCME) [8] regarding an integratedapproach to establishing evidence for score validity in rela-tion to practical use. OBID seamlessly integrates contentand construct validity analyses into a single process, whichalleviates the need for a large participant sample. Thepreviously introduced IRT with a probit link model,expressed by equations 1 and 2, is used to model theordinal participant responses. The likelihood for y�ij is

L y�jα; λ; fð Þ ¼YN

i¼1

YP

j¼1N y�ijjαj þ λjf i ; 1� �

: ð11Þ

By equations 5, 8 and 10 and the delta method, wespecify the prior distribution of the item discriminationparameter λj through a normal approximation where

λj e N exp 2μj� �

−1

2 exp μj

� � ;exp 2μj� �

þ 1n o2

4n0j exp 2μj� �

0B@1CA: ð12Þ

Since the item-to-domain correlation ρj does not depend

on the negative item difficulty parameter αj, we assign theprior αjeN 0; 1ð Þ according to recommendations made byJohnson and Albert [7]. The full posterior distribution is

π α; λjy�; fð Þ ¼YNi¼1

YPj¼1

N y�ijjαj þ λjf i ; 1� �

�YN

i¼1N f ij0; 1ð Þ

�YP

j¼1N αjj0; 1� �

�YPj¼1

N

λjj exp 2μj

� �−1

2 exp μj

� � ;exp 2μj� �

þ 1n o2

4n0j exp 2μj� �

!

�YP

j¼1N μj μ0j;

1n0j

:

�ð13Þ

OBID model estimationThe integration of content and construct validity ana-lyses requires us to calculate the posterior distribution ofthe expert data and use the posterior inferences as priors

Garrard et al. BMC Medical Research Methodology (2015) 15:77 Page 5 of 14

Page 6: A novel method for expediting the development of patient ... · patient-centered health care, ... gate endpoints for clinical trials and studies in humans, which are essential for

for the participant model parameters, as expressed inequation 13. Prior to eliciting expert opinions, it isnatural to assume that no information exists regard-ing the items. Thus, flat or non-informative priorscan be specified in equations 9 and 10 such that σ2eIG0:00001; 0:00001ð Þ and μj ¼ g ρj

� �eN 0; 3ð Þ. The MCMC

procedure is implemented in the free software WinBUGS[28] to estimate the posterior distribution of λj based onμj from the experts’ data. Three chains are used with a

burn-in sample of 2000 draws. The next 10,000 iterationsare used to calculate the posterior inferences that formthe priors of λj in the participant IRT model.The estimation of λj ’s in the participant model can be

obtained by using the MCMCordfactanal function in-cluded in the free R package MCMCpack [29]. To bespecific, the R function utilizes a Metropolis-Hastingswithin Gibbs sampling algorithm proposed by Cowles[30]. Similarly, the posterior estimation of λj ’s is basedon 10,000 iterations after 2000 burn-in draws. Theitem-to-domain correlations ρj ’s can be subsequently

calculated from the estimated λj ’s via equation 6. Animportant consideration in any MCMC procedure isthe choice of a tuning parameter that influences the ap-propriate acceptance or rejection rate for each modelparameter. According to Gelman, Carlin, Stern andRubin [31] and Quinn [25], the proportion of acceptedcandidate values should fall between 20- 50 %. There isno standard “formula” for selecting the most appropri-ate tuning parameter. As Quinn suggested, users typic-ally adjust the value of the turning parameter throughtrial and error. In the upcoming discussion of the simu-lation study, we have found that the following tuningparameter values 1.00, 0.70, 0.50, and 0.30 appear to workwell for sample sizes 50, 100, 200, and 500, respectively.

Predictive validityAn essential yet often neglected instrument evaluationstep is the assessment of predictive validity. Predictivevalidity is sometimes referred to as criterion-relatedvalidity whereas the criterion is external to the currentpredictor instrument. From a statistical standpoint, as-suming the availability of an appropriate criterion, the pre-dictive validity is directly indicated by the size of thecorrelation between predictor scores and criterion scores.However, demonstrating construct validity of an instru-ment may not always support the establishment of pre-dictive validity due to factors such as range restriction,where the relevant differences on the predictor or criter-ion are eliminated or minimized. Thus, the performanceof predictive validity depends entirely on the extent towhich predictor scores correlate with criterion scoresintended to be predicted [9, 24].

In this article we compare the OBID predictive validitywith that of the traditional approach. Using the testscores or the underlying latent ability parameter f i ofthe subjects, the validity coefficient is defined as

γ ¼ corr E fð Þ; f T� �; ð14Þ

where E (f ) is the posterior mean of the test scores andfT represents the set of true test scores. In our simula-tion study, the criterion is assumed to be perfectly mea-sured; thus the correlation of the test score f i (i.e., theability parameter) and the criterion score is the same asthe validity coefficient corrected for attenuation in thecriterion only.

ResultsSimulation studyIn this section, we use simulated data to test the OBIDapproach by comparing its overall performance to clas-sical instrument development, specifically through thecomparison of parameter estimates, MSEs, and predict-ive validity. Two important assumptions are made byJiang et al. [21] for BID that also apply to the OBIDsimulation setting. First, all experts are assumed toagree in regards to interpreting the concept of correl-ation in their opinions about the items’ relevancy; andsecond, the experts’ data are assumed to be correlatedwith the participants’ data with the indication of havingeither the same opinions or very similar opinions. Inaddition, the BID study makes the assumption that thetrue item-to-domain correlation is ρT ¼ 0:50 for allitems. Upon careful consideration, we have decidedagainst this assumption for the current study as in real-ity it is rare for all items to have the same moderateitem-to-domain correlation. Thus, we employ a mixtureof low, moderate, and high (i.e., 0.30, 0.50, and 0.70)true item-to-domain correlations in this simulationstudy. The simulation is conducted in R software version3.1.2 [19], including additional inferences and simulationplots. OBID parameter estimation is obtained using thepreviously introduced MCMCordfactanal function in theR package MCMCpack [29]. In addition, for comparisonpurposes ordinal CFA is performed using the cfa functionin the R package lavaan version 0.5-17 [18].Working with the assumed unidimensional model, a

five-way factorial design is used to simulate the data.The simulation factors include number of items onthe instrument (4, 6, 9) and number of response cat-egories per item (2, 5, 7). For simplicity and demon-stration purposes, we assume that all items have thesame number of response categories in the currentsimulation. However, it is possible for items to havedifferent number of response categories on a ques-tionnaire. In addition, we examine the effect of expert

Garrard et al. BMC Medical Research Methodology (2015) 15:77 Page 6 of 14

Page 7: A novel method for expediting the development of patient ... · patient-centered health care, ... gate endpoints for clinical trials and studies in humans, which are essential for

bias using different number of participants (50, 100,200, 500), number of subject experts (2, 3, 6, 16), andtypes of expert bias (unbiased, moderately biased,highly biased). We define unbiased experts as ρ0 ¼ ρT ,moderately biased experts as ρ0 ¼ ρT þ 0:1 , and highly

biased experts as ρ0 ¼ ρT þ 1�ρT

2 . This design results in432 different combinations of factors. The detailed simula-tion strategy is as follows:

1. Simulate standardized participant responses z�ij andconvert to y�ij based on the classical factor model(equation 3). The true item-to-domain correlationρT is specified as ρT ¼ 0:50; 0:30; 0:70; 0:50ð Þ forall four item scenarios, ρT ¼0:30; 0:50; 0:70; 0:70; 0:30; 0:50ð Þ for all six itemscenarios, and ρT ¼0:30; 0:50; 0:70; 0:70; 0:30; 0:50; 0:70; 0:50; 0:30ð Þfor all nine item scenarios.

2. Convert y�ij to ordinal responses yij using equation 1and percentile-based cut points. When the numberof categories is binary, or C ¼ 2, the single cutpoint is the 50th percentile of the standard normal.When the number of categories is polytomous, orC > 2, the cut points are defined as the1C ; …; C�1

C

� �th percentile of the standard normal.

3. Define prior for the participant IRT model(equation 2) item discrimination parameter λj usingequations 8, 10 and 11. Recall that we previouslyspecify the prior for the negative item difficultyparameter αj as αjeN 0; 1ð Þ.

4. Select appropriate tuning parameters to ensure20-50% acceptance rate. As previously mentioned,we have found through trial and error that thefollowing tuning parameter values 1.00, 0.70, 0.50,and 0.30 appear to work well for sample sizes N= 50,100, 200, and 500, respectively.

5. Fit the IRT model on the simulated datasets createdin steps 1–2 via MCMCpack and obtain estimatesfor λj and ρj using equations 5 and 6.

6. Fit the ordinal CFA model on the same simulateddatasets created in steps 1–2 via lavaan andestimate ρj.

7. Perform 100 simulations for each of the scenariosdefined by the simulation factors.

The simulation process for one type of expert bias takesabout two days to run on an Intel Core i7 3.40 GHz com-puter with 32GB of RAM. In order to compare the overallperformances of OBID and CFA, we calculate the averageMSE of the item-to-domain correlation estimates and theMSE of the validity coefficient estimates across 100 simu-lations with 5000 MCMC iterations and 2000 burn-indraws. We denote ρ̂j sð Þ as the OBID posterior mean or

CFA parameter estimate of the sth iteration and �ρj

¼P100

s¼1ρ̂ j sð Þ

100 . Then MSE ρ̂ j

� �¼P100

s¼1ρ̂ j sð Þ�ρTjf g2

100 and �MSE

¼PP

j¼1MSE ρ̂ jð ÞP ; Bias ρ̂ j; ρ

Tj

� �n o2¼ �ρj � ρTj

� �2and �Bias2

¼PP

j¼1Bias ρ̂ j;ρ

Tjð Þf g2

P . For evaluating the predictive validity,

we denote γ̂ sð Þ ¼ corr E f̂ i sð Þn o

; f Ti sð Þh i

as the correlation

between the posterior mean of estimated factor scores andtrue factor scores for the sth iteration. As previously men-tioned, we assume that the true criterion is perfectly mea-

sured such that γT ¼ 1. Then MSE γð Þ ¼P100

s¼1γ̂ sð Þ�γTf g2

100 .In addition, due to concerns about the performance ofCFA with small samples, we record the frequency that or-dinal CFA fails to converge and/or produces “bad” esti-mates such that ρj∉ �1; 1½ �.Figure 1 shows the average MSE of item-to-domain

correlation ρ for unbiased experts when the numberof items (P) is six. The participant sample sizes areN = 50, 100, 200, and 500. The numbers of responsecategories are C = 2, 5, and 7, and the numbers of ex-perts are K = 2, 3, 6, and 16. The MSE for CFA doesnot change with the number of experts (dashed line)as the expert content validity information is not uti-lized under the traditional approach. Thus the prior infor-mation has no effect on the CFA estimates acrossdifferent choices for the number of experts. The OBIDMSE (solid line) is consistently smaller than the CFAMSE, regardless of sample size and number of responsecategories, demonstrating the superior performance of theOBID approach. OBID is most promising for smaller sam-ples (e.g., N =50 or 100). In addition, the OBID MSE de-creases as the number of experts increases, with thelargest reduction occurring approximately between 3–6experts. When the number of response categories is bin-ary (C = 2), we observe the largest vertical distancebetween the OBID MSE and the CFA MSE. This verticaldistance reduces as the number of response categories in-crease, due to an increase in scale information. Similarly,the MSEs for both OBID and CFA decrease as the numberof response categories increase; however, the MSE graphsfor the five- and seven-point scales become very similar toeach other across all sample sizes. It’s also expected thatthe MSEs for both approaches decrease as sample size in-creases, as a result of decreasing measurement errors. Theasymptotic behavior of OBID is evaluated with sample size500. As we expect, the two approaches produce almostidentical MSEs with OBID being slightly smaller.When experts are moderately biased (Fig. 2), a similar

overall trend is observed as that of the unbiased case.OBID continues to outperform CFA in all scenarios; how-ever, the differences in MSEs between OBID and CFA

Garrard et al. BMC Medical Research Methodology (2015) 15:77 Page 7 of 14

Page 8: A novel method for expediting the development of patient ... · patient-centered health care, ... gate endpoints for clinical trials and studies in humans, which are essential for

become smaller in the moderately biased case, indicatingthe effect of biased priors. Additionally, the efficiency gainof the OBID approach experiences a steady increase from2–6 experts, and gradually levels off from 6–16 experts.This indicates that with moderately biased priors, havingmore than six experts does not contribute to any add-itional gain in the efficiency of OBID. When priors arehighly biased (Fig. 3), our results support similar findingsof BID [21] where the relative efficiency of OBID

compared with CFA is a function of the number of ex-perts. In the case of a binary response option and samplesize 50, OBID produces smaller MSEs than CFA, despiteof the receding efficiency as the number of experts in-creases. OBID is most efficient with smaller samples (e.g.,N ≤ 100) and the number of experts is two or three. Asnumber of experts increases, the impact of highly biasedpriors is substantial with smaller samples. The differencesin MSEs between the OBID and CFA approaches exhibit

Fig. 1 Average MSE of item-to-domain correlation ρ for six items and unbiased experts. Average mean squared error (MSE) for item-to-domain correlation ρ using OBID (solid blue line) and ordinal CFA (dashed red line) when P = 6 (number of items) and experts areunbiased {ρ0 ¼ 0:30; 0:50; 0:70; 0:70; 0:30; 0:50ð Þg. The participant sample sizes are N = 50, 100, 200, and 500. The numbers of responsecategories are C = 2, 5, and 7, and the numbers of experts are K = 2, 3, 6, and 16. Note. OBID = Ordinal Bayesian Instrument Development;CFA = Confirmatory Factor Analysis

Garrard et al. BMC Medical Research Methodology (2015) 15:77 Page 8 of 14

Page 9: A novel method for expediting the development of patient ... · patient-centered health care, ... gate endpoints for clinical trials and studies in humans, which are essential for

similar patterns when the number of items is four or nine.MSE plots for additional simulation scenarios are includedin Additional file 1: Figures S1–S6.From simply observing the graphs, one may think that

although OBID is more efficient, the performance of or-dinal CFA is comparable and not a bad choice. However,a close examination of the frequency that ordinal CFAfailed to converge and/or produced “bad” estimates (i.e.,ρj∉ �1; 1½ � ) reveals limitations of the classical method

with small samples. In the six item simulation example,when N = 50 and C = 2, ordinal CFA fails to converge for2 % of simulation iterations and produces out of boundcorrelation estimates for 21 % of simulation iterations.When both sample size and number of response categor-ies increase, although all simulation iterations converge,CFA continues to produce 1-3 % out of bound correlationestimates. The four item scenarios face more challengeswith convergence and reliable estimates with smaller

Fig. 2 Average MSE of item-to-domain correlation ρ for six items and moderately biased experts. Average mean squared error (MSE) for item-to-domain correlation ρ using OBID (solid blue line) and ordinal CFA (dashed red line) when P = 6 (number of items) and experts are moderatelybiased {ρ0 ¼ 0:40; 0:60; 0:80; 0:80; 0:40; 0:60ð Þg. The participant sample sizes are N = 50, 100, 200, and 500. The numbers of response categoriesare C = 2, 5, and 7, and the numbers of experts are K = 2, 3, 6, and 16. Note. OBID = Ordinal Bayesian Instrument Development; CFA = ConfirmatoryFactor Analysis

Garrard et al. BMC Medical Research Methodology (2015) 15:77 Page 9 of 14

Page 10: A novel method for expediting the development of patient ... · patient-centered health care, ... gate endpoints for clinical trials and studies in humans, which are essential for

samples. When the number of items is nine, the perform-ance of CFA becomes more stable with only 6 % out ofbound estimates in the sample size 50 and binary responseoption case. The complete table that summarizes CFAperformance can be found in Additional file 1: Table S1.In contrast, the OBID approach consistently produces ap-propriate and reliable correlation estimates without anychallenges using all sample sizes and response options.Lastly we assessed the predictive validity of the two ap-

proaches under simulation settings. Under the previously

mentioned assumption, the criterion is perfectly measured(i.e., the ideal target); thus the correlation of test scores f i(i.e., the ability parameter) and criterion scores is the sameas the validity coefficient corrected for attenuation in thecriterion only. Figure 4 displays the MSEs of the validity co-efficient γ computed using both OBID and CFA approacheswhen experts are highly biased and the number of items issix. Based on findings from Gajewski et al. [20], the subjectexperts tend to overestimate the relevancy of items, result-ing in highly biased item-to-domain correlations. The

Fig. 3 Average MSE of item-to-domain correlation ρ for six items and highly biased experts. Average mean squared error (MSE) for item-to-do-main correlation ρ using OBID (solid blue line) and ordinal CFA (dashed red line) when P = 6 (number of items) and experts are highlybiased {ρ0 ¼ 0:65; 0:75; 0:85; 0:85; 0:65; 0:75ð Þg. The participant sample sizes are N = 50, 100, 200, and 500. The numbers of response categories areC = 2, 5, and 7, and the numbers of experts are K = 2, 3, 6, and 16. Note. OBID =Ordinal Bayesian Instrument Development; CFA = ConfirmatoryFactor Analysis

Garrard et al. BMC Medical Research Methodology (2015) 15:77 Page 10 of 14

Page 11: A novel method for expediting the development of patient ... · patient-centered health care, ... gate endpoints for clinical trials and studies in humans, which are essential for

predictive validity of OBID is examined in the extreme caseof highly biased priors with a small sample size. For 50 par-ticipants, we can clearly observe that the MSE of OBID isthe smallest with a binary response option (C = 2), com-pared with the CFA MSE. As number of response categor-ies increases, OBID continues to have smaller MSE thanthat of CFA, although the differences become much smallerand almost negligible. When we increase the sample size,the two approaches become almost identical in terms ofMSEs. A similar trend is observed in the four item and nineitem scenarios, with corresponding plots included in

Additional file 1: Figures S7–S14. Prior to the simulation,we hypothesize that MSE γOBID

� �< MSE γCFA

� �, fOBID is

more correlated with fT than fCFA. The simulation resultssupport this original hypothesis. Thus, we make the conclu-sion that OBID produces higher predictive validity than thatof the traditional approach, especially for small samples.

Application to PAMS short form satisfaction survey dataDue to scarcely available mammography-specific satisfac-tion assessments, researchers at a Midwestern academic

Fig. 4 Average MSE of validity coefficient γ for six items and highly biased experts. Mean squared error (MSE) for validity coefficient γ using OBID (solidblue line) and ordinal CFA (dashed red line) when P = 6 (number of items) and experts are highly biased {ρ0 ¼ 0:65; 0:75; 0:85; 0:85; 0:65; 0:75ð Þg.The participant sample sizes are N = 50, 100, 200, and 500. The numbers of response categories are C = 2, 5, and 7, and the numbers of experts areK = 2, 3, 6, and 16. Note. OBID = Ordinal Bayesian Instrument Development; CFA = Confirmatory Factor Analysis

Garrard et al. BMC Medical Research Methodology (2015) 15:77 Page 11 of 14

Page 12: A novel method for expediting the development of patient ... · patient-centered health care, ... gate endpoints for clinical trials and studies in humans, which are essential for

medical center developed the patient assessment ofmammography services (PAMS) satisfaction survey(four-factor with 20 items) and PAMS-Short Form (sin-gle factor with seven items) [32]. In this section, weapply the OBID approach to complete data collectedfrom the PAMS-Short Form instrument that was ad-ministered to 2865 women: Hispanic (36, 1.26 %), Non-Hispanic white (2768, 96.61 %), African American (34,1.19 %), and other (27, 0.94 %). Participants rated theirsatisfaction with each of the seven items using a five-point Likert-type scale, ranging from “poor” to “excel-lent.” In addition, six subject experts were consultedand instructed to evaluate each of the seven items on afour-point relevancy scale. The University of KansasMedical Center’s Internal Review Board (IRB) has de-termined that our study does not require oversight bythe Human Subjects Committee (HSC), as data werecollected for prior studies and they are provided to usin a de-identified fashion.

Based on the sample size for each racial/ethnic group,establishing construct validity evidence for scores forNon-Hispanic white participants is clearly adequate andtraditional CFA will suffice based on the large sample. Yet,researchers are interested in establishing score-basedconstruct validity evidence for groups such as Hispanic/African Americans which are typically small. ClassicalCFA is ill-suited for such small samples; thus we apply theOBID approach for the analyses of Hispanic/AfricanAmerican populations. For comparison purposes, we per-form OBID with experts’ opinions (informative) andOBID without experts’ opinions (non-informative) due toestimation challenges with traditional CFA. Flat priors areassigned for the IRT model parameters in the OBID pos-terior non-informative cases, in which, αjeN 0; 1ð Þ and λjeN 0; 4ð Þ . In addition, based on trial and error we set thetuning parameter value required for MCMCpack to 2.00for both small populations. The estimated item-to-domaincorrelation ρj and its corresponding standard error are re-

ported in Additional file 1: Table S2.The non-informative OBID tends to overestimate ρj

compared with the experts’ estimated correlations (.381-.673), for both Hispanic (.570-.920) and African American(.774-.942) populations. By integrating the experts’ opin-ions with participants’ data, informative OBID producesmore reliable results (Hispanic: .466-.717; African Ameri-can: .495-.725) by appropriately lowering the estimated ρj .

Although not reported, the factor score or latent variablescore for each participant (i.e., individual mammographysatisfaction) also is estimated. Since the factor scores areadjusted or corrected for measurement error, patients canbe more accurately classified into diagnostic groups basedon factor scores, and then treated as covariates in subse-quent analyses. The non-informative OBID estimates tend

to have slightly smaller standard errors, which can beviewed as a trade-off between the overestimated reliabilityρ2j and the variance. Overall, as we expect, OBID success-

fully produces reliable item-to-domain correlation esti-mates and overcomes the small sample size challenge thatoften causes classical CFA to fail.

DiscussionAs health care moves rapidly toward a patient-centeredness care model, the development of reliableand valid PROMs is recognized as an essential step inpromoting quality care. Despite of increasing publicawareness, the development of PROMs using trad-itional psychometric methodologies often is lengthyand constrained by the large sample size requirement,resulting in substantially increased costs and resources. Inthis study, an innovative OBID approach within a Bayes-ian IRT framework is proposed to overcome both smallsample size (e.g., patients from small populations or rarediseases) and ordinal data modeling limitations. OBIDseamlessly and efficiently utilizes subject experts’ opinions(content validity) to form the prior distributions for theIRT parameters in construct validity analysis, as opposedto using arbitrarily selected priors in other Bayesian IRTsimulation studies mentioned in the introduction.A thorough comparison between OBID and trad-

itional CFA is provided through assessing item-to-domain correlation estimates, MSEs, and predictivevalidity under a simulation setting with three differenttypes of expert bias. Simulation results across allthree types of expert bias clearly demonstrate that theoverall performance of OBID is most superior to thatof traditional CFA when the sample size is small (i.e., ≤100 participants) and the instrument response option isbinary. When subject experts are biased, the gain in effi-ciency gradually recedes for OBID as number of expertsincreases; and traditional CFA eventually becomes moreefficient. Although not discussed in the article, the averagesquared bias for the item-to-domain correlation esti-mate also is examined across different expert biases.The corresponding plots are included in Additionalfile 1: Figures S15–S23. A trade-off situation is ob-served as OBID may exhibit larger bias; yet it reducesthe MSEs by decreasing variances. In addition, OBIDproduces higher predictive validity than that of thetraditional method when the sample size is small. Thesimulation results are supported by the PAMS-ShortForm example where OBID is successfully applied tosmall Hispanic and African American populations.The de-identified PAMS-Short Form data are available ina de-identified fashion to researchers upon requestthrough e-mail to the corresponding author of this paper.Overall, while traditional methods are restricted by small

Garrard et al. BMC Medical Research Methodology (2015) 15:77 Page 12 of 14

Page 13: A novel method for expediting the development of patient ... · patient-centered health care, ... gate endpoints for clinical trials and studies in humans, which are essential for

samples, OBID proves to be an efficient and reliableapproach.One limitation of this study is associated with the

source of experts’ information used in the PAMS-ShortForm example. Opinions from the six content expertswere originally consulted with the purpose of validatingthe PAMS instrument for the American Indian womenpopulation. Although the same set of survey items wasadministered to all American Indian, Hispanic, andAfrican American populations, potential bias could beintroduced due to the original focus of content experts.Nonetheless, as previously mentioned, reliable informa-tion collected from the six experts can still be utilizedto form a “general prior” in establishing construct valid-ity for Hispanic and African American populations. An-other limitation of the study comes from the elicitationof content validity using relevance scales. AlthoughGajewski et al. [26] has demonstrated the appropriate-ness of measuring content validity using relevancescales, the equivalency with measuring content validityusing correlation scales is approximate, which may havean effect on the parameter estimation. A third limita-tion of the study comes from the approximate normaldistribution assumption that we made regarding theprior distribution of the experts after Fisher’s trans-formation. As pointed out by one of the reviewers, po-tential disagreements among selected subject expertsmay occur, which can cause the expert opinion to fol-low a bimodal (i.e., two groups of experts with oppositeviews) or even trimodal distribution. We acknowledgethis limitation as this scenario was not examined in thecurrent simulation study.Two useful practical recommendations can be ex-

tracted from the current study. As previously men-tioned, no standard method exist for determiningappropriate tuning parameter values that ensure the20-50 % acceptance rate needed for the MCMC pro-cedure. Although trial and error also is used in thisstudy, our findings provide a general guideline for theselection of tuning parameter values. We find that tuningparameter values 1.00, 0.70, 0.50, and 0.30 appear to workwell for sample sizes 50, 100, 200, and 500, respectively.Additionally, our study results are consistent with find-ings from Polit and Beck [33] regarding the number ofsubject experts needed to establish content validity.Across three types of expert biases, results show thathaving more than six experts does not contribute toany additional gain in the efficiency of OBID. Withhighly biased experts, three experts appear to be suffi-cient for establishing content validity.An implication from this study is that a hierarchical

model can be considered in the future to incorporate theindividual effect of content experts, as the scores expertsassigned from item to item are likely to be correlated. In

addition, the development of the user-friendly BID soft-ware can be used to guide the development of the OBIDsoftware, where multi-factor models can be evaluated, asit is common in many “long-form” questionnaires to en-compass several constructs of interest. It is our ultimategoal to extend the application capability of OBID andpresent it as an efficient and reliable method for re-searchers and clinicians in future PROMs development.

ConclusionsIn this study, the efficiency of OBID is evaluated bycomparing its performance to classical instrument devel-opment performance using actual and simulation data.This study successfully demonstrated that the OBID ap-proach is more efficient than the classical approachwhen the sample size is small. OBID promises an effi-cient and reliable method for researchers and cliniciansin future PROMs development for small populations orrare diseases.

Additional file

Additional file 1: Additional Simulation and Application Results.Additional simulation and application results referenced in Sections 3, 4and 5. (PDF 901 kb)

AbbreviationsAERA: American Educational Research Association; APA: AmericanPsychological Association; BID: Bayesian Instrument Development;CFA: confirmatory factor analysis; FDA: Food and Drug Administration;HSC: Human Subjects Committee; IOM: Institute of Medicine;IACCV: Integrated Analysis of Content and Construct Validity; IRB: InternalReview Board; IRT: item response theory; MCMC: Markov chain Monte Carlo;MSE: Mean squared error; NCME: National Council on Measurement inEducation; NIH: National Institute of Health; NQF: National Quality Forum;OBID: Ordinal Bayesian Instrument Development; PAMS: Patient assessmentof mammography services; PCORI: Patient-Centered Outcomes ResearchInstitute; PROMs: Patient-reported outcome measures; SEM: Structureequation modeling; DHHS: U. S. Department of Health and Human Services;WLSMV: Weighted least squares means- and variance-adjusted.

Competing interestsThe authors declare that they have no competing interests.

Authors’ contributionsLG conducted literature review, participated in the study design, simulateddata, performed the statistical analysis, and drafted the manuscript. LPparticipated in the study design, provided feedback on simulation resultsand psychometric analyses, and provided critical manuscript revision. MBparticipated in the study design, provided feedback on simulation results,and provided critical manuscript revision. BG conceived of the study,conducted literature review, initiated the study design and implementation,reviewed all statistical analyses and simulation results, and provided criticalmanuscript revision. All authors contributed to and approved the finalmanuscript.

Authors' informationNot applicable.

Availability of data and materialsNot applicable.

Garrard et al. BMC Medical Research Methodology (2015) 15:77 Page 13 of 14

Page 14: A novel method for expediting the development of patient ... · patient-centered health care, ... gate endpoints for clinical trials and studies in humans, which are essential for

Acknowledgments

FundingResearch reported in this publication was supported by the National Instituteof Nursing Research of the National Institutes of Health under Award NumberR03NR013236. The content is solely the responsibility of the authors and doesnot necessarily represent the official views of the National Institutes of Health.Data from PAMS come from P20MD004805.

Author details1Department of Biostatistics, University of Kansas Medical Center, Mail Stop1026, 3901 Rainbow Blvd., Kansas City, KS 66160, USA. 2College of Education,Texas State University, San Marcos, TX 78666, USA. 3University of KansasSchool of Nursing, Mail Stop 4043, 3901 Rainbow Blvd., Kansas City, KS66160, USA.

Received: 5 May 2015 Accepted: 21 September 2015

References1. Institute of Medicine. Crossing the Quality Chasm: A New Health System for

the 21st Century. Washington, DC: National Academy Press; 2001.2. Cella D, Riley W, Stone A, Rothrock N, Reeve B, Yount S, et al. The

Patient-Reported Outcomes Measurement Information System (PROMIS)developed and tested its first wave of adult self-reported health outcome itembanks: 2005–2008. J Clin Epidemiol. 2010;63(11):1179–94.

3. US Department of Health and Human Services Food and DrugAdministration. Guidance for Industry Patient-Reported Outcome Measures:Use in Medical Product Development to Support Labeling Claims. 2009.

4. National Quality Forum. Patient Reported Outcomes (PROs) in PerformanceMeasurement. 2013.

5. Patient Centered Outcomes Research Institute. The Design and Selection ofPatient-Reported Outcomes Measures (PROMs) for Use in Patient CenteredOutcomes Research. 2012.

6. Pett MA, Lackey NR, Sullivan JJ. Making sense of factor analysis: The use offactor analysis for instrument development in health care research.Thousand Oaks: Sage; 2003.

7. Johnson VE, Albert JH. Ordinal data modeling. New York: Springer Science &Business Media; 1999.

8. American Educational Research Association, American PsychologicalAssociation, National Council on Measurement in Education. Standards forEducational and Psychological Testing. Washington, DC: AERA; 2014.

9. Nunnally IH, Bernstein JC. Psychometric theory. New York: McGraw-Hill; 1994.10. Messick S. Validity. In: Linn RL, editor. Educational Measurement. 3rd ed.

New York: American Council on Education; 1989. p. 13–103.11. Knapp TR, Brown JK. Ten measurement commandments that often should

be broken. Res Nurs Health. 1995;18:465–9.12. Thurstone LL. Multiple factor analysis. Chicago: Chicago University of

Chicago Press; 1947.13. Brown TA. Confirmatory factor analysis for applied research. 2nd ed. New

York: Guilford Publications; 2014.14. Anthoine E, Moret L, Regnault A, Sébille V, Hardouin JB. Sample size used to

validate a scale: a review of publications on newly-developed patientreported outcomes measures. Health Qual Life Outcomes. 2014;12:176.

15. Weenink JW, Braspenning J, Wensing M. Patient reported outcomemeasures (PROMs) in primary care: an observational pilot study of sevengeneric instruments. BMC Fam Pract. 2014;15:88.

16. Knapp TR. Treating ordinal scales as interval scales: an attempt to resolvethe controversy. Nurs Res. 1990;39:121–3.

17. Muthén B. A general structural equation model with dichotomous, orderedcategorical, and continuous latent variable indicators. Psychometrika.1984;49:115–32.

18. Rosseel Y. lavaan: an R package for structural equation modeling. J StatSoftw. 2012;48:1–36.

19. R Core Team. R: A language and environment for statistical computing. In.Vienna, Austria: R Foundation for Statistical Computing; 2014.

20. Gajewski BJ, Price LR, Coffland V, Boyle DK, Bott MJ. Integrated analysis ofcontent and construct validity of psychometric instruments. Qual Quant.2013;47:57–78.

21. Jiang Y, Boyle DK, Bott MJ, Wick JA, Yu Q, Gajewski BJ. Expediting clinicaland translational research via Bayesian instrument development. ApplPsychol Meas. 2014;38:296–310.

22. Arima S. Item selection via Bayesian IRT models. Stat Med. 2015;34(3):487–503.23. Fox JP, Glas CAW. Bayesian estimation of a multilevel IRT model using Gibbs

sampling. Psychometrika. 2001;66(2):271–88.24. Price LR. Psychometric Methods: Theory into Practice. New York, NY:

Guilford Publications; in press.25. Quinn KM. Bayesian factor analysis for mixed ordinal and continuous

responses. Polit Anal. 2004;12(4):338–53.26. Gajewski BJ, Coffland V, Boyle DK, Bott M, Price LR, Leopold J, et al.

Assessing content validity through correlation and relevance tools aBayesian randomized equivalence experiment. Methodology-Eur.2012;8(3):81–96.

27. Cohen J. Statistical power analysis for the behavioral sciences. Hillsdale, NJ:Erlbaum; 1988.

28. Lunn DJ, Thomas A, Best N, Spiegelhalter D. WinBUGS - a Bayesianmodelling framework: concepts, structure, and extensibility. Stat Comput.2000;10(4):325–37.

29. Martin AD, Quinn KM, Park JH. MCMCpack: Markov Chain Monte Carlo in R.J Stat Softw. 2011;42(9):1–21.

30. Cowles MK. Accelerating Monte Carlo Markov chain convergence forcumulative-link generalized linear models. Stat Comput. 1996;6(2):101–11.

31. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian data analysis. Texts instatistical science series. 2004.

32. Engelman KK, Daley CM, Gajewski BJ, Ndikum-Moffor F, Faseru B, Braiuca S,et al. An assessment of American Indian women's mammographyexperiences. BMC women's health. 2010;10(1):34.

33. Polit DF, Beck CT. The content validity index: are you sure you know what’sbeing reported? Critique and recommendations. Res Nurs Health.2006;29:489–97.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit

Garrard et al. BMC Medical Research Methodology (2015) 15:77 Page 14 of 14