efficiently encoding and modeling subjective probability

Efficiently Encoding and Modeling Subjective ProbabilityDistributions for Quantitative Variables

Thomas S. Wallsten, Yaron Shlomi, Colette Nataf, and Tracy TomlinsonUniversity of Maryland

Expert forecasts of quantitative variables in the form of continuous subjective probabilitydistributions are more useful to decision makers than are point estimates or confidenceintervals. We present 2 experiments using participants recruited via the Internet aimed at (a)developing methods for estimating and modeling continuous subjective distributions fromsmall numbers of judgments, and (b) assessing the effects of procedural variables onforecasting accuracy and difficulty. Experiment 1 assessed the feasibility of the proposedmethods by having participants provide specified quantiles for ratios of area of geometricfigures. Gamma and Weibull distributions fit the judgments very well and yielded mean andvariance estimates that matched those obtained via established nonparametric methods. InExperiment 2, participants forecasted 3 future values: the date of an Apple product releaseannouncement, the proportion of 2012 Summer Olympics medals that the United States andChina would win, and the high temperature in their locality exactly 2 weeks hence.Between-participants variables were number of cut points (3 or 5) and response format(quantiles, cumulative probabilities, or interval probabilities). Overall, probability estimateswere better than quantile estimates in terms of accuracy and ease of responding. Five cutpoints took longer than 3, but did not systematically improve accuracy. Gamma distributions fitthe date forecasts well, normal distributions fit the temperature forecasts well, and beta distri-butions fit the proportion forecasts well. The results are very encouraging for rapid and efficientencoding and modeling of probabilistic forecasts of quantitative variables.

Keywords: subjective probability, forecasting, subjective continuous probabilitydistributions, models of judgment

Supplemental materials: http://dx.doi.org/10.1037/dec0000047.supp

Many forecasting problems concern the valueof a continuous quantity at some future point intime, for example, change in gross national

product in the next quarter, the expected num-ber of deaths due to an epidemic before it iscontained, the increase in sea level due to cli-mate change within the next 50 years, or, veryimportantly, the date by which an event willoccur.1 We report the development and evalua-tion of a quick and efficient method for elicitingand modeling subjective continuous probabilityforecasts. The work is part of a broader researchprogram, the Aggregative Contingent Estima-tion (ACE) Program (Tetlock, Mellers,Rohrbaugh, & Chen, 2014; Warnaar et al.,2012), the goal of which is to develop computer-based systems for eliciting and aggregating ex-perts’ probabilistic forecasts regarding real-world sociopolitical events.

1 Although numbers of outcomes or dates technically arediscrete, not continuous, it is reasonable to invoke contin-uous approximations.

This article was published Online First December 14, 2015.Thomas S. Wallsten, Yaron Shlomi, Colette Nataf, and Tracy

Tomlinson, Department of Psychology, University of Maryland.Yaron Shlomi is now at the Media Innovation Lab, Inter-

disciplinary Center (IDC), Herzliya, Israel. Colette Nataf isnow at Mobile Data Labs, Inc., San Francisco, CA.

This research was supported by the Intelligence AdvancedResearch Projects Activity (IARPA) via Department of Inte-rior National Business Center (DoI/NBC) contract numberD11PC20059 and D11PC20061. The U.S. Government is au-thorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright annotation thereon.The views and conclusions contained herein are those of theauthors and should not be interpreted as necessarily represent-ing the official policies or endorsements, either expressed orimplied, of IARPA, DoI/NBC, or the U.S. Government.

Correspondence concerning this article should be ad-dressed to Thomas S. Wallsten, Department of Psychology,University of Maryland, 4094 Campus Drive, College Park,MD 20742. E-mail: [email protected]

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Decision © 2015 American Psychological Association2016, Vol. 3, No. 3, 169–189 2325-9965/16/$12.00 http://dx.doi.org/10.1037/dec0000047

169

http://dx.doi.org/10.1037/dec0000047.supp

mailto:[email protected]

http://dx.doi.org/10.1037/dec0000047

Among the many advantages of continuousprobability forecasts over point estimates orsubjective confidence intervals (CIs) is that theyreduce the possibility of close-call counterfac-tuals (Tetlock, 2005; Tetlock & Belkin, 1996).The term close-call counterfactual refers to out-comes that come close to occurring, but do notoccur. Thus, a forecast giving a high probabilitythat an event will occur after July 7 becomes aclose-call counterfactual when the event actu-ally occurs on July 6. How wrong are close callssuch as these, or are they so close that theyshould not be considered wrong at all? Argu-ments over such matters disappear with contin-uous probability forecasts, because the entireconcept of a close call does not apply.

Beyond reducing the tendentious issue ofclose-call counterfactuals, continuous probabil-ity forecasts of dates and of other quantitiesprovide decision makers with as complete adescription of the uncertainty as possible. Thus,instead of being restricted to probabilities asso-ciated with possibly artificial or arbitrary cutpoints, decision makers can obtain probabilityestimates associated with any values or intervalsof interest on the continuum.

Reports in which full probability distribu-tions have been elicited from experts are rela-tively uncommon, probably because of the dif-ficulty and expense incurred in conducting theprocess well. Carefully encoding subjectivecontinuous probability distributions can be atime-consuming task that involves much priorpreparation as well as back and forth betweenthe facilitator and the expert. For example,Whitfield and Wallsten’s (1989) encoding ofhealth experts’ judgments regarding dose-response relationships of selected pollutantstook 4 to 6 hr per expert.

As another example, both the European Cen-tral Bank (ECB; e.g., Bowles et al., 2007; Gar-cia, 2003) and the Federal Reserve Bank ofPhiladelphia (Croushore, 1993) regularly sur-vey experts to obtain their probabilistic fore-casts of various economic indicator variables.Both banks conduct their surveys by partition-ing the variables of interest into as many as 10to 12 intervals and asking experts to assignprobabilities to those bins.2 These surveys donot entail face-to-face interaction between facil-itator and expert, but judging 10 to 12 intervalsrequires a good deal of time and effort from theforecasters.

Morgan (2014) provides an excellent discus-sion of the issues and difficulties involved ineliciting full probability distributions from ex-perts. Both Garthwaite, Kadane, and O’Hagan(2005) and O’Hagan et al. (2006) provide addi-tional discussion on the same topics. It often isthe case, however, as in real-time intelligenceanalysis, that neither sufficient resources nortime are available to engage in efforts of the sortdescribed by Morgan and illustrated by the ECBand Federal Reserve surveys of experts. In suchcircumstances, efficient online methods that donot require the aid of a facilitator become cru-cial.

Putting aside, for the moment, the difficultiesand expense of properly eliciting subjectivecontinuous probability forecasts from experts,the empirical evidence is strong that explicitprobability judgments lead to more accurateforecasts than do judgments of quantiles.Seaver, von Winterfeldt, and Edwards (1978)showed this to be true in the context of estab-lishing continuous distributions. More recently,Haran, Moore, and Morewedge (2010) showedthat having forecasters estimate bin probabili-ties greatly reduces overprecision comparedwith having them estimate either 90% CIs or5% and 95% quantiles, in the sense that theinferred 90% CIs in the first case containedclose to 90% of the outcomes, whereas in thelatter two cases, they contained only 74% of theoutcomes. Other studies, as well, have shownthat subjective CIs defined by quantile estimatestend to be too narrow (e.g., Klayman, Soll,González-Vallejo, & Barlas, 1999; Moore, Ten-ney, & Haran, 2016; Teigen & Jørgensen,2005). Although methods exist for increasingthe width of the CI (Jain, Mukherjee, Bearden,& Gaba, 2013; Soll & Klayman, 2004; Win-man, Hansson, & Juslin, 2004), the fact remainsthat it is an insufficient statistic if the entiresubjective distribution is desired.

In addition to eliciting bin probabilities fromexperts, it is likely that fitting continuous mod-els to the discrete judgments will improve ac-curacy. And it is precisely these models that

2 Examples can be found in ECB quarterly reports oftheir surveys available at http://www.ecb.europa.eu/stats/prices/indic/forecast/html/index.en.html, or in Federal Re-serve reports at http://www.philadelphiafed.org/research-and-data/real-time-center/survey-of-professional-forecasters/form-examples/SpfForm-14Q1.pdf.

170 WALLSTEN, SHLOMI, NATAF, AND TOMLINSON

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

http://www.ecb.europa.eu/stats/prices/indic/forecast/html/index.en.html

http://www.ecb.europa.eu/stats/prices/indic/forecast/html/index.en.html

http://www.philadelphiafed.org/research-and-data/real-time-center/survey-of-professional-forecasters/form-examples/SpfForm-14Q1.pdf



will provide decision makers with the flexibilityto obtain probabilistic forecasts for any cutpoints or intervals on the continuum that are ofinterest to them.

It is well established that multiple regressionmodels of human judgments perform better thanthe humans themselves (Dawes, Faust, &Meehl, 1989; Goldberg, 1970; Hoffman, 1960).In all these studies, the probability judgmentswere based on multidimensional cues (e.g.,Minnesota Multiphasic Personality Inventory[MMPI] profiles) with respect to binary out-comes (e.g., psychotic or neurotic). In this arti-cle, we extend that framework to estimatingcontinuous probability distributions as modelsof probabilistic forecasts about continuous vari-ables.

We are aware of only one study that has fitcontinuous distributions to subjective cumula-tive probability judgments. Abbas, Budescu,Yu, and Haggerty (2008) used a pair-compari-son method to elicit participants’ probabilityforecasts of either the closing value of the Dow-Jones Average (DJA) or the high temperature inPalo Alto 1 week in the future, and then fit betadistributions to the resulting judgments. Partic-ipants first estimated lower and upper boundsfor the variable in question in order to set therange of interest and then responded to a seriesof paired comparisons. For each pair compari-son, participants saw a putative value of thevariable (high temperature or high DJA) on oneside of the display and a probability wheel,radially divided into a gray and a white sector,on the other. They had to choose whether to beton the wheel or on the value the variable wouldattain in exactly 1 week. Specifically, they wereasked whether they would rather base a hypo-thetical $20 lottery on the spin of the wheellanding on gray or on the variable outcome 1week hence being less than the displayed value.Then, depending on the condition, the computeradjusted the value of the displayed variable orthe value of the displayed wheel probability upor down according to the participant’s previousresponse. The procedure was continued accord-ing to an algorithm that systematically reducedthe step size and terminated when the size wassufficiently small to estimate the indifferencepoint to the desired level of precision. In theestimate-quantile condition, the probabilitywheel was set at 5%, 25%, 50%, 75%, or 95%gray, and the value of the displayed variable

was adjusted from trial to trial in a mixed order.In the estimate-probability condition, the dis-played variable was fixed at five values setwithin each individual’s upper and lowerbounds and the probability wheel was ad-justed from trial to trial in a mixed order.Monotonicity of indifference points was notforced, but was quite good, and beta distribu-tions provided quite respectable fits to thejudgments. In a variety of ways, the estimate-probability method tended to be preferable tothe estimate-quantile. Response times weresomewhat quicker, the distributions wereslightly more accurate, and the participantspreferred it.

In many ways, the Abbas et al. (2008) studyis a model of the online probability encodingsystem we are striving for. However, for all itsadvantages, the pair-comparison method is verytime consuming. It would not do in situations,such as the ACE context, in which expert (e.g.,intelligence analysts) can be expected to devotevery little time to the probability encoding pro-cess. It is necessary, therefore, to explore meth-ods that can proceed more rapidly and still leadto well considered, reliable, and useful subjec-tive probability distributions.

We report two experiments. For reasons thatwill become clear, Experiment 1 had partici-pants estimate quantiles in a perceptual task.The goals were to assess (a) the feasibility ofencoding subjective distributions under ACEconditions, (b) the effects of procedural vari-ables on coherence and accuracy, and (c) meth-ods for estimating full probability distributionsfrom small numbers of judgments. Experiment2 extended the method and the goals to proba-bility as well as quantile estimation in real-world forecasting contexts.

An important question is how to decidewhich formal probability distribution providesthe best model for an individual’s set of discretejudgments. The relevant considerations are bothempirical and epistemological. We presentsome of them in the course of modeling thedata, and consider the larger issues at the end inthe General Discussion section.

Experiment 1

For this experiment, we sought a variablewith a real answer that cannot be looked up onthe Internet and for which everyone would have

171MODELING SUBJECTIVE PROBABILITY DISTRIBUTIONS

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

roughly equal expertise. We achieved this endby presenting respondents with either a largesquare and a small circle, or a large circle and asmall square, and eliciting their judgmentsabout the ratio of the area of the large to thesmall shape.

We used only the quantile-estimationmethod, despite its drawbacks mentionedabove, because there exists a validated distribu-tion-free algorithm to estimate means and vari-ances from judged quantile values (Lau, Lau, &Ho, 1998). We considered that step important inorder to have model-free estimates of those twoparameters, against which we could comparethe means and standard deviations estimatedfrom fitted distributions. Additional aims wereto compare the means with the correct values,note the behavior of the estimated means andvariances in response to certain experimentalmanipulations, and to assess issues associatedwith fitting formal distributions to the judg-ments.

Method

Participants. We recruited 99 participantsfrom a variety of websites designed to offeronline experiments, such as Psychological Re-search on the Net through Hanover College.Participation was voluntary and without com-pensation. We stopped collecting data when theresponse rate per day dropped to zero. Theexperiment was available online for a total of131 days.

Design. The design was mixed 3 � 2 �2 � 2, Ratio � Shape � Direction � Numberof Quantiles. There were no other conditions inthe experiment. Ratio was a within-subject vari-able and the other three factors were between-subjects. These independent variables are de-scribed next. Figure 1 provides a screen shot ofone of the 24 cells and should be referenced asthe variables are introduced.

We defined three levels of ratio by using asingle large shape and varying the area of thesmall shape over trials. Every participant judgedthe three area ratios of 15, 30, 45, with thesequence randomized across participants. Fig-ure 1 shows an example of the Ratio � 45condition. The experiment was programmed inQualtrics and participants were randomized intothe remaining 2 � 2 � 2, Shape � Direction �Quantile, cells. For shape, participants judged

either small squares relative to one large circle(circle-standard condition, shown in Figure 1)or the reverse (square-standard condition). Fordirection, they provided judgments beginningeither with the subjective median and thenworking toward the tails of the distribution (in–out, as in Figure 1), or beginning at the tails andthen working inward (out–in). For quantiles,participants judged either three quantiles (3Q)or five quantiles (5Q, as in Figure 1).

Participants in the 3Q conditions judged the5th, 50th, and 95th percentiles. Those in the 5Qcondition also judged the 25th and 75th percen-tiles.

Procedure. Participants were randomly as-signed to one of the eight conditions, as de-scribed. Upon providing informed consent, theybegan with a training screen consisting of alarge rectangle and a small triangle. The ratio ofareas was 15:1. The remaining factors, 3Q or5Q and in–out or out–in, were matched to thecondition into which the participant had beenrandomized.

Upon completion of the training trial, partic-ipants moved immediately to the three experi-mental sequences, with one screen each for theratio 15, 30, and 45 in randomized order. Thequestions are illustrated in Figure 1. Participantsreceived no feedback during the training or testtrials.

We collected no other dependent variables.

Results

Monotonicity of judgments. Monotonicjudgments are necessary in order both to applythe Lau et al. (1998) equations and to estimatecontinuous probability distributions. Recogniz-ing this fact, we nevertheless did not force strictmonotonicity in order to assess the extent towhich it occurred naturally. We address thelatter question first and then return to the pri-mary matter of comparing model-free and para-metric estimates of judgment means and stan-dard deviations.

Strict monotonicity requires that xi � xj iffpi � pj, where xi is the estimated value of theratio corresponding to percentile pi and iff de-notes “if and only if.” Our measure of monoto-nicity is Wilson’s e (Gonzalez & Nelson, 1996),defined as


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

e �a � d

n,

where a is the number of (xi, xj) pairs correctlyordered, d is the number incorrectly ordered,and n is the total number of pairs (n � 6 in 3Qconditions and n � 10 in 5Q conditions). Notethat it is possible that a � d � n due to possibleties, that is, xi � xj.

Wilson’s e varies from �1 to �1, with �1indicating strict inverse monotonicity and �1strict positive monotonicity. We calculatedthree values of e per participant, one for eachstimulus ratio. The overall mean value was 0.36and there were no effects of the independentvariables (all F ratios �1).

From another perspective, 42 of the respon-dents showed perfect monotonicity, that is, had

e � 1 for all three sets of estimates. Another 21participants had positive mean e values less than1, and 36 participants had mean values of e thatwere either negative or zero. We eliminated oneof the 42 participants with perfectly monotonicjudgments because that person’s estimates wereextreme outliers relative to everyone else’s, ex-ceeding the true ratios by more than 100 for twoof the three stimuli. Our primary analyses arelimited to the remaining 41 participants withstrictly monotonic judgments. There were 26such participants in the 3Q condition and 15 inthe 5Q condition.

Nonparametric estimates of subjectivemeans. We used two methods for estimatingsubjective means and variances of the targetdistributions. The first relied on empirical ap-proximations developed by Lau et al. (1998),

Figure 1. Screen shot of the Ratio � 45, circle standard, 5Q, in– out condition, Exper-iment 1.


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

and the second on fitting the gamma andWeibull distributions to the judgments.

Lau et al. (1998) used Monte Carlo tech-niques to develop multiple regression equationsfor estimating means and variances that wouldbe robust over distributions that varied widelyin skewness and kurtosis.3 For each sample,they regressed various combinations of thequantiles against the distribution’s populationmean and standard deviation. The resultingequations were surprisingly accurate, and in-creasingly so as the number of quantiles in theequation increased. We used their three-quantileequations, relying on estimated x.05, x.50, andx.95 quantiles, corresponding to the 5th, 50th,and 95th percentiles, to estimate the subjectivemeans and variances. Note that this methodrelied on all the estimates in the 3Q conditionbut only on three of the five estimates in the 5Qcondition. We applied the following equationfrom Lau et al. separately to each participant’sestimates for each ratio:

� � .63x.50 � .185�x.05 � x.95�.

The top part of Table 1 shows the mean (andstandard error) of the estimated means for theLau et al. (1998) method as a function of theratio and quantile conditions, collapsed overshape and direction. It is evident that the order-

ing of the means is correct but that, in all cases,the 3Q ratios are overestimated. In contrast, the5Q ratios are overestimated only for the trueratio of 15; the other two ratios are underesti-mated. It is also clear that the 5Q means areconsistently closer to the true ratios than are the3Q means.

Simple ANOVAs on the Lau et al. (1998)estimates showed a significant effect of shapeon accuracy, F(1, 34) � 26.34, p � .01, withsystematic overestimation of the ratios in thesquare-standard condition and underestimationin the circle-standard condition. This is an in-teresting result, but not of direct concern forpresent purposes and therefore we do not pursueit further. Accuracy was not significantly af-fected by the direction of estimation, beginningwith the median or with the tails.

Nonparametric estimates of subjectivevariances. Lau et al. (1998) also developedrobust equations to estimate population stan-dard deviations in the same manner that theydeveloped the equations for the means. We usedtheir three-quantile equation to estimate thevariance of each participant’s judgments foreach area ratio:

� � �.63�x.05 � ��2 � .185��x.05 � ��2 � �x.95 � ��2�.

The top panel of Table 2 shows the means(and standard errors) of the standard deviationestimates. The standard deviations increase as afunction of the ratio and are greater in the 5Qthan the 3Q condition, but none of the differ-ences are significant.

Modeling the judgments. We used boththe gamma and the Weibull distributions tomodel the data because both are bounded atzero, as are the ratios, and because they are easyto work with. We take up further considerationsabout distribution choice in the General Discus-sion. For both distributions, we sought modelparameters that minimized the squared devia-tions between the observed (participant esti-mated) quantile values and theoretical valuesunder the distribution, as shown in the top rowof Table 3. Specifically, pi in Table 3 refers to

3 Rather than select specific distributions, Lau et al.(1998) randomly sampled among distributions of standard-ized variables (� � � � 1) differing in skewness andkurtosis within a space defined by functions of those mo-ments. See Lau et al. for details.

Table 1Means (Standard Errors) of � Estimated From theLau, Lau, and Ho (1998) Equation and via theGamma and Weibull Distributions in Experiment 1

Ratio

Condition

3Q (n � 26) 5Q (n � 15)

Lau, Lau, and Ho (1998)15 27.42 (2.49) 18.33 (3.28)30 44.04 (4.36) 28.93 (5.74)45 51.65 (5.96) 43.07 (7.84)

Gamma distribution15 27.42 (2.49) 18.20 (3.28)30 44.00 (4.40) 28.93 (5.79)45 51.62 (6.01) 42.53 (7.91)

Weibull distribution15 27.77 (2.56) 18.40 (3.37)30 44.62 (4.44) 29.20 (5.85)45 52.23 (6.07) 43.27 (7.99)

Note. 3Q and 5Q refer to the number of quantiles judged,3 and 5, respectively.


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

the fixed probability values presented to theparticipants, qi refers to the participant’s esti-mated quantile values for each pi, is the vectorof parameters for the distribution being mod-eled, and F

X

�1�pi, �� is the modeled quantilevalue for the given pi and .

Details of the estimation procedure are pro-vided in the Appendix. One point must be em-phasized here: Although we could have used allfive judged quantiles to estimate the model pa-rameters in the 5Q conditions, we chose not todo so in order to maintain comparability withthe Lau et al. (1998) methods. Thus, we usedonly the estimates of the quantiles associatedwith the 5th, 50th, and 95th percentiles. Anadvantage of omitting the quantile estimatescorresponding to the 25th and the 75th percen-tiles is that they are available for comparingwith the predicted values under each of the twodistributions as another index of which distri-bution provides the better model. We report thatcomparison in the next section.

The distribution-based estimates of themeans and standard deviations are shown in thebottom two panels of Table 1 and Table 2,respectively.4 The results are very similar tothose obtained with the Lau et al. (1998) equa-tions and to each other. Although repeated-measures ANOVAs do show significant differ-ences among the models, they are so small as to

not be of interest.5 To a very good approxima-tion, the three estimation methods yield identi-cal results.

Quality of gamma and Weibull fits.Although it is reassuring that estimates of thesubjective means and variances obtained viaboth distributions were virtually identical tothose obtained via the Lau et al. (1998) equa-tions, it is important to assess whether (a) themodeled distributions provide reasonable de-scriptions of the judged cumulative distribu-tions, and (b) one of the two distributions issystematically closer to the data than is theother. Recall that we used only x.05, x.50, andx.95 to estimate the distribution parameters. Toassess the descriptive quality of the distribu-tions, for each participant and for each ratio, wecalculated the mean-squared deviation (MSD)between the judged quantiles used in the esti-

4 Virtually the same results obtain when the model esti-mates are based on all five judgments.

5 Separately within each shape condition (circle, judginghow many small squares could fit in a large circle; andsquare, judging how many small circles could fit in a largesquare), we ran a repeated-measures generalized linearmodel ANOVA on the estimated means, with model typeand ratio as repeated factors and order and quantile asbetween-subjects factors. For the circle condition, there wasa main effect of model, F(1.29, 27.04) � 7.4, p � .01, andratio, F(1.94, 40.74) � 22.91, p � .01; all degrees offreedom adjusted according to the Greenhouse-Geisser cor-rection. For the square condition, there was a main effect ofmodel, F(1.80, 23.34) � 6.26, p � .01, and a three-wayinteraction with model, ratio, and quantile, F(2.05, 26.68) �5.91, p � .01.

Table 2Means (Standard Errors) of � Estimated From theLau, Lau, and Ho (1998) Equation and Via theGamma and Weibull Distributions in Experiment 1

Ratio

Condition

3Q 5Q

Lau, Lau, and Ho (1998)15 4.62 (.79) 3.07 (1.03)30 7.04 (1.49) 4.47 (1.96)45 7.85 (1.57) 6.93 (2.07)

Gamma15 4.54 (.78) 3.00 (1.03)30 6.96 (1.49) 4.40 (1.96)45 7.65 (1.56) 6.67 (2.06)

Weibull15 4.54 (.78) 3.07 (1.03)30 7.04 (1.51) 4.47 (1.99)45 7.81 (1.59) 6.80 (2.09)

Note. 3Q and 5Q refer to the number of quantiles judged,3 and 5, respectively.

Table 3Objective Function for Each Response Format

Elicitation format Objective function

Quantile �i�1

n

�qi � FX

�1�pi, ��2

Cumulative probability �i�1

n

�pi � FX�qi, ��2

Interval probability �i�1

n

�pi � �FX�qi, �� FX�qi�1, ��2

Note. Experiment 1 used only the quantile format. Exper-iment 2 used all three formats. FX is the modeled cumulativeprobability distribution, is the estimated parameter vectorfor the modeled distribution, and pi and qi are the elicited orexperimenter-defined probabilities and quantiles, respec-tively, corresponding to fixed values i.


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

mation and the predicted values, and convertedthat to

S � 1 �MSD

Var, (1)

where Var is the variance of the data points.6

The mean values of S (and standard errors),collapsed over the between-participants condi-tions, are shown in the top of Table 4, labeled “Sin-sample,” for each ratio magnitude. It is evi-dent that the fits are excellent and appear not tobe systematically different for the two distribu-tions. An additional comparison comes frompredicting the out-of-sample data, x.25 and x.75,in the 5Q condition. Now, Var in the equationabove is the variance of those two points. Theresults are shown in the bottom of Table 4, “Sout-sample.” As to be expected, the values arelower than for the in-sample data, but still verygood.

As a final test of whether one of the twodistributions tended to fit the judgments betterthan the other, we calculated six differencescores for each participant: one for each ratiofor the out-of-sample data and one for the in-sample data. The scores consisted of the signeddifference between the S values for the gammaand the Weibull distributions. Then, separatelyfor each ratio and separately for the out-of-sample and the in-sample data, we computed

the t statistic over the 41 participants against thenull hypothesis of zero difference. Of the six ttests, only the one for the in-sample Ratio � 30data was significant, t(40) � 3.02, two-tailedp � .01. We also did a binomial test for thenumber of positive versus negative differences.The corresponding sign test also was significantat p � .01. As can be seen in Table 4, thegamma fit slightly better than the Weibull inthat case, but the difference is small.

It is safe to conclude that neither distributionis systematically better than the other at repre-senting these judgments at the aggregate level.However, looking at the data of individuals, it isapparent that in some instances, the gamma isclearly better than the Weibull, even when thelatter is pretty good; in some instances, thereverse is true, and in yet others, both fit ap-proximately to the same degree. Figure 2 showsexamples of the three situations.

Discussion

Focusing first on the feasibility of estimatingcontinuous subjective probability distributionsfrom a small number of discrete judgments, aprime desideratum is that the judgments bestrictly monotonic. In most applied risk assess-ments, the analyst collecting the judgmentsfrom the expert uses any number of methods toenforce monotonicity. We deliberately did notdo so in this online study in order to gauge thedegree to which it would occur naturally whenparticipants provide a small number of unaidedquantile estimates. In fact, only 42 of the 99participants (42.4%) consistently providedstrictly monotonic estimates. And too often, thefailures of monotonicity were substantial. Thisresult contrasts with the substantial, althoughnot perfect, satisfaction of strict monotonicity inthe Abbas et al. (2008) study. The primarydifferences between our and their experimentsare (a) theirs used the pair-comparison proce-dure and ours used numerical estimation, and(b) theirs was done in a laboratory context andours was done anonymously via the Internet.Therefore, any online system for obtaining sub-

6 The statistic S is very similar to R2, percent varianceaccounted for, except that it can go negative when the fittedfunction deviates substantially from the data points, indi-cating a completely inappropriate model.

Table 4Means (Standard Errors) of S for Gamma andWeibull Distributions Fit to the Judged Quantilesin Experiment 1

Ratio

Distribution

Gamma Weibull

S in-sample15 .98 (.005) .98 (.007)30 .99 (.003) .98 (.004)45 .98 (.004) .98 (.004)

S out-sample15 .93 (.01) .91 (.01)30 .91 (.01) .89 (.02)45 .87 (.03) .87 (.03)

Note. “S In-sample” refers to how well the modeled val-ues fit the judged X.05, X.50, and X.95 used to estimate themodel. “S Out-sample” refers to how well the modeledvalues predict the judged X.25 and X.75, which were not usedin the estimating the model.


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

jective distributions over numerical variablesvia estimation must enforce strict monotonicity.

However, when judgments were strictlymonotonic, it was straightforward to estimatesubjective means and variances, and to fit dis-

tributions. The subjective means were uni-formly properly ordered, and, as expected, sub-jective variances were smaller when thedistributions were closer to the lower bound ofzero. It did not matter whether participants be-gan at the tails and worked toward the medianor in the other order. Also noteworthy are theresults that the means of the subjective distri-butions were closer to the correct values and theestimated standard deviations were smaller inthe 5Q than in the 3Q conditions, perhaps re-flecting the greater cognitive effort required bymaking five rather than three judgments.

In this regard, the estimated subjective meansand variances derived from the Lau et al. (1998)equations and from the gamma and Weibulldistributions agreed very well with each other,providing strong internal validation to our pro-cedures. Moreover, the good S values, both in-sample and out (in the 5Q condition), furtherattest to the reasonableness of the probabilitymodels.

Experiment 2

Having shown the feasibility of fitting accu-rate continuous distributions to a small numberof probability judgments, Experiment 2 extendsthe approach to probabilistic forecasting ofthree kinds of variables, each requiring a differ-ent distribution family. The variables are eventdate, with forecasts to be modeled by distribu-tions over the nonnegative real numbers (as wedid in Experiment 1); future daily high temper-ature, with forecasts to be modeled by un-bounded distributions; and a proportion, withforecasts to be modeled by distributionsbounded at 0 and 1. In all cases, the systemenforced forecast monotonicity.

We contrast three forecasting modes: estima-tion of quantiles, of interval probabilities, and ofcumulative probabilities. Based on the literaturereviewed above, we expect the probability judg-ments to be superior to the quantile in terms ofperformance and user acceptability, but it is notclear whether there will be a difference betweeninterval and cumulative probability judgments.

Crossed with forecasting mode is the numberof judgments elicited. Will the Experiment 1result that five judgments led to better perfor-mance than three replicate?

Figure 2. Three examples of fits of the gamma and theWeibull distributions to the in-sample and out-of-samplejudgments in Experiment 1 showing (A) roughly equivalentfits for both, (B) a better fit of the Weibull, and (C) a betterfit of the gamma.


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Method

Participants. A total of 488 participants(248 female, 238 male, 2 unspecified; 481 fromacross the United States 3 from elsewhere and 4who failed to indicate their country)7 respondedwithin a 24-hr period via the Qualtrics site.Therefore, we ceased data collection after 1 day.The respondents represented a broad spectrumof ages and educational levels. Their reportedmean age was 42.6, with a standard deviation of14.0, based on categorical age responses. Theyoungest age category was 18 to 25 and the oldestwas greater than 75. The sample consisted of 49high school graduates, 196 with some college oran associate’s degree, 166 with bachelor’s degree,and 77 had postbachelor education.

Design. The mixed design was 3 � 3 � 2,Question (date, temperature or proportion) �Response Format (estimate quantiles, intervalprobabilities or cumulative probabilities) �Number of Cut Points (three or five). Questionwas manipulated within participants and theother two factors were manipulated betweenparticipants. Table 5 shows the sample size percondition.

We ran the experiment in early July 2012,shortly before Apple was expected to announcethe release date of the iPhone 5 and before theSummer Olympics were held in London, UnitedKingdom. The three forecasting questions were

• When will Apple officially announce therelease date of the iPhone 5?

• What proportion of medals (gold, silver,bronze) will the United States and Chinawin in the 2012 Summer Olympics?

• What will the daily high temperature be inyour location 2 weeks from today?

For the temperature forecast, we first askedparticipants whether they preferred to work inFahrenheit or Celsius, and then asked them toenter their zip code so that we could later de-termine the correct temperature for them.

The first two major rows of Table 6 show thecut points used for the questions in the intervaland cumulative probability estimation condi-tions. The five-cut-point conditions used all fivevalues shown in the table; the three-cut-pointconditions omitted the bracketed values (Due toan experimental error, participants in cumula-tive-probability-5-cut-point condition had asixth cut point set at U � (U � L)/6.). We fixedthe cut points for the date question, as it waswell known that the release date was imminent.The Interval Probability conditions for the datequestion also included a never category towhich participants could assign positive proba-bility, should they want to. The cumulativeprobability conditions did not have a never cat-egory, as that was implicitly included in theopen interval beyond the last cut point.

Because there was considerable uncertaintyregarding the medal proportions and the rangeof daily high temperatures (the latter becauserespondents could be from anywhere in theUnited States), we required respondents to pro-vide their judgments of lower and upper bounds(L and U) on each variable. The table showshow we used those values to set cut pointsuniquely for each respondent. Participants pro-vided their probability estimates by moving slid-ers on continuous horizontal scales, labeled withthe cut point at the left end and marked off at 20%intervals from 0% to 100%. The scales associatedwith the various cut points were arrayed one be-low the other from lowest cut point on the top tohighest at the bottom of the screen. The computerenforced weak monotonicity for the cumulativeprobability judgments and required the intervalprobability judgments to sum to 100%.

The bottom row of Table 6 shows cut pointsfor the quantile estimation condition. Valuesomitted for the three-judgment condition areshown in brackets. Note that the five quantileshere differ from the five used in Experiment 1.

7 Two of the three other countries are Uzbekistan andVanuatu, which immediately follow the United States in thedrop down menu, suggesting that they may have beenresponse-entry errors. The remaining other country is Bo-tswana.

Table 5Sample Size Within Conditions for Experiment 2

No. ofcut points

Response format

TotalCumulativeprobability

Intervalprobability Quantilesa

3 86 81 82 2495 82 83 74 239

Total 168 164 156 488

a The effective sample size in the quantile conditions for theforecasts of the iPhone release date are 26 and 33 for the threeand five cut point conditions, respectively. See the text fordetails.


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Upon realizing that probability judgments asso-ciated with the 25th and 75th quantiles had noeffect on the model estimates, we decided todrop those quantiles and to include the 1st and99th instead. Participants entered their estimatesby typing numbers in response to the promptsshown below; bracketed terms below were re-placed by words appropriate to the variable. Theprompts were arranged one below the other,with the 1% one at the top of the screen and the99% one at the bottom of the screen. The com-puter enforced strict monotonicity. The promptswere

• “. . . a 1% (i.e., 1 in a 100) chance that thatthe [actual value] will be [LESS] than it.”

• “. . . a 5% (i.e., 1 in a 20) chance that thatthe [actual value] will be [LESS] than it.”

• “. . . a 50% chance that the [actual value]will be [less] than it and a 50% chancethat the [actual value] will be [GREATER]than it. I.e., you would be equally surprised tolearn that your estimate was too high as youwould be to learn it was too low.”

• “. . . a 5% (i.e., 1 in a 20) chance that that theactual [value] will be [GREATER] than it.

• “. . . a 1% (i.e., 1 in a 100) chance that that theactual [value] will be [GREATER] than it.

All quantile estimates were in the form ofnumbers. For the iPhone release date, we pro-vided a template within which respondents en-tered the numerical “yymmdd” to expressmonth and date. For example, to respond “Au-gust 15, 2012” they entered “120815.” For theOlympics medals question, they entered propor-tions from 0 to 100, and for the temperaturequestion, they entered temperature in their se-lected scale, Fahrenheit or Celsius.

At the bottom of each screen across all con-ditions was a rating scale that asked participants

to rate how easy or difficult they found thequestion. Equidistant labels on the scale, fromleft to right, were very easy, easy, somewhateasy, neutral, somewhat difficult, difficult, andvery difficult.

Procedure. Participants were randomly as-signed to the six conditions of the design. Fol-lowing the consent form and optional demo-graphic questions, they responded to the threeforecasting questions. Half of them saw thetemperature question first, then the proportionquestion; the other half saw those questions inthe reverse order. All participants received thedate question last, because its response templatein the quantile estimation condition, yymmdd,was different from the other response formatsand we thought that would minimize confusion.

At the end of the session, we administered abrief numeracy test (five items taken from the10-item scale in Table 2 of Lipkus, Samsa, &Rimer, 2001, consisting of the three generalitems plus Numbers 4 and 7 of the expandedscale). There were no numeracy effects associ-ated with the independent variables, perhapsdue to our having truncated the scale, and there-fore we do not consider these data further.

We collected no other dependent variables.

Results

Difficulty ratings. Table 7 shows meandifficulty ratings (and standard errors) as a func-tion of question, response format, and numberof cut points. Temperature forecasts were ratedeasier than the other two, but of greater interest,quantile estimation was rated as more difficultthan were either of the two probability formats,which did not differ from each other. There wasno rated difficulty difference between three and

Table 6Summary of the Cut Points Used in the Various Conditions of Experiment 2

Forecast problem Estimation cut points

Estimate interval or cumulative probabilitiesDate August 1, 2012 [August 7, 2012] August 15, 2012 [August, 22 2012] September 1, 2012

Temperature or proportion [L]L �

U�L

6

U�L

2U �

U�L

6

[U]

Estimate quantiles

All 1% [5%] 50% [95%] 99%

Note. L and U are lower and upper bounds set by each participant. Bracketed cut points were used in the five-responseconditions only.


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

five cut point conditions. Specifically, a mixed,three-way Question Domain � Response For-mat � Number of Cut Points ANOVA yieldeda significant effect of domain, F(2, 968) �39.56, p � .001, and a three-way interaction,F(4, 968) � 2.63, p � .05. In light of the latter,we performed simple, two-way Response For-mat � Number of Cut Points ANOVAs withineach question domain. In all three cases, therewas a significant effect of response format (F[2,484] � 38.10, 9.89 and 12.96 for the tempera-ture, Olympic medal, and iPhone announcementdate domains, respectively, all ps � .001). Posthoc tests showed that difficulty ratings for thequantile estimation condition significantly ex-ceeded those for interval probability estimation(all ps � .001), whereas the interval and cumu-lative estimation conditions did not differ (allps .35). There were no other significant ef-fects in the two-way analyses.

Overview of forecast modeling andaccuracy. We analyzed forecast performanceand accuracy separately for each of the threequestions. In each case, we fit one or morecontinuous probability models to each individ-ual’s set of forecasts. Specifically, for each par-ticipant and each problem, we sought distribu-tion parameters that minimized the relevantobjective function as shown in Table 3.

The objective function for participants in thequantile elicitation condition is the same as thatused in Experiment 1. For the cumulative prob-ability elicitation condition, however, the quan-tiles, qi, were fixed by the experiment, and theparticipants provided the cumulative probabili-

ties, denoted pi in the middle row of Table 3.Therefore, the function to be minimized, asshown in the middle row, is the sum of squareddifferences between the judged pi and the mod-eled values, denoted by FX�qi, ��. The objectivefunction for participants in the interval proba-bility estimation condition is shown in the lastrow of Table 3. Here, the estimated probability,pi, was of an interval, and therefore the modeledprobability is the difference between two cumu-lative values, as shown in the objective function(For this purpose we added probability in thenever category, which was used by 29 Interval-condition respondents, to the probability massin the final, open, interval.).

For the questions in which we fit more thanone model per forecaster, we first compared thequality of the fits using the S index in Equation1. On that basis, we selected one distributionfamily to model each person’s forecast andevaluated the effects of response format andnumber of judgments on estimated model pa-rameters as well as on forecast accuracy. Of thevarious possible accuracy indices, we used thesimplest and most easily interpreted one, whichis the deviation, d � t � �, between the actualoutcome t and the estimated mean of the fitteddistribution �.

Temperature forecasts. Considering thisscale to be unbounded in a practical sense, wefit a normal distribution to each participant’sjudgments. Following Abbas et al. (2008), wealso fit a beta distribution, using each partici-pant’s judged lower and upper possible values,L and U, respectively, as the outer bounds forthe distribution. Calculating the fit index, S, inEquation 1 for each distribution for each of the488 respondents yielded four instances (0.8%)of negative values for the normal distributionsand 34 instances (7.0%) for the beta distribu-tion. Figure 3 shows the cumulative distribu-tions of S for the two distributions, truncated onthe left by zero in order render the scale read-able. It is clear that the beta distribution overallprovides the poorer fit. On this basis, wedropped the beta and continued with the normaldistribution.

The top section of Table 8 shows the mean(and standard error) of S for the normal distri-butions fitted to the individual-level judgmentswithin each of the 2 � 3 experimental condi-tions, excluding the four instances for which the

Table 7Mean (Standard Errors) Difficulty Ratings on a (0,100) Scale for Experiment 2

No. ofcut points

Cumulativeprobabilities

Intervalprobabilities Quantiles

Temperature cut points

3 37.5 (3.0) 40.1 (2.0) 56.9 (2.7)5 39.9 (2.8) 37.3 (2.1) 60.8 (3.4)

Olympic medal cut points

3 53.6 (2.5) 46.5 (3.0) 59.1 (2.6)5 49.7 (3.2) 51.4 (2.8) 63.6 (3.3)

iPhone release date cut points

3 41.6 (2.7) 42.9 (2.8) 54.7 (3.3)5 46.8 (3.2) 44.8 (2.9) 60.1 (3.4)


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

normal is an inappropriate function. It is evidentthat the normal distribution provides an excel-lent description of the judgments for the remain-ing 484 respondents, and this is true over all theconditions. The only significant effect in a 2 �3 ANOVA was due to number of cut points,F(1, 478) � 13.60, p � .001, with slightlybetter fit statistics given three rather than fivecut points, which is to be expected on statisticalgrounds alone.

In order to assess the accuracy of the fittedprobability distributions, we first used each in-dividual’s reported zip code to look up the ac-tual high temperature in that person’s local area2 weeks after the forecasts were made. All calcu-lations are in the Fahrenheit scale. We eliminated16 of the 484 well-modeled respondents from theaccuracy analysis because the actual high temper-atures in their locations differed from the means oftheir fitted distributions to such an extreme degree.Two cases were in the interval probability condi-tion, in one instance, with a mean close to 154°and a true value of 66°, and in the other instancewith a mean close to 0° and a true value of 84°.The remaining 14 instances were all in thequantile condition. In all these cases, the fitteddistributions dramatically underestimated thetrue values, with the means all being in thesingle digits, whereas the true values rangedfrom 71° to 109°.

The top portion of Table 9 shows, for theremaining 468 respondents, the mean difference�t � �� (and its standard error) in degrees Fahr-enheit between the mean � of each individual’s

estimated normal distribution and the actualvalue t for that person as a function of experi-mental condition. The cumulative and intervalprobability estimation conditions yielded sur-prisingly accurate forecasts, with accuracygreater in the three than the five cut-point con-ditions. In contrast, the quantile estimation con-ditions led to exceedingly poor forecasts, under-estimating the temperature by 27.45° F onaverage across both numbers of cut points.8

These conclusions are backed up by a two-way ANOVA, which showed no significant in-teraction, F(2, 462) � 1, but significant effectsdue both to response format, F(2, 462) � 101.17, p � .001, and to number of cut points, F(1,462) � 4.48, p � .05. Post hoc tests show thequantile condition to be significantly worse thanboth the cumulative and the interval probabilityconditions, and no difference between the cu-mulative and interval probability conditions.

Olympic medal forecasts. This scale isbounded between 0% and 100%, so we only

8 This difference is greater when the 16 respondentseliminated from this analysis are included. Specifically, thetwo in the interval probability condition had deviationscores, respectively, of 88° and �84°, so including themwould have left the mean difference virtually unchanged,while substantially increasing the variance in that condition.In contrast, all 14 respondents in the quantile condition hadextreme difference scores, ranging from 70° to 107° andaveraging 79.3°. Thus, including them would have served toincrease the overall mean difference (across both numbersof judgments) to 32.10° F.

Table 8Mean (Standard Errors) Fit Indices (S) forExperiment 2

No. ofcut-points

Cumulativeprobabilities

Intervalprobabilities Quantiles

Normal distributions fitted to temperaturejudgments

3 .97 (.006) .98 (.007) .96 (.005)5 .95(.005) .96 (.010) .95 (.007)

Beta distributions fitted to Olympic medalsjudgments

3 .97 (.004) .99 (.005) .96 (.006)5 .95 (.006) .89 (.024) .95 (.006)

Gamma distribution fitted to iPhone 5release date announcement judgments

3 .97 (.007) .98 (.007) .95 (.012)5 .94 (.007) .97 (.007) .91 (.011)

Figure 3. Cumulative distribution of the fit statistic, S, forthe normal and beta distributions fit to the temperaturejudgments in Experiment 2, truncated on the left at zero inorder to facilitate scaling.


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

used the beta distribution to model the proba-bility forecasts. The distribution was inappro-priate in 11 of the 488 cases (2.3%), as evi-denced by negative S indices, and fit well in theremaining 477.

The middle section of Table 8 shows themean (and standard error) of S for the betadistributions fitted to the individual-level judg-ments within each of the 2 � 3 experimentalconditions, excluding the 11 instances for whichthe beta is nondescriptive. Overall, the betadistribution provides a very good fit of the judg-ments, although better with three cut points thanwith five. The ANOVA shows a significant in-teraction, F(2, 471) � 9.07, p � .001, and asignificant effect of number of cut points, F(1,471) � 24.05, p � .001, but no significant effectdue to response format, F(2, 471) � 2.35, p �.1. Post hoc comparisons of the response for-mats at each number of cut points shows that theinterval format leads to a better fit than thequantile format with three cut points, but to aworse fit compared with quantile and cumula-tive formats with five cut points

With regard to forecast accuracy, the UnitedStates and China, in total, won 20% of themedals in the 2012 Summer Olympics. Differ-ence scores, therefore, are with respect to thatvalue. The middle portion of Table 9 shows themean difference in percent �20 � �� betweenthe actual value and the estimated mean of eachindividual’s beta distribution.

The accuracy pattern is different than withthe temperature forecast. Here, the differencescore shows a two-way Number of Re-sponses � Response Format interaction, F(2,

471) � 3.75, p � .05, no overall effect ofnumber of cut points, and a main effect ofresponse format, with the quantile estimationformat yielding the most accurate judgments,F(2, 471) � 23.24, p � .001. The interactionreflects the 2 � 2 crossover in the nonquantileresponse format cells. Nevertheless, post hoccomparisons of the response formats at eachnumber of cut points show the quantile formatto be significantly more accurate than either ofthe probability conditions, and those two to benonsignificantly different from each other.

Because the beta distribution is so flexiblewith regard to shape, and therefore the meanmight be far from the maximum density of thedistribution, we repeated the accuracy analyseson the distribution medians and modes.9 Resultswere essentially unchanged.

Date of iPhone announcement forecasts.Unfortunately, many participants in the quantileestimation condition failed to respond to thisquestion, probably because they had difficultywith the required yymmdd response format.Consequently, the effective sample size for thiscondition is reduced, as shown in the Table 5note.

Taking the day on which the forecast wasmade as Day 0, this scale is bounded frombelow at zero and unbounded from above. On

9 We thank a reviewer for suggesting these additionalanalyses. For analyses of the modes, we further excludedrespondents for whom both distribution parameters, � and �(C and D in McLaughlin’s [2014] notation), were less than1, thereby assuring that all remaining distributions had validmodes.

Table 9Mean (and Standard Errors) Accuracy Scores for Each Question in Experiment 2

No. of cut points Cumulative probabilities Interval probabilities Quantiles

Temperature question difference scores in degrees Fahrenheit

3 �1.04 (1.68) �.67 (1.55) 30.91 (3.51)5 �4.21 (1.58) �2.07 (1.14) 23.46 (3.81)

Olympic medals question difference in percent of medals

3 �25.1 (1.9) [�24.0] �30.9 (2.2) [�30.8] �15.3 (2.1) [�16.2]5 �29.2 (1.9) [�28.1] �23.0 (2.5) [�19.7] �13.1 (2.4) [�11.9]

iPhone release date question difference in days

3 �16.55 (6.92) [�.86] 9.85 (7.42) [15.82] �49.57 (12.52) [�30.74]5 �11.45 (7.09) [9.06] 15.02 (7.32) [15.82] �69.61 (11.28) [�30.75]

Note. Bracketed values in the bottom two panels are the accuracy scores for the consensus forecasts.


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

that basis, numerous distributions are candidatemodels of these forecasts. We tried three: thegamma, Weibull, and Poisson distributions. ThePoisson performed terribly, whereas the gammaand Weibull provided excellent and equivalentfits to the forecasts. For convenience, we reportanalyses only with the gamma distribution. Thegoodness of fit index S is summarized in thebottom panel of Table 8. Again, the fit is betterwith three cut points than with five, although thedifference is small, F(1, 368) � 14.375, p �.001. Unlike with the other problems, there alsois a significant effect due to response format,with interval probability estimates beingslightly better fit than the cumulative probabil-ity estimates, and quantile estimates beingslightly more poorly fit, F(2, 368) � 12.745,p � .001.

In terms of forecast accuracy, all forecastswere made on the same day, and the iPhonerelease date announcement was made 40 dayslater. Taking the forecast day as Day 0, theactual outcome for all participants was t � 40.The difference scores between the true valueand means of the fitted distributions in numberof days, �40 � ��, are shown in the bottompanel of Table 9. The quantile estimation formatled to announcement date predictions too far inthe future, whereas the two probability estima-tion formats resulted in more accurate forecasts,with cumulative probability estimation fore-casts predicting dates somewhat too distant andinterval probability estimation forecasting pre-dicting them somewhat too soon. Statistically,the only significant effect is that due to responseformat, F(2, 368) � 26.899, p � .001. Post hoctests show each format to be significantly dif-ferent from the others.

Discussion

The primary goals of Experiment 2 were (a)to extend the continuous probability distributionmodeling approach from a single perceptualproblem to a variety of forecasting questionsthat a priori seemed suited to different familiesof models, and (b) to compare the effects ofdifferent elicitation formats and numbers of cutpoints on model and forecasting accuracy. Sec-ondarily, we also looked at rated difficultyacross the forecasting conditions—an importantconsideration in designing systems for computer-

elicited human forecasts of sociopolitical (or otherkinds of) events.

In each of the three domains tested, the se-lected model family worked very well, as in-dicted by goodness-of-fit indices (S) close totheir upper bound of 1. In each case, we fit themodel by estimating two parameters of the se-lected distribution; thus, it is not surprising thatthe fits were slightly better given three than fivecut points and the associated greater number ofjudgments. Notable, however, is that there wasno systematic effect on model fit of responseformat.

Although these results are very encouraging,they should not be overgeneralized. First, al-though the selected model performed very wellfor the vast majority of participants in eachcase, there were always a few for whom itperformed very poorly. We have no way todetermine whether their forecasts were justnoise, perhaps due to unfamiliarity with thedomain, or whether they represented a consid-ered opinion not well captured by the model.For example, an individual may have thoughtthat the United States and China would wineither very few or very many medals, but not anintermediate number. The beta distribution isnot well suited for representing such a bimodaldistribution. Similarly, none of the distributionswe worked with would capture a judgment thatApple’s iPhone announcement was imminent,but if it did not occur soon, it was unlikely to doso for many months. We continue discussion ofmodel selection and future research on the topicin the General Discussion section that follows.

Focusing on those cases that our models didcapture well, we failed to replicate the Experi-ment 1 result that five judgments led to moreaccurate forecasts than did three. Just the oppo-site result occurred with the temperature fore-casts, and there was no effect of number of cutpoints, i.e., number of judgments, on accuracyfor the other two forecasts. With regard to ef-fects of response format on accuracy, quantileestimation led to poorer forecasts than did theprobability estimation methods in two cases andto better forecasts in one. Specifically, quantileestimation yielded worse temperature andiPhone announcement date forecasts and betterOlympic medal proportion forecasts than dideither of the two probability estimation meth-ods. Overall, the cumulative and interval prob-


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

ability formats yielded equivalent accuracy lev-els across the three cases.

In the one case in which quantile estimationperformed better than both of the probabilityestimation methods, its advantage was smallrelative to the advantage of the probabilitymethods of the other two cases. Table 10 showsthe absolute ratios of the mean accuracy differ-ences of the quantile method relative to each ofthe probability methods. In all cases, the bettermethod (smaller difference score) is in the de-nominator and the poorer method (larger differ-ence score) is in the numerator of the ratio. Inother words, quantile difference is the numera-tor for the temperature and iPhone date fore-casts, and is the denominator for the Olympicmedal forecasts. Thus, each ratio shows howmuch better the one method is to the other. Notethat for the Olympic medal forecast, the quan-tile method beats the two probability methodsby ratios ranging from 1.64 to 2.23. In contrast,for the temperature and iPhone announcementdate forecasts, the ratios of probability estima-tion to quantile estimation advantage rangefrom 3.00 to 46.13.

Based on the literature reviewed in the intro-duction, we had predicted that quantile esti-mates would yield less accurate forecasts in allcases. We have no explanation for why it per-formed better in the double-bounded forecastcontext than in the others, but note that when itdid perform better, it was not by much com-pared with the reverse advantage for the other

questions. The result needs to be replicated be-fore taken too seriously. Participants rated thequantile method as more difficult for all threeforecasting questions. On the basis of that re-sult, and the fact that it performed substantiallymore poorly in the two cases and only moder-ately better in the one, the results overall favorprobability over quantile estimation for makingforecasts of this kind.

General Discussion

The overarching aim of this research was todevelop methods for eliciting and modelingforecasters’ continuous subjective probabilitydistributions over future quantitative values, in-cluding dates by which sociopolitical (or other)events may occur.10 Crucial to the enterprisewas that the elicitation should not take much offorecasters’ time working unaided on a com-puter platform. These conditions are a require-ment if the method is to be successfully deployed inongoing contexts—such as intelligence analysisfor national security—that often require rapidand frequently updated analyses on both newand ongoing questions. It may also be helpful inother forecasting domains, such as economics,that now use other, more time-consuming meth-ods. For example, the Federal Reserve Bank ofPhiladelphia generally uses 10 probability inter-vals when surveying professional forecasters.(see, e.g., Croushore, 1993, p. 6, or the Internetsites given in Footnote 2). They could use fewerintervals, perhaps only four or five, fit continu-ous functions to the judgments, and very likelyobtain equally accurate forecasts. Shorter sur-veys might increase the response rate.

The advantage of modeling human analysts’forecasts is that it not only provides a means forincreasing the signal in human forecasts thatotherwise have random components in them(Dawes et al., 1989; Goldberg, 1970) but alsominimizes close-call counterfactuals (Tetlock,2005) and provides decision makers with a toolfor extracting probability estimates associatedwith any values or dates of interest—not just theones for which analysts had provided judg-ments.

10 Material in this section has benefitted from many use-ful conversations with Joe W. Tidwell III.

Table 10Absolute Ratio of Better to Poorer Difference Scorefor Each Forecast Question Illustrating theAdvantage or Disadvantage of the QuantileEstimation Method Relative to the Cumulative andto the Interval Estimation Method in Experiment 2

DomainNo. of

cut points

Quantile relative to

Cumulative Interval

Temperature 3 29.72 46.135 5.57 11.33

Olympic medals 3 1.64 2.025 2.23 1.76

iPhone announcement 3 3.00 5.035 6.08 4.63

Note. Quantile difference is the numerator for the temper-ature and iPhone date forecasts, and the denominator for theOlympic medal forecasts.


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

The two experiments reported here establishthe viability of such elicitation and modeling.We showed that probabilistic forecasts consist-ing of relatively small numbers of judgmentscan be successfully modeled, that they are rea-sonably accurate relative to the actual out-comes, and that probability judgments tend tooutperform quantile judgments. But much moreresearch is required. Among the important is-sues not addressed in this report and requiringfurther research are how to select the best prob-ability model for a set of judgments, how todetermine the cut points or bins for which prob-ability judgments will be elicited, and how toaggregate models of multiple individual fore-casters into a single consensus forecast.

Aggregating IndividualProbability Distributions

Tidwell, Wallsten, Yang, and Moore (2015)have demonstrated that two aggregation meth-ods are far superior to many others in the liter-ature. One method, proposed by Hora, Fransen,Hawkins, and Susel (2013), is to establish theconsensus distribution by taking the median ofthe individual cumulative probabilities at eachvalue of the variable (or, equivalently, to takethe median variable value at each cumulativeprobability level). The other, developed byTidwell et al., applies when one type of proba-bility distribution models all the individual fore-casts, as was the case here. Under this condi-tion, the consensus forecast can be estimated asa distribution of the same type with parametersequal to the median of the individually esti-mated parameters.

For illustration, we used the latter method toestimate consensus distributions in each of thesix conditions for the Olympic medal and theiPhone announcement date in Experiment 2. (Itcannot be applied to the temperature judgments,as every forecast had a different correct an-swer.) Using the notation of McLaughlin(2014), for the beta distribution, holding A � 0and B � 1, we found the median C and Dparameters (often denoted � and � in thebroader literature). For the gamma distribution(see Appendix), fixing A � 0, we found themedian B and C parameters. We then took themeans of the resulting consensus distributionsin each condition and calculated the accuracyscores shown in brackets in Table 9.

For the Olympic medal question, the consen-sus distribution is more accurate (smaller accu-racy score) than the average forecaster in five ofthe six conditions, although none of the differ-ences between the consensus and the mean ac-curacy score are very large. In contrast, for theiPhone release date question, the consensus dis-tribution is more accurate than the average fore-caster for only four of the six conditions, but inthe two quantile conditions, the improvementsare substantial. From another perspective (notshown in the table), the respective consensusdistributions are more accurate than their con-stituent individuals in 53.6% of the cases for theOlympic medals question and 75.4% of thecases for the iPhone release date question.

Thus, the consensus distributions are im-provements over most individuals, but not to thesame degree as was the case for Tidwell et al.(2015). Any number of factors may be at play:Tidwell et al. used a different accuracy metric,the continuous rank probability score (Mathe-son & Winkler, 1976), than we did. That metricscores the entire distribution, not just the mean,but lacks the intuitive interpretation that thedifference between the mean and the correctvalue has. In addition, the two items aboutwhich we elicited forecasts were much in thenews at the time of the study, unlike the items inthe Tidwell et al. study, and therefore it is likelythat individuals had correlated informationabout them. It is well established that centraltendencies improve upon conditionally corre-lated judgments to a lesser degree than they doto conditionally uncorrelated judgments (John-son, Budescu, & Wallsten, 2001). Another dif-ference is that the reasonable forecast range (interms of days into the future or numerical out-come) for many of the items in the Tidwell et al.study was substantially greater than for ouritems, allowing the possibility of much greaterindividual differences in the forecast, and there-fore for greater disparity between individual andconsensus distributions. Clearly, more researchis required to identify the conditions that affectthe degree of improvement wrought by consen-sus distributions.

Defining the Bins

The questions of how to set the bounds andnumber of bins for which probability forecastsare to be elicited is still very much open. Ex-


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

periment 2 failed to replicate the result of Ex-periment 1 that distributions estimated from fivejudgments were more accurate than those esti-mated from three. There may be a fundamentaldifference between estimating physical proper-ties—areas of ratios, in this case—and forecast-ing future events, or it may be that the differ-ence in results between the two experimentswas due to chance. It seems likely that greatergranularity in judgments would lead to bettermodels, and therefore more accurate forecasts.On the other hand, they also require more workfrom the forecaster and may lead to fatigue, andhence to less thoughtful judgments. More re-search on the question is required.

Similarly, we have not investigated the bestway to set bounds on the intervals to be judged.We preset bounds for the iPhone question. Incontrast, we individualized them for the Olym-pic medals and temperature questions by askingindividuals for conceivable upper and lowerbounds for each variable. That approach seemsmandatory for questions in which each fore-caster is faced with a different reality, such asoccurred when our participants across theUnited States made probability judgments abouttheir local daily high temperatures. But it maybe unduly burdensome to the forecasters inother cases. The advantage of individualizingforecast intervals is that it allows each person tofocus attention on what they consider to be thehigh-density region of the variable. The disad-vantages are that it takes additional effort on thepart of each forecaster and that interval bound-aries set a priori might provide useful orientinginformation. When boundaries are individual-ized, there are still open issues of (a) how bestto ask for the extreme values from which inter-val bounds can be set, and (b) the best algorithmto use in setting the bounds. Our algorithm inTable 6 yielded reasonable results, but there isno reason to think it cannot be improved upon.

Choosing the Right Probability Distribution

We turn now to the question of how best toselect the probability distribution, or mixture ofdistributions, to model a set of subjective prob-ability forecasts—an issue that is far from trivialand still very much open. In this report, werelied on the boundedness, or lack thereof, ofthe variable being judged and on convenience.The former is necessary—for example, it makes

no sense to model judgments about variablesthat cannot go below zero with a distributionover the unbounded real numbers. The latter isalways a factor, but the question remains, whatother principled considerations can apply?

One possibility frequently suggested to us isto use distributions that have been developed tomodel the process being judged, for example, aPoisson distribution to model judgments aboutthe frequency of recurring independent eventsover time. Differences between the subjectiveand objective probabilities could be representedby differences in the estimated objective andsubjective parameters. This approach is reason-able when the events being judged are aleatory,that is, are repeatable with well-defined refer-ence classes, but not when they are unique,which is the domain of interest here, and theuncertainty is epistemic. Some events areunique from one perspective, but can still beaggregated within a relative frequency frame-work, so that judgments can be related back toreal distributions, as Griffiths and Tenenbaum(2006) have done. Examples they used includemovie box-office grosses, poem lengths, andcake baking times. The approach may workwhen applying the present method to economicforecasting, which accrues quarterly and annualoutcomes. But it does not apply to events suchas those used here and more generally in theACE Program (Tetlock et al., 2014; Warnaar etal., 2012), which is focused on experts’ proba-bilistic forecasts regarding real-world sociopo-litical events.

It is important to bear in mind that the modelsare of forecasters’ judgments, not of the eventoccurrences themselves. When the event uncer-tainty is aleatoric, it may be sensible to assumethat the subjective representation of it is of thesame distributional family, albeit with differentparameters. The situation is more complex,however, when the uncertainty is epistemic. Fu-ture research in such cases might look for qual-itative methods for assessing characteristics ofan individual’s belief about an event and thenselect distribution families accordingly. For ex-ample, it may be useful to assess whether aforecaster believes that an event’s likelihoodincreases or decreases monotonically with timeor is nonmonotonic, or whether the likelihoodfunction for a quantitative variable such as afuture currency exchange rate is monotonic inone direction or the other or nonmonotonic. In


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

modeling individual forecasters, we must beopen to the possibility that different forecasterswill have different perspectives on the event inquestion and therefore not described by thesame distributional family.

References

Abbas, A. E., Budescu, D. V., Yu, H.-T., & Hag-gerty, R. (2008). A comparison of two probabilityencoding methods: Fixed probability vs. fixedvariable values. Decision Analysis, 5, 190–202.http://dx.doi.org/10.1287/deca.1080.0126

Bowles, C., Friz, R., Genre, V., Kenny, G., Meyler,A., & Rautanen, T. (2007). The ECB Survey ofProfessional Forecasters (SPF): A review aftereight years’ experience. Occasional Paper Series,No 59. Frankfurt am Main, Germany: EuropeanCentral Bank. Retrieved from https://www.ecb.europa.eu/pub/pdf/scpops/ecbocp59.pdf?af1864eb25c8cc584119b3f63450bf79

Croushore, D. (1993). Introducing: The Survey ofProfessional Forecasters. Business Review - Fed-eral Reserve Bank of Philadelphia 3, November/December, 3–15. Available at https://www.philadelphiafed.org/research-and-data/real-time-center/survey-of-professional-forecasters

Dawes, R. M., Faust, D., & Meehl, P. E. (1989).Clinical versus actuarial judgment. Science, 243,1668–1674.

Garcia, J. A. (2003). An introduction to the ECB’sSurvey of Professional Forecasters. OccasionalPaper Series, No 8. Frankfurt am Main, Germany:European Central Bank. Retrieved from https://www.ecb.europa.eu/pub/pdf/scpops/ecbocp8.pdf?b632908ccefcd886a379f074ab6ad12d

Garthwaite, P. H., Kadane, J. B., & O’Hagan, A.(2005). Statistical methods for eliciting probabilitydistributions. Journal of the American StatisticalAssociation, 100, 680–701. http://dx.doi.org/10.1198/016214505000000105

Goldberg, L. R. (1970). Man versus model of man: Arationale, plus some evidence, for a method ofimproving on clinical inferences. PsychologicalBulletin, 73, 422–432. http://dx.doi.org/10.1037/h0029230

Gonzalez, R., & Nelson, T. O. (1996). Measuringordinal association in situations that contain tiedscores. Psychological Bulletin, 119, 159–165.

Griffiths, T. L., & Tenenbaum, J. B. (2006). Optimalpredictions in everyday cognition. PsychologicalScience, 17, 767–773.

Haran, U., Moore, D. A., & Morewadge, C. K.(2010). A simple remedy for overprecision in judg-ment. Judgment and Decision Making, 5, 467–476.

Hoffman, P. J. (1960). The paramorphic representa-tion of clinical judgment. Psychological Bulletin,57, 116–131.

Hora, S. C., Fransen, B. R., Hawkins, N., & Susel, I.(2013). Median aggregation of distribution func-tions. Decision Analysis, 10, 279–291. http://dx.doi.org/10.1287/deca.2013.0282

Jain, K., Mukherjee, K., Bearden, J. N., & Gaba, A.(2013). Unpacking the future: A nudge towardwider subjective confidence intervals. Manage-ment Science, 59, 1970–1987. http://dx.doi.org/10.1287/mnsc.1120.1696

Johnson, T. R., Budescu, D. V., & Wallsten, T. S.(2001). Averaging probability judgments: MonteCarlo analyses of asymptotic diagnostic value.Journal of Behavioral Decision Making, 14, 123–140. http://dx.doi.org/10.1002/bdm.369

Klayman, J., Soll, J. B., González-Vallejo, C., & Barlas,S. (1999). Overconfidence: It depends on how, what,and whom you ask. Organizational Behavior andHuman Decision Processes, 79, 216–247. http://dx.doi.org/10.1006/obhd.1999.2847

Lau, H.-S., Lau, A. H.-L., & Ho, C.-J. (1998). Im-proved moment-estimation formulas using morethan three subjective fractiles. Management Sci-ence, 44, 346–351. http://dx.doi.org/10.1287/mnsc.44.3.346

Lipkus, I. M., Samsa, G., & Rimer, B. K. (2001).General performance on a numeracy scale amonghighly educated samples. Medical DecisionMaking, 21, 37– 44. http://dx.doi.org/10.1177/0272989X0102100105

Matheson, J. E., & Winkler, R. L. (1976). Scoringrules for continuous probability distributions.Management Science, 22, 1087–1096. http://dx.doi.org/10.1287/mnsc.22.10.1087

McLaughlin, M. (2014). Compendium of commonprobability distributions (2nd ed., Vol. 27). Re-trieved from http://www.causascientia.org/math_stat/Dists/Compendium.pdf

Moore, D. A., Tenney, E. R., & Haran, U. (2016).Overprecision in judgment. In G. Wu & G. Keren(Eds.), Handbook of judgment and decision mak-ing (pp. 182–212). New York, NY: Wiley.

Morgan, M. G. (2014). Use (and abuse) of expertelicitation in support of decision making for publicpolicy. Proceedings of the National Academy ofSciences of the United States of America, 111,7176 –7184. http://dx.doi.org/10.1073/pnas.1319946111

O’Hagan, A., Buck, C. E., Daneshkhah, A., Eiser,J. R., Garthwaite, P. H., Jenkinson, D., . . . Rakow,T. (2006). Uncertain judgements: Eliciting ex-perts’ probabilities. Chichester, UK: Wiley. http://dx.doi.org/10.1002/0470033312

Seaver, D., von Winterfeldt, D. A., & Edwards, W.(1978). Eliciting subjective probability distribu-tions on continuous variables. Organizational Be-


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

http://dx.doi.org/10.1287/deca.1080.0126

https://www.ecb.europa.eu/pub/pdf/scpops/ecbocp59.pdf?af1864eb25c8cc584119b3f63450bf79



https://www.philadelphiafed.org/research-and-data/real-time-center/survey-of-professional-forecasters



https://www.ecb.europa.eu/pub/pdf/scpops/ecbocp8.pdf?b632908ccefcd886a379f074ab6ad12d



http://dx.doi.org/10.1198/016214505000000105

http://dx.doi.org/10.1198/016214505000000105

http://dx.doi.org/10.1037/h0029230

http://dx.doi.org/10.1037/h0029230



http://dx.doi.org/10.1287/mnsc.1120.1696

http://dx.doi.org/10.1287/mnsc.1120.1696

http://dx.doi.org/10.1002/bdm.369

http://dx.doi.org/10.1006/obhd.1999.2847

http://dx.doi.org/10.1006/obhd.1999.2847

http://dx.doi.org/10.1287/mnsc.44.3.346


http://dx.doi.org/10.1177/0272989X0102100105

http://dx.doi.org/10.1177/0272989X0102100105



http://www.causascientia.org/math_stat/Dists/Compendium.pdf

http://www.causascientia.org/math_stat/Dists/Compendium.pdf

http://dx.doi.org/10.1073/pnas.1319946111

http://dx.doi.org/10.1073/pnas.1319946111

http://dx.doi.org/10.1002/0470033312

http://dx.doi.org/10.1002/0470033312

havior & Human Performance, 21, 379–391.http://dx.doi.org/10.1016/0030-5073(78)90061-2

Soll, J. B., & Klayman, J. (2004). Overconfidence ininterval estimates. Journal of Experimental Psychol-ogy: Learning, Memory, and Cognition, 30, 299–314. http://dx.doi.org/10.1037/0278-7393.30.2.299

Teigen, K. H., & Jørgensen, M. (2005). When 90%confidence intervals are 50% certain: On the cred-ibility of credible intervals. Applied Cognitive Psy-chology, 19, 455–475. http://dx.doi.org/10.1002/acp.1085

Tetlock, P. E. (2005). Expert political judgment:How good is it? How can we know? Princeton, NJ:Princeton University Press.

Tetlock, P. E., & Belkin, A. (Eds.). (1996). Counter-factual thought experiments in world politics: Log-ical, methodological, and psychological perspec-tives. Princeton, NJ: Princeton University Press.

Tetlock, P. E., Mellers, B. A., Rohrbaugh, N., &Chen, E. (2014). Forecasting tournaments: Toolsfor increasing transparency and improving thequality of debate. Current Directions in Psycho-logical Science, 23, 290–295. http://dx.doi.org/10.1177/0963721414534257

Tidwell, J. W., Wallsten, T. S., Yang, H., & Moore,D. A. (2015). Eliciting, modeling and aggregatingprobability forecasts of continuous quantities.Manuscript in preparation.

Warnaar, D. B., Merkle, E. C., Steyvers, M., Wallsten,T. S., Stone, E. R., Budescu, D. V., . . . Carter, J. N.(2012). The aggregative contingent estimation sys-tem: Selecting, rewarding, and training experts in awisdom of crowds approach to forecasting. Proceed-ings of the 2012 Association for the Advancement ofArtificial Intelligence Spring Symposium Series(AAAI Tech. Rep. No. SS-12–06), 75–76.

Whitfield, R. G., & Wallsten, T. S. (1989). A riskassessment for selected lead-induced health ef-fects: An example of a general methodology. RiskAnalysis, 9, 197–207. http://dx.doi.org/10.1111/j.1539-6924.1989.tb01240.x

Winman, A., Hansson, P., & Juslin, P. (2004). Sub-jective probability intervals: How to reduce over-confidence by interval evaluation. Journal of Ex-perimental Psychology: Learning, Memory, andCognition, 30, 1167–1175. http://dx.doi.org/10.1037/0278-7393.30.6.1167

Appendix

Estimating the Gamma and Weibull Distributions

On the assumption that the criterion function(top row of Table 4) is not single peaked, wedeveloped a grid of starting values within themean by variance space to use in estimating bothdistributions. For each point in the grid, we con-verted the mean–variance pair to the correspond-ing distribution parameters, as described below,and then searched the parameter space to estimatethe global minimum of the criterion function. Thisapproach allowed us to equate in some sense thestarting grids for the two distributions.

Letting u and v correspond to a pair of start-ing values for the mean and variance, respec-tively, we formed the grid from

ui � x.05 �i

M(x.95 � x.05)

where i � 0, 1, 2, . . . , M and M � 8, and

vj �j

N(x.95 � x.05),

where j � 1, 2, . . . , N � 1 and N � 5.Next is a description of how we converted

each mean–variance pair into a pair of startingparameter values for purposes of optimization.

Gamma Distribution

We follow the parameterization of the gammadistribution in McLaughlin (2014). Accordingly,the relationships between the mean and varianceof the gamma distribution and the location, scale,and shape parameters, A, B, and C, are

� � A � BC (A1)

and

(Appendix continues)


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

http://dx.doi.org/10.1016/0030-5073%2878%2990061-2

http://dx.doi.org/10.1037/0278-7393.30.2.299

http://dx.doi.org/10.1002/acp.1085

http://dx.doi.org/10.1002/acp.1085

http://dx.doi.org/10.1177/0963721414534257

http://dx.doi.org/10.1177/0963721414534257

http://dx.doi.org/10.1111/j.1539-6924.1989.tb01240.x

http://dx.doi.org/10.1111/j.1539-6924.1989.tb01240.x

http://dx.doi.org/10.1037/0278-7393.30.6.1167

http://dx.doi.org/10.1037/0278-7393.30.6.1167

�2 � B2C. (A2)

For our purposes, we assumed A � 0, andtherefore can express the mean and variance asa function of the scale and shape,

C ��2

�2 (A3)

and

B ��2

�. (A4)

Weibull Distribution

Again, using the parameterization inMcLaughlin (2014), the mean and variance ofthe Weibull distribution are expressed as

� � A � B��C � 1

C (A5)

and

�2 � B2 ��C � 2

C � �2�C � 1

C �, (A6)

where A, B, and C correspond to location scaleand shape.

Assuming again that A � 0, we reexpress thevariance in terms of the mean (i.e., we plug A5in A6):

�2 � B2��1 �2

C� �2. (A7)

Rearranging A7 to solve for B yields

B � ��2 � �2��1 �2

C�1�0.5

. (A8)

Note in equation A8 that the value of B variesas a function of the value of C. The grid spec-ifies the values of the mean and variance, butnot the value of C. Thus, the solution for B isunderdetermined. We solved this problem byrelying on an estimate of C obtained from thefollowing procedure. We estimated the slopeand intercept of the regression,

p � a � cx,

where p and x are the percentiles (e.g., .05, .50,and .95) and x are the subjective estimates.11

We used c as an estimate of C in Equation A8.

11 The estimation procedure is described in http://reliawiki.org/index.php/The_Weibull_Distribution#Estimation_of_the_Weibull_Parameters.

Received May 5, 2015Revision received September 10, 2015

Accepted October 5, 2015 �


Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

http://reliawiki.org/index.php/The_Weibull_Distribution%23Estimation_of_the_Weibull_Parameters



efficiently encoding and modeling subjective probability

Documents