a rational model of function learning -...

Psychon Bull RevDOI 10.3758/s13423-015-0808-5

THEORETICAL REVIEW

A rational model of function learning

Christopher G. Lucas · Thomas L. Griffiths ·Joseph J. Williams · Michael L. Kalish

Received: 14 July 2014 / Revised: 17 January 2015 / Accepted: 26 January 2015© Psychonomic Society, Inc. 2015

Abstract Theories of how people learn relationshipsbetween continuous variables have tended to focus on twopossibilities: one, that people are estimating explicit func-tions, or two that they are performing associative learningsupported by similarity. We provide a rational analysisof function learning, drawing on work on regression inmachine learning and statistics. Using the equivalence ofBayesian linear regression and Gaussian processes, whichprovide a probabilistic basis for similarity-based functionlearning, we show that learning explicit rules and using sim-ilarity can be seen as two views of one solution to thisproblem. We use this insight to define a rational model ofhuman function learning that combines the strengths of bothapproaches and accounts for a wide variety of experimentalresults.

Keywords Function learning · Bayesian modeling

C. G. Lucas (�)School of Informatics, University of Edinburgh, 10 Crichton St.,Edinburgh, EH8 9AB UKe-mail: [email protected]

T. L. GriffithsDepartment of Psychology, University of California, Berkeley,USA

J. J. WilliamsHarvardX, Harvard University, Cambridge, USA

M. L. KalishDepartment of Psychology, Syracuse University, Syracuse, USA

A rational model of function learning

Every time we get into a rental car, we have to learnhow hard to press the gas pedal for a given amount ofacceleration. Solving this problem—which is an impor-tant part of driving safely—requires learning a relationshipbetween two continuous variables. Over the past 50 years,several studies of function learning have shed light onhow people come to understand continuous relationships(Carroll 1963; Brehmer 1971; 1974; Koh and Meyer 1991;Busemeyer et al. 1997; DeLosh et al. 1997; Kalish et al.2004; McDaniel and Busemeyer 2005). It has become clearthat people can learn and recall a wide variety of relation-ships, but demonstrate certain systematic biases that tellus about the mental representations and implicit assump-tions that humans employ when solving function learningproblems. For example, people tend to expect that rela-tionships will be linear when extrapolating to novel exam-ples (DeLosh et al. 1997), and find it more difficult tolearn relationships that change direction than those that donot (Brehmer 1974; Byun 1995).

Several models have been developed to understandthe cognitive mechanisms behind function learning. Thesemodels tend to fall into two different theoretical camps.The first includes rule-based theories (e.g., Carroll, 1963,Brehmer, 1974, Koh and Meyer, 1991), which suggest thatpeople learn an explicit function from a given family, suchas polynomials (Carroll 1963; McDaniel and Busemeyer2005) or power-law functions (Koh and Meyer 1991). Thisapproach attributes rich representations to human learn-ers, but has traditionally given limited treatment to howsuch representations could be acquired. A second approachincludes similarity-based theories (e.g.,DeLosh et al., 1997,Busemeyer et al., 1997), which focus on the idea that

mailto:[email protected]

Psychon Bull Rev

people learn by forming associations: if x is used to pre-dict y, observations with similar x values should also havesimilar y values. This approach can be straightforwardlyimplemented in a connectionist architecture and thus givesan account of the underlying learning mechanisms, but faceschallenges in explaining how people generalize so broadlybeyond their experience. Most recently, hybrids of thesetwo approaches have been proposed (e.g., Kalish et al.,2004, McDaniel and Busemeyer, 2005), with an associa-tive learning process that acts on explicitly representedfunctions.

Almost all past research on computational models offunction learning has been oriented towards understandingthe psychological processes that underlie human perfor-mance, or the steps by which people update and deploytheir mental representations of continuous relationships. Inthis paper, we take a different approach, presenting a ratio-nal analysis of function learning in the spirit of Anderson(1990) and Marr and Vision. W.H. (1982), and Shepard(1987). Specifically, we start with an abstract representa-tion of the problem to be solved and a handful of additionalassumptions about the nature of continuous relationships,and then explore optimal solutions to the problem in light ofthese assumptions with the goal of shedding light on humanbehavior. This rational analysis provides a way to under-stand the relationship between the rule- and similarity-basedapproaches that have dominated previous work and suggesthow they might be combined. Whereas hybrid models applysimilarity-based learning to explicit rules, we offer a singlefoundation that supports both approaches, using a commonset of commitments about learning and representation.

To understand the abstract problem that a function learnerfaces, we can turn to machine learning and statistics, whereprediction in continuous domains—a problem familiarlyknown as regression—has been studied extensively. Thereare a variety of solutions to regression problems, but wefocus on methods related to Bayesian linear regression (e.g.,Bernardo and Smith, 1994), which allow us to make and testexplicit claims about learners’ expectations, using probabil-ity distributions. Bayesian linear regression is also directlyrelated to a nonparametric approach known as Gaussianprocess prediction (e.g., Williams, 1998), in which predic-tions about the values of an output variable are based onthe similarity between values of an input variable. We usethis relationship to connect the two traditional approachesto modeling function learning, as it shows that learningrules that describe functions and specifying the similaritybetween stimuli for use in associative learning are not mutu-ally exclusive alternatives, but rather two views of the samesolution. We exploit this fact to define a rational model of

human function learning that incorporates the strengths ofboth approaches.

The plan of this paper is as follows. First, we reviewseveral sets of empirical phenomena in function learn-ing, both to provide background and to establish criteriaby which different theories of function learning can bejudged. We then review past models of function learning,dividing them into rule-based, similarity-based, and hybridapproaches. Next, we introduce a new perspective on func-tion learning in which rules and similarity can be expressedin a common framework, and describe a model that fol-lows from this perspective. Finally, we evaluate differentvariations on our model against one another and previousmodels.

Phenomena in function learning

Past studies have taken diverse approaches to understandinghow people learn relationships between continuous vari-ables, but we will focus on four kinds of empirical phenom-ena that have been used in previous tests of function learningmodels (e.g., McDaniel and Busemeyer, 2005), or explic-itly measure what kinds of relationships people implicitlybelieve to be more or less likely (Kalish et al. 2007), orchallenge many models of function learning (Kalish et al.2004). Our decision to focus on the following phenom-ena is also motivated by their being relatively comparable,coming from similar experimental designs involving ran-domly ordered, sequentially presented training stimuli, inthe absence of informative cover stories or contextual infor-mation. In this section, we review these four kinds ofphenomena, which we will later use to evaluate our ownapproach to explaining and understanding function learning.

Interpolation and learning difficulty

Some kinds of relationships are easier to learn than others.For example, increasing linear relationships tend to be easierto learn than decreasing linear relationships (Brehmer 1971;1976). Similarly, linear relationships are typically easierto learn than non-linear ones ((Brehmer 1974; Brehmeret al. 1985; Byun 1995); see Koh and Meyer (1991)for a possible counterexample). Among non-linear rela-tionships, people have more difficulty learning those thatchange direction (Brehmer 1974; Brehmer et al. 1985;Byun 1995). Cyclic relationships are especially difficult—but not impossible—to learn (Bott and Heit 2004; Byun1995; Kalish 2013). These systematic differences sug-gest that some relationships are subjectively simpler, more

Psychon Bull Rev

common, or more straightforwardly represented than oth-ers, and the patterns given above dovetail with explicithuman judgments about the probabilities of different kindsof relationships (Brehmer 1974).

If the difficulty of learning a relationship reflects itsmental representation, one can evaluate a model of func-tion learning by comparing its average error rates to thoseof humans across several kinds of relationships. More pre-cisely, if one orders several relationships by the averagemagnitude of errors that humans make when predicting y forx values that fall between past examples, i.e., interpolating,a good model should show the same ordering in its predic-tion error. For humans, these errors are influenced by manyfactors, such as the match or mismatch of cover stories tothe available data, the number of training points, and presen-tation order (Byun 1995), but we will focus on properties ofthe relationships themselves, which provide a simple basisfor evaluating different theories of function learning. Forinstance, relationships in which y increases as a functionof x tend to be easier to learn than functions in which y

decreases as a function of x, which are in turn easier tolearn than non-monotonic functions. For a summary of somequalitative properties of functions that contribute to differ-ential learning difficulty for humans, see (Busemeyer et al.1997). In our own evaluation, we will use data from severalstudies that were gathered by (McDaniel and Busemeyer2005) and are summarized in Table 1.

Extrapolation

Studies that measure interpolation errors allow relationshipsto be ranked by how easy they are to learn, with impli-cations for those relationships’ subjective probability andconsistency with humans’ mental representations. Unfortu-nately, quite different models can show similar patterns oferrors (given a limited set of relationship types) which con-strains the amount one can learn from this approach. Thisand other limitations of interpolation-error studies have ledsome researchers to focus on how people extrapolate, ormake judgments about points that are distant from thoseseen before. This approach gives a greater share of influ-ence to learners’ prior beliefs, and makes it possible touncover patterns that are not reflected in interpolation errorrates. To date, extrapolation-based studies of function learn-ing are comparatively sparse, but have revealed severalbiases in human learners. For example, people’s extrapola-tion judgments follow linear patterns ((DeLosh et al. 1997),but see Kalish et al. (2004)), and more specifically tendtoward functions with a positive slope and an intercept ofzero (Kwantes and Neal 2006). In one instance of this bias,

when people are trained using data from a quadratic func-tion, their average predictions fall between the true functionand straight lines fitted to the closest training points.

Learning multiple relationships

The term “function learning” suggests that relationshipsbetween continuous variables—or at least the representa-tions that people form of them—are functions, in that fora given value of the predictor x, there is a single validprediction, or at least a range of predictions with a singlemost-likely value or mode. In reality, this is not always thecase. For example, dose responses for drugs might have twoor more patterns, depending on unobserved genetic factorsor patient histories, and some hybrid cars have different rela-tionships between pressure on the accelerator and the car’sreal acceleration, depending on whether or not the combus-tion engine is active. The world abounds with hidden medi-ators that can change the relationship between observablevariables, and one might expect humans to be able to makejudgments that reflect the presence of multiple underlyingrelationships. Consistent with this intuition, Lewandowsky,(Kalish and Ngang 2002) found that fire fighters learn twodistinct relationships between wind speed, ground slope,and the rate at which a fire spreads, depending on whetherthe fire is labeled as a standard forest fire, or a “back burn”fire set to mitigate damage from future fires. Lewandowskyet al. refer to this phenomenon as “knowledge partitioning”,based on the idea that participants’ knowledge of the rela-tionship at hand is partitioned into distinct subsets based oncontext.

More recently, Kalish, Lewandowsky and Krushke 2004conducted three experiments showing that people makejudgments that demonstrate an implicit belief in the pres-ence of multiple overlapping linear relationships, even whenno contextual information was present, and in circumstanceswhere the training data could be explained using a singlenon-linear relationship (see Fig. 1 for examples).

Iterated learning is an experimental method that was firstdeveloped for studying language evolution (Kirby 2001),but it has more recently been applied to other phenomena,including function learning. In an iterated learning exper-iment, there are chains of learners where the first learnerin each chain receives data, makes some inference on thebasis of those data, and uses that inference to provide newdata to the next learner in the chain. The data produced byeach learner is the product of the data he or she receivesand his or her inductive biases or expectations about theunderlying relationship, item, or event. As the chain oflearners grows longer, the influence of the learners’ shared

Psychon Bull Rev

Tabl

e1

Dif

ficu

ltyof

lear

ning

resu

ltsba

sed

onex

peri

men

tsre

view

edin

McD

anie

land

Bus

emey

er20

05

Hyb

rid

mod

els

Mod

el1

Gau

ssia

npr

oces

sex

pert

mod

els

Func

tion

Hum

anA

LM

Poly

Four

ier

Log

istic

Lin

ear

Qua

dR

BF

LR

LQ

RQ

LR

QG

PE(M

odel

2)L

ocal

GPE

(Mod

el3)

Byu

n,(1

995,

Exp

t1B

)

Lin

ear

.20

.04

.04

.05

.16

.000

38.0

2.0

32.0

044

.000

63.0

3.0

021

.011

.015

Squa

rero

ot.3

5.0

5.0

6.0

6.1

9.0

44.0

39.0

43.0

47.0

43.0

5.0

45.0

38.0

44

Byu

n,(1

995,

Exp

t1A

)

Lin

ear

.15

.10

.33

.33

.17

.001

1.0

21.0

33.0

037

.000

92.0

3.0

028

.007

9.0

13

Pow

er,p

os.a

cc.

.20

.12

.37

.37

.24

.066

.021

.053

.067

.055

.033

.065

.053

.058

Pow

er,n

eg.a

cc.

.23

.12

.36

.36

.19

.044

.039

.037

.046

.042

.048

.045

.049

.045

Log

arith

mic

.30

.14

.41

.41

.19

.07

.051

.046

.073

.067

.063

.073

.07

.066

Log

istic

.39

.18

.51

.52

.33

.11

.11

.079

.11

.11

.11

.11

.11

.11

Byu

n,(1

995,

Exp

t2)

Lin

ear

.18

.01

.18

.19

.12

.000

15.0

17.0

20.0

023

.000

38.0

22.0

017

.011

.008

1

Qua

drat

ic.2

8.0

3.3

1.3

1.2

4.2

6.0

41.0

44.1

4.0

084

.082

.055

.083

.072

Cyc

lic.6

8.3

2.4

1.4

0.6

8.3

0.2

9.1

8.2

9.2

9.2

8.1

7.3

1.3

1

(Del

osh,

Bus

emey

er,&

McD

anie

l,19

97)

Lin

ear

.10

.04

.11

.11

.04

.000

21.0

19.0

26.0

042

.000

49.0

26.0

025

.005

2.0

089

Exp

onen

tial

.15

.05

.17

.17

.02

.05

.039

.033

.05

.047

.05

.05

.048

.049

Qua

drat

ic.2

4.0

7.2

7.2

7.1

1.2

6.0

53.0

56.2

1.0

100.

10.0

68.1

1.0

73

Ove

rall

corr

elat

ion

betw

een

hum

anan

dm

odel

perf

orm

ance

Lin

ear

1.0

.83

.45

.45

.93

.66

.93

.93

.78

.91

.92

.89

.92

.93

Ran

k-or

der

1.0

.55

.51

.51

.77

.72

.82

.80

.71

.69

.80

.74

.81

.79

Mea

nw

ithin

-exp

erim

entc

orre

latio

nbe

twee

nhu

man

and

mod

elpe

rfor

man

ce

Lin

ear

1.0

.99

.97

.97

.90

.91

.98

.96

.96

.74

.61

.96

.99

.97

Not

e:R

ows

corr

espo

ndto

func

tions

lear

ned

inex

peri

men

tsre

view

edin

(McD

anie

land

Bus

emey

er20

05).

Col

umns

give

the

mea

nab

solu

tede

viat

ion

(MA

D)

from

the

true

func

tions

for

hum

anle

arne

rsan

ddi

ffer

ent

mod

els

(Gau

ssia

npr

oces

sm

odel

sw

ithm

ultip

leke

rnel

sar

ede

note

dby

the

initi

als

ofth

eir

kern

els,

e.g.

,L

QR

=L

inea

r,Q

uadr

atic

,an

dR

adia

lB

asis

Func

tion)

.H

uman

MA

Dva

lues

repr

esen

tsa

mpl

em

eans

(for

asi

ngle

subj

ect

over

tria

ls,

then

over

subj

ects

),an

dre

flec

tbo

thes

timat

ion

and

prod

uctio

ner

rors

,be

ing

high

erth

anm

odel

MA

Dva

lues

whi

char

eco

mpu

ted

usin

gde

term

inis

ticm

odel

pred

ictio

nsan

dth

usre

flec

tonl

yes

timat

ion

erro

r.T

hela

stth

ree

row

sgi

veth

eco

rrel

atio

nsof

the

hum

anan

dm

odel

MA

Dva

lues

,inc

ludi

nglin

ear

and

rank

-or

der

corr

elat

ions

acro

ssal

lexp

erim

ents

and

mea

nlin

ear

corr

elat

ions

onan

expe

rim

ent-

by-e

xper

imen

tbas

is,p

rovi

ding

anin

dica

tion

ofho

ww

ellt

hem

odel

mat

ches

the

diff

icul

type

ople

have

inle

arni

ngdi

ffer

entf

unct

ions

.

Psychon Bull Rev

expectations eventually washes out the information carriedby the data provided to the first learner. After enough iter-ations, the data carried forward in the chain reflect humanexpectations about what relationships are likely, rather thanthe data the first learner in the chain sees, providing usefulinformation about how people represent and reason aboutthe phenomena at hand (Kalish et al. 2007).

Iterated learning

Figure 2 shows the results of a set of iterated function learn-ing experiments conducted by Kalish et al. 2007. Therewere four conditions that differed in what data were givento the first participants in the chains. The positive linear(A) chains started with a linear relationship with a slope ofone and an intercept of zero, the negative linear (B) chainsstarted with a linear relationship with a slope of negativeone and an intercept of zero, the U-shaped (C) chains startedwith data from a U-shaped relationship, and the random (D)chains started with a disorganized collection of points with-out any apparent underlying regularity. Kalish et al. (2007)found that the judgments of later participants tended to con-verge to a positive linear relationship with a slope of oneand an intercept of zero regardless of the initial data. Whilethese convergence results dovetail with past findings indi-cating that positive linear relationships are easier to learn,the intermediate states of the chains provide a more detailedview of function learning. For example, learners tended topreserve negative linear relationships, consistent with theidea that people think these relationships are likely or plausi-ble. Further, many learners were quick to infer the presenceof multiple overlapping relationships, as when some partici-pants interpreted noisy data as evidence for a negative linearrelationship superimposed on a positive one.

Models of human function learning

The phenomena described in the previous section haveinspired several theories and models of function learning,which can be organized into three classes: those based onrules or explicit functions, those based on associative orsimilarity-based learning, and hybrids that use explicit rep-resentations and associative learning. In this section, wereview each class in turn, before discussing the extent towhich each is consistent with the empirical results describedabove.

Representing functions with rules

Some of the earliest research into function learning postu-lates that people learn continuous relationships using explic-itly represented functions (Carroll 1963). Carroll proposed

that people assume a particular class of functions (such aspolynomials of degree k) and use the available observationsto estimate the parameters of those functions. The result-ing representation allows people to generalize beyond theobserved values of the variables involved. Consistent withthe version of this hypothesis that Carroll advanced, peoplelearned linear and quadratic functions better than randompairings of values for two variables, and extrapolated appro-priately. Similar assumptions have guided subsequent work,which has explored the ease with which people learn differ-ent kinds of functions (e.g., Brehmer, 1974), and examinedhow well human responses are described by different formsof nonlinear regression (e.g., Koh and Meyer, 1991).

The advent of rule-based models precedes most of empir-ical results we consider, so it may be unsurprising thatthese models face some difficulty in explaining thoseresults. Rule-based models do not show the flexibility ininterpolation that human learners exhibit, and tend not topredict the order-of-difficulty found in interpolation stud-ies (McDaniel and Busemeyer 2005). Similarly, there isevidence that rule-based models (such as Koh and Meyer(1991)) make extrapolation predictions that diverge fromhuman judgments (DeLosh et al. 1997). Purely rule-basedmodels make no provision for multiple overlapping relation-ships, and thus cannot account for knowledge partitioningeffects (Kalish et al. 2004). By extension, their ability toexplain (Kalish, Griffiths, and Lewandowsky’s 2007) iter-ated learning results is limited: while rule-based modelsmight be able to explain long-run convergence to posi-tive linear relationships, they do not anticipate participants’multimodal judgments.

Similarity and associative learning

Associative learning models propose that people do notlearn relationships between continuous variables by explic-itly learning rules, but instead forge associations betweenobserved events and generalize based on the similarity ofnew variable values to old. The first model to implementthis approach was the Associative Learning Model (ALM;DeLosh et al., 1997, Busemeyer et al., 1997), in which inputand output arrays are used to represent a range of valuesfor the variables between which the functional relationshipholds. Presentation of an input activates input nodes close tothat value, with activation falling off as a Gaussian functionof distance, implementing a theory of similarity in the inputspace.

Learned weights determine the activation of the out-put nodes, which is a linear function of the activation ofthe input nodes. Weights are learned by gradient descent,where the local relationship between weights and errors isused to find new weights that reduce the squared error ofthe model’s predictions. This process is repeated until the

Psychon Bull Rev

Fig. 1 Training data and four participants’ judgments for Experiments 1–3 in Kalish et al. (2004). Predictor variable values are plotted on thex-axes, with predicted variable values plotted on the y-axes

error can no longer be reduced. In practice, this approachperforms well when interpolating between observed val-ues, but poorly when extrapolating beyond those values, asit does not capture humans’ ability to extrapolate in sys-tematic, structured ways. As a consequence, Delosh et al.introduced the EXAM model, which constructs a linearapproximation to the output of the ALM when selectingresponses.

Similarity-based models have seen mixed success inexplaining the range of empirical phenomena we describeabove. In studies of interpolation and learning difficulty,similarity-based models show similar patterns of interpola-tion errors to those of humans (McDaniel and Busemeyer2005). In the context of extrapolation, ALM does notaddress extrapolation but EXAM was developed with thoseresults in mind and effectively captures the human biastoward linearity and predicts human extrapolations over avariety of relationships (McDaniel and Busemeyer 2005),but without accounting for the human capacity for non-linear extrapolation (Bott and Heit 2004). Like rule-basedmodels, similarity-based models make unimodal predictionsfor any given x, and thus fail to account for knowledgepartitioning results. This limitation also prevents EXAMfrom capturing some of the intermediate patterns that peopleproduce in the iterated learning experiment.

Hybrid approaches

Several studies have explored methods for combining rule-like representations of functions with associative learning.

One example of such an approach is the set of modelsexplored in McDaniel and Busemeyer (2005). These mod-els used the same kind of input representation as ALMand EXAM, with activation of a set of nodes similar tothe input value. However, the models also feature a set ofhidden units, where each hidden unit corresponds to a differ-ent parameterization of a rule from a given class, includingpolynomial, Fourier, and logistic functions. The values ofthe hidden units—corresponding to the values of the rulesthey instantiate—are combined linearly to obtain outputpredictions, with the weight of each hidden node beinglearned through gradient descent.

Another instance of a hybrid approach is the POLEmodel (Kalish et al. 2004), in which hidden units representdifferent linear functions and the weights from inputs tohidden nodes indicate which linear function should be usedto make predictions for particular input values. Using thisrepresentation, the model can learn non-linear functions byidentifying a series of local linear approximations, and caneven model situations in which people seem to learn differ-ent functions in different parts of the input space. As a result,it is unique among the models we have discussed in its abil-ity to match the bimodal response distributions discoveredby Kalish et al. (2004).

Hybrid rule- and similarity-based models form a moreheterogenous group than similarity- and ruled-based mod-els, with representatives including POLE (Kalish et al.2004) and McDaniel and Busemeyer’s (2005) connection-ist implementations of rule-based models. POLE is set apartfrom the other models we have discussed by its ability to

Psychon Bull Rev

(A) (B)

(C) (D)

Fig. 2 Plots of results from Kalish et al. (2007). A Positive linear initial data; B Negative linear initial data; C U–shaped initial data; D Randominitial data

capture knowledge partitioning effects and it demonstratesa similar ordering of error rates to those of human learn-ers (McDaniel et al. 2009). In its extrapolation predictions,however, there is evidence that it deviates from human per-formance (McDaniel et al. 2009). In an iterated learningdesign, POLE showed both convergence to positive linearrelationships and some of the qualitative patterns that humanlearners demonstrate (depicted in Fig. 3II), including tran-sitional states with overlapping positive and negative linearrelationships. McDaniel and Busemeyer’s hybrid polyno-mial model—which performed better than the alternativehybrid models they considered—demonstrates an orderingof interpolation errors on different functions that alignsonly roughly with human judgments (see Table 1), but itsextrapolation predictions are consistent with human judg-ments from McDaniel and Busemeyer’s studies (McDanieland Busemeyer 2005). Like rule-based models, this modeloffers unimodal predictions, and thus cannot account for

knowledge partitioning phenomena, and has not been eval-uated against iterated learning results.

Summary

We have reviewed a diverse set of models that accuratelypredict a variety of empirical phenomena in function learn-ing. Despite their different commitments about how humanslearn continuous relationships, a common theme of thesemodels is an emphasis on the process by which functionlearning occurs. In the next section, we will take a funda-mentally different view, focusing on the abstract problemof function learning and the forms that good solutions tothat problem should take, rather than the process. This viewcomplements past models rather than supplanting them, andwe will demonstrate that it provides a common frameworkwith which to understand and unify rule- and similarity-based approaches.

Psychon Bull Rev

Fig. 3 Model predictions foriterated learning data. A–Ddenote positive linear, negativelinear, U-shaped, and randominitial data, respectively. (I)Predictions from EXAM; (II)Predictions from POLE; (III)Mean function estimates fromour Model 3, removing noise (I)

(II)

(III)

Rational solutions to regression problems

The models outlined in the previous section all aim todescribe the psychological processes involved in humanfunction learning. In this section, we consider the abstractcomputational problem underlying this task, using optimalsolutions to this problem to shed light on both previousmodels and human learning. Viewed abstractly, the com-putational problem behind function learning is to use a setof real-valued observations xn = (x1, ..., xn) and tn =(t1, ..., tn), to predict what yn+1 goes with a new xn+1. Here,the y-values correspond to the underlying relationship, andthe t-values are observations of y that have been obscuredby additive noise, so yn+1 = E[tn+1]. Following muchof the literature on human function learning, we consider

only one-dimensional relationships, but this approach gen-eralizes naturally to the multi-dimensional case. In machinelearning and statistics, this is referred to as a regressionproblem. In this section, we discuss how regression prob-lems can be solved using Bayesian statistics, and how theresult of this approach is related to Gaussian processes, aformalism with close ties to associative learning. Our pre-sentation follows that in Williams (1998). See Appendix Afor a more thorough treatment of the mathematical details.

Bayesian linear regression

Ideally, we would seek to solve our regression problem byusing not just the observations of x and t, but some priorbeliefs about the probability of encountering different kinds

Psychon Bull Rev

of functions f (·) in the world. We can do this by applyingBayes’ rule, with

p(f |xn, tn) = p(tn|f, xn)p(f )∫F p(tn|f, xn)p(f )df

. (1)

Knowledge of which functions in the space of possibili-ties F is more likely to be the true function is captured byp(f ), the prior distribution. The probability of observingthe values of tn if f were the true function is given by thelikelihood function p(tn|f, xn), and the probability that f

is the true function given the observations xn and tn is theposterior distribution p(f |xn, tn). In most regression mod-els, the likelihood is defined by assuming that any deviationfrom the true function is due to many independent sourcesof noise—more specifically, that ti is Gaussian with meanyi = f (xi) and variance σ 2

t . Predictions about the valueof the function f for a new input xn+1 can be made byintegrating over all functions in the posterior distribution,

p(yn+1|xn+1, tn, xn) =∫

f

p(yn+1|f, xn+1)p(f |xn, tn)df

(2)

where p(yn+1|f, xn+1) is a delta function placing all ofits mass on yn+1 = f (xn+1). Performing the integrationoutlined above can be challenging, but it becomes straight-forward if we limit the hypothesis space to certain specificclasses of functions. If we take F to be all linear func-tions of the form y = b0 + xb1, then our problem takesthe familiar form of linear regression. To perform Bayesianlinear regression, we need to define a prior p(f ) over alllinear functions. Since these functions are identified by theparameters b0 and b1, it is sufficient to define a prior overb = (b0, b1), which we can do by assuming that b followsa multivariate Gaussian distribution, which results in a pos-terior distribution over b that is also a multivariate Gaussian(see Bernardo and Smith (1994)). Linear transformationsof Gaussian distributions are also Gaussian, so the predic-tive density Eq. 2 is also Gaussian, and the noise introducedbetween true values t and observations y simply adds to thevariance of this distribution.

While considering only linear functions might seemoverly restrictive, linear regression actually gives us thebasic tools we need to solve this problem for more gen-eral classes of functions. Many classes of functions can bedescribed as linear combinations of a small set of basis func-tions. For example, all kth degree polynomials are linearcombinations of functions of the form 1 (the constant func-tion), x, x2, ..., xk . Letting φ(1), ..., φ(k) denote a set offunctions, we can define a prior on the class of functionsthat are linear combinations of this basis by expressing suchfunctions in the form f (x) = b0+φ(1)(x)b1+...+φ(k)(x)bk

and defining a prior on the vector of weights b. As long as

the prior over weights is Gaussian, the same results apply asin the simple linear case.

Gaussian processes

Another approach to regression problems is to forgo anyexplicit representation of the underlying function and focuson making predictions. If our goal is merely to predict yn+1

using xn+1, tn, and xn, we might simply define a joint distri-bution on tn+1 given xn+1 and find its expected value, whichis equal to yn+1, after conditioning ontn:

p(tn+1|xn+1, xn, tn) = p(tn+1|xn+1, xn)

p(tn|xn+1, xn). (3)

This equation expresses the problem of regression in verygeneral terms, and may, at first glance, seem daunting tocompute: it involves defining a joint distribution over allof the points observed so far, as well as the joint distribu-tion including the new, unknown point. Further, if we wantto predict yn+1, we must be able to take the expectation ofthis quotient. However, in some circumstances, the proba-bility of tn+1 given xn+1, xn, and tn has a straightforwardanalytical solution, and an easily computed expectation.One such case, which will be our focus here, is whenall tn+1 values are jointly Gaussian. In other words, tn+1

is distributed according to a single multivariate Gaussian,with dimensionality corresponding to the number of pointsunder consideration. This is determined by its mean andcovariance matrix, and once these are specified, we have asolution for Eq. 3: the quotient has a closed form for mul-tivariate Gaussians (see Rasmussen and Williams, 2006, fordetails). As we will see, assuming a jointly Gaussian distri-bution is not a strong constraint, and we can express a verybroad set of relationships through our choice of means andcovariances.

Both the mean vector and the covariance matrix are deter-mined by the values of x. Broadly speaking, the mean vectorcaptures expectations about how the function looks in theabsence of data, and the covariance matrix—or the kernelfunction that generates it—captures expectations about howpoints relate to one another. The covariance matrix entry forany pair of t-values (ti , tj ) is given by a function K(xi , xj ),plus a diagonal matrix capturing the noisy relationshipbetween the underlying values yi and the observations ti . wecan Using this covariance matrix, we can obtain the distri-bution of tn+1 conditional on tn. The function K(·, ·), calledthe kernel function, can be chosen arbitrarily as long as thecovariance matrix it produces is valid.

One common kind of kernel is a radial basis function,e.g.,

K(xi, xj ) = θ21 exp(− 1

θ22

(xi − xj )2) (4)

Psychon Bull Rev

which leads to t values that are more strongly correlatedwhen their corresponding x values are more similar, withthe parameters θ1 and θ2 determining how quickly the cor-relation falls off as differences in x values increase. Otherkernels are possible, including periodic functions such as

K(xi, xj ) = θ23 exp

(

θ24

(

cos

(2π

θ5[xi − xj ]

)))

(5)

indicating that values of y for which values of x are closerelative to the period θ3 are likely to be highly correlated.

This approach to prediction, in which a kernel func-tion applied to x defines a normal distribution on t-values,is called a Gaussian process. A wide variety of kernelfunctions are possible, corresponding to varied commit-ments about which x values are likely to lead to similart-values, making Gaussian processes a flexible way to solveregression problems.

Two views of regression

Bayesian linear regression and Gaussian processes appearto be quite different approaches. In Bayesian linear regres-sion, a hypothesis space of functions is identified, a prior onthat space is defined, and predictions are formed by aver-aging over the posterior distribution of y, while Gaussianprocesses simply use the similarity between different valuesof x, as expressed through a kernel, to predict correlationsin values of y. It might thus come as a surprise that theseapproaches are equivalent.

Showing that Bayesian linear regression correspondsto Gaussian process prediction is straightforward. Theassumption of linearity means that the vector yn+1 isequal to Xn+1b. Given normally distributed weights, itfollows that p(yn+1|xn+1) is a multivariate Gaussian distri-bution with mean zero and covariance matrix Xn+1�bXT

n+1.Bayesian linear regression thus corresponds to predictionusing Gaussian processes, with this covariance matrix play-ing the role of Kn+1 above (i.e., using the kernel functionK(xi, xj ) = [1 xi][1 xj ]T ). Using a richer set of basisfunctions corresponds to taking Kn+1 = �n+1�b�

Tn+1, i.e.,

K(xi, xj ) =[1 φ(1)(xi) . . . φ(k)(xi)

] [1 φ(1)(xi) . . . φ(k)(xi)

]T

,

(6)

where φ(1...k) are k arbitrary functions of x (Williams 1998).It is also possible to show that Gaussian process predic-tion can always be interpreted as Bayesian linear regression,albeit with a potentially infinite number of basis functions.Just as we can express a covariance matrix in terms of itseigenvectors and eigenvalues, we can express a given kernel

K(xi, xj ) in terms of its eigenfunctions φ and eigenvaluesλ, with

K(xi, xj ) =∞∑

k=1

λkφ(k)(xi)φ

(k)(xj ) (7)

for any xi and xj (Minh et al. 2006). Thus, any kernelcan be viewed as the result of performing Bayesian lin-ear regression with a set of basis functions correspondingto its eigenfunctions, and a prior with covariance matrix�b = diag(λ).

These results establish an important duality betweenBayesian linear regression and Gaussian processes: forevery prior on functions, there exists a corresponding ker-nel, and for every kernel, there exists a corresponding prioron functions. Bayesian linear regression and prediction withGaussian processes are thus just two views of the samesolution to regression problems.

Combining rules and similarity through Gaussianprocesses

The results outlined in the previous section suggest that, inthe context of regression, learning using rules—as expressedin a Bayesian linear regression model, and generalizingbased on similarity, as expressed in a Gaussian process’skernel function—are mutually compatible points of view.In this section, we briefly describe how previous accountsof function learning connect to these statistical models, andthen use this insight to define a model of human functionlearning that combines the strengths of both approaches.

Reinterpreting previous accounts of human functionlearning

That idea of human function learning as a kind of statisti-cal regression connects directly to Bayesian linear regres-sion. Many rule-based models (e.g., Koh and Meyer, 1991Carroll, 1963) can be framed in terms Bayesian linearregression while retaining all of their basic commitmentsand predictions. Similarly, the basic ideas behind Gaus-sian process regression (with a standard radial-basis kernelfunction) lie at the heart of similarity-based models suchas ALM. In particular, ALM and the associative-learningcomponent of EXAM implement cubic spline approxima-tion (McDaniel and Busemeyer 2005), which can be repre-sented using Gaussian processes (Rasmussen and Williams2006). Similarly, neural network approaches to similarity-based generalization are directly related to Gaussian pro-cesses, with some networks having a perfect mapping toa corresponding Gaussian process (Neal 1994). Gaussianprocesses with radial-basis kernels can thus be viewed as

Psychon Bull Rev

implementing a simple kind of similarity-based generaliza-tion, predicting similar y values for stimuli with similarx values. The hybrid approach to rule learning taken byMcDaniel and Busemeyer (2005) is also closely related toBayesian linear regression. The rules represented by thehidden units serve as a basis set that specifies a class offunctions, and applying penalized gradient descent on theweights assigned to those basis elements serves as an onlinealgorithm for finding the function with highest posteriorprobability (MacKay 1995).

Mixing functions in a Gaussian process model

The relationship between Gaussian processes and Bayesianlinear regression suggests that we can define a single modelthat exploits both similarity and rules in forming predic-tions. We can do this by choosing a hypothesis space thatcovers a broad class of functions, including both those con-sistent with a radial basis kernel and those taking simpleparametric forms. This is equivalent to modeling y as beingproduced by a Gaussian process with a kernel correspondingto one of a small number of types. Specifically, we assumethat observations are generated by a function that is linearwith positive slope, linear with negative slope, quadratic, ornonlinear but generally smooth. Figure 4 depicts samplesfrom these individual kernels. This combination is one wayto express the total prior over functions in Eqs. 1 and 2, withp(f ) = ∑

k p(f |k)P (k), where k represents a particularkernel in the set of four we have mentioned. For examples offunctions that are likely under each of the different kernels,see Fig. 4.

We do not claim that the specific kernels compose anexhaustive account of the relationships that people and learnand extrapolate from. Rather, we believe that people findthese relationships especially easy to learn, and especiallyplausible or likely as explanations of data in the face ofuncertainty, based on the results of Brehmer (1971) andDeLosh et al. (1997) and Kalish et al. (2007).

A more complete account would include kernels thatpermit a wide variety of extrapolation patterns (e.g., Bottand Heit, 2004) , but for the data we will consider suchan expansion would add to the complexity of our modelswithout substantially changing our predictions (see Lucaset al. (2012) for a demonstration of how Gaussian pro-cess models can be used to predict a variety of non-linearextrapolations). The probabilities of the different relation-ship types are defined by the vector π . The relevant kernelsare introduced in the previous sections (where “Nonlinear”corresponds to the radial basis kernel), with the positiveand negative kernels having different means in their distri-butions over weights b, taking mean intercepts and slopesof [0 1], [1 −1], respectively. Using this Gaussian processmodel allows a learner to simultaneously make inferences

about the overall type and specific form of the function fromwhich their observations are drawn.

In developing this kind of model and selecting this par-ticular set of priors—reflected in our choice of kernelfunctions—we are making explicit commitments about theinductive biases that shape human function learning. Theseinclude what types of relationships are more subjectivelyprobable than others, and the more specific forms that rela-tionships of a given type are likely to take. Our model doesnot, however, commit to any specific process by which thosebiases shape people’s inferences, which might resemble, forexample, the associative mechanisms present in POLE orEXAM or an elaboration the hypothesis-testing frameworkoffered by Brehmer.1

Basic tests of the Gaussian process model

In the remainder of the paper, we will evaluate our Gaus-sian process approach to function learning using each ofthe empirical phenomena we discussed earlier. First, fol-lowing the approach taken in McDaniel and Busemeyer’s(2005) review of computational models of function learn-ing, we look at two quantitative tests of Gaussian processesas an account of human function learning: reproducing theorder of difficulty of learning functions of different types,and extrapolation performance. As indicated earlier, there isa large literature consisting of both models and data con-cerning human function learning, and these simulations areintended to demonstrate the potential of the Gaussian pro-cess model rather than to provide an exhaustive test of itsperformance. See Appendix B for a summary of the param-eters in our model, and Appendix C for a description of theprocedures used to generate model predictions.

Difficulty of learning

As discussed above, one important measure of a theory ofhuman function learning is its ability to account for the rel-ative difficulty people have in learning different kinds ofrelationships. Table 1 is an augmented version of resultspresented in McDaniel and Busemeyer (2005) which com-pared several models’ prediction errors to humans’ errorswhen learning a range of functions. Each entry in the tableis the mean absolute deviation (MAD) of human or modelresponses from the actual value of the function, evaluated

1In obtaining predictions from our model, we use sampling methodsthat are described in Appendix C. There has been recent work sup-porting the idea that sampling may explain the inferences that humansmake in many domains (Griffiths et al. 2012), but our predictionsare not coupled to any inference procedure and we do not have datadistinguishing between different mechanistic or implementation-levelaccounts.

Psychon Bull Rev

Fig. 4 Samples from the fourkernels that are combined in ourmodels, reflecting the kind ofrelationship that each kernelfavors

over the stimuli presented in training. The MAD provides ameasure of how difficult it is for people or a given model tolearn a function. The data reported for each set of studies areordered by increasing MAD (corresponding to increasingdifficulty). In addition to reproducing the MAD for the mod-els in McDaniel and Busemeyer (2005), the table has beenexpanded to contain the MADs exhibited by seven Gaussianprocess (GP) models trained on the target functions.

The seven GP models incorporated different col-lections of kernel functions by adjusting their priorprobabilities. The most comprehensive model includesthe {Positive Linear,Negative Linear,Quadratic,Nonlinear}set of kernel functions, assigning them prior probabili-ties proportional to 8, 1, 0.1, and 0.01, respectively.2 Sixother GP models were examined by assigning certain ker-nel functions zero prior probability and re-normalizing theremainder so that the prior probabilities summed to one.The seven distinct GP models are presented in Table 1 arelabeled by the kernel functions to which they assign non-zero probability, under the header “Model 1”. Models 2

2The selection of these values was guided by results indicating theorder of difficulty of learning functions of these different types forhuman learners, but we did not optimize π with respect to the criteriareported here.

and 3, which are extensions that account for knowledgepartitioning phenomena, are discussed below. The kernelsinclude Linear (including both positive and negative linearfunctions), Quadratic (second-order polynomial functions),RBF (nonlinear relationships, fit by a radial basis functionkernel), LQ (linear and quadratic), LR (linear and RBF), QR(quadratic and RBF), and LRQ (linear, quadratic, and RBF).The MAD for each function from McDaniel and Busemeyer(2005) is reported for each model in Table 1, along withhuman MADs. The last three rows of Table 1 give the cor-relations between human and model performance acrossfunctions, expressing quantitatively how well each modelcaptured the pattern of human function learning behav-ior. All of the GP models perform well, with every model(except for the Linear and LQ models) providing a closermatch to the human data than any of the models consideredby McDaniel and Busemeyer (2005).

Extrapolation performance

Predicting and explaining people’s capacity for extrapola-tion to novel stimuli is another key criterion for judgingmodels of function learning. In Table 2, we compare meanhuman predictions for linear, exponential, and quadraticfunctions (from DeLosh et al. (1997)) to those of several

Psychon Bull Rev

Table 2 Linear correlations between human and model predictionsfor extrapolation regions

Model Linear Exponential Quadratic

EXAM .999 .997 .961

Model 1, Linear .996 .972 .277

Model 1, Quadratic .986 .973 .921

Model 1, RBF .996 .989 .921

Model 1, LQ .996 .972 .942

Model 1, LR .996 .972 .363

Model 1, RQ .985 .967 .941

Model 1, LRQ .996 .971 .513

Model 2 .996 .981 .955

Model 3 .996 .982 .957

Note: Gaussian process models with multiple kernels are denoted asin Table 1.

models described in McDaniel and Busemeyer (2005), aswell as each of the Gaussian process models we describeabove and two model extensions that we will describebelow. While none of the GP models produce quite as higha correlation as EXAM on all three functions, all but theLinear and LR models make predictions that correspondclosely with human judgments. It is notable that this per-formance is achieved with the same parameters that wereused for the difficulty of learning data (see Appendix Bfor details), while the predictions of EXAM were theresult of optimizing two parameters for each of the threefunctions.

Figure 5 displays mean human judgments for each of thethree functions, along with the predictions of an extendedGaussian process model we discuss below, which incor-porates Linear, Quadratic, and Nonlinear kernel functions.The regions to the left and right of the solid black linesrepresent extrapolation regions, containing input values forwhich neither people nor the model were trained. Both peo-ple and the model extrapolate nearly optimally on the linearfunction, and reasonably accurately for the exponential andquadratic function. However, there is a bias towards a linearslope in the extrapolation of the exponential and quadraticfunctions, with extreme values of the quadratic and expo-nential function being overestimated. Quantitative measuresof extrapolation performance are shown in Table 2, whichgives the correlation between human and model predictionsfor EXAM (DeLosh et al. 1997; Busemeyer et al. 1997) andthe seven GP models.

Summary

We have shown that our model accounts well for the relativedifficulty with which people learn different kinds of rela-

tionships, and how they extrapolate from limited trainingdata. More complex phenomena, such as knowledge parti-tioning and the multiple overlapping relationships it entails,require more complex models. The next section addressesthese phenomena, and describes a straightforward extensionof our Gaussian process model to accommodate the possi-bility of multiple relationships while still explaining humaninterpolation and extrapolation behavior.

Extending the Gaussian process model beyond singlerelationships

In most models of function learning, including the Gaussianprocess-based models described above, it is assumed thatpeople learn a single relationship between a variable andits predictors. There might be a complex, non-linear rela-tionship between x and f (x), but for a single value of x,f (x) is always unimodal and relationships are never compo-sitions of other relationships. We have mentioned that thisassumption fails to describe many real relationships, and, asknowledge partitioning results show, it also fails to explainhuman behavior.

Of the models we have described, only the POLEmodel (Kalish et al. 2004) makes predictions that are con-sistent with knowledge partitioning phenomena, doing soby appealing to the mental representations and processespeople use when learning functions. We will show that arational analysis of function learning leads to a similar set ofpredictions. In many real-world situations, two variables x

and y will relate to one another in different ways, dependingon context. If y depends on w in addition to x, i.e., the truefunction is y = f (x, w), and w is not observable, the appar-ent relationship between x and y may have discontinuities,and it may not be a function at all, having multiple valuesof y for a given x. We previously discussed examples ofsuch relationships, including acceleration in hybrid cars anddose-response curves in a patient population. Other exam-ples of hidden mediators include the relationship betweenbrake pressure and acceleration, mediated by surface slip-periness, and the relationship between the temperature ofa material and its malleability, mediated by its unobservedcrystal structure, as with the temper of a piece of metal.With these intuitions in hand, we will now describe how ourmodel may be extended to reflect them.

Mixtures of Gaussian process experts

We extended our Gaussian process model (Model 1), to cap-ture the assumption that each point belongs to one of anunknown number of underlying relationships. Clearly, thereis no fixed bound on the number of relationships that mightobtain between x and y, but one would expect that fewer

Psychon Bull Rev

Fig. 5 Extrapolation performance, with mean predictions on lin-ear, exponential, and quadratic functions for human participants from(Delosh, Busemeyer and McDaniel 1997) and a mixture of Gaussianprocess experts (Model 3; see text). Training data were presented in the

region spanned by a solid black line, and extrapolation performancewas evaluated outside this region, with the true function representedby dashed lines

relationships should be more plausible than more, as a mat-ter of simplicity or parsimony (Chater and Vitanyi 2003).There are multiple ways to express this intuition formally,but one obvious choice is to allow points to be dividedinto arbitrary partitions, assigning each partition a prob-ability using a Chinese Restaurant Process prior (Aldous1985), which has previously been used in rational analysesof categorization (Anderson 1991; Sanborn et al. 2010).

Under this prior, the likelihood that a new (x, y) pair willbe assigned to an existing relationship is proportional to thenumber of other points that participate in that relationship,and the likelihood that it will be assigned to a new relation-ship is proportional to a parameter α. More precisely, theprobability that the ith point’s relationship ri will be k is

Pr(ri = k) =

⎧⎪⎪⎨

⎪⎪⎩

nk

i + αif nk > 0,

α

i + αif nk = 0

(8)

where nk is the number of points already participating inrelationship k. The likelihood of the data under a given parti-tion is determined by how likely the ensemble of y values is,given the nature of the relationships they participate in andtheir corresponding x values. This conceptually straight-forward extension from Gaussian processes to a mixtureof Gaussian processes will be called Model 2. We mightalso wish the capture the intuition that (x, y) pairs thathave similar x values are more likely to participate in thesame relationships—in other words, relationships tend to belocally smooth and unimodal. This expectation can be builtinto the model by assuming that the likelihood that a pointbelongs to a partition is determined in part by its close-ness to current members, represented using the x-value’slikelihood under a Gaussian distribution based on existingmembers. This last model, Model 3, is an example of a mix-ture of experts (Jacobs et al. 1991; Erickson and Kruschke1998; Kalish et al. 2004), an approach that has been appliedto Gaussian processes in the past (Rasmussen and Ghahra-mani 2002; Meeds and Osindero 2006). As with Model 1,

Models 2 and 3 can be interpreted in terms of Bayesian lin-ear regression or Gaussian processes, where every Gaussianprocess kernel for every expert can be represented as a linearregression model, albeit, as before, with a potentially infi-nite number of features. See Fig. 6 for samples of the kindsof relationships that the mixture of Gaussian process experts(henceforth Model 3) favors.

Knowledge partitioning

Before applying Models 2 and 3 to knowledge parti-tioning phenomena, we evaluated them against the samedifficulty-of-learning and extrapolation results with whichwe assessed our original Gaussian process models. As withthe earlier models, we used the same parameters for allof the experiments, and obtained close fits to human judg-ments, summarized in Tables 1 and 2 (see Appendix B fordetails of parameters and fits). We also plotted predictionsfor Model 3 against mean human judgments in the extrap-olation experiments in Fig. 5. In general, Models 2 and 3performed as well as any other model, and better than themajority of the alternatives.

To gauge the extent to which the models’ predictionsare consistent with knowledge partitioning phenomena, weobtained individual predictions from twelve participantsin (Kalish et al.’s 2004) studies, four per experiment.3

Each experiment included training points and interpolationregions that were designed to elicit multiple modes in y fora given x. For example, in Experiment 1, there was a gapbetween two partial linear functions with the same slopeand different intercepts. Many participants made judgmentsin the gap that matched both functions, leaving a bimodalresponse distribution. Like Kalish et al., we focus on show-ing that our model captures the bimodal responses of the

3The full data set was not available, but the combined distributionof judgments for the 12 participants was consistent with the overalldistribution reported in (Kalish et al. 2004).

Psychon Bull Rev

Fig. 6 Samples from Model 3. The left plots show samples drawnfrom an infinite mixture of experts with α = .1, favoring a small num-ber of distinct relationships. The right plots show samples drawn froma mixture with α = 10, favoring a large number of distinct relation-ships. Because of the diffuse prior over locations in the x-axis in Model3, randomly drawn samples tend not to concentrate in x and thus looksimilar to samples drawn from Model 2

participants, and gives a posterior distribution that matchesthe distribution of actual judgments.

The results are summarized in Fig. 7, comparing Models’1, 2, and 3 predicted probabilities of different y values tothose given by participants. Model 1 predicts the aggregatetrend in Kalish et al.’s Experiment 1, but cannot explain thediscontinuities exhibited by two of the participants shownin Fig. 1) or the multiple modes evident in participants’judgments for Experiments 2 and 3. In contrast, Models2 and 3 predict the multiple relationships will be inferred.Model 3, being sensitive to the proximity of points, is morelikely than Model 2 to group points into local relationships,as is apparent in its predictions for Experiment 1 We useda single prior distribution across the different experimentsand participants, but the individual differences in Fig. 1 arereadily explained in terms of different participants havingdifferent inductive biases. Future work, with more extensivewithin-subjects data, would permit us to test our model asa framework for understanding how inductive biases varybetween individuals.

Iterated learning

As a final measure of Gaussian process models of func-tion learning, we compared their predictions to humanjudgments in the iterated learning experiments of Kalish

et al. (2007). As mentioned earlier, iterated learning designsinvolve a chain of learners in which each individualobserves data, makes inferences from those data, and usesthose inferences to provide data to the next learner in thechain. For function learning specifically, each observationis an (x, y) pair, and the data that a learner passes forwardis a subset of his or her y-predictions for new x-values.Ideally, these judgments would reflect samples from theinferred underlying function, with variance attributable onlyto uncertainty about that function, and, potentially, inferrednoise around that function. In practice, however, partici-pants’ judgments are subject to errors in perception and inrecording their judgments, as well as varying degrees ofmotivation and attention. Rather than attempting to modelthese factors—which are underdetermined—we chose toapply our mixture-of-experts model to the same tasks thathuman faced as-is, looking for the same qualitative pat-terns that human learners demonstrated. As in Kalish etal.’s experiments, we ran chains in which the first itera-tion’s observations, or the initial data, were drawn fromfour functions, including positive linear, negative linear, U-shaped, and random functions. For each subsequent round,the model used 50 predictions generated from the previousround, like the human learners.

The human learners’ judgments revealed several broadpatterns, shown in Fig. 2, which we used as the basis for ourevaluation, including: (1) given positive linear initial data,judgments were consistently positive linear over succes-sive rounds; (2) a shift toward positive linear functions forthe negative linear, U-shaped, and random initial data, withtransitional states reflecting uncertainty or inferences tohigh noise or multiple overlapping relationships—in almostall chains, there are intermediate states that deviate fromany simple, well-formed function; and (3) greater stabilityand slower transitions in the negative linear case than in theU-shaped and random cases.

Figure 3III shows that Model 3 demonstrates each ofthese features. Like many human learners and the POLEand EXAM models (Fig. 3I and II), it preserved positive lin-ear relationships, with small deviations from a 0–intercept1–slope relationship that are due to our treatment of out-of-range samples: when the model samples y values thatare greater than 1 or less than 0, those values are resam-pled, leading to a slight flattening of the slope. A policy ofconverting out-of-range samples by replacing them with themost extreme value would reduce this effect. Like humanchains and the POLE model’s predictions, but not EXAM’s,iterations following U-shaped initial data included cases ofoverlapping positive and negative relationships. Like severalhuman learners, random initial data led to the GP model tooffer overlapping, weakly sloped linear relationships beforeshifting towards a single positive linear relationship. Finally,like human learners, the GP model tended to preserve the

Psychon Bull Rev

Fig. 7 Plots comparing human judgments in Experiments 1–3 of(Kalish et al. 2004) to the predictions of Models 1, 2, and 3. Thepoints represent individual human judgments, aggregated over four

individuals for whom data were available, while the colors repre-sent log probability densities, with hotter colors representing higherprobabilities

negative linear relationships more than U-shaped and disor-dered relationships. The most salient difference between theGP Model 3 and human learners is its slower convergenceto positive linear relationships.

There are several ways in which we might account forthis difference in convergence rates. First, our priors overtypes of relationships were not fitted to human behavior, andone more strongly favoring positive linear relationships—ora lower variance in the distribution of slopes—would nat-urally lead to faster convergence. Second, a more nuancedview of noise would be consistent with the differencesin convergence rates. For example, our model assigns a

very low probability to “random” relationships, in whichpoints have very high variance, whereas participants mightexpect that some points are anomalies, analogous to equip-ment failures. Third, the rapid convergence of human chainsmight be explained in part by differences between individ-ual human learners. For example, specific individuals mighthave stronger expectations that relationships are positiveand linear, and believe more strongly that their observationsare only noisy reflections of the underlying relationship. Aswith individual differences in knowledge partitioning, all ofthese possibilities could be explored using within-subjectsdata.

Psychon Bull Rev

General discussion

Function learning is one of the core inductive problems thatwe encounter every day, arising whenever we need to learnthe relationship between two continuous variables. Modelsof function learning have explained the human ability tosolve this problem in terms of different cognitive mecha-nisms, such as inducing rules or generalizing on the basisof similarity. We have shown that these different cognitivemechanisms correspond to different strategies for solvingthe abstract computational problem of regression, and thatboth can be expressed as special cases of a Bayesian solu-tion to this problem based on Gaussian processes. Thisperspective helps to reveal the commonalities between thesedifferent mechanisms, and to define models that combinetheir strengths. The resulting models provide a good fitto human data, performing similarly to the best mechanis-tic accounts, and provide a way to transparently identifythe inductive biases that guide human learners in functionlearning tasks.

In our introduction, we stated that our model is intendedto complement, rather than replace, existing accounts offunction learning: we focus on the inductive biases thatshape function learning, rather that the processes by whichit occurs. In the remainder of this paper, we will discuss therelationship between these levels of analysis, the project ofidentifying human inductive biases, and some of the waysin which our work could be extended.

The roles of models at different levels of analysis

Our focus in this paper has been on understanding humanfunction learning by identifying the underlying computa-tional problem and the assumptions that seem to yield par-allels between optimal solutions to this problem and humanbehavior. This approach is in the spirit of the approach ofrational analysis laid out by (Anderson 1990), yielding anexplanation of behavior that lies at what (Marr 1982) termedthe “computational level”. The results of this investigationare quite different from those yielded by a more tradi-tional modeling approach operating at what (Marr 1982)termed the “algorithmic level” and focusing on identifyingthe cognitive mechanisms underlying human behavior. Theprevious models of function learning we have discussed inthis paper are defined at this level, making claims about theaspects of human memory and reasoning that contribute totheir performance on function learning tasks.

The focus on the computational level establishes a clearset of goals for our model. First, we are not trying to definethe single best model of human performance on functionlearning tasks, because our computational-level model is notin competition with algorithmic-level models. It is entirely

possible for our computational-level analysis to be correct,and for it to be executed at the algorithmic level by cognitivemechanisms that resemble existing psychological processmodels. In this case, we would expect both kinds of mod-els to fit well (and possibly the process models to fit better,since they will capture idiosyncrasies of behavior due tothe way in which the computational-level solution is carriedout). Our goal is to show that the computational-level solu-tion we have proposed does a good job of capturing humanbehavior, and existing algorithmic-level models provide agood yardstick against which to measure this performance.

Second, a key part of our contribution is theoretical. Wehave shown that algorithmic-level mechanisms that seemquite different can in fact be captured in a single theoreti-cal framework at the computational level, and that this leadsto new ways of thinking about combining the strengths ofthese approaches. This kind of contribution has a precedentin other work examining aspects of cognition at differentlevels of analysis: (Ashby and Alfonso-Reese 1995) showedthat exemplar and prototype models of categorization couldboth be viewed as strategies for solving the problem of den-sity estimation that arises when categorization is viewedfrom the perspective of Bayesian inference. This demonstra-tion of a common underlying computational-level problem(and connections to ideas in statistics) provides the foun-dation for recent work on rational models of categorizationthat can interpolate between exemplar and prototype rep-resentations (Sanborn et al. 2010). We view our analysisas making a similar contribution for the case of functionlearning, providing an explicit link between existing cogni-tive models and ideas from statistics that leads to new waysof understanding human behavior. A probabilistic approachalso provides a basis for understanding a broader range ofphenomena, including not just patterns of interpolation andextrapolation judgments. For example, one can use explainthe influence by linguistic and contextual information onfunction learning (Byun 1995) in terms of priors, and under-stand people search for new information (Borji and Itti2013) or benefit from different kinds of instruction (Lindseyet al. 2013).

Capturing human inductive biases

In inductive problems, such as function learning, the rightanswer is underdetermined by the available data. Thismeans that doing a good job of solving the problem requireshaving good inductive biases—those factors other thanthe data that lead a learner to favor one hypothesis overanother (Mitchell 1997). When viewed from the abstractcomputational level, the key challenge in explaining humaninductive inference is characterizing our inductive biases.Bayesian models of cognition make this task particularly

Psychon Bull Rev

clear, as the inductive biases of these models are expressedthrough the choice of hypothesis space and the prior onhypotheses.

In function learning, the characterization of the inductivebiases of a learner is particularly clear: it corresponds to aprior distribution on functions. As we have discussed, defin-ing a prior distribution on functions is challenging, sincethere are uncountably many possible functions, dependenton an unbounded number of latent variables. The Gaussianprocess models we have explored provide a succinct way ofexpressing priors on functions that is nonetheless extremelyflexible in the range of distributions that it allows, andthus provide a powerful tool for exploring human inductivebiases for function learning. We can express the assump-tions behind this prior in terms of a kernel function, whichcaptures the similarity between stimuli, in terms of a setof basis functions, which express a representation of thesestimuli, or through samples from the resulting distributionover functions, providing three different ways to indicate theinductive biases that a learner has.

Being able to characterize human inductive biases interms of a probability distribution over functions also makesit straightforward to make automated learning systems thatare guided by the same inductive biases. We can easilytake the prior assumed by our Gaussian process models anduse it as a component of Gaussian process models usedin machine learning or statistics. This provides a naturalbridge between human and machine learning, and an oppor-tunity to explore whether using human inductive biasesimproves the operation of automated systems as well as todevelop automated systems that make inferences that aremore comprehensible to human users.

Limitations and future directions

The models we have explored cover a wide range of resultsfrom the literature on human function learning, but thereare still phenomena that they cannot capture and aspects ofhuman performance that lie outside the considerations thatnormally inform a computational-level analysis. Address-ing these limitations creates some interesting directions forfuture work.

A basic omission in the formulation of our model isthat it is unable to learn cyclic functions. Since thesefunctions are learnable by people (although with signif-icant difficulty Bott and Heit, 2004, Byun, 1995), thisis a weakness that should be addressed. It is straightfor-ward to incorporate a capacity to learn cyclic functionsby including a periodic kernel in the mixture of kernels.Incorporation of this additional kernel—with an appropri-ately low mixture weight—would not change the predic-tions of the model for non-cyclic functions appreciably. Wejudged the corresponding increase in the complexity of the

model to outweigh the value of capturing these additionalphenomena.

The fact that people can learn cyclic functions raisesanother interesting question: Can we build an exhaustivesummary of the kinds of relationships that people can learn?In the context of our models, this becomes a question ofwhat kinds of functions have support in people’s prior dis-tributions, or what set of kernels should be included in themixture. Existing results support the inclusion of a relativelysmall set of kernels—essentially, those that we consider plusa periodic kernel for cyclic functions.

Another issue, also related to our prior over kernels, isthat we chose a distribution strongly favoring linear rela-tionships. Is this prior consistent with the idea that a rationalanalysis should use diffuse priors that capture the statis-tical structure of the environment (Anderson, 1990)? It isa shortcoming of the current work that we cannot be cer-tain, but we believe that a linearity-biased prior is betterthan alternatives. In function learning, it is not realistic todirectly measure the statistical structure of the environment,i.e., what functions are truly more or less common: doing sowould depend on knowing what combinations of variablesare salient to human observers over long periods of time,including, perhaps, our evolutionary history. Further, anycensus of functions would reflect the cognitive and atten-tional biases of the people who would conduct it. In theabsence of ground truth about the frequencies of functions,we believe that the best approach is to look at what relation-ships people think are more common, using both direct andindirect measures. Previous studies, including many that wehave not evaluated here (see Busemeyer et al. 1997) for asummary, and Little and Shiffrin (2009), for evidence thatpeople infer linear relationships given very noisy data) sup-port the idea that linear relationships are thought to be morecommon. Among these are results showing that people saythat linear relationships occur much more frequently thannon-linear ones, and showing that people tend to offer linearrelationships when prompted in the absence of data or infor-mative context (Brehmer 1974). Even if we set aside theseresults, it seems a case can be made that linear functionsare indeed very common in situations the matter to humans.Under usual (e.g., non-relativistic) conditions, relationshipsbetween mass, force, acceleration, velocity, distance, andtime can be expressed as collections of linear relationships,and many physical objects have broadly similar shapes atdifferent scales, implying that an object’s height is a roughlylinear function of its width, for example.

Our focus on the abstract computational-level problemunderlying function learning and the nature of ideal solu-tions to that problem means that there are aspects of humanperformance that our models cannot capture. For example,our models assume that people have perfect memory forthe stimuli and exact recall of the values of the variables

Psychon Bull Rev

presented on each trial. These assumptions are clearly false,and a more realistic treatment of memory and perceptionmight make it possible to tease apart the assumptions in ourmodel that are due to these factors (e.g., high noise param-eters) from those that capture human inductive inference(e.g., the set of kernels appearing in the mixture). Thereare going to be aspects of human performance that cannotbe captured by the kind of computational-level models wehave considered, such as sensitivity to the order in whichstimuli are presented, that may be candidates for identifyingalgorithmic-level implementations of these ideal solutions(similar to the role of order effects in categorization Ander-son, 1991, Sanborn et al., 2010). As a starting place, it maybe worth drawing inspiration from efforts in the machineliterature to overcome the difficulty of scaling Gaussian pro-cesses to large data sets in the face of limited memory (e.g.,Hensman et al., 2013).

If future work is to provide a deeper understanding offunction learning, including the roles played by priors (andfree parameters more generally), learners’ limited cognitiveresources, and individual differences, it will be necessary togo beyond the evaluation methods that have become stan-dard in function learning, in at least two respects. First, it isnow increasingly feasible and important to examine not justoverall error rates, or aggregate correlations between modelpredictions and human judgments, but the accuracy withwhich a model can predict human judgments for individualpoints given the values and order of previous training andtest points. In addition to making it possible to assess howwell a model can account for the process and dynamics offunction learning—which include order effects as describedabove, as well as other phenomena, like the tacit belief that arelationship might be changing over time or trials (Speeken-brink and Shanks 2010)—such an approach is more robustto aggregation artifacts (Navarro et al. 2006). Second, wehave taken a common approach to fitting and testing cogni-tive models—finding global parameters or priors that givelow error or a high likelihood of the experimental data—but this approach has drawbacks beyond the simple riskof overfitting. Perhaps the most serious of these, in caseswhere we are interested in the priors that people tacitly use,is that this approach licenses only coarse-grained conclu-sions about what priors are likely or plausible given theexperimental data. In the future, we hope that cheaper com-putational resources and increasingly efficient algorithmswill make it feasible to conduct a Bayesian analysis of ourmodel and others, which would provide a clearer picture ofthe priors that are consistent with group-level tendencies aswell as individual differences (Hemmer et al. 2014).

Finally, a key question for any Bayesian model of cogni-tion is the origins of the inductive biases that are expressedin the prior distribution. Having established a picture ofadult inductive biases at the start of an experiment, we can

begin to explore questions related to the development ofthese inductive biases. Within the Bayesian framework, itis possible to make inferences at the level of prior distri-butions by using hierarchical Bayesian models (Tenenbaumet al. 2006). In the case of our Gaussian process model, peo-ple could learn the set of kernels or parameter distributionsfor flexible kernel types (for work related to these ideas, seeWilson and Adams, 2013, Duvenaud et al., 2013), the prob-abilities assigned to those kernels, and other parameters ofthe model. The predictions of this account of the origins ofhuman inductive biases for function learning can be evalu-ated by comparing the performance of children and adultsin function learning tasks and conducting transfer learn-ing experiments examining how people’s inductive biaseschange through experience, and is an exciting direction forfuture research.

Conclusions

We have presented a rational account of human functionlearning, drawing on ideas from machine learning and statis-tics to show that the two approaches that have dominatedprevious work—rules and similarity—can be interpreted astwo views of the same kind of optimal solution to this prob-lem. Our Gaussian process models combine the strengths ofboth approaches, using a mixture of kernels to allow sys-tematic extrapolation as well as sensitive non-linear interpo-lation. Tests of the performance of this model on benchmarkdatasets show that it can capture some of the basic phe-nomena of human function learning, and is competitive withexisting process models. The result is a clear characteriza-tion of human inductive biases for function learning, anda new set of links between human learning and ideas instatistics and machine learning.

Author Note This work was supported by AFOSR grant num-bers FA9550-07-1-0351 and FA9550-13-1-0170, NSERC and SSHRCCanada, and the McDonnell Causal Learning Collaborative. Prelimi-nary results from the first two simulations were presented at the NeuralInformation Processing Systems conference (Griffiths et al. 2009).

Appendix A: Equivalence of Bayesian linear regressionand Gaussian processes

This appendix contains a more detailed description ofhow Bayesian linear regression can be expressed usingGaussian processes, showing first that Bayesian linearregression with normal priors on coefficients can be rep-resented as a mean function and a covariance functionrelating observed and predicted y values. Next we showthat such a representation—which describes a Gaussian

Psychon Bull Rev

process—can be used for prediction directly, setting asidethe linear regression interpretation.

Bayesian linear regression

In Bayesian linear regression, the goal is to use n observedx-values, xn = (x1, ..., xn), and their corresponding y-values with added noise, tn = (t1, ..., tn), to predict yn+1

from xn+1. Let the hypothesis space include linear functionsof the form y = b0 + b1x, where the prior probability ofa given function is a multivariate Gaussian distribution onb = (b0, b1) with mean zero and covariance �b. Apply-ing Eq. 1 then results in a multivariate Gaussian posteriordistribution on b (see Bernardo & Smith, 1994) with

E[b|xn, tn] =(σ 2

t �−1b + XT

n Xn

)−1XT

n tn (9)

cov[b|xn] =(

�−1b + 1

σ 2t

XTn Xn

)−1

(10)

where Xn = [1nxn] (i.e., a matrix with a vec-tor of ones horizontally concatenated with xn+1)Since yn+1 is simply a linear function of b, applyingEq. 2 yields a Gaussian predictive distribution, withyn+1 having mean [1 xn+1]E[b|xn, tn] and variance[1 xn+1]cov[b|xn][1 xn+1]T . The predictive distribution fortn+1 is similar, but with the addition of σ 2

t to the variance.In the more general case where y is a function of an arbi-

trary number of basis functions φ(1), ..., φ(k) of x, the sameresult holds, substituting � = [1n φ(1)(xn) ... φ(k)(xn)]for X and [1 φ(1)(xn+1) ... φ(k)(xn+1)] for [1 xn+1], whereφ(xn) = [φ(x1) ... φ(xn)]T .

Gaussian processes

The Gaussian process approach amounts to predicting y

using x by defining a joint Gaussian distribution on yn+1

given xn+1 and conditioning on yn, with covariance matrix

Kn+1 =(

Kn kn,n+1

kTn,n+1 kn+1

)

(11)

where Kn depends on the values of xn, kn,n+1 depends onxn and xn+1, and kn+1 depends only on xn+1. If we condi-tion on yn, the distribution of yn+1 is Gaussian with meankT

n,n+1K−1n y and variance kn+1 − kT

n,n+1K−1n kn,n+1. This

approach to prediction uses a Gaussian process, a stochasticprocess that induces a Gaussian distribution on y based onthe values of x. We can extend this approach to predict yn+1

from xn+1, tn, and xn by adding σ 2t In to Kn, where In is

the n × n identity matrix, to take into account the additionalvariance associated with the observations tn.

The covariance matrix Kn+1 is specified using a two-place function in x known as a kernel, with Kij =

K(xi, xj ). Any kernel that results in an appropriate (sym-metric, positive-definite) covariance matrix for all x canbe used. Common kinds of kernels include radial basisfunctions, e.g.,

K(xi, xj ) = θ21 exp(− 1

θ22

(xi − xj )2) (12)

with values of y for which values of x are close beingcorrelated, and periodic functions, e.g.,

K(xi, xj ) = θ23 exp(θ2

4 (cos(2π

θ5[xi − xj ]))) (13)

indicating that values of y for which values of x are closerelative to the period θ3 are likely to be highly correlated.Gaussian processes thus provide a flexible approach to pre-diction, with the kernel defining which values of x are likelyto have similar values of y.

Appendix B: Priors and parameters

The Gaussian process model makes use of two kinds of priordistributions: priors over different types kernels, and priorsover the parameters of the individual kernel functions. Theprior over kernels reflects past results showing that peopleact in a manner consistent with the assumption that posi-tive linear relationships as more likely than negative linearrelationships, which are more likely than quadratic relation-ships, which are in turn more likely than arbitrary non-linearrelationships (Brehmer 1974).

In previous experiments designed to reveal prior beliefsabout the prevalence of different kinds of relationships(Kalish et al. 2007), positive linear relationships are approx-imately 8 times as likely as negative linear relationships, butfewer specifics are available for the rates at which peoplegenerate other relationships, beyond the qualitative orderingdescribed by Busemeyer et al (Busemeyer et al. 1997). As aresult, we choose prior probabilities proportional to 8,1, 0.1,and 0.01 for positive linear, negative linear, quadratic, andradial basis kernels.

The parameters for the kernels were given gamma-distributed priors, and included the variances of the weightsand intercept for the linear and quadratic kernels, the heightand distance parameters for the radial basis kernel. In all ofthese cases, the gamma distribution had a shape parameterof 1.001, which had the effect of discounting values veryclose to zero. All of the scale parameters were set to one,except for the radial basis function’s width, or smoothness.

As discussed in the body text, Models 2 and 3, whichare mixtures of Gaussian processes and mixtures of Gaus-sian process experts, respectively, also included a parameterα determining how dispersed points were expected to beover distinct experts. For Model 3, the prior over x for each

Psychon Bull Rev

Table 3 Model Parameters for Interpolation and Extrapolation Phe-nomena

σ 2t θl α

Model 1, Linear 0.01 NA NA

Model 1, Quadratic 0.01 NA NA

Model 1, Radial-basis 0.10 1 NA

Model 1, LQ 0.01 NA NA

Model 1, LR 0.10 1 NA

Model 1, RQ 0.10 1 NA

Model 1, LRQ 0.05 10 NA

Model 2 0.01 10 1.00

Model 3 0.01 10 1.00

Note: Model 1 kernels included combinations of linear (L), quadratic(Q), and radial basis functions (R).

expert was specified by assuming two virtual points at theextremes of the x range, and had no free parameters.

Model fitting

For the fitted parameters, we considered combinations ofσ 2

t ∈ {0.001, 0.01, 0.05, 0.1, 0.2}, α ∈ {0.01, 0.1, 1, 10},and θl ∈ {1, 10}, where σ 2

t is the variance of points aroundtheir true function, α is the dispersion parameter for the Chi-nese Restaurant Process prior on partitions, and θl controlsthe smoothness of functions under the radial basis kernel.In all cases, we used parameters that maximized the meancorrelation between model predictions and mean humanjudgments across the difficulty-of-learning and extrapo-lation data. The remaining parameters, for variances fornon-noise terms and the radial basis function’s height scale,were all fixed at 1. Separate fits were obtained for the POLEdata. See Table 3 for the values that were applied to the inter-polation and extrapolation experiments, and Table 4 for thevalues that were applied to the knowledge partitioning anditerated learning experiments.

Appendix C: Inference

This appendix describes the procedures by which weobtained predictions for each of our models.

Gaussian process model (Model 1)

To obtain predictions, we performed probabilistic inferenceusing a Markov chain Monte Carlo (MCMC) algorithm(for an introduction, see Gilks et al., 1996). This algorithmdefines a Markov chain for which the stationary distribu-tion is the distribution from which we wish to sample. Inour case, this is the posterior distribution over types and the

Table 4 Model Parameters for Iterated Learning and KnowledgePartitioning Phenomena

σ 2t θl α

Model 1 0.2 10 NA

Model 2 0.001 10 10

Model 3 0.001 10 10

Note: Only models incorporating all kernels were considered.

hyperparameters for the kernels θ given the observationsx and t. The hyperparameters include all kernel parame-ters discussed above and the noise in the observations σ 2

t .Our MCMC algorithm repeats two steps. The first step issampling the type of function conditioned on x, t, and thecurrent value of θ , with the probability of each type beingproportional to the product of p(tn|xn) for the correspond-ing Gaussian process and the prior probability of that typeas given by π . The second step is sampling the value ofθ given xn, tn, and the current type, which is done usinga Metropolis-Hastings procedure, proposing a value for θ

from a Gaussian distribution centered on the current valueand deciding whether to accept that value based on the prod-uct of the probability it assigns to tn given xn and the priorp(θ). In all cases, this inference procedure was iterated8,000 times.

Mixtures of Gaussian process experts (Models 2 and 3)

The infinite mixture of Gaussian process model extends thebasic model by assigning observations to different experts,or Gaussian processes. The prior probability that an obser-vation will be assigned to a particular expert is determinedby a Chinese restaurant process (CRP) prior, where theprobability that a new point will be assigned to an expert isproportional to the number of points already assigned to it,and the probability that a point will be assigned to a newexpert is determined by α: if expert k has nk points of a totalN assigned points, a new point will be assigned to it withprobability nk

N+α, and to a new table with probability α

N+α.

For Model 3, the locations of points in x also influence theexperts to which they are assigned: each expert is assumedto have a Gaussian distribution over x values with a min-imally informative prior, defined by combining improperconstant priors for the mean and variance with two virtualpoints at 0 and 1, the extremes of the range of x. This priorleads to a t-distributed density for new points, conditional onthose already assigned to the expert (for details, see Gelmanet al., 2004).

The first steps of the inference procedure are identicalto those used for Model 1. These are followed by Gibbssampling for the assignments of points to experts, resam-pling each point and assigning it to a new or existing expert

Psychon Bull Rev

according to its conditional probability under that expertgiven all other points and all parameters (Neal 1998). Sim-ulated annealing was used to speed mixing of the samplingchains, which included 8,000 iterations of each step.

References

Aldous, D. (1985). Exchangeability and related topics. In Ecole d’etede probabilites de Saint-Flour, XIII—1983 (pp. 1–198). Berlin:Springer.

Anderson, J.R. (1990). The adaptive character of thought. NJ: Erl-baum Hillsdale.

Anderson, J.R. (1991). The adaptive nature of human categorization.Psychological Review, 98(3), 409–429.

Ashby, F.G., & Alfonso-Reese, L.A. (1995). Categorization as proba-bility density estimation. Journal of Mathematical Psychology, 39,216–233.

Bernardo, J.M., & Smith, A.F.M. (1994). Bayesian theory. New York:Wiley.

Borji, A., & Itti, L. (2013). Bayesian optimization explains humanactive search. In Burges, C., Bottou, L., Welling, M., Ghahra-mani, Z., & Weinberger, K. (Eds.) Advances in Neural InformationProcessing Systems 26 (pp. 55–63): Curran Associates, Inc.

Bott, L., & Heit, E. (2004). Nonmonotonic extrapolation in functionlearning. Journal of Experimental Psychology: Learning Memory,and Cognition, 30(1).

Brehmer, B. (1971). Subjects’ ability to use functional rules. Psycho-nomic Science, 24, 259–260.

Brehmer, B. (1974). Hypotheses about relations between scaledvariables in the learning of probabilistic inference tasks. Orga-nizational Behavior and Human Decision Processes, 11, 1–27.

Brehmer, B. (1976). Subjects’ ability to find the parameters offunctional rules in probabilistic inference tasks. OrganizationalBehavior and Human Performance, 17(2), 388–397.

Brehmer, B., Alm, H., & Warg, L. (1985). Learning and hypothesistesting in probabilistic inference tasks. Scandinavian Journal ofPsychology, 26(1), 305–313.

Busemeyer, J.R., Byun, E., DeLosh, E.L., & McDaniel, M.A. (1997).Learning functional relations based on experience with input-output pairs by humans and artificial neural networks. In Lam-berts, K., & Shanks, D. (Eds.) Concepts and Categories (pp. 405–437). Cambridge: MIT Press.

Byun, E. (1995). Interaction between prior knowledge and type ofnonlinear relationship on function learning. Purdue University:Unpublished doctoral dissertation.

Carroll, J. (1963). Functional learning: The learning of continuousfunctional mappings relating stimulus and response continua. NJ:Educational Testing Service Princeton.

Chater, N., & Vitanyi, P. (2003). Simplicity: a unifying principle incognitive science. Trends in Cognitive Science, 7, 19–22.

DeLosh, E.L., Busemeyer, J.R., & McDaniel, M.A. (1997). Extrapola-tion: The sine qua non of abstraction in function learning. Journalof Experimental Psychology: Learning Memory, and Cognition,23, 968–986.

Duvenaud, D., Lloyd, J.R., Grosse, R., Tenenbaum, J.B., &Ghahramani, Z. (2013). Structure discovery in nonparametricregression through compositional kernel search. arXiv preprint.arXiv:1302.4922.

Erickson, M., & Kruschke, J. (1998). Rules and exemplars in categorylearning. Journal of Experimental Psychology: General, 127(2),107.

Gelman, A., Carlin, J., Stern, H., & Rubin, D. (2004). Bayesian dataanalysis. CRC Press.

Gilks, W., Richardson, S., & Spiegelhalter, D.J. (Eds.) (1996). MarkovChain Monte Carlo in Practice. UK: Chapman and Hall Suffolk.

Griffiths, T.L., Lucas, C.G., Williams, J.J., & Kalish, M.L. (2009).Modeling human function learning with Gaussian processes.Advances in Neural Information Processing Systems, 21.

Griffiths, T.L., Vul, E., & Sanborn, A. (2012). Bridging levels of anal-ysis for probabilistic models of cognition. Current Directions inPsychological Science, 21(4), 263–268.

Hemmer, P., Tauber, S., & Steyvers, M. (2014). Moving beyond qual-itative evaluations of Bayesian models of cognition. PsychonomicBulletin & Review, 1–15.

Hensman, J., Fusi, N., & Lawrence, N.D. (2013). Gaussian processesfor big data. arXiv preprint. arXiv:1309.6835.

Jacobs, R., Jordan, M., Nowlan, S., & Hinton, G. (1991). Adaptivemixtures of local experts. Neural Computation, 3(1), 79–87.

Kalish, M.L. (2013). Learning and extrapolating a periodic function.Memory & Cognition, 41(6), 886–896.

Kalish, M.L., Griffiths, T.L., & Lewandowsky, S. (2007). Iteratedlearning intergenerational knowledge transmission reveals induc-tive biases. Psychonomic Bulletin and Review.

Kalish, M.L., Lewandowsky, S., & Kruschke, J. (2004). Populationof linear experts: Knowledge partitioning and function learning.Psychological Review, 111, 1072–1099.

Kirby, S. (2001). Spontaneous evolution of linguistic structure: Aniterated learning model of the emergence of regularity and irreg-ularity. IEEE Journal of Evolutionary Computation, 5, 102–110.

Koh, K., & Meyer, D. (1991). Function learning: Induction ofcontinuous stimulus-response relations. Journal of Experimen-tal Psychology: Learning Memory, and Cognition, 17(5), 811–836.

Kwantes, P., & Neal, A. (2006). Why people underestimate y whenextrapolating in linear functions. Journal of Experimental Psy-chology: Learning Memory, and Cognition, 32(5), 1019.

Lewandowsky, S.L., Kalish, M.L., & Ngang, S.K. (2002). Simplifiedlearning in complex situations: Knowledge partitioning in humanlearning. Journal of Experimental Psychology: General, 131, 163–193.

Lindsey, R.V., Mozer, M.C., Huggins, W.J., & Pashler, H. (2013).Optimizing instructional policies. In Burges, C., Bottou, L.,Welling, M., Ghahramani, Z., & Weinberger, K. (Eds.) Advancesin Neural Information Processing Systems 26 (pp. 2778–2786):Curran Associates, Inc.

Little, D.R., & Shiffrin, R.M. (2009). Simplicity bias in the estima-tion of causal functions. In Proceedings of the Thirty-First AnnualConference of the Cognitive Science Society (pp. 1157–1162).

Lucas, C.G., Sterling, D.J., & Kemp, C. (2012). Superspace extrapo-lation reveals inductive biases in function learning. In Miyake, N.,Peebles, D., & Cooper, R.P. (Eds.) Proceedings of the 34th AnnualConference of the Cognitive Science Society. Cognitive ScienceSociety.

MacKay, D. (1995). Probable networks and plausible predictions—a review of practical Bayesian methods for supervised neuralnetworks. Network: Computation in Neural Systems, 6, 469–505.

Marr, D., & Vision. W.H. (1982). San Francisco CA: Freeman.McDaniel, M., & Busemeyer, J. (2005). The conceptual basis of func-

tion learning and extrapolation: Comparison of rule-based andassociative-based models. Psychonomic Bulletin & Review, 12(1),24.

McDaniel, M., Dimperio, E., Griego, J., & Busemeyer, J. (2009). Pre-dicting transfer performance: A comparison of competing functionlearning models. Journal of Experimental Psychology: LearningMemory, and Cognition, 35(1), 173.

http://arxiv.org/abs/1302.4922


Psychon Bull Rev

Meeds, E., & Osindero, S. (2006). An alternative infinite mixtureof Gaussian process experts. In Weiss, Y., Scholkopf, B., &Platt, J. (Eds.) Advances in Neural Information Processing Sys-tems 18 (pp. 883–890). Cambridge, MA: MIT Press.

Minh, H., Niyogi, P., & Yao, Y. (2006). MerceraAZs theorem, featuremaps, and smoothing Learning theory.

Mitchell, T.M. (1997). Machine learning. New York: McGraw Hill.Navarro, D.J., Griffiths, T.L., Steyvers, M., & Lee, M.D. (2006). Mod-

eling individual differences using Dirichlet processes. Journal ofMathematical Psychology, 50, 101–122.

Neal, R.M. (1994). Priors for infinite networks technical report, Tech-nical Report CRG-TR-94-1 Department of Computer ScienceUniversity of Toronto.

Neal, R.M. (1998). Markov chain sampling methods for Dirichlet pro-cess mixture models Technical Report 9815. University of Toronto:Department of Statistics.

Rasmussen, C.E., & Ghahramani, Z. (2002). Infinite mixtures of Gaus-sian process experts. In Dietterich, T., Becker, S., & Ghahramani,Z. (Eds.) Advances in Neural Information Processing Systems 14,pages 881–888. MIT Press.

Rasmussen, C.E., & Williams, C.K.I. (2006). Gaussian processes formachine learning. MIT Press.

Sanborn, A.N., Griffiths, T.L., & Navarro, D.J. (2010). Rationalapproximations to rational models: alternative algorithms for cat-egory learning. Psychological Review, 117(4), 1144.

Shepard, R.N. (1987). Towards a universal law of generalization forpsychological science. Science, 237, 1317–1323.

Speekenbrink, M., & Shanks, D.R. (2010). Learning in a chang-ing environment. Journal of Experimental Psychology: General,139(2), 266.

Tenenbaum, J.B., Griffiths, T.L., & Kemp, C. (2006). Theory-basedBayesian models of inductive learning and reasoning. Trends inCognitive Science, 10, 309–318.

Williams, C.K.I. (1998). Prediction with Gaussian processes: Fromlinear regression to linear prediction and beyond. In Jordan,M.I. (Ed.) Learning in Graphical Models (pp. 599–621). Cam-bridge MA.: MIT Press.

Wilson, A.G., & Adams, R.P. (2013). Gaussian process covariancekernels for pattern discovery and extrapolation arXiv preprint.arXiv:1302.4245.


a rational model of function learning -...

Documents