chapter structure and flexibility in bayesian models of...

C H A P T E R

9 Structure and Flexibility in Bayesian Modelsof Cognition

Joseph L. Austerweil, Samuel J. Gershman, Joshua B. Tenenbaum, and Thomas L. Griffiths

Abstract

Probability theory forms a natural framework for explaining the impressive success ofpeople at solving many difficult inductive problems, such as learning words and categories,inferring the relevant features of objects, and identifying functional relationships.Probabilistic models of cognition use Bayes’s rule to identify probable structures orrepresentations that could have generated a set of observations, whether the observationsare sensory input or the output of other psychological processes. In this chapter weaddress an important question that arises within this framework: How do people inferrepresentations that are complex enough to faithfully encode the world but not so complexthat they “overfit” noise in the data? We discuss nonparametric Bayesian models as apotential answer to this question. To do so, first we present the mathematical backgroundnecessary to understand nonparametric Bayesian models. We then delve intononparametric Bayesian models for three types of hidden structure: clusters, features, andfunctions. Finally, we conclude with a summary and discussion of open questions for futureresearch.

Key Words: inductive inference, Bayesian modeling, Nonparametrics, Bias-variancetradeoff, Categorization, Feature representations, Function learning, Clustering

IntroductionProbabilistic models of cognition explore the

mathematical principles behind human learningand reasoning. Many of the most impressivetasks that people perform—learning words andcategories, identifying causal relationships, andinferring the relevant features of objects—canbe framed as problems of inductive inference.Probability theory provides a natural mathematicalframework for inductive inference, generalizinglogic to incorporate uncertainty in a way that canbe derived from various assumptions about rationalbehavior (e.g., Jaynes, 2003).

Recent work has used probabilistic models toexplain many aspects of human cognition, frommemory to language acquisition (for a representa-tive sample, see Chater and Oaksford, 2008). There

are existing tutorials on some of the key mathemati-cal ideas behind this approach (Griffiths and Yuille,2006; Griffiths, Kemp, & Tenenbaum, 2008a) andits central theoretical commitments (Tenenbaum,Griffiths, & Kemp, 2006; Tenenbaum, Kemp,Griffiths, & Goodman, 2010a; Griffiths, Chater,Kemp, Perfors, & Tenenbaum, 2010a). In thischapter, we focus on a recent development inthe probabilistic approach that has received lessattention—the capacity to support both structureand flexibility.

One of the striking properties of human cogni-tion is the ability to form structured representationsin a flexible way: we organize our environmentinto meaningful clusters of objects, identify dis-crete features that those objects possess, andlearn relationships between those features, without

187

In J.R. Busemeyer, J.T. Townsend, Z. Wang, & A. Eidels, Eds, Oxford Handbook of Computational and Mathematical Psychology. Oxford University Press, 2015.

apparent hard constraints on the complexity ofthese representations. This kind of learning posesa challenge for models of cognition: how canwe define models that exhibit the same capacityfor structured, flexible learning? And how do weidentify the right level of flexibility, so that we onlypostulate the appropriate level of complexity? Anumber of recent models have explored answers tothese questions based on ideas from nonparametricBayesian statistics (Sanborn, Griffiths, & Navarro,2010; Austerweil & Griffiths 2013; see Gershmanand Blei 2012 for a review), and we review the keyideas behind this approach in detail.

Probabilistic models of cognition tend to focuson a level of analysis that is more abstract than thatof many of the other modeling approaches discussedin the chapters of this handbook. Rather than tryingto identify the cognitive or neural mechanisms thatunderlie behavior, the goal is to identify the abstractprinciples that characterize how people solve induc-tive problems. This kind of approach has its rootsin what Marr (1982) termed the computational level,focusing on the goals of an information processingsystem and the logic by which those goals arebest achieved, and was implemented in Shepard’s(1987) search for universal laws of cognition andAnderson’s (1990) method of rational analysis. Butit is just the starting point for gaining a morecomplete understanding of human cognition—onethat tells us not just why people do the things theydo, but how they do them. Although we will notdiscuss it in this chapter, research on probabilisticmodels of cognition is beginning to consider howwe can take this step, bridging different levels ofanalysis. We refer the interested reader to one ofthe more prominent strategies for building such abridge, namely rational process models (Sanbornet al., 2010).

The plan of this chapter is as follows. First, weintroduce the core mathematical ideas that are usedin probabilistic models of cognition—the basics ofBayesian inference—and a formal framework forcharacterizing the challenges posed by flexibility.We then turn to a detailed presentation of the ideasbehind nonparametric Bayesian inference, lookingat how this approach can be used for learningthree different kinds of representations—clusters,features, and functions.

Mathematical BackgroundIn this section, we present the necessary math-

ematical background for understanding nonpara-metric Bayesian models of cognition. First, we

describe the basic logic behind using Bayes’ rulefor inductive inference. Then, we explore two ofthe main types of hypothesis spaces for possiblestructures used in statistical models: parametric andnonparametric models.1 Finally, we discuss what itmeans for a nonparametric model to be “Bayesian”and propose nonparametric Bayesian models asmethods combining the benefits of both parametricand nonparametric models. This sets up the remain-der of the article, where we compare the solutiongiven by nonparametric Bayesian methods to howpeople (implicitly) solve this dilemma when learn-ing associations, categories, features, and functions.

Basic BayesAfter observing some evidence from the envi-

ronment, how should an agent update her beliefsin the various structures that could have producedthe evidence? Given a set of candidate structuresand the ability to describe the degree of belief ineach structure, Bayes’s rule prescribes how an agentshould update her beliefs across many normativestandards (Oaksford and Chater, 2007; Robert,1994). Bayes’s rule simply states that an agent’sbelief in a structure or hypothesis h after observingdata d from the environment, the posterior P(h|d),should be proportional to the product of two terms:her prior belief in the structure, the prior P(h), andhow likely the observed data d would be had itbeen produced by the candidate structure, called thelikelihood P(d |h). This is given by

P(h|d) = P(d |h)P(h)!h′∈HP(d |h′)P(h′)

,

where H is the space of possible hypotheses orlatent structures. Note that the summation in thedenominator is the normalization constant, whichensures that the posterior probability is still avalid probability distribution (sums to one). Inaddition to specifying how to calculate the posteriorprobability of each hypothesis, a Bayesian modelprescribes how an agent should update her beliefin observing new data dnew from the environmentgiven the previous observations d

P(dnew|d) ="

h

P(dnew|h)P(h|d)

="

h

P(dnew|h)P(d |h)P(h)!

h′∈HP(d |h′)P(h′).

The fundamental assumptions of Bayesian models(i.e., what makes them “Bayesian”) are (a) agents ex-press their expectations over structures as probabili-ties and (b) they update their expectations according

188 h i g h e r l e v e l c o g n i t i o n

to the laws of probability. These are not uncon-troversial assumptions in psychology (e.g., Bowersand Davis 2012; Jones and Love 2011; Kahnemanet al. 1982; McClelland et al. 2010, but also lookat the replies Chater et al. 2011; Griffiths, Chater,Norris, & Pouget 2012; Griffiths, Chater, Kemp,Perfors, & Tenenbaum 2010b). However, they areextremely useful because they provide methodologi-cal tools for exploring the consequences of adoptingdifferent assumptions about the kind of structurethat appears in the environment.

Parametric and NonparametricOne of the first steps in formulating a com-

putational model is a specification of the possiblestructures that could have generated the observa-tions from the environment, the hypothesis spaceH. As discussed in the Basic Bayes subsection, eachhypothesis h in a hypothesis space H is definedas a probability distribution over the possibleobservations. So, specifying the hypothesis spaceamounts to defining the set of possible distributionsover events that the agent could observe. To makethe model Bayesian, a prior distribution over thosehypotheses also needs to be specified.

From a statistical point of view, a Bayesian modelwith a particular hypothesis space (and prior overthose hypotheses) is a solution to the problemof density estimation, which is the problem ofestimating the probability distribution over possibleobservations from the environment.2 In fact, it isthe optimal solution given that the hypothesis spacefaithfully represents how the environment producesobservations, and the environment randomly se-lects a hypothesis to produce observations withprobability proportional to the prior distribution.

In general, a probability distribution is a func-tion over the space of observations, which can becontinuous, and thus is specified by an infinitenumber of parameters. So, density estimationinvolves identifying a function specified by aninfinite number of parameters, as theoretically,it must specify the probability of each point ina continuous space. From this perspective, afunction is analogous to a hypothesis and thespace of possible functions constructed by varyingthe values of the parameters defines a hypothesisspace. Different types of statistical models makedifferent assumptions about the possible functionsthat define a density, and statistical inferenceamounts to estimating the parameters that defineeach function.

The statistical literature offers a useful classifica-tion of different types of probability density func-tions, based on the distinction between parametricand nonparametric models (Bickel and Doksum,2007). Parametric models take the set of possibledensities to be those that can be identified with afixed number of parameters. An example of a para-metric model is one that assumes the density followsa Gaussian distribution with a known variance, butunknown mean. This model estimates the mean ofthe Gaussian distribution based on observations andits estimate of the probability of new observations istheir probability under a Gaussian distribution withthe estimated mean. One property of parametricmodels is that they assume there exists a fixedset of possible structures (i.e., parameterizations)that does not change regardless of the amountof data observed. For the earlier example, nomatter how much data the model is given that isinconsistent with a Gaussian distribution (e.g., abimodal distribution), its density estimate wouldstill be a Gaussian distribution because it is the onlyfunction available to the model.

In contrast, nonparametric models make muchweaker assumptions about the family of possiblestructures. For this to be possible, the number ofparameters of a nonparametric model increases withthe number of data points observed. An exampleof a nonparametric statistical model is a Gaussiankernel model, which places a Gaussian distributionat each observation and its density estimate is theaverage over the Gaussian distributions associatedwith each observation. In essence, the parametersof this statistical model are the observations, andso the parameters of the model grow with thenumber of data points. Although nonparametricsuggests that nonparametric models do not have anyparameters, this is not the case. Rather, the numberof parameters in a nonparametric model is not fixedwith respect to the amount of data.

One domain within cognitive science where thedistinction between parametric and nonparametricmodels has been useful is category learning (Ashbyand Alfonso-Reese, 1995). The computationalproblem underlying category learning is identifyinga probability distribution associated with eachcategory label. Prototype models (Posner and Keele,1968; Reed, 1972), approach this problem para-metrically, by estimating the mean of a Gaussiandistribution for each category. Alternatively, exem-plar models (Medin and Schaffer, 1978; Nosofsky,1986) are nonparametric, using each observationas a parameter; each category’s density estimate for

s t r u c t u r e a n d f l e x i b i l i t y i n b a y e s i a n m o d e l s o f c o g n i t i o n 189

(a) (b) (c)Parametric Nonparametric Nonparametric Bayesian

Fig. 9.1 Density estimators from the same observations (displayed in blue) for three types of statistical models: (a) parametric, (b)nonparametric, and (c) nonparametric Bayesian models. The parametric model estimates the mean of a Gaussian distribution fromthe observations, which results in the Gaussian density (displayed in black). The nonparametric model averages over the Gaussiandistributions centered at each data point (in red) to yield its density estimate (displayed in black). The nonparametric Bayesian modelputs the observations into three groups, each of which gets its own Gaussian distribution with mean centered at the center of theobservations (displayed in red). The density estimate is formed by averaging over the Gaussian distributions associated with eachdisplayed in black.

an new observation is a function of the sum of thedistances between the new observation to previousobservations. See Figure 9.1(a) and 9.1(b) forexamples of parametric and nonparametric densityestimation, respectively.

One limitation of parametric models is that thestructure inferred by these models will not be thetrue structure producing the observations if thetrue structure is not in the model’s hypothesisspace. On the other hand, many nonparametricmodels are guaranteed to infer the true structuregiven enough (i.e., infinite) observations, whichis a property known as consistency in the statisticsliterature (Bickel and Doksum, 2007). Thus, asthe number of observations increase, nonparametricmodels have lower error than parametric modelswhen the true structure is not in the hypothesisspace of the parametric model. However, non-parametric models typically need more observationsto arrive at the true hypothesis when the truehypothesis is in the parametric model’s hypothesisspace. Box 1 describes the bias-variance trade-off, which expounds this intuition and provides aformal framework for understanding the benefitsand problems with each approach.

Putting Them Together: NonparametricBayesian Models

Although people are clearly biased toward certainstructures by their prior expectations (like paramet-ric models), the bias is a soft constraint, meaningthat people seem to be able to infer seeminglyarbitrarily complex models given enough evidence(like nonparametric models). In the remainder of

the article, we propose nonparametric Bayesianmodels, which are nonparametric models withprior biases toward certain types of structures, asa computational explanation for how people inferstructure but maintain flexibility.

Box 1 The bias-variance trade-offGiven that nonparametric models are guaran-teed to infer the true structure, why wouldanyone use a parametric model? Although it istrue that nonparametric models will convergeto the true structure, they are only guaranteedto do so in the limiting case of an infinitenumber of observations. However, people donot get an infinite number of observations.Thus, the more appropriate question for cogni-tive scientists is how do people infer structuresfrom a small number of observations, andwhich type of model is more appropriate forunderstanding human performance? There aremany structures consistent with the limited andnoisy evidence typically observed by people.Furthermore, when observations are noisy, itbecomes difficult for an agent to distinguishbetween noise and systematic variation due tothe underlying structure, a problem known as“overfitting.”

When there are many structures availableto the agent, as is the case for nonparametricmodels, this becomes a serious issue. So, al-though nonparametric models have the upsideof guaranteed convergence to the appropriatestructure, they have the downside of being


Box 1 continuedprone to overfitting. In this box, we discussthe bias-variance trade-off, which provides auseful framework for understanding the trade-off between ultimate convergence to the truestructure and overfitting.

The bias-variance trade-off is a mathemati-cal result, which demonstrates that the amountof error an agent is expected to make whenlearning from observations can be decomposedinto the sum of two components (German,Bienenstock, & Doursat, 1992; Griffithset al., 2010): bias, which measures how closethe expected estimated structure is to the truestructure, and variance, which measures howsensitive the expected estimated structure is tonoise (how much it is expected to vary acrossdifferent possible observations). Intuitively,increasing the number of possible structures byusing a larger parametric or a fully nonpara-metric model reduces the bias of the modelbecause a larger hypothesis space increases thelikelihood that the true structure is availableto the model. However, it also increases thevariance of the model because it will be harderto choose among them given noisy observationsfrom the environment. On the other hand,decreasing the possible number of structuresavailable by using a small parametric model in-creases the bias of the model, because unless thetrue structure is one of the structures availableto the parametric model, it will not be able toinfer the true structure. Furthermore, using asmall parametric model reduces the variance,because there are fewer structures available tothe model that are likely to be consistent withthe noisy observations. The bias-variance trade-off presents a trade-off: reducing the bias of amodel by using a nonparametric model withfewer prior constraints comes at the cost of lessefficient inference and increased susceptibilityto overfitting, resulting in larger variance.

How do people resolve the bias-variancetrade-off? In some domains, they are clearlybiased because some structures are much easierto learn than others (e.g., linear functions infunction learning; Brehmer 1971, 1974). So insome respects people act like parametric mod-els, in that they use strong constraints to inferstructures. However, given enough training,experience, and the right kind of information,people can infer extremely complex structures

(e.g., McKinley and Nosofsky 1995). Thus, inother respects, people act like nonparametricmodels. How to reconcile these two viewsremains an outstanding question for theoriesof human learning. Hierarchical Bayesian mod-els offer one possible answer, where agentsmaintain multiple hypothesis spaces and inferthe appropriate hypothesis space to performBayesian inference over, using the distributionof stimuli in the domain (Kemp, Perfors, &Tenenbaum, 2007) and the concepts agentslearn over the stimuli (Austerweil and Griffiths,2010a). In principle, a hierarchical Bayesianmodel could be formulated that includesboth parametric and nonparametric hypothesisspaces, thereby inferring which is appropriatefor a given domain. Formulating such a modelis an interesting challenge for future research.

Nonparametric Bayesian models are Bayesianbecause they put prior probabilities over the setof possible structures, which typically includearbitrarily complex structures. They posit prob-ability distributions over structures that can, inprinciple, be infinitely complex, but they are biasedtowards “simpler” structures (those representableusing a smaller number of units), which reducesthe variance that plagues classical nonparametricmodels. The probability of data under a structure,which can be very large for complex structuresthat encode each observation explicitly (e.g., eachobservation in its own category), is traded offagainst a prior bias toward simpler structures,which allow observations to share parameters. Thisbias toward simpler structures is a soft constraint,allowing models to adopt more complex structuresas new data arrive (this is what makes these models“nonparametric”). Thus, nonparametric Bayesianmodels combine the benefits of parametric andnonparametric models: a small variance (by usingBayesian priors) and a small bias (by adaptingtheir structure nonparametrically). See Figure 9.1(c)for an example of nonparametric Bayesian densityestimation.

Nonparametric Bayesian models can be classifiedaccording to the type of hidden structure theyposit. For the previously discussed category learningexample, the hidden structure is a probability dis-tribution over the space of observations. Thus, theprior is a probability distribution over probabilitydistributions. A common choice for this prior is the


Dirichlet process (Ferguson, 1973), which inducesa set of discrete clusters, where each observationbelongs to a single cluster and each cluster isassigned to a randomly drawn value. Combiningthe Dirichlet process with a model of how observedfeatures are generated by clusters, we obtain aDirichlet-process mixture model (Antoniak, 1974).As we discuss in the following section, elaborationsof the Dirichlet process mixture model have beenapplied to many psychological domains as variedas category learning (Anderson, 1991; Sanborn,Canini, & Navarro, 2008b), word segmentation(Griffiths, & Johnson, 2009), and associativelearning (Gershman, Blei, & Niv, 2010; Gershman,and Niv, 2012).

Although many of the applications of non-parametric Bayesian models in cognitive sciencehave focused on the Dirichlet process mixturemodel, other nonparametric Bayesian models, suchas the Beta process (Hjort, 1990; Thibaux andJordan, 2007) and Gaussian process (Rasmussen andWilliams, 2006), are more appropriate when peopleinfer probability distributions over observationsthat are encoded using multiple discrete units orcontinuous units. For example, feature learning isbest described by a hidden structure with multiplediscrete units. The Beta process (Griffiths andGhahramani, 2005, 2011; Hjort, 1990; Thibauxand Jordan, 2007) is one appropriate nonparametricBayesian model for this example, and as we discussin the section Inferring Features: What Is a Per-ceptual Unit?, elaborations of the Beta process havebeen applied to model feature learning (Austerweiland Griffiths, 2011, 2013), multimodal learning(Yildirm and Jacobs, 2012), and choice preferences(Görür, Jäkel; Miller, Griffiths). Finally, Gaussianprocesses are appropriate when each observation isencoded using one or more continuous units; wediscuss their application to function learning in thesection Learning Functions: How Are ContinuousQuantities Related? (Griffiths, Lucas, Williams, &Kalish, 2009).

Inferring Clusters: How Are ObservationsOrganized into Groups?

One of the basic inductive problems facedby people is organizing observations into groups,sometimes referred to as clustering. This problemarises in many domains, including category learning(Clapper and Bower, 1994; Kaplan and Murphy,1999; Pothos and Chater, 2002), motion percep-tion (Braddick, 1993), causal inference (Kemp,Tenenbaum, Niyogi, & Griffiths, 2010), word

segmentation (Werker and Yeung, 2005), semanticrepresentation (Griffiths, Steyvers, & Tenenbaum,2007), and associative learning (Gershman et al.,2010). Clustering is challenging because in realworld situations the number of clusters is oftenunknown. For example, a child learning languagedoes not know a priori how many words there arein the language. How should a learner discover newclusters?

In this section, we show how clustering canbe formalized as Bayesian inference, focusing inparticular on how the nonparametric conceptsintroduced in the previous section can be broughtto bear on the problem of discovering new clusters.We then describe an application of the same ideasto associative learning.

A Rational Model of CategorizationCategorization can be formalized as an inductive

problem: given the features of a stimulus (denotedby x), infer the category label c. Using Bayes’s rule,the posterior over category labels is given by:

P(c|x) = P(x|c)P(c)!c′ P(x|c′)P(c′)

= P(x, c)!c′ P(x, c′)

.

From this point of view, category learning is funda-mentally a problem of density estimation (Ashby andAlfonso-Reese, 1995) because people are estimatinga probability distribution over the possible obser-vations from each category. Probabilistic modelsdiffer in the assumptions they make about the jointdistribution P(x, c). Anderson (1991) proposed thatpeople model this joint distribution as a mixturemodel:

P(x, c) ="

z

P(x, c|z)P(z),

where z ∈ {1, . . . ,K } denotes the cluster assigned tox (z is the traditional notation for a cluster, andis analogous to the hypothesis h in the previoussection). From a generative perspective, observa-tions are generated by a mixture model from theenvironment according to the following process: tosample observation n, first sample its cluster zn fromP(z), and then the observation xn and its category cnfrom the joint distribution specified by the cluster,P(x, c|zn). Each distribution specified by a clustermight be simple (e.g., a Gaussian), but their mixturecan approximate arbitrarily complicated distribu-tions. Because each observation only belongs to onecluster, the assignments zn = {z1, . . . ,zn} encode apartition of the items into K distinct clusters, wherea partition is a grouping of items into mutually


exclusive clusters. When the value of K is specified,this generative process defines a simple probabilisticmodel of categorization, but what should be thevalue of K ?

To address the question of how to select thevalue of K , Anderson assumed that K is notknown a priori, but rather learned from experi-ence, such that K can be increased as new dataare observed. As the number of clusters growswith observations and each cluster has associatedparameters defining its probability distribution overobservations, this rational model of categorizationis nonparametric. Anderson proposed a prior onpartitions that sequentially assign observations toclusters according to:

P(zn = k|zn−1) =! mk

n−1+α if mk > 0 (i.e., k is old)α

n−1+α if mk = 0 (i.e., k is new)

where mk is the number of items in zn−1 assignedto cluster k, n is the total number of items observedso far, and α ≥ 0 is a parameter that governsthe total number of clusters.3 As pointed out byNeal (2000), the process proposed by Anderson isequivalent to a distribution on partitions knownas the Chinese restaurant process (CRP; Aldous,1985; Blackwell and MacQueen, 1973). Its namecomes from the following metaphor (illustrated inFigure 9.2): Imagine a Chinese restaurant with anunbounded number of tables (clusters), where eachtable can seat an unbounded number of customers(observations). The first customer enters and sitsat the first table. Subsequent customers sit at anoccupied table with a probability proportional tohow many people are already sitting there (mk),and at a new table with probability proportionalto α. Once all the customers are seated, theassignment of customers to tables defines a partitionof observations into clusters.

The CRP arises in a natural way from anonparametric mixture modeling framework (seeGershman and Blei, 2012, for more details). Tosee this, consider a finite mixture model where the

cluster assignments are drawn from:

θ ∼ Dirichlet(α/K , . . . ,α/K )

zi ∼ Multinomial(θ ), for i = 1, . . . ,n,

where Dirichlet( · ) denotes the K -dimensionalDirichlet distribution, where θ roughly correspondsto the relative weight of each block in the partition.As the number of clusters increases to infinity (K →∞), this distribution on zn is equivalent to the CRP.Another view of this model is given by a seeminglyunrelated process, the Dirichlet process (Ferguson,1973), which is a probability distribution over dis-crete probability distributions. It directly generatesthe partition and the parameters associated witheach block in the partition. Marginalizing over allthe possible ways of getting the same partition froma Dirichlet process defines a related distribution, thePólya urn (Blackwell and MacQueen, 1973), whichis equivalent to the CRP when the parameter asso-ciated with each block is ignored. For this reason,a mixture model with a CRP prior on partitionsis known as a Dirichlet process mixture model(Antoniak, 1974).

One of the original motivations for developingthe rational model of categorization was to reconciletwo important observations about human categorylearning. First, in some cases, the confidence withwhich people assign a new stimulus to a categoryis inversely proportional to its distance from theaverage of the previous stimuli in that category(Reed, 1972). This, in conjunction with other dataon central tendency effects (e.g., Posner and Keele,1968), has been interpreted as people abstractinga “prototype” from the observed stimuli. On theother hand, in some cases, people are sensitiveto specific stimuli (Medin and Schaffer, 1978),a finding that has inspired exemplar models thatmemorize the entire stimulus set (e.g., Nosofsky,1986). Anderson (1991) pointed out that hisrational model of categorization can capture both ofthese findings, depending on the inferred partitionstructure: when all items are assigned to the samecluster, the model is equivalent to forming a single

1

2

3

. . . Table 1

Table 2

Table 3

Chinese Restaurant Process Tables

Custo

mer

s

1

2

3

(a) (b)

Fig. 9.2 (a) The culinary representation of the Chinese restaurant process and (b) the cluster assignments implied by it.


prototype, whereas when all items are assignedto unique clusters, the model is equivalent to anexemplar model. However, in most circumstances,the rational model of categorization will partitionthe items in a manner somewhere between thesetwo extremes. This is a desirable property that otherrecent models of categorization have adopted (Love,Medin, & Gureckis, 2004). Finally the CRP servesas a useful component in many other cognitivemodels (see Figure 7 and Box 2).

Associative LearningAs its name suggests, associative learning has

traditionally been viewed as the process by whichassociations are learned between two or morestimuli. The paradigmatic example is Pavlovianconditioning, in which a cue and an outcome (e.g.,a tone and a shock) are paired together repeatedly.When done with rats, this procedure leads to therat freezing in anticipation of the shock whenever ithears the tone.

Association learning can be interpreted in termsof a probabilistic causal model, which calculates theprobability that one variable, called the cue (e.g., atone), causes the other variable, called the outcome(e.g., a shock). Here y encodes the presence orabsence of the outcome, and x similarly encodes thepresence or absence of the cue. The model assumesthat y is a noisy linear function of x: y∼N (wx,σ 2).The parameter w encodes the associative strengthbetween x and y and σ 2 parameterizes the variabilityof their relationship. This model can be generalizedto the case in which multiple cues are paired withan outcome by assuming that the associations areadditive: y ∼ N (

!i wixi,σ 2), where i ranges over

the cues. The linear-Gaussian model also has aninteresting connection to classical learning theoriessuch as the Rescorla-Wagner model (Rescorla andWagner, 1972), which can be interpreted as assum-ing a Gaussian prior on w and carrying out Bayesianinference on w (Dayan, Kakade, Montague 2000;Kruschke, 2008).

Despite the successes of the Rescorla-Wagnermodel and its probabilistic variants, they incorrectlypredict that there should only be learning when theprediction error is nonzero (i.e., when y−

!i wixi ̸=

0), but people and animals can still learn insome cases. For example, in sensory preconditioning(Brogden, 1939), two cues (A and B) are presentedtogether without an outcome; when A is subse-quently paired with an outcome, cue B acquiresassociative strength despite never being paired withthe outcome. Because A and B presumably start

out with zero associative strength and there are noprediction errors during the preconditioning phase,the Rescorla-Wagner model predicts that thereshould be no learning. The model fails to explainhow preconditioning enables B to subsequently ac-quire associative strength from A-outcome training.

Box 2 Composing RicherNonparametric Models

Although this chapter has focused on several ofthe most basic nonparametric Bayesian modelsand their applications to basic psychologicalprocesses, these components can also be com-posed to build richer accounts of more sophis-ticated forms of human learning and reasoning(bottom half of Figure 7). These compositesgreatly extend the scope of phenomena thatcan be captured in probabilistic models ofcognition. In this box, we discuss a number ofthese composite models.

In many domains, categories are not simply aflat partition of entities into mutually exclusiveclasses. Often they have a natural hierarchicalorganization, as in a taxonomy that divideslife forms into animals and plants; animalsinto mammals, birds, fish, and other forms;mammals into canidae, felines, primates, ro-dents, . . . ; canidae into dogs, wolves, foxes,coyotes, . . . ; and dogs into many differentbreeds. Category hierarchies can be learned vianonparametric models based on nested versionsof the CRP, in which each category at one levelgives rise to a CRP of subcategories at the levelbelow (Griffiths et al., 2008b; Blei, Griffiths, &Jordan, 2010).

Category learning and feature learning aretypically studied as distinct problems (as dis-cussed in the sections Inferring Clusters: HowAre Observations Organized into Groups? andInferring Features: What Is a Perceptual Unit?),but in many real-world situations people canjointly discover how best to organize objectsinto classes and which features best sup-port these categories. Nonparametric modelsthat combine the CRP and IBP can capturethese joint inferences (Austerweil and Griffiths,2013). The model of Salakhutdinov et al.(2012) extends this idea to hierarchies, usingthe nested CRP combined with a hierarchicalDirichlet process topic model to jointly learnhierarchically structured object categories and


Box 2 continuedhierarchies of part-like object features thatsupport these categories.

The CrossCat model (Shafto et al., 2011)allows us to move beyond another limitationof typical models of categorization (parametricor nonparametric): the assumption that thereis only a single best way to categorize a setof entities. Many natural domains can berepresented in multiple ways: animals maybe thought of in terms of their taxonomicgroupings or their ecological niches, foods maybe thought of in terms of their nutritionalcontent or social role; products may be thoughtof in terms of function or brand. CrossCatdiscovers multiple systems of categories ofentities, each of which accounts for a distinctsubset of the entities’ observed attributes, bynesting CRPs over entities inside CRPs overattributes.

Traditional approaches to categorizationtreat each entity individually, but richer seman-tic structure can be found by learning categoriesin terms of how groups of entites relate toeach other. The Infinite Relational Model (IRM;Kemp et al. 2006, 2010) is a nonparametricmodel that discovers categories of objects thatnot only share similar attributes, but alsoparticipate in similar relations. For instance, adata set for consumer choice could be charac-terized in terms of which consumers boughtwhich products, which features are present inwhich products, which demographic propertiescharacterize which users, and so on. IRMcould then discover how to categorize productsand consumers (and perhaps also features anddemographic properties), and simultaneouslyuncover regularities in how these categoriesrelate (for) example, that consumers in class Xtend to buy products in class Y).

Nonparametric models defined over graphstructures, such as the graph-based GP modelsof Kemp and Tenenbaum (2008, 2009), cancapture how people reason about a widerrange of dependencies between the propertiesof entities and the relations between entities,allowing that objects in different domains canbe related in qualitatively different ways. Forinstance, the properties of cities might bebest explained by their relative positions in atwo-dimensional map, the voting patterns ofpoliticians by their orientation along a one-

dimensional liberal-conservative axis, and theproperties of animals by their relation in ataxonomic tree. We could also distinguish an-imals’ anatomical and physiological properties,which are best explained by the taxonomy,from behavioral and ecological properties thatmight be better explained by their relation in adirected graph such as a food web. Perhaps mostintriguingly, nonparametric Bayesian modelscan be combined with symbolic grammars toaccount for how learners could explore thebroad landscape of different model structuresthat might describe a given domain and arrive atthe best model (Kemp and Tenenbaum, 2008;Grosse et al., 2012). A grammar is used togenerate a space of qualitatively different modelfamilies, ranging from simple to complex, eachof which defines a predictive model for theobserved data based on a GP, CRP, IBP or othernonparametric process. These frameworks havebeen used to build more human-like machinelearning and discovery systems, but they remainto be tested as psychological accounts of howhumans learn domain structures.

Sensory preconditioning and other related find-ings have prompted consideration of alterna-tive probabilistic models for associative learning.Courville, Daw, & Touretzky, (2006) proposed thatpeople and animals posit latent causes to explaintheir observations in Pavlovian conditioning exper-iments. According to this idea, a single latent causegenerates both the cues and outcomes. Latent causemodels are powerful because they can explain whylearning occurs in the absence of prediction errors.For example, during sensory preconditioning, thelatent cause captures the covariation between thetwo cues; subsequent conditioning of A increasesthe probability that B will also be accompanied bythe outcome.

An analogous question about how to pick thenumber of clusters in a mixture model arises inassociative learning: How many latent causes shouldthere be in a model of associative learning? Toaddress this question, Gershman and colleagues(Gershman et al., 2010; Gershman and Niv, 2012)used the CRP as a prior on latent causes. Thisallows the model to infer new latent causes whenthe sensory statistics change, but otherwise it prefersa small number of latent causes. Unlike previousmodels that define the number of latent causes


a priori, Gershman et al. (2010) showed thatthis model could explain why extinction does nottend to erase the original association: Extinctiontraining provides evidence that a new latent causeis active. For example, when conditioning andextinction occur in different contexts, the modelinfers a different latent cause for each context; uponreturning to the conditioning context, the modelpredicts a renewal of the conditioned response,consistent with empirical findings (see Bouton,2004). By addressing the question of how agentsinfer the number of latent causes, the model offerednew insight into a perplexing phenomenon.

Inferring Features: What Is aPerceptual Unit?

The types of latent structures that people useto represent a set of stimuli can be far richer thanclustering the stimuli into groups. For example,consider the following set of animals: domestic cats,dogs, goldfish, sharks, lions, and wolves. Althoughthey can be represented as clusters (e.g., PETSand WILD ANIMALS or FELINES, CANINES,and SEA ANIMALS), another way to represent theanimals is using features, or multiple discrete unitsper animal (e.g., a cat might be represented with thefollowing features: HAS TAIL, HAS WHISKERS,HAS FUR, IS CUTE, etc.). Feature representa-tions can be used to solve problems arising inmany domains, including choice behavior (Tversky,1972), similarity (Nosofsky, 1986; Tversky, 1977),and object recognition (Palmer, 1999; Selfridge andNeisser, 1960). Analogous to clustering, the appro-priate feature representation or even the number offeatures for a domain is not known a priori. In fact,a common criticism of feature-based similarity isthat there is an infinite number of potential featuresthat can be used to represent any stimulus andthat human judgments are mostly determined bythe features selected to be used in a given context(Goldmeier, 1972; Goodman, 1972; Murphy andMedin, 1985). How do people infer the appropriatefeatures to represent a stimulus in a given context?

In this section, we describe how the problemof inferring feature representations can be castas a problem of Bayesian inference, where thehypothesis space is the space of possible featurerepresentations. Because there is an infinite numberof feature representations, the model used to solvethis problem will be a nonparametric Bayesianmodel. Then, we illustrate two psychologicalapplications.

A Rational Model of Feature InferenceAnalogous to other Bayesian models of cogni-

tion, defining a rational model of feature inferenceamounts to applying Bayes’s rule to a specificationof the hypothesis space, how hypotheses produceobservations (the likelihood), and the prior proba-bility of each hypothesis (the prior). Following pre-vious work by Austerweil and Griffiths (2011), wefirst define these three components for a Bayesianmodel with a finite feature repository and thendefine a nonparametric Bayesian model by allowingan infinite repository of features. This allows themodel to infer a feature representation withoutpresupposing the number of features ahead of time.

The computational problem of feature repre-sentation inference is as follows: Given the D-dimensional raw primitives for a set of N stimuliX (each object is a D-dimensional row vectorxn), infer a feature representation that encodes thestimuli and adheres to some prior expectations.We decompose a feature representation into twocomponents: an N × K feature ownership matrixZ, which is a binary matrix encoding which ofthe K features each stimulus has (i.e., znk = 1 ifstimulus n has feature k, and znk = 0 otherwise),and a K ×D feature image matrix Y, which encodesthe consequence of a stimulus having each feature.In this case, the hypothesis space is the Cartesianproduct of possible feature ownership matrices andpossible feature image matrices. As we discuss laterin further detail, the precise format of feature imagematrix Y depends on the format of the observedraw primitives. For example, if the stimuli are theimages of objects and the primitives are D binarypixels that encode whether light was detected ineach part of the retina, then a stimulus and a featureimage, x and y respectively, are both D-dimensionalbinary vectors. So if x is the image of a mug, y couldbe the image of its handle.

Applying Bayes’s rule and assuming that the fea-ture ownership and image matrices are independenta priori, inferring a feature representation amountsto optimizing the product of three terms

P(Z,Y|X) ∝ P(X|Y,Z)P(Y)P(Z),

where P(X|Y,Z), the likelihood, encodes howwell each object xn is reconstructed by combiningtogether the feature images Y of the features theobject has, which is given by zn, P(Y) encodes priorexpectations about feature images (e.g., Gestaltlaws), and P(Z) encodes prior expectations aboutfeature ownership matrices.4 As the likelihood andfeature image prior are more straightforward to


define and specific to the format of the observedprimitives, we first derive a sensible prior distri-bution over all possible feature ownership matricesbefore returning our attention to them.

Before delving into the case of an infinitenumber of potential features, we derive a priordistribution on feature ownership matrices that hasa finite and known number of features K . As theelements of a feature ownership matrix are binary,we can define a probability distribution over thematrix by flipping a weighted coin with bias πkfor each element znk. We do not observe πk andso, we assume a Beta distribution as its prior. Thiscorresponds to the following generative process

πkiid∼ Beta(α/K ,1), for k = 1, . . . ,K

znk|πkiid∼ Bernoulli(πk), for n = 1, . . . ,N .

Due to the conjugacy of Bernoulli likelihoods andBeta priors, it is relatively simple to integrate outπ1, . . . ,πK to arrive at the following probabilitydistribution, P(Z|α), over finite feature ownershiprepresentations. See Bernardo and Smith (1994)and Griffiths and Ghahramani (2011) for details.

Analogous to the method discussed earlier forconstructing the CRP as the infinite limit of a finitemodel, taking the limit of P(Z|α) as K → ∞yields a valid probability distribution over featureownership matrices with an infinite number ofpotential features.5 Note that as K → ∞, theprior on feature weights gets concentrated at zero(because α/K → 0). This results in an infinitenumber of columns that simply contain zeroes, andthus, these features will have no consequence forthe set of stimuli we observed (as they are notassigned to any stimuli). Because both the numberof columns K → ∞ and the probability of anobject taking a feature (probability that znk = 1)πk → 0 at corresponding rates, there is a finite,but random, number of columns have at least onenonzero element (the features that have been takenby at least one stimulus). This limiting distributionis called the Indian buffet process (IBP; Griffiths andGhahramani 2005, 2011), and it is given by thefollowing equation

P(Z |α) = αK+#2n−1

h=1 Khexp

$−α

N"

n=1

n−1

%

×K+&

k=1

(N −mk)(mk −1)N

,

where K+ is the number of columns with at leastone nonzero entry (the number of features taken by

at least one object), and Kh is the number of featureswith history h, where a history can be thoughtof as the column of the feature interpreted as abinary number. The term containing the historypenalizes features that have equivalent patterns ofownership and it is a method for indexing featureswith equivalent ownership patterns.

Analogous to the CRP, the probability distri-bution given by this limiting process is equivalentto the probability distribution on binary matricesimplied by a simple sequential culinary metaphor.In this culinary metaphor, “customers,” correspond-ing to the stimuli or rows of the matrix, enter anIndian buffet and take dishes, corresponding tothe features or columns of the matrix, accordingto a series of probabilistic decisions based on howthe previous customers took dishes. When the firstcustomer enters the restaurant, she takes a numberof new dishes sampled from a Poisson distributionwith parameter α. As customers sequentially enterthe restaurant, each customer n takes a previouslysampled dish k with probability mk/n and thensamples a number of new dishes sampled from aPoisson distribution with parameter α/n.

Figure 9.3(a) illustrates an example of a potentialstate of the IBP after three customers have enteredthe restaurant. The first customer entered therestaurant and sampled two new dishes from thePoisson probability distribution with parameter α.Next, the second customer entered the restaurantand took each of the old dishes with probability 1/2and sampled one new dish from the Poisson proba-bility distribution with parameter α/2. Then, thethird customer entered the restaurant and tookthe first dish with probability 2/3, did not takethe second dish with probability 1/3, and didnot take the third dish with probability 2/3.The equivalent feature ownership matrix repre-sented by this culinary metaphor is shown inFigure 9.3(b).

As previously encountered features are sampledwith probability proportional to the number oftimes they were previously taken and the proba-bility of new features decays as more customersenter the restaurant (it is Poisson distributed withparameter given by α/N where N is the number ofcustomers), the IBP favors feature representationsthat are sparse and have a small number of features.Thus, it encodes a natural prior expectation towardfeature representations with a few features, and canbe interpreted as a simplicity bias.

Now that we have derived the feature ownershipprior P(Z), we turn to defining the feature image


. . . Dish 1

Dish 2

Dish 3

Indian Buffet Process

1

1

32

Dishes

Custo

mer

s

1

2

3

2

2

(a) (b)

Fig. 9.3 (a) The culinary representation of the Indian buffet process and (b) the feature ownership matrix implied by it.

prior P(Y) and the likelihood P(X|Y,Z), which willfinish a specification of the nonparametric Bayesianmodel. Remember that the feature image matrixcontains the consequences of a stimulus having afeature in terms of the raw primitives. Thus, inthe nonparametric case, the feature image matrixcan be thought of containing the D-dimensionalconsequence of each of the K+ used features. Theinfinite unused features can be ignored becausea feature only affects the representation of astimulus if it is used by that stimulus. In mostapplications to date, the feature image prior ismostly “knowledge-less.” For example, the standardprior used for stimuli that are binary images is anindependent Bernoulli prior, where each pixel ison with probability φ independent of the otherpixels in the image. One exception is one of thesimulations by Austerweil and Griffiths (2011),who demonstrated that using a simple proximitybias (an Ising model that favors adjacent pixels toshare values, Geman and Geman, 1984) as thefeature image prior results in more psychologicallyplausible features: The feature images without usingthe proximity bias were not contiguous and werespeckled, whereas the feature images using theproximity bias were contiguous. For grayscale andcolor images (Austerweil and Griffiths, 2011; Hu,Zhai, Williamson, & Boyd-Graber, 2012), thestandard “knowledge-less” prior generates each pixelfrom a Gaussian distribution independent of theother pixels in the image.

Analogous to the feature image priors, the choiceof the likelihood depends on the format of the rawdimensional primitives. In typical applications, thelikelihood assumes that the reconstructed stimuliis given by the product of the feature ownershipand image matrices, ZY and penalizes the de-viation between the reconstructed and observedstimuli (Austerweil and Griffiths, 2011). For binaryimages, the noisy-OR likelihood (Pearl, 1988;Wood, Griffiths, & Ghahramani, 2006) is used,which amounts to thinking of each feature as a“hidden cause” and has support as a psychological

explanation for how people reason about observedeffects being produced by multiple hidden causes(Cheng, 1997; Griffiths and Ghahramani, 2005).For grayscale images, the linear-Gaussian likelihoodis typically used (Griffiths and Ghahramani, 2005;Austerweil and Griffiths, 2011), which is optimalunder the assumption that the reconstructed stimuliis given by ZY and that the metric of success isthe sum squared error between the reconstructedand observed stimuli. Recent work in machinelearning has started to explore more complexlikelihoods, such as explicitly accounting for depthand occlusion (Hu, Zhai, Williamson, & Boyd-Graber 2012). Formalizing more psychologicallyvalid feature image priors and likelihoods is amostly unexplored area of research that demandsattention.

After specifying the feature ownership andimage prior and the likelihood, a feature rep-resentation can be inferred for a given set ofobservations using standard machine learning infer-ence techniques, such as Gibbs sampling (Gemanand Geman, 1984) or particle filtering (Gordon,Salmond, & Smith, 1993). We refer the readerto Austerweil and Griffiths (2013), who discussGibbs sampling and particle filtering for featureinference models and analyze their psychologicalplausibility.

What features should people use to represent theimage in Figure 9.4(a)? When the image is in thecontext of the images in Figure 9.4(b), Austerweiland Griffiths (2011) found that people and the IBPmodel infer a single feature to represent it, namelythe object itself, which is shown in Figure 9.4(d).Alternatively, when the image is in the context ofthe images in Figure 9.4(c), people and the IBPmodel infer a set of features to represent it, whichare three of the six parts used to create the images,which are shown in Figure 9.4(e).6 Importantly,Austerweil and Griffiths (2011) demonstrated thattwo of the most popular machine-learning tech-niques for inferring features from a set of im-ages, principal component analysis (Hyvarinen,


(a)

(b) (c)

(d) (e)

Fig. 9.4 Effects of context on the features inferred by people to represent an image (from Experiment 1 of Austerweil & Griffiths, 2011).What features should be used to represent the image shown in (a)? When (a) is in the context of the images shown in (b), participantsand the nonparametric Bayesian model represent it with a single feature, the image itself, which is shown in (d). Conversely, when(a) is in the context of the images shown in (c), it is represented as a conjunction of three features, shown in (e). Participants and thenonparametric Bayesian model generalized differently due to using different representations.

Karhunen, & Oja 2001) and independent com-ponent analysis (Hyvarinen et al., 2001), did notexplain their experimental results. This suggests thatthe simplicity bias given by the IBP is similar tothe prior expectations people use to infer featurerepresentations, although future work is necessaryto more precisely test and formally characterize thebiases people use to infer feature representations.Furthermore, this contextual effect was replicated ingrayscale 3-D rendered images and in a conceptualdomain, suggesting that it may be a domain-generalcapability.

Unlike the IBP and most other computationalmodels of feature learning, people tend to use fea-tures that are invariant over transformations (e.g.,translations or dilations) because properties of fea-tures observed from the environment do not occuridentically across appearances (e.g., the retinal im-age after eye or head movements). By modifying theculinary metaphor such that each time a customertakes a dish she draws a random spice from a spicerack, the transformed IBP (Austerweil and Griffiths,2010b) can infer features invariant over a set oftransformations. Austerweil and Griffiths (2013)explain a number of previous experimental resultsand outline novel contextual effects predicted bythe extended model that they subsequently confirmthrough behavioral experiments. Other extensionsto this same framework have been used to explainmultimodal representation learning (Yildirm andJacobs, 2012), and in one extension, the IBP andCRP are used together to infer features diagnosticfor categorization (Austerweil and Griffiths, 2013).

Choice BehaviorAccording to feature-based choice models, peo-

ple choose between different options depending ontheir preference for the features (called an aspect)that each option has. One influential feature-basedchoice model is the elimination by aspects model(Tversky, 1972). Under this model, the preferenceof a feature is assumed to be independent of otherfeatures and defined by a single, positive number,called a weight. Choices are made by repeatedlyselecting a random feature weighted by their prefer-ence and removing all options that do not containthe feature until there is only one option remaining.For example, when a person is choosing whichtelevision to purchase, she is confronted with a largearray of possibilities that vary on many features,such as IS LCD, IS PLASMA, IS ENERGYEFFICIENT, IS HIGH DEFINITION, and so on,each with their own associated weight. Because thesame person may make different choices even whenthey are confronted with the same set of options,the probability of choosing option i over option j,pij , is proportional to the weights of the features thatoption i has, but option j does not have, relative tothe features that option j has, but option i does nothave. Formally, this is given by

pij =!

k wkzik(1− zjk)!

k wkzik(1− zjk)+!

k wk(1− zik)zjk,

where zik = 1 denotes that option i has feature kand wk is the preference weight given to feature k.


Although modeling human choice behaviorusing the elimination by aspects model is straight-forward when the features of each option andtheir preference weights are known; it is notstraightforward how to infer what features a personuses to represent a set of options and their weightsgiven only the person’s choice behavior. To addressthis issue, Görür et al. (2006) proposed theIBP as a prior distribution over possible featurerepresentations and a Gamma distribution as aprior over the feature weights, which is a flexibledistribution that only assigns probability to positivenumbers. They demonstrated that human choicefor which celebrities participants from the early1970s one would prefer to chat with (Rumelhartand Greeno, 1971) is just as well described bythe elimination by aspects model given a setof features defined by a modeler or when thefeatures are inferred with the IBP as the prior overpossible feature representations. As participants arefamiliar with the celebrities and the celebrities arerelated to each other according to a hierarchy (i.e.,politicians, actors, and athletes), Miller et al. (2008)extended the IBP such that it infers features ina manner that respects the given hierarchy (i.e.,options are more likely to have the same featuresto the degree that they are close to each otherin the hierarchy). They demonstrated that theextended IBP explains human choice judgmentsbetter and uses fewer features to represent theoptions. Although the IBP-based extensions helpsthe elimination by aspects model overcome someof its issues, the extended models are unable toaccount for the full complexity of human choicebehavior (e.g., attraction effects; Huber, Payne,& Puto, 1982). Regardless, exploring choicemodels that include a feature-inference process isa promising direction for future research, becausesuch models can potentially be incorporated withmore psychologically valid choice models (e.g.,sequential sampling models of preferential choice;Roe, Busemeyer, & Townsend 2001).

Learning Functions: How Are ContinuousQuantities Related?

So far, we have focused on cases in which thelatent structure to be inferred is discrete—eithera category or a set of features. However, latentstructures can also be continuous. One of the mostprominent examples is function learning, in whicha relationship is learned between two (or more)continuous variables. This is a problem that peopleoften solve without even thinking about it, as when

learning how hard to press the pedal to yield acertain amount of acceleration when driving a rentalcar. Nonparametric Bayesian methods also providea solution to this problem that can learn complexfunctions in a manner that favors simple solutions.

Viewed abstractly, the computational problembehind function learning is to learn a functiony = f (x) from a set of real-valued observationsxN = (x1, . . . ,xN ) and tN = (t1, . . . , tN ), where tn isassumed to be the true value obscured by some kindof additive noise (i.e., tn = yn +ϵn, where ϵn is sometype of noise). In machine learning and statistics,this is referred to as a regression problem. In thissection, we discuss how this problem can be solvedusing Bayesian statistics, and how the result ofthis approach is related to a class of nonparametricBayesian models known as Gaussian processes. Ourpresentation follows that in Williams (1998).

Bayesian linear regressionIdeally, we would seek to solve our regression

problem by combining some prior beliefs aboutthe probability of encountering different kindsof functions in the world with the informationprovided by x and y. We can do this by applyingBayes’s rule, with

p(f |xN , tN ) = p(tN |f ,xN )p(f )'F p(tN |f ,xN )p(f )df

, (1)

where p(f ) is the prior distribution over functionsin the hypothesis space F , p(tN |f ,xN ) is thelikelihood of observing the values of tN if f werethe true function, and p(f |xN , tN ) is the posteriordistribution over functions given the observationsxN and yN . In many cases, the likelihood is definedby assuming that the values of tn are independentgiven f and xn, and each follows a Gaussiandistribution with mean yn = f (xn) and variance σ 2.Predictions about the value of the function f for anew input xN+1 can be made by integrating overthis posterior distribution.

Performing the calculations outlined in theprevious paragraph for a general hypothesis spaceF is challenging, but it becomes straightforwardif we limit the hypothesis space to certain specificclasses of functions. If we take F to be alllinear functions of the form y = b0 + xb1, thenwe need to define a prior p(f ) over all linearfunctions. As these functions can be expressed interms of the parameters b0 and b1, it is sufficientto define a prior over the vector b = (b0,b1),which we can do by assuming that b followsa multivariate Gaussian distribution with mean


zero and covariance matrix !b. Applying Eq. 1,then, results in a multivariate Gaussian posteriordistribution on b (see Rasmussen and Williams,2006, for details) with

E[b|xN , tN ] =(σ 2

t !−1b +XT

N XN

)−1XT

N tN

cov[b|xN ,yN ] =*

!−1b + 1

σ 2t

XTN XN

+−1

where XN = [1N xN ] (i.e., a matrix with a vectorof ones horizontally concatenated with xN+1).Because yN+1 is simply a linear function of b,the predictive distribution is Gaussian, with yN+1having mean [1 xN+1]E[b|xN , tN ] and variance[1 xN+1]cov[b|xN , tN ][1 xN+1]T . The predictivedistribution for tN+1 is similar but with theaddition of σ 2 to the variance.

Basis Functions and Similarity KernelsAlthough considering only linear functions

might seem overly restrictive, linear regressionactually gives us the basic tools we need to solvethis problem for more general classes of functions.Many classes of functions can be described as linearcombinations of a small set of basis functions. Forexample, all kth degree polynomials are linear com-binations of functions of the form 1 (the constantfunction), x, x2,. . . , xk. Letting φ(1), . . . ,φ(k) denotea set of basis functions, we can define a prior onthe class of functions that are linear combinationsof this basis by expressing such functions in theform f (x) = b0 + φ(1)(x)b1 + ·· · + φ(k)(x)bk anddefining a prior on the vector of weights b. Ifwe take the prior to be Gaussian, we reach thesame solution as outlined in the previous paragraph,substituting " = [1N φ(1)(xN ) . . .φ(k)(xN )] forX and [1 φ(1)(xN+1) . . .φ(k)(xN+1) for [1 xN+1],where φ(xN ) = [φ(x1) . . .φ(xN )]T .

If our goal were merely to predict yN+1 fromxN+1, yN , and xN , we might consider a differentapproach, by simply defining a joint distributionon yN+1 given xN+1 and conditioning on yN .For example, we might take yN+1 to be jointlyGaussian, with covariance matrix

KN+1 =*

KN kN ,N+1kT

N ,N+1 kN+1

+

where KN depends on the values of xN , kN ,N+1depends on xN and xN+1, and kN+1 depends onlyon xN+1. If we condition on yN , the distributionof yN+1 is Gaussian with mean kT

N ,N+1K−1N y

and variance kN+1 − kTN ,N+1K−1

N kN ,N+1. This

approach to prediction is often referred to as usinga Gaussian process, since it assumes a stochasticprocess that induces a Gaussian distribution ony based on the values of x. This approach canalso be extended to allow us to predict yN+1 fromxN+1, tN , and xN by adding σ 2

t IN to KN , whereIN is the n × n identity matrix, to take intoaccount the additional variance associated with theobservations tN .

The covariance matrix KN+1 is specified usinga function whose argument is a pair of stimuliknown as a kernel, with Kij = K (xi,xj). Anykernel that results in an appropriate (symmetric,positive-definite) covariance matrix for all x canbe used. One common kernel is the radial basisfunction, with

K (xi,xj) = θ21 exp

*− 1

θ22

(xi − xj)2+

indicating that values of y for which values of xare close are likely to be highly correlated. SeeSchölkopf and Smola (2002) for further detailsregarding kernels. Gaussian processes thus providean extremely flexible approach to regression, withthe kernel being used to define which values of xare likely to have similar values of y. Some examplesare shown in Figure 9.5.

Just as we can express a covariance matrix interms of its eigenvectors and eigenvalues, we canexpress a given kernel K (xi,xj) in terms of itseigenfunctions φ and eigenvalues λ, with

K (xi,xj) =∞"

k=1

λkφ(k)(xi)φ(k)(xj)

for any xi and xj . Using the results from the previousparagraph, any kernel can be viewed as the result ofperforming Bayesian linear regression with a set ofbasis functions corresponding to its eigenfunctions,and a prior with covariance matrix !b = diag(λ).

These equivalence results establish an importantduality between Bayesian linear regression andGaussian processes: For every prior on functions,there exists a kernel that defines the similaritybetween values of x, and for every kernel, thereexists a corresponding prior on functions that yieldsthe same predictions. This result is a consequence ofMercer’s theorem (Mercer, 1909). Thus, Bayesianlinear regression and prediction with Gaussianprocesses are just two views of the same class ofsolutions to regression problems.


−1.5 −1 −0.5 0 0.5 1 1.5−3

−2

−1

0

1

2

3

Input (x)

Out

put (

y)

A

−1.5 −1 −0.5 0 0.5 1 1.5−3

−2

−1

0

1

2

3

Input (x)

Out

put (

y)

B

Fig. 9.5 Modeling functions with Gaussian processes. The data points (crosses) are the same in both panels. (A) Inferred posteriorusing a radial basis covariance function (Eq. 2) with θ2 = 1/4. (B) Same as panel A, but with θ1 = 1 and θ2 = 1/8. Notice that asθ2 gets smaller, the posterior mean more closely fits the observed data points, and the posterior variance is larger in regions far fromthe data.

Modeling Human Function LearningThe duality between Bayesian linear regression

and Gaussian processes provides a novel perspectiveon human function learning. Previously, theoriesof function learning had focused on the roles ofdifferent psychological mechanisms. One class oftheories (e.g., Carroll, 1963; Brehmer, 1974; Kohand Meyer, 1991) suggests that people are learningan explicit function from a given class, suchas the polynomials of degree D. This approachattributes rich representations to human learners,but has traditionally given limited treatment tothe question of how such representations couldbe acquired. A second approach (e.g., DeLosh,Busemeyer, & McDaniel, 1997) emphasizes thepossibility that people could simply be formingassociations between similar values of variables.This approach has a clear account of the underlyinglearning mechanisms, but it faces challenges inexplaining how people generalize beyond theirexperience. More recently, hybrids of these twoapproaches have been proposed (e.g., Kalish,Lewandowsky, & Kruschke, 2004; McDaniel andBusemeyer, 2005). For example, the populationof linear experts (POLE; Kalish et al. 2004) usesassociative learning to learn a set of linear functions

and their expertise over regions of dimensionalspace.

Bayesian linear regression resembles explicit rulelearning, estimating the parameters of a function,whereas the idea of making predictions based onthe similarity between predictors (as defined by akernel) that underlies Gaussian processes is morein line with associative accounts. The fact that,at the computational level, these two ways ofviewing regression are equivalent suggests that thesecompeting mechanistic accounts may not be as farapart as they once seemed. Just as viewing categorylearning as density estimation helps to understandthat prototype and exemplar models correspond todifferent types of solutions of the same statisticalproblem, viewing function learning as regressionreveals the shared assumptions behind rule learningand associative learning.

Gaussian process models also provide a goodaccount of human performance in function learningtasks. Griffiths et al. (2009) compared a Gaussianprocess model with a mixture of kernels (linear,quadratic, and radial basis) to human performance.Figure 9.6 shows mean human predictions whentrained on a linear, exponential, and quadraticfunction (from DeLosh et al., 1997), together with


Linear(a) Quadratic

FunctionHuman/Model

Exponential

(b)

Fig. 9.6 Extrapolation performance in function learning. Mean predictions on linear, exponential, and quadratic functions for (a)human participants (from DeLosh et al. 1997) and (b) a Gaussian process model with linear, quadratic, and nonlinear kernels. Trainingdata were presented in the region between the vertical lines, and extrapolation performance was evaluated outside this region. Figurereproduced from Griffiths et al. (2009).

the predictions of the Gaussian process model.The regions to the left and right of the verticallines represent extrapolation regions, being inputvalues for which neither people nor the model weretrained. Both people and the model extrapolate nearoptimally on the linear function, and reasonably ac-curate extrapolation also occurs for the exponentialand quadratic function. However, there is a biastoward a linear slope in the extrapolation of theexponential and quadratic functions, with extremevalues of the quadratic and exponential functionbeing overestimated.

ConclusionsProbabilistic models form a promising frame-

work for explaining the impressive success thatpeople have in solving different inductive problems.As part of performing these feats, the mindconstructs structured representations that flexiblyadapt to the current set of stimuli and context.In this chapter, we reviewed how these problemscan be described from a statistical viewpoint. Todefine models that infer representations that areboth flexible and structured, we described threemain classes of nonparametric Bayesian models andhow the format of the observed stimuli determines

which of the three classes of models should beused.

Our presentation of these three classes of modelsis only the beginning of an ever-growing literatureusing nonparametric Bayesian processes in cognitivescience. Each class of model can be thought of asproviding primitive units, which can be composedin various ways to form richer models. For example,the IBP can be interpreted as providing primitiveunits that can be composed together using logicaloperators to define a categorization model, whichlearns its own features along with a propositionalrule to define a category. Figure 9.7 presents someof these applications and Box 2 provides a detaileddiscussion of them.

Concluding remarks1. Although it is possible to define probabilistic

models that can infer any desired structure, due tothe bias-variance trade-off, prior expectations overa rich class of structures are needed to capture thestructures inferred by people when given a limitednumber of observations.

2. Nonparametric Bayesian models areprobabilistic models over arbitrarily complexstructures that are biased toward simpler structures.


Representational primitive Key Psychological Application Nonparametric Bayesian Process Key References

Partitions Category assignment Chinese restaurant process (CRP) Aldous (1985)Probability distributions overcompeting discrete units

Category learning/Density esti-mation

Dirichlet process (DP) Ferguson (1973)

Probability distributions over in-dependent discrete units

Feature assignmentIndian buffet process (IBP)/Beta pro-cess (BP)1

Griffiths and Ghahramani (2005);Thibaux and Jordan (2007)

Distributions over continuousunits

Function learning Gaussian process (GP) Rasmussen and Williams (2006)

Hierarchical category learning Nested CRPGriffiths et al. (2008b); Blei et al.(2010)

Jointly learning categories andfeatures

CRP + IBP, nested CRP + hierarchicalDP topic model

Austerweil and Griffiths (2013);Salakhutdinov et al. (2012)

Composites Cross-cutting category learning CRP over entities embedded insideCRP over attributes

Shafto et al. (2011)

Relational category learning Product of CRPs over multiple entitytypes

Kemp et al. (2006, 2010)

Property induction GP over latent graph structure Kemp and Tenenbaum (2009)

Domain structure learning GP/CRP/IBP + grammar over modelforms

Kemp and Tenenbaum (2008);Grosse et al. (2012)

Fig. 9.7 Different assumptions about the type of structure generating the observations from the environment results in different typesof nonparametric Bayesian models. The most basic nonparametric models define distributions over core representational primitives,while more advanced models can be constructed by composing these primitives with each other and with other probabilistic modelingand knowledge representation elements (see Box 2). Typically, researchers in cognitive science do not distinguish between the CRPand DP, or the IBP and BP. However, they are all distinct mathematical objects, where the CRP and IBP are distributions over theassignment of stimuli to units and the DP and BP are distributions over the assigment of stimuli to units and the parameters associatedwith those units. The probability distribution given by only considering the number of stimuli assigned to each unit by a DP and BPyields a distribution over assignments equivalent to the CRP and IBP, respectively.

Thus, they form a middle ground between the twoextremes of models that infer overly simplestructures (parametric models) and models thatinfer overly complex structures (nonparametricmodels).

3. Using different nonparametric models resultin different assumptions about the format of thehidden structure. When each stimulus is assignedto a single latent unit, multiple latent units, orcontinuous units, the Dirichlet process, Betaprocess, and Gaussian process are appropriate,respectively. These processes are compositional inthat they can be combined with each other andother models to infer complex latent structures,such as relations andhierarchies.

Some Future Questions1. How similar are the inductive biases defined

by nonparametric Bayesian models to those peopleuse when inferring structured representations?

2. What are the limits on the complexity ofrepresentations that people can learn? Arenonparametric Bayesian models too powerful?

3. How do nonparametric Bayesian modelscompare to other computational frameworks thatadapt their structure with experience, such asneural networks?

Notes1. Note that we define parametric, nonparametric, and other

statistical terms from the Bayesian perspective. We refer thereader interested in the definition of these terms from thefrequentist perspective and a comparison of frequentist andBayesian perspectives toYoung and Smith (2010).

2. Our definition of “density estimation” includes estimatingany probability distribution over a discrete or continuous space,which is slightly broader than its standard use in statistics,estimating any probability distribution over a continuous space.

3. Our formulation departs from Anderson’s by adopting thenotation typically used in the statistics literature. However, thetwo formulations are equivalent.

4. We have written the posterior probability as proportionalto the product of three terms because the normalizing constant(the denominator) for this example is intractable to computewhen there is an infinite repository of features.

5. Technically, to ensure that the infinite limit of P(Z|α) isvalid requires defining all feature ownership matrices that differonly in the order of the columns to be equivalent. This is due toidentifiability issues and is analogous to the arbitrariness of thecluster (or table) labels in the CRP.

6. Austerweil and Griffiths (2011) tested whether peoplerepresent the objects with the parts as features by seeing if theywere willing to generalize a property of the set of objects (beingfound in a cave on Mars) to a novel combination of three of thesix parts used to create the images. See Austerweil and Griffiths(2011) and Austerweil and Griffiths (2013) for a discussion ofthis methodology and the theoretical implications of theseresults.

7. Although we collapsed the distinction between IBP andBP, they are distinct nonparametric Bayesian processes. See thecaption of Figure 2 and the glossary for moredetails.


GlossaryBeta process: a stochastic process that assigns a real numberbetween 0 and 1 to a countable set of units, which makesit a natural prior for latent feature models (interpreting thenumber as a probability)

bias: the error between the true structure and theaverage structure inferred based on observations from theenvironment

bias-variance trade-off: to reduce the error in generalizinga structure to new observations, an agent has to reduce bothits bias and variance

Chinese restaurant process: a culinary metaphor thatdefines a probability distribution over partitions, whichyields an equivalent distribution on partitions as the oneimplied by a Dirichlet process when only the number ofstimuli assigned to each block is considered

computational level: interpreting the behavior of a systemas the solution to an abstract computational problem posedby the environment

consistency: given enough observations, the statisticalmodel infers the true structure producing the observations

Dirichlet process: a stochastic process that assigns a set ofnon-negative real numbers that sum to 1 to a countableset of units, which makes it a natural prior for latent classmodels (interpreting the assigned number as the probabilityof that unit)

exchangeability: a sequence of random variables is ex-changeable if and only if their joint probability is invariantto reordering (does not change)

Gaussian process: a stochastic process that defines a jointdistribution on a set of variables that is Gaussian, whichmakes it a natural prior for function learning models

importance sampling: approximating a distribution bysampling from a surrogate distribution and then reweight-ing the samples to compensate for the fact that they camefrom the surrogate rather than the desired distribution

Indian buffet process: a culinary metaphor for describingthe probability distribution over the possible assignmentsof observations to multiple discrete units, which yields anequivalent distribution on discrete units as the one impliedby a Beta process when only the number of stimuli assignedto each unit is considered

inductive inference: a method for solving a problem thathas more than one logically possible solution

likelihood: the probability of some observed data given aparticular structure or hypothesis is true

Markov chain Monte Carlo: approximating a distributionby setting up a Markov chain whose limiting distribution isthat distribution

Monte Carlo: using random number sampling to solvenumerical problems

nonparametric: a model whose possible densities belongsto a family that includes arbitrary distributions

parametric: a model that assumes possible densities belongsto a family that is parameterized by a fixed number ofvariables

particle filtering: a sequentially adapting importancesampler where the surrogate distribution is based on theapproximated posterior at the previous time step

partition: division of a set into nonoverlapping subsets

posterior probability: an agent’s belief in a structure orhypothesis after some observations

prior probability: an agent’s belief in a structure orhypothesis before any observations

rational analysis: interpreting the behavior of a system asthe ideal solution to an abstract computational problemposed by the environment usually with respect to someassumptions about the system’s environment

rational process models: a process model that is a statisticalapproximation to the ideal solution given by probabilitytheory

variance: the degree that the inferred structure changesacross different possible observations from the environment

ReferencesAldous, D. (1985). Exchangeability and related topics. In École

d’Été de Probabilités de Saint-Flour XIII, pp. 1–198. Berlin:Springer.

Anderson, J. R. (1990). The adaptive character of thought.Hillsdele, NJ: Erlbaum.

Anderson, J. R. (1991). The adaptive nature of humancategorization. Psychological Review 98(3), 409–429.

Antoniak, C. (1974). Mixtures of Dirichlet processes withapplications to Bayesian nonparametric problems. TheAnnals of Statistics, 2, 1152–1174.

Ashby, F. G. & Alfonso-Reese, L. A. (1995). Categorizationas probability density estimation. Journal of MathematicalPsychology, 39, 216–233.

Austerweil, J. L. & Griffiths, T. L. (2010a). Learning hypothesisspaces and dimensions through concept learning. In S.Ohlsson & R. Camtrabone, (Ed.). Proceedings of the 32ndAnnual Conference of the Cognitive Science Society, 73–78.Austin, TX: Cognitive Science Society.

Austerweil, J. L. & Griffiths, T. L. (2010b). Learning invariantfeatures using the transformed indian buffet process. In R.Zemel & J. Shawne-Taylor, (Ed.). Advances in neural infor-mation processing systems (Vol. 23, pp. 82–90. Cambridge,MA. MIT Press.

Austerweil, J. L. & Griffiths, T. L. (2011). A rational model ofthe effects of distributional information on feature learning.Cognitive Psychology, 63, 173–209.

Austerweil, J. L. & Griffiths, T. L. (2013). A nonparametricBayesian framework for constructing flexible feature repre-sentations. Psychological Review, 120, 817–851.

Bernardo, J. M. & Smith, A. F. M. (1994). Bayesian theory. NewYork, NY: Wiley.

Bickel, P. J. & Doksum, K. A. (2007). Mathematical statistics:basic ideas and selected topics. Upper Saddle River, NJ:Pearson.

Blackwell, D. & MacQueen, J. (1973). Ferguson distributionsvia Polya urn schemes. The Annals of Statistics, 1, 353–355.


Blei, D. M., Griffiths, T. L., & Jordan, M. I. (2010). The nestedChinese restaurant process and Bayesian nonparametricinference of topic hierarchies. Journal of the ACM, 57, 1–30.

Bouton, M. (2004). Context and behavioral processes inextinction. Learning & Memory, 11(5):485–494.

Bowers, J. S. & Davis, C. J. (2012). Bayesian just-sostories in psychology and neuroscience. Psychological Bulletin,138(3):389–414.

Braddick, O. (1993). Segmentation versus integration in visualmotion processing. Trends in neurosciences, 16(7), 263–268.

Brehmer, B. (1971). Subjects’ ability to use functional rules.Psychonomic Science, 24, 259–260.

Brehmer, B. (1974). Hypotheses about relations between scaledvariables in the learning of probabilistic inference tasks.Organizational Behavior and Human Decision Processes, 11,1–27.

Brogden, W. (1939). Sensory pre-conditioning. Journal ofExperimental Psychology, 25(4), 323–332.

Carroll, J. D. (1963). Functional learning: The learning ofcontinuous functional mappings relating stimulus and responsecontinua. Princeton, NJ: Education Testing Service.

Chater, N., Goodman, N., Griffiths, T. L., Kemp, C.,Oaksford, M., & Tenenbaum, J. B. (2011). The imaginaryfundamentalists: The unshocking truth about Bayesiancognitive science. Behavioral and Brain Sciences, 34(4), 194–196.

Chater, N. & Oaksford, M. (2008). The probabilistic mind:Prospects for Bayesian cognitive science. New York, NY: OxfordUniversity Press.

Cheng, P. (1997). From covariation to causation: A causal powertheory. Psychological Review, 104, 367–405.

Clapper, J. & Bower, G. (1994). Category invention inunsupervised learning. Journal of Experimental Psychology:Learning, Memory, and Cognition, 20(2), 443–460.

Courville, A., Daw, N., & Touretzky, D. (2006). Bayesiantheories of conditioning in a changing world. Trends inCognitive Sciences, 10(7), 294–300.

Dayan, P., Kakade, S., & Montague, P. R. (2000). Learning andselective attention. Nature Neuroscience, 3, 1218–1223.

DeLosh, E. L., Busemeyer, J. R., & McDaniel, M. A.(1997). Extrapolation: the sine qua non for abstractionin function learning. Journal of Experimental Psychology:Learning, Memory, and Cognition, 23(4), 968–986.

Ferguson, T. (1973). A Bayesian analysis of some nonparametricproblems. The Annals of Statistics, 1, 209–230.

Freedman, D. & Diaconis, P. (1983). On inconsistent Bayesestimates in the discrete case. Annals of Statistics, 11(4),1109–1118.

Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural net-works and the bias-variance dilemma. Neural Computation,4, 1–58.

Geman, S. & Geman, D. (1984). Stochastic relaxation, Gibbsdistributions, and the Bayesian restoration of images. IEEETransactions on Pattern Analysis and Machine Intelligence, 6,721–741.

Gershman, S., Blei, D., & Niv, Y. (2010). Context, learning,and extinction. Psychological Review, 117(1), 197–209.

Gershman, S. & Niv, Y. (2012). Exploring a latent cause theoryof classical conditioning. Learning & Behavior, 40(3), 255–268.

Gershman, S. J. & Blei, D. M. (2012). A tutorial on bayesiannonparametric models. Journal of Mathematical Psychology,56(1), 1–12.

Ghosal, S. (2010). The Dirichlet process, related priors, andposterior asymptotics. In N. L. Hjort, C. Holmes, P. Müller,& S. G. Walker, (Eds.), Bayesian nonparametrics (pp. 35–79).Cambridge UK, Cambridge University Press.

Goldmeier, E. (1972). Similarity in visually perceived forms.Psychological Issues, 8, 1–136. Original written in Germanand published in 1936.

Goldwater, S., Griffiths, T. L., & Johnson, M. (2009). ABayesian framework for word segmentation: Exploring theeffects of context. Cognition, 112, 21–54.

Goodman, N. (1972). Seven strictures on similarity. In N.Goodman, (Ed.), Problems and projects. New York, NY:Bobbs-Merrill.

Gordon, N., Salmond, J., & Smith, A. (1993). A novel approachto non-linear/non-Gaussian Bayesian state estimation. IEEEProceedings on Radar and Signal Processing, 140, 107–113.

Görür, D., Jäkel, F., & Rasmussen, C. E. (2006). A choice modelwith infinitely many latent features. In Proceedings of the 23rdInternational Conference on Machine Learning (ICML 2006),pages 361–368, New York. ACM Press.

Griffiths, T., Chater, N., Kemp, C., Perfors, A., & Tenenbaum,J. (2010a). Probabilistic models of cognition: exploringrepresentations and inductive biases. Trends in CognitiveSciences, 14, 357–364.

Griffiths, T., Steyvers, M., & Tenenbaum, J. (2007). Topicsin semantic representation. Psychological Review, 114(2),211–244.

Griffiths, T. L. (2010). Bayesian models as tools for exploringinductive biases. In Banich, M. & Caccamise, D., editors,Generalization of knowledge: Multidisciplinary perspectives.Psychology Press, New York.

Griffiths, T. L., Chater, N., Kemp, C., Perfors, A., &Tenenbaum, J. B. (2010b). Probabilistic models of cognition:exploring representations and inductive biases. Trends inCognitive Sciences, 14(8), 357–364.

Griffiths, T. L., Chater, N., Norris, D., & Pouget, A. (2012).How the Bayesians got their beliefs (and what those beliefsactually are): Comment on Bowers and Davis (2012).Psychological Bulletin, 138(3), 415–422.

Griffiths, T. L. & Ghahramani, Z. (2005). Infinite latent featuremodels and the Indian buffet process. (Technical Report2005-001, Gatsby Computational Neuroscience Unit).

Griffiths, T. L. & Ghahramani, Z. (2011). The Indian buffetprocess: An introduction and review. Journal of MachineLearning Research, 12, 1185–1224.

Griffiths, T. L., Kemp, C., & Tenenbaum, J. B. (2008a).Bayesian models of cognition. In R. Sun, (Ed.). Cambridgehandbook of computational cognitive modeling. Cambridge,England: Cambridge University Press.

Griffiths, T. L., Lucas, C., Williams, J. J., & Kalish, M. L.(2009). Modeling human function learning with Gaussianprocesses. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C.K. I. Williams & A. Culotta, (Eds.). Advances in NeuralInformation Processing Systems 21, pp. 556–563. Red Hook,NY: Curran.

Griffiths, T. L., Sanborn, A. N., Canini, K. R., & Navarro,D. J. (2008b). Categorization as nonparametric Bayesian


density estimation. In N. Chater, & M. Oaksford, (Eds.).The probabilistic mind. Oxford, England: Oxford UniversityPress.

Griffiths, T. L. & Yuille, A. (2006). A primer on probabilisticinference. Trends in Cognitive Sciences, 10(7), 1–11.

Grosse, R. B., Salakhutdinov, R., Freeman, W. T., &Tenenbaum, J. B. (2012). Exploiting compositionality toexplore a large space of model structures. In N. de Freitas& K. Murphy, (Eds.). Conference on Uncertainty in ArtificialIntelligence, pp. 306–315. Corvallis, OR: AUAI Press.

Hjort, N. L. (1990). Nonparametric Bayes estimators basedon Beta processes in models for life history data. Annals ofStatistics, 18, 1259–1294.

Hu, Y., Zhai, K., Williamson, S., & Boyd-Graber, J. (2012).Modeling images using transformed Indian buffet processes.International Conference of Machine Learning, Edinburgh,UK.

Huber, J., Payne, J. W., & Puto, C. (1982). Adding asymmet-rically dominated alternatives: Violations of regularity andthe similarity hypothesis. Journal of Consumer Research, 9(1),90–98.

Hyvarinen, A., Karhunen, J., & Oja, E. (2001). Independentcomponent analysis. New York, NY: Wiley.

Jaynes, E. T. (2003). Probability theory: The logic of science.Cambridge, England: Cambridge University Press.

Jolliffe, I. T. (1986). Principal component analysis. New York, NY:Springer.

Jones, M. & Love, B. C. (2011). Bayesian fundamentalism orenlightenment? on the explanatory status and theoreticalcontributions of Bayesian models of cognition. Behavioraland Brain Sciences, 34, 169–231.

Kahneman, D., Slovic, P., & Tversky, A., editors (1982).Judgment under uncertainty: Heuristics and biases. Cambridge,England: Cambridge University Press.

Kalish, M. L., Lewandowsky, S., & Kruschke, J. K. (2004).Population of linear experts: knowledge partitioning andfunction learning. Psychological Review, 111(4), 1072–1099.

Kaplan, A. & Murphy, G. (1999). The acquisition of categorystructure in unsupervised learning. Memory & Cognition,27(4), 699–712.

Kemp, C., Perfors, A., & Tenenbaum, J. B. (2007).Learning overhypotheses with hierarchical bayesian models.Developmental Science, 10(3), 307–321.

Kemp, C., Tenenbaum, J., Niyogi, S., & Griffiths, T. (2010). Aprobabilistic model of theory formation. Cognition, 114(2),165–196.

Kemp, C. & Tenenbaum, J. B. (2008). The discovery ofstructural form. Proceedings of the National Academy ofSciences, 105(31), 10687–10692.

Kemp, C. & Tenenbaum, J. B. (2009). Structured statisticalmodels of inductive reasoning. Psychological Review, 116(1),20–58.

Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T.,& Ueda, N. (2006). Learning systems of concepts withan infinite relational model. In Y. Gil & R. J. Mooney,(Eds.). Proceedings of the 21st National Conference on ArtificialIntelligence, pp. 381–388. Menlo Park, CAAAAI Press.

Koh, K. & Meyer, D. E. (1991). Function learning: inductionof continuous stimulus–response relations. Journal of Exper-imental Psychology: Learning, Memory, and Cognition, 17(5),811–836.

Kruschke, J. (2008). Bayesian approaches to associative learning:From passive to active learning. Learning & Behavior, 36(3),210–226.

Love, B. C., Medin, D. L., & Gureckis, T. M. (2004). SUS-TAIN: A network model of category learning. PsychologicalReview, 111, 309–332.

Marr, D. (1982). Vision. San Francisco, CA: WH Freeman.McClelland, J. L., Botvinick, M. M., Noelle, D. C., Plaut,

D. C., Rogers, T. T., Seidenberg, M. S., & Smith,L. B. (2010). Letting structure emerge: connectionistand dynamical systems approaches to cognition. Trends inCognitive Sciences, 14(8), 348–356.

McDaniel, M. A. & Busemeyer, J. R. (2005). The conceptualbasis of function learning and extrapolation: Comparisonof rule-based and associative-based models. PsychonomicBulletin & Review, 12(1), 24–42.

McKinley, S. C. & Nosofsky, R. M. (1995). Investigations ofexemplar and decision bound models in large, ill-definedcategory structures. Journal of Experimental Psychology:Human Perception and Performance, 21(1), 128–148.

Medin, D. L. & Schaffer, M. M. (1978). Context the-ory of classification learning. Psychological Review, 85,207–238.

Mercer, J. (1909). Functions of positive and negative typeand their connection with the theory of integral equations.Philosophical Transactions of the Royal Society A, 209, 415–446.

Miller, K. T., Griffiths, T. L., & Jordan, M. I. (2008). Thephylogenetic Indian Buffet Process: A non-exchangeablenonparameteric prior for latent features. In D. McAllester& P. Myllymaki, (Eds.). Proceedings of the Twenty-FourthConference on Uncertainity in Artificial Intelligence, (pp. 403–410). Corvallis, Oregon: AUAI Press.

Murphy, G. L. & Medin, D. L. (1985). The role oftheories in conceptual coherence. Psychological Review, 92,289–316.

Neal, R. M. (2000). Markov chain sampling methods fordirichlet process mixture models. Journal of Computationaland Graphical Statistics, 9(2), 249–265.

Nosofsky, R. M. (1986). Attention, similarity, and theidentification-categorization relationship. Journal of Experi-mental Psychology: General, 115(1), 39–57.

Oaksford, M. & Chater, N. (2007). Bayesian rationality: Theprobabilistic approach to human reasoning. New York, NY:Oxford University Press.

Palmer, S. E. (1999). Vision Science. Cambridge, MA: MIT Press.Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San

Francisco, CA: Morgan Kaufmann.Posner, M. I. & Keele, S. W. (1968). On the genesis of abstract

ideas. Journal of Experimental Psychology, 77(3p1), 353.Pothos, E. & Chater, N. (2002). A simplicity principle in

unsupervised human categorization. Cognitive Science, 26(3),303–343.

Rasmussen, C. E. & Williams, C. K. (2006). Gaussian Processesfor Machine Learning. Cambridge, MA: MIT Press.

Reed, S. K. (1972). Pattern recognition and categorization.Cognitive Psychology, 3, 393–407.

Rescorla, R. A. & Wagner, A. R. (1972). A theory of ofPavlovian conditioning: variations in the effectiveness ofreinforcement and nonreinforcement. In A. Black, & W.Prokasy, (Eds.), Classical conditioning II: Current research


and theory (pp. 64–99). New York, NY: Appleton-Century-Crofts.

Robert, C. P. (1994). The Bayesian choice: A decision-theoreticmotivation. New York, NY: Springer.

Roe, R. M., Busemeyer, J. R., & Townsend, J. T. (2001). Mul-tialternative decision field theory: A dynamic connectionistmodel of decision making. Psychological Review, 108(2),370–392.

Rumelhart, D. & Greeno, J. (1971). Similarity betweenstimuli: An experimental test of the Luce and Restlechoice models. Journal of Mathematical Psychology, 8,370–381.

Salakhutdinov, R., Tenenbaum, J. B., & Torralba, A. (2012).Learning to learn with compound hierarchical-deep models.In Advances in Neural Information Processing Systems.

Sanborn, A. N., Griffiths, T. L., & Navarro, D. J. (2010).Rational approximations to rational models: alternative al-gorithms for category learning. Psychological Review, 117(4),1144–1167.

Schölkopf, B. & Smola, A. J. (2002). Learning with kernels:Support vector machines, regularization, optimization, andbeyond. MIT Press, Cambridge, MA.

Shepard, R. N. (1987). Towards a universal law of generalizationfor pshychological. Science. 237, 1317–1323.

Selfridge, O. G. & Neisser, U. (1960). Pattern recognition bymachine. Scientific American, 203, 60–68.

Shafto, P., Kemp, C., Manishka, V., & Tenenbaum, J. B. (2011). Aprobabilistic model of cross-categorization. Cognition, 120(1),1–25.

Tenenbaum, J. B., Griffiths, T. L., & Kemp, C. (2006). Theory-based Bayesian models of inductive learning and reasoning.Trends in Cognitive Science, 10:309–318.

Tenenbaum, J. B., Kemp, C., Griffiths, T. L., & Goodman,N. D. (2011). How to grow a mind: Statistics, structure, andabstraction. Science, 331, 1279–1285.

Thibaux, R. & Jordan, M. I. (2007). Hierarchical Beta processesand the Indian buffet process. In Eleventh InternationalConference on Artificial Intelligence and Statistics (AISTATS2007), pages 564–571.

Tversky, A. (1972). Elimination by aspects: A theory of choice.Psychological Review, 79, 281–299.

Tversky, A. (1977). Features of similarity. Psychological Review,84, 327–352.

Werker, J. & Yeung, H. (2005). Infant speech perception bootstrapsword learning. Trends in Cognitive Sciences, 9(11), 519–527.

Williams, C. K. (1998). Prediction with gaussian processes: Fromlinear regression to linear prediction and beyond. In M.Jordan, (Ed.). Learning in graphical models (pp. 599–621).Cambridge, MA: MIT Press.

Wood, F., Griffiths, T. L., & Ghahramani, Z. (2006). Anonparametric Bayesian method for inferring hidden causes.Proceedings of the 22nd Conference in Uncertainty inArtificial Intelligence (UAI ’06), 536–543.

Yildirm, I. & Jacobs, R. A. (2012). A rational analysis of theacquisition of multisensory representations. Cognitive Science,36, 305–332.

Young, G. A. & Smith, R. L. (2010). Essentials of statisticalinference. Cambridge, UK: Cambridge University Press.


chapter structure and flexibility in bayesian models of...

Documents