iterated learning 1 running head: iterated learning ... · iterated learning 6 the probability that...
TRANSCRIPT
Iterated learning 1
Running head: ITERATED LEARNING
Iterated learning: Intergenerational knowledge transmission reveals inductive biases
Michael L. Kalish
Institute of Cognitive Science
University of Louisiana at Lafayette
Thomas L. Griffiths
Department of Cognitive and Linguistic Sciences
Brown University
Stephan Lewandowsky
Department of Psychology
University of Western Australia
Address for correspondence: Word count: 3580Mike KalishInstitute of Cognitive ScienceUniversity of Louisiana at LafayetteLafayette, LA 70504Phone: (337) 482 1135E-mail: [email protected]
Iterated learning 2
Abstract
Cultural transmission of information plays a central role in shaping human knowledge.
Some of the most complex knowledge that people acquire, such as languages or cultural
norms, can only be learned from other people, who themselves learned from previous
generations. The prevalence of this process of “iterated learning” as a mode of cultural
transmission raises the question of how it affects the information being transmitted.
Analyses of iterated learning under the assumption that the learners are Bayesian agents
predict that this process should converge to an equilibrium that reflects the inductive
biases of the learners. An experiment in iterated function learning with human participants
confirms this prediction, providing insight into the consequences of intergenerational
knowledge transmission and a method for discovering the inductive biases that guide
human inferences.
Iterated learning 3
Iterated learning: Intergenerational knowledge transmission
reveals inductive biases
Knowledge changes as it is passed from one person to the next, and from one
generation to the next. Sometimes the change is dramatic: the deaf children of Nicaragua
have transformed a fragmentary protolanguage into a real language in the brief time
required for one generation of signers to mature within the new language’s community
(e.g., Senghas & Coppola, 2001). Language is only one example, although it is perhaps
the most striking, of the inter-generational transmission of cultural knowledge. In many
cases of cultural transmission, one learner serves as the next learner’s teacher. Languages,
legends, superstitions and social norms are all transmitted by such a process of “iterated
learning” (see Figure 1a), with each generation learning from data produced by that which
preceded it (Boyd & Richerson, 1985; Briscoe, 2002; Cavalli-Sforza & Feldman, 1981;
Kirby, 1999, 2001). However, iterated learning does not result in perfect transfer of
knowledge across generations. Its outcome depends not just on the data being passed from
learner to learner, but on the properties of the learners themselves.
The prevalence of iterated learning as a mode of cultural transmission raises an
important question: what are the consequences of iterated learning for the information
being transmitted? In particular, does this information converge to a predictable
equilibrium, and are the dynamics of this process understandable? This question has been
explored in a variety of disciplines, including anthropology and linguistics. In
anthropology, several researchers have argued that processes of cultural transmission like
iterated learning provide the opportunity for the biases of learners to manifest in the
concepts used by a society (Atran, 2001, 2002; Boyer, 1994, 1998; Sperber, 1996). In
Iterated learning 4
linguistics, iterated learning provides a potential explanation for the structure of human
languages (e.g., Kirby, 2001; Briscoe, 2002). This approach is an alternative to traditional
claims that the structure of language is the result of constraints imposed by an innate,
special-purpose, language faculty (e.g., Chomsky, 1965; Hauser, Chomsky, & Fitch,
2002). Simulations of iterated learning with general-purpose learning algorithms have
shown that languages with considerable degrees of structure can emerge when agents are
allowed to learn from one another (Kirby, 2001; Smith, Kirby, & Brighton, 2003;
Brighton, 2002).
Despite this interest in cultural transmission, there has been very little laboratory
work on the consequences of iterated learning. Bartlett’s (1932) experiments in “serial
reproduction” were the first psychological investigations of this topic, using a procedure in
which participants reconstructed a stimulus from memory, with their reconstructions
serving as stimuli for later participants. Bartlett concluded that reproductions seem to
become more consistent with the biases of the participants as the number of reproductions
increases. However, these claims are impossible to validate, since Bartlett’s experiments
used stimuli, such as pictures and stories, that are not particularly amenable to rigorous
analysis. In addition, there was no unambiguous pre-experimental hypothesis about what
people’s biases might be for these complex stimuli. There have been only a few
subsequent studies in serial reproduction, with the most prominent being Bangerter (2000)
and Barrett and Nyhof (2001), and thus we presently have little understanding of the likely
outcome of iterated learning in controlled conditions. The possibilities are numerous:
iteration might produce divergence from structure into noise, or into random or
unpredictable alternation from one solution to another, or people might blend their biases
with the data to form consistent “compromise” solutions. In this paper, we attempt to
Iterated learning 5
determine the outcome of intergenerational knowledge transmission by testing the
predictions made by a formal analysis of iterated learning in an experiment using a
controlled set of stimuli.
We can gain some insight into the consequences of iterated learning by considering
the case where the learners are Bayesian agents. Bayesian agents use a principle of
probability theory, called Bayes’ rule, to infer the process that was responsible for
generating some observed data. Assume that a learner has a set of hypotheses, H, about
the process that could have produced data, d, and a “prior” probability distribution, p(h),
that encodes that learner’s biases by specifying the probability a learner assigns to the
truth of each hypothesis h ∈ H before seeing d. In the case of learning a language, the
hypotheses, h, are different languages, and the data, d, are a set of utterances. Bayes’ rule
states that the probability that an agent should assign to each hypothesis after seeing d –
known as the “posterior” probability, p(h|d) – is
p(h|d) =p(d|h)p(h)
∑
h∈H p(d|h)p(h), (1)
where p(d|h) – the “likelihood” – indicates how likely d is under hypothesis h, and p(d) is
the probability of d averaged over all hypotheses, p(d) =∑
h p(d|h)p(h), sometimes
called the prior predictive distribution. The assumption that learners are Bayesian agents
is not unreasonable: adherence to Bayes’ rule is a fundamental principle of rational action
in statistics and economics (Savage, 1954; Jaynes, 2003; Robert, 1994) and its use
underlies many learning algorithms (Mitchell, 1997; Mackay, 2003).
In iterated learning with Bayesian agents, each learner uses Bayes’ rule to infer the
language spoken by the previous learner, and generates the data provided to the next
learner using the results of this inference (see Figure 1b). Having formalized iterated
learning in this way, we can examine how it affects the hypotheses chosen by the learners.
Iterated learning 6
The probability that the nth learner chooses hypothesis i given that the previous learner
chose hypothesis j is
p(hn = i|hn−1 = j) =∑
d
p(hn = i|d)p(d|hn−1 = j), (2)
where p(hn = i|d) is the posterior probability obtained from Equation 1. This specifies
the transition matrix of a Markov chain, with the hypothesis chosen by each learner
depending only on that chosen by the previous learner. Griffiths and Kalish (2005) showed
that the stationary distribution of this Markov chain is p(h), the prior assumed by the
learners. The Markov chain will converge to this distribution under fairly general
conditions (e.g., Norris, 1997). This means that the probability that the last in a long line
of learners chooses a particular hypothesis is simply the prior probability of that
hypothesis, regardless of the data provided to the first learner. In other words, the stimuli
provided for learning are completely irrelevant in the long run, and only the biases of the
learners affect the outcome of iterated learning.1
A similar convergence result can be obtained if we consider how the data generated
by the learners (instead of the hypotheses they hold) change over time: after many
generations, the probability that a learner generates data d will be p(d) =∑
h p(d|h)p(h),
the probability of d under the prior predictive distribution (Griffiths & Kalish, 2006). This
process of convergence is illustrated in Figure 2 for the case where the hypotheses are
linear functions and the prior favors functions with unit slope and zero intercept (the
details of this Bayesian model appear in the Appendix). This is a simple example of
iterated learning, but nonetheless illustrative of convergence to the prior predictive
distribution. As each generation of learners combines the evidence provided by the data
with their (common) prior, their posterior distributions move closer to the prior, and the
data they produce becomes more consistent with hypotheses that have high prior
Iterated learning 7
probability.
The preceding analysis of iterated learning with Bayesian agents provides a simple
answer to the question of how iterated learning affects the information being transmitted:
the information will be transformed to reflect the inductive biases of the learners. Whether
a similar transformation will be observed with human learners is an open empirical
question. To test this prediction, we reproduced iterated learning in the laboratory using a
set of controlled stimuli for which people’s biases are well understood. We chose to use a
function learning task, because of the prominent role that inductive bias seems to play in
this domain.2 In function learning, each learner sees data consisting of (x, y) pairs and
attempts to infer the underlying function relating y to x. Experiments typically present the
values of x graphically, and subjects produce a graphical y magnitude in response. Tests
of interpolation and extrapolation with novel x values reveal that people infer continuous
functions from these discrete trials. Previous experiments in function learning suggest that
people have an inductive bias favoring linear functions with a positive slope: initial
responses are consistent with such functions (Busemeyer, Byun, DeLosh, & McDaniel,
1997), and they require the least training to learn (Brehmer, 1971, 1974; Busemeyer et al.,
1997). Kalish, Lewandowsky, and Kruschke (2004) showed that a model that included
such a bias could account for a variety of phenomena in human function learning. If
iterated learning converges to an equilibrium reflecting the inductive biases of the learners,
we should expect to see linear functions with positive slope emerge after a few
generations of learners. We tested this hypothesis by examining the outcome of iterated
function learning, varying the functions used to train the first learner in each sequence.
Iterated learning 8
Method
Participants
288 undergraduate psychology students from the University of Louisiana at
Lafayette participated for partial course credit. The experiment had four conditions,
corresponding to different initial training functions. There were 72 participants in each
condition, forming nine generations of learners from eight “families” in which the
responses of each generation of learners during a post-training transfer test were presented
to the next generation of learners as the to-be-learned target stimuli.
Apparatus and Stimuli
Participants completed the experiment in individual sound-attenuated booths. A
computer displayed all trials and collected all responses. On each trial a filled blue bar
1cm high and 0.3cm (x = 1) to 30cm (x = 100) wide was presented as the stimulus. The
stimulus was always presented in the upper portion of the screen, with its upper left corner
approximately 4cm from the top and 4cm from the left of the edge of the screen. The
participant entered a response magnitude by adjusting a vertically-oriented unmarked
slider (located 4cm from the botton and 6cm from the right of the screen) with the mouse;
the slider’s position determined the height of a filled red bar 1cm wide, which could
extend up to 25cm. During the training phase, feedback was provided in the form of a
filled yellow bar 1cm wide placed 1cm to the right of the response bar, which varied from
0.25cm (y = 1) to 25cm (y = 100) in height and was aligned so that the height of the bar
was aligned with the correct response.
Iterated learning 9
Procedure
The experiment had both training and transfer phases. For the learners who formed
the first generation of any family, the values of the training stimuli were 50 randomly
selected stimulus values (x) ranging from 1 to 100 paired with feedback (y) given by the
function of the condition the participant was in. The four functions used during training of
the first generation of participants in the four conditions were: y = x (positive linear),
y = 101 − x (negative linear), y = 50.5 + 49.5 sin(
π2
+ x5π
)
(non-linear, U-shaped) and a
random 1-to-1 pairing of x and y with both x, y ∈ {1, . . . , 100}. All values of x were
integers and all values of y were rounded to the nearest integer prior to display.
The test items consisted of 25 of the training items along with 25 of the 50 unused
stimulus values. Inter-generational transfer took place by making the test stimuli and
responses of generation n of each family serve as the training items of generation n + 1 in
that family. Inter-generational transfer was conducted entirely without personal contact
and participants were not made aware that their test responses would serve as training for
later participants; the use of one generation’s test items in training the next generation was
the only contact between generations.
Each trial was initiated by the presentation of a stimulus, selected without
replacement from the 50 items in either the training or test set. Following each stimulus
presentation, while the stimulus remained on the screen, the participant used the mouse to
adjust the slider to indicate their predicted response magnitude and clicked a button to
record their response when they had adjusted the slider to the desired magnitude. The
response could be manipulated ad lib until the participant chose to record the response.
During training each response was followed by the presentation of a feedback bar. If
the response was correct (defined as within 1.5 cm, or 5 units, of the target value y), there
Iterated learning 10
was a study interval of 1s duration during which the stimulus, response, and feedback
were all presented. If the response was incorrect, a tone sounded and the participant was
shown the feedback bar. The participant was then required to set the slider so that the
response bar was equal to the feedback bar. A study interval of 2s duration followed this
correction. Thus, participants who responded accurately spent less time studying the
feedback; this was the reward for accurate responses. After each study interval there was a
blank interval of 2s duration before the next trial. Each participant completed a single
block of training in which each of their 50 training values was presented once in random
order. Test trials were identical to training trials, except that no feedback was made
available after the response was entered. Participants were informed prior to the beginning
of the test phase about this change.
Results
Figure 3 shows a single family of nine participants for each condition, chosen to be
representative of the overall results. Each set of axes shows the test-phase responses of a
single learner who was trained using the data shown in the graph to its left. For example,
the responses of the first generation in each condition (in column 2) were based on the
data provided by the actual function (to the left, in column 1). The responses of the second
generation (in column 3) were based on the data produced by the first generation (in
column 2), and so forth.
Regardless of the data seen by the first learner, iterated learning converged to a
linear function with positive slope in only a few generations for 28 of the 32 families of
learners. The other four families, three in the negative linear condition and one in the
random condition, were producing a negative linear function at the point the experiment
Iterated learning 11
concluded. Figure 3(a) indicates that a linear function with positive slope is stable under
iterated learning; none of the other initial conditions had this level of stability. Figure 3(b)
is reminiscent of the analysis shown in Figure 2: despite starting with a linear function
with negative slope, learners converged to a linear function with positive slope. Figure
3(c) and (d) show that linear functions with positive slope also emerge from iterated
learning when the initial function is non-monotonic or completely random. Figure 3(e)
shows the family that most strongly maintained the negative linear function.
The results shown in Figure 3 illustrate a tendency for iterated learning to converge
to a positive linear function. To provide a more quantitative analysis, we computed the
correlation between the responses of each participant and the positive linear function
y = x. The outliers produced by families converging to the negative linear function made
the mean correlation less informative than the median; Figure 3(f) shows the median
correlations at each generation for each of the four conditions. Other than the positive
linear condition, where the correlation was at ceiling from the first generation, the
correlations systematically increased across generations. As the data clearly show, iterated
learning produced an increase in the consistency of people’s responses with a positive
linear function. This result is consistent with a prior that is dominated by linear functions,
with a strong bias for the positive over the negative.
Discussion
Our analysis of iterated learning with Bayesian agents indicates that when Bayesian
learners learn from one another, they converge to a distribution over hypotheses
determined by their inductive biases (Griffiths & Kalish, 2005, 2006). The purpose of this
experiment was to test whether iterated learning with human learners likewise converges
Iterated learning 12
to an outcome consistent with their inductive biases. Previous research in function
learning suggests that people favor linear functions with positive slopes (Brehmer, 1971,
1974; Busemeyer et al., 1997; Kalish et al., 2004). In our experiment, iterated learning
converged to a positive linear function in the majority of cases, regardless of the function
seen by the first learner. Mapping out the stationary distribution of iterated learning in
greater detail would require collecting significantly more data, making it possible to
confirm that the underlying Markov chains have converged, and providing more samples
from the stationary distribution. Nonetheless, the dominance of the positive linear
function in our results suggests that, for human learners as for Bayesian agents, iterated
learning produces convergence to the prior.
These empirical results are consistent with our theoretical analysis of iterated
learning, and have two significant implications. First, they support the idea that
information transmitted via iterated learning will ultimately come to mirror the structure
of the human mind, a conclusion consistent with claims that processes of cultural
transmission can allow the biases of learners to manifest in cultures (Atran, 2001, 2002;
Boyer, 1994, 1998; Sperber, 1996). This suggests that languages, legends, religious
concepts, and social norms are all tailored to match our biases, providing a formal
justification for studying these phenomena as a means of understanding human cognition.
These results also validate the interpretation of existing, less controlled, experiments using
serial reproduction (e.g., Bartlett, 1932; Bangerter, 2000; Barrett & Nyhof, 2001) as
revealing people’s biases.
Second, and perhaps more importantly, our results suggest that iterated learning can
be used as a method for exploring the biases that guide human learning. Many of the
problems that are central to cognitive science, from learning and using language to
Iterated learning 13
inferring the structure of categories from a few examples, are problems of induction. A
variety of arguments, from both philosophy (e.g., Goodman, 1955) and learning theory
(e.g., Kearns & Vazirani, 1994; Vapnik, 1995), stress the importance of inductive biases in
solving these problems. In order to understand how people make inductive inferences, we
need to understand the biases that constrain those inferences. The present experiment
involved a case in which the general shape of learners’ biases was known prior to the
study. Based on the results of the experiment, it appears possible to use the procedure to
investigate biases in situations in which they are unknown and people are unable (or
unwilling) to reveal what those biases are. By reproducing iterated learning in the
laboratory, we may be able to map out the implicit inductive biases that make human
learning possible.
Iterated learning 14
References
Atran, S. (2001). The trouble with memes: Inferences versus imitation in cultural
creation. Human Nature, 12, 351-381.
Atran, S. (2002). In gods we trust: The evolutionary landscape of religion. Oxford:
Oxford University Press.
Bangerter, A. (2000). Transformation between scientific and social representations of
conception: The method of serial reproduction. British Journal of Social
Psychology, 39, 521-535.
Barrett, J., & Nyhof, M. (2001). Spreading nonnatural concepts. Journal of Cognition
and Culture, 1, 69-100.
Bartlett, F. C. (1932). Remembering: a study in experimental and social psychology.
Cambridge: Cambridge University Press.
Box, G. E. P., & Tiao, G. C. (1992). Bayesian inference in statistical analysis. New York:
Wiley.
Boyd, R., & Richerson, P. J. (1985). Culture and the evolutionary process. Chicago:
University of Chicago Press.
Boyer, P. (1994). The naturalness of religious ideas: A cognitive theory of religion.
Berkeley, CA: University of California Press.
Boyer, P. (1998). Cognitive tracks of cultural inheritance: how evolved intuitive ontology
governs cultural transmission. American Anthropologist, 100, 876-889.
Brehmer, B. (1971). Subjects’ ability to use functional rules. Psychonomic Science, 24,
259-260.
Iterated learning 15
Brehmer, B. (1974). Hypotheses about relations between scaled variables in the learning
of probabilistic inference tasks. Organizational Behavior and Human Decision
Processes, 11, 1-27.
Brighton, H. (2002). Compositional syntax from cultural transmission. Artificial Life,
25-54.
Briscoe, E. (Ed.). (2002). Linguistic evolution through language acquisition: Formal and
computational models. Cambridge, UK: Cambridge University Press.
Busemeyer, J. R., Byun, E., DeLosh, E. L., & McDaniel, M. A. (1997). Learning
functional relations based on experience with input-output pairs by humans and
artificial neural networks. In K. Lamberts & D. Shanks (Eds.), Concepts and
categories (p. 405-437). Cambridge: MIT Press.
Cavalli-Sforza, L. L., & Feldman, M. W. (1981). Cultural transmission and evolution.
Princeton, NJ: Princeton University Press.
Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis.
New York: Chapman & Hall.
Goodman, N. (1955). Fact, fiction, and forecast. Cambridge: Harvard University Press.
Griffiths, T. L., & Kalish, M. L. (2005). A Bayesian view of language evolution by
iterated learning. In B. G. Bara, L. Barsalou, & M. Bucciarelli (Eds.), Proceedings
of the Twenty-Seventh Annual Conference of the Cognitive Science Society (p.
827-832). Mahwah, NJ: Erlbaum.
Griffiths, T. L., & Kalish, M. L. (2006). Language evolution by iterated learning with
Bayesian agents. (Submitted for publication)
Iterated learning 16
Hauser, M. D., Chomsky, N., & Fitch, W. T. (2002). The faculty of language: What is it,
who has it, and how did it evolve? Science, 298, 1569-1579.
Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge: Cambridge
University Press.
Kalish, M., Lewandowsky, S., & Kruschke, J. (2004). Population of linear experts:
Knowledge partitioning and function learning. Psychological Review, 111,
1072-1099.
Kearns, M., & Vazirani, U. (1994). An introduction to computational learning theory.
Cambridge, MA: MIT Press.
Kirby, S. (1999). Function, selection and innateness: The emergence of language
universals. Oxford: Oxford University Press.
Kirby, S. (2001). Spontaneous evolution of linguistic structure: An iterated learning
model of the emergence of regularity and irregularity. IEEE Journal of Evolutionary
Computation, 5, 102-110.
Mackay, D. J. C. (2003). Information theory, inference, and learning algorithms.
Cambridge: Cambridge University Press.
Mitchell, T. M. (1997). Machine learning. New York: McGraw Hill.
Norris, J. R. (1997). Markov chains. Cambridge, UK: Cambridge University Press.
Robert, C. P. (1994). The Bayesian choice: A decision-theoretic motivation. New York:
Springer.
Savage, L. J. (1954). Foundations of statistics. New York: John Wiley & Sons.
Senghas, A., & Coppola, M. (2001). Children creating languge: How Nicaraguan sign
language acquired a spatial grammar. Psychological Science, 12, 323-328.
Iterated learning 17
Smith, K., Kirby, S., & Brighton, H. (2003). Iterated learning: A framework for the
emergence of language. Artificial Life, 9, 371-386.
Sperber, D. (1996). Explaining culture: A naturalistic approach. Oxford: Blackwell.
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.
Iterated learning 18
Appendix
Bayesian linear regression
Linear regression is a standard problem that is dealt with in detail in several books
on Bayesian statistics, including Box and Tiao (1992) and Gelman, Carlin, Stern, and
Rubin (1995). For this reason, our treatment of this analysis is extremely brief. Assume
that the data d are a set of n pairs (xi, yi), and that the hypothesis space H consists of
linear functions of the form y = β1x + β0 + ε, where ε is Gaussian noise with variance
σ2
Y . Since a hypothesis h is identified entirely by the parameters β1 and β0, we can
summarize both data and hypotheses using column vectors, x = [x1, x2, . . . , xn]T ,
y = [y1, y2, . . . , yn]T , and β = [β1 β0]T .
The likelihood, p(d|h), is simply the probability of x and y given β, p(y,x|β).
Assuming that x follows a distribution q(x) which is constant over all hypotheses, we
have p(y,x|β) = p(y|x,β)q(x). From the assumption that y = β1x + β0 + ε, it follows
that p(y|x, β) is Gaussian with mean Xβ and covariance matrix σ2
Y In, where
X = [x 1n], and 1n and In are an n × 1 vector of 1s and the n × n identity matrix
respectively. The prior p(h) is a distribution over the parameters β, p(β). We take p(β) to
be Gaussian with mean µβ and covariance matrix σ2
βI2.
The posterior distribution p(h|d) is a distribution over β given x and y. Using our
choice of prior and likelihood, this is Gaussian with covariance matrix
Σpost = (1
σ2
Y
XTX +1
σ2
β
I2)−1, (3)
and mean
µpost = Σ−1
post(1
σ2
Y
XTy +1
σ2
β
µβ). (4)
Iterated learning 19
Figure 2 was generated by simulating iterated learning with this model. The first learner
saw 20 datapoints generated by sampling x uniformly at random from the range [0, 1], and
taking y = 1 − x. A value of β was sampled from the resulting posterior distribution, and
used to generate values of y for 20 new randomly drawn values of x, which were supplied
as data to the next learner. This process was continued for a total of nine learners,
producing the results shown in the figure. The likelihood and prior assumed by the
learners had σ2
Y = 0.0025, σ2
β = 0.005, and µβ = [1 0]T , corresponding to a strong prior
favoring functions with a slope of 1 and an intercept of 0.
Iterated learning 20
Author Note
This research was partially supported by a Research Award from the Louisiana
Board of Regents to the first author and by a Discovery Grant from the Australian
Research Council to the third author. We thank Charles Barouse, Laurie Robinette and
Margery Doyle for assistance in data collection.
Iterated learning 21
Footnotes
1In the case of language, demonstrating that iterated learning converges to the prior
distribution over hypotheses maintained by the learners should not be taken as implying
that linguistic universals are necessarily the consequence of innate constraints specific to
language learning. The biases encoded by the prior need not be either innate, as they
could result from experiences with data other than those under consideration, or language
specific, as they could include general-purpose constraints such as limitations on
information processing (see Griffiths & Kalish, 2006, for a more detailed discussion).
2Our use of function learning was also inspired by simulations of iterated learning of
languages, in which a language is often conceived of as a function mapping meanings to
utterances (e.g., Smith et al., 2003).
Iterated learning 22
Figure Captions
Figure 1. (a) Iterated learning. Each learner sees data produced by a learner in a previous
generation, forms a hypothesis about the process by which those data were produced, and
uses this hypothesis to produce the data that will be supplied to a learner in the next
generation. (b) Iterated learning with Bayesian agents. The first learner sees data d0,
computes a posterior probability distribution over hypotheses according to Equation 1,
samples a hypothesis h1 from this distribution, and generates new data d1 by sampling
from the likelihood associated with that hypothesis. This data is provided to the second
learner, and the process continues, with the nth learner seeing data dn−1, inferring a
hypothesis hn, and generating new data dn.
Figure 2. Iterated learning with Bayesian agents converges to the prior. (a) The leftmost
panel shows the initial data provided to a Bayesian learner, a sample of 20 points from a
function. The learner inferred a hypothesis (in this case a linear function) from these data,
and then generated the predicted values of y shown in the next panel for a new set of
inputs x. These predictions were supplied as data to another Bayesian learner, and the
remaining panels show the predictions produced by learners at each generation as this
process continued. All learners had a prior distribution over hypotheses favoring linear
functions with positive slope (see the Appendix for details). As iterated learning proceeds,
the predictions converge to a positive linear function. (b) The correlation between
predictions and the function y = x provides a quantitative measure of correspondence to
the prior. The solid line shows the median correlation with y = x for functions produced
by 1000 sequences of iterated learning like that shown in (a). The dotted lines show the
95% confidence interval. Using this quantitative measure, it is easy to see that iterated
Iterated learning 23
learning quickly produces a strong correspondence to the prior.
Figure 3. Iterated learning with human learners. The leftmost panel in each row shows the
data seen by the first learner, a sample of 50 points from one of four functions. The other
columns show the data produced by each generation of learners, trained on the data from
the column to their left. Each row shows a single sequence of nine learners, drawn at
random from the eight “families” of learners run with the same initial data. The rows
differ in the functions used to generate the data shown to the first subject: (a) is a linear
function with positive slope, (b) is a linear function with negative slope, (c) is a non-linear
function, and (d) is a random set of points. In each case, iterated learning quickly
converges to a linear function with positive slope, consistent with findings indicating that
human learners are biased towards this kind of function. (e) In a minority of cases (4 out
of 32) families were producing negative linear functions at the end of the experiment,
suggesting that such functions receive some weight under the prior. (f) The median
correlation with y = x across all families was assessed for the four conditions illustrated
in (a)-(e). Regardless of the initial data, this correlation increased over generations as
predictions became more consistent with a positive linear function.
...hypothesis
data data data
hypothesisa
b ...d0 h1 d1 h2 d2
p(h|d)p(h|d) p(d|h) p(h|d) p(d|h)
a n = 1 n = 2 n = 3 n = 4 n = 5 n = 6 n = 7 n = 8 n = 9
1 2 3 4 5 6 7 8 9−1
−0.5
0
0.5
1
Cor
rela
tion
Iteration (n)
b
a n = 1 n = 2 n = 3 n = 4 n = 5 n = 6 n = 7 n = 8 n = 9
b
c
d
e
1 2 3 4 5 6 7 8 9−1
−0.5
0
0.5
1
Cor
rela
tion
Iteration (n)
fPositive Linear (a)Negative Linear (b)Non−linear (c)Random (d,e)