“ideal” learning of language and categories nick chater department of psychology university of...

48
“Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica Amsterdam

Upload: silvester-woods

Post on 31-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

“Ideal” learning of language and categories

Nick ChaterDepartment of

PsychologyUniversity of Warwick

Paul VitányiCentrum voor Wiskunde en InformaticaAmsterdam

Page 2: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

OVERVIEW

I. Learning from experience: The problem

II. Learning to predictIII. Learning to identifyIV. A methodology for assessing

learnabilityVI. Where next?

Page 3: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

I. Learning from experience: The problem

Page 4: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Learning: How few assumptions will work?

Model fitting Assume M(x) Optimize x Easy, but needs prior

knowledge

No assumptions Learning is

impossible---”no free lunch”

Can a more minimal model of learning still work?

?

?

?

?

Page 5: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Learning from +/- vs. + data

target language/category

guess

overlap

Under-general Over-general

+ data

- data

Page 6: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

But how about learning from + data only? Categorization Language

acquisition

?

?

??

??

Page 7: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Learning from +ive data seems to raise in principle problems In Categorization, rules

out: Almost all learning

experiments in psychology

Exemplar models Prototype models NNs, SVMs…

?

?

?

??

?

Language acquisition Assumed that children

only needing access to positive evidence

Sometimes viewed as ruling out learning models entirely

The “Logical” problem of language acquisition(e.g., Hornstein & Lightfoot, 1981; Pinker, 1979)

Page 8: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Must be solvable: A parallel with science

Science only has access to positive data

Yet it seems to be possible So overgeneral theories must be

eliminated, somehow e.g., “Anything goes” seems a bad theory

Theories must capture regularities, not just fit data

Page 9: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Absence as implicit negative evidence?

Thus overgeneral grammars may predict lots of missing sentences

And their absence is a systematic clue that the theory is probably wrong

This idea only seems convincing if can be proved that convergence works well, statistically... So what do we need to assume?

Page 10: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Modest assumption: Computability constraint

Assume that data is generated by

Random factors Computable factors i.e., nothing

uncomputable

“Monkeys typing into a programming language”

A modest assumption!

…HHTTTHTTHTTHT…

Computable process

Chance

NP V

NPV

S

…The cat sat on the mat. The dog…

Page 11: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Learning by simplicity Find explanation of “input” that is as

simple as possible An ‘explanation’ reconstructs the input Simplicity measured in code length

Long history in perception: Mach, Koffka, Hochberg, Attneave, Leeuwenberg, van der Helm

Mimicry theorem with Bayesian analysis E.g., Li & Vitányi (2000); Chater (1996); Chater & Vitányi ( ms.)

Relation to Bayesian inference Widely used in statistics and machine learning

Page 12: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Consider “ideal” learning Given the data, what

is the shortest code How well does the

shortest code work? Prediction Identification

Ignore the question of search

Makes general results feasible

But search won’t go away…!

Fundamental question: when is learning data-limited or search-limited?

Page 13: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Three kinds of induction

Prediction: converge on correct predictions

Identification: identify generating category/distribution in the limit

Learning causal mechanisms?? Inferring counterfactuals---effects of intervention (cf Pearl: from probability to causes)

Page 14: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

II. Learning to predict

Page 15: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Prediction by simplicity Find shortest ‘program/explanation’

for current data Predict using that program

Strictly, use ‘weighted sum’ of explanations, weighted by brevity…

Equivalent to Bayes with (roughly) a 2-K(x)

prior, where K(x) is the length of the shortest program generating x

Page 16: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Summed error has finite bound (Solomonoff, 1978)

)(2

2logs

1=jj Ke

So prediction converges [faster than 1/nlog(n), for corpus size n]

Inductive inference is possible!

No independence or stationarity assumptions; just computability of generating mechanism

Page 17: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Applications

Language

A. Grammaticality judgements

B. Language production

C. Form-meaning mappings

Categorization

Learning from positive examples

Page 18: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

A: Grammaticality judgments

We want a grammar that doesn’t over- or under- generalize (much) w.r.t., ‘true’ grammar, on sentences that are statistically likely to occur

NB. No guarantees for… Colorless green ideas sleep furiously (Chomsky) Bulldogs bulldogs bulldogs fight fight fight (Fodor)

Page 19: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Converging on a grammar Fixing undergeneralization is easy (such

grammars get ‘falsified’) Overgeneralization is the hard problem

Need to use absence as evidence But the language is infinite; any corpus finite So almost all grammatical sentences are also absent

Logical problem of language acquisition; Baker’s paradox Impossibility of ‘mere’ learning from positive evidence

Page 20: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Overgeneralization Theorem 

Suppose learner has probability j of

erroneously guessing an ungrammatical jth word

Intuitive explanation: overgeneralization implies smaller than

need probs to grammatical sentences; and hence excessive code lengths

2log

)(

1 ejj

K

Page 21: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

B: Language production

Simplicity allows ‘mimicry’ of any computable statistical method of generating a corpus

Arbitrary prob, ; simplicity prob,

Li & Vitányi, 1997

1)|(

)|(

xy

xy

Page 22: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

C: Learning form-meaning mappings

So far we have ignored semantics Suppose language inputs consists

of form-meaning pairs (cf Pinker) Assume only the form→meaning and

meaning → form mappings are computable (don’t have to be deterministic)…

Page 23: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

A theorem It follows that:

Total errors in mapping forms to (sets of) meanings (with probs) and

Total errors in mapping forms to (sets of) meanings (with probs)

…have a finite bound (and hence average errors/sentence tend to 0)

Page 24: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Categorization Sample n items from

category C (assume each all items equally likely)

Guess, by choosing the D that provides the shortest code for the data

General proof method:1. Overgeneralization D must be basis for a

shorter code than C (or you wouldn’t prefer it)

2. Undergeneralization Typical data from

category C will have no code shorter than nlog|C|

Page 25: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

1. Fighting overgeneralization D can’t be much

bigger than C, or it’ll have a longer code length:

K(D)+nlog|D| ≤ K(C)+nlog|C|

as n, constraint is that

|D|/|C| ≤ 1+O(1/n)

Page 26: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

2. Fighting undergeneralization But guess must cover

most of the correct category---or it’d provide a “suspiciously” short code for the data

Typicality:K(D|C)+nlog|CD|≥nlog|C| as n, constraint is

that |CD|/|C| ≥ 1-O(1/n)

C

C

D

D

Page 27: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Implication |D| converges

to near |C|

Accuracy bounded by O(1/n), with n samples

i.i.d. assumptions

Actual rate depends on structure of category is crucial

Language: need lots of examples (but how many?)

Some categories may only need a few (one?) example (Tenenbaum, Feldman)

Page 28: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

III. Learning to identify

Page 29: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Hypothesis identification

Induction of ‘true’ hypothesis, or category, or language

In philosophy of science, typically viewed as hard problem…

Needs stronger assumptions than prediction

Page 30: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Identification in the limit: The problem Assume endless data Goal: specify an

algorithm that, at each point, picks a hypothesis

And eventually locks in on the correct hypothesis

though can never announce it---as there may always be an additional low frequency item that’s yet to be encountered

Gold, Osherson et al have studied this extensively

Sometimes viewed as showing identification not possible (but really a mix of positive and negative results)

But i.i.d. and computability allows a general positive result

Page 31: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Algorithm Algorithms have two

parts Program which

specifies set Pr Sample from Pr, using

average code length H(Pr) per data point

Pick a specific set of data (which needs to be ‘long enough’)

Won’t necessarily know what is long enough---an extra assumption

Specify enumeration of programs for Pr, e.g., in order of length

Run, dovetailing

Initialize with any Pr

Flip to Pr that corresponds to shortest program so far, that has generated data

Page 32: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Dovetailing

prog1 1 2 4 7prog2 3 5 8prog3 6 9prog4 10

Runs for ever…

Run these in order, dovetailing, where each program gets 2(-

length) steps This process runs for

ever (looping programs)

Shortest prog so far “pocketed”…

This will always finish on the “true” program

Page 33: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Overwhelmingly likely to work... (as n, Prob correct identification1)

For large enough stream of n typical data, no alternative model does better

Expected code length of coding data generated by Pr, by Pr’ rather than Pr, wastes

n.D(Pr’||Pr)

D(Pr’||Pr) > 0; so swamps initial code length, for large enough n

K(Pr)

K(Pr’)

Initial Code n=8Pr wins

Page 34: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

IV. A methodology for assessing learnability

Page 35: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Assessing learnability in cognition?

Constraint c is learnable if code which

1. “invests” l(c) bits to encode c (investing) can…

2. recoup its investment save more than l(c) bits in encoding the data

Nativism? c is acquired But not enough data

can’t recoup investment

(e.g., little/no relevant data)

Viability of empiricism?

Ample supply of data to recoup l(c)

Cf Tenenbaum, Feldman…

Page 36: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Language acquisition: Poverty of the stimulus, quantified

Consider of linguistic constraint

(e.g., noun-verb agreement; subjacency, phonological constraints)

Cost assessed by length of formulation

(length of linguistic rules)

Saving: reduction in cost of coding data (perceptual, linguistic)

Page 37: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Easy example: learning singular-plural

John loves tennisThey love_ tennis

John loves tennis*John love_ tennis They love_ tennis*They loves tennis

x bitsy bits

x+1 bits

y+1 bits

If constraint applies to proportion p of n sentences, constraint saves pn bits.

Page 38: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Visual structure―ample data? Depth from

stereo: Invest: algorithm

for correspondence

Recoup: almost a whole image (that’s a lot!)

Perhaps could infer stereo for a single stereo image?

Object/texture models (Yuille) Investment in

building the model But recoup in

compression, over “raw” image description

Presumably few images needed?

Page 39: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

A harder linguistic case: Baker’s

paradox (with Luca Onnis and Matthew Roberts)

Quasi-regular structures are ubiquitous in language: e.g., alternations

It is likely that John will come

It is possible that John will come

John is likely to come *John is possible to

come

(Baker,1979, see also Culicover)

Strong winds High winds Strong currents *High currents

I love going to Italy! I enjoy going to Italy! I love to go to Italy! *I enjoy to go to

Italy!

Page 40: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Baker’s paradox (Baker, 1979)

Selectional restrictions: “holes” in the space of possible sentences allowed by a given grammar…

How does the learner avoid falling into the holes??

i.e., how does the learner distinguish genuine ‘holes’ from the infinite number of unheard grammatical constructions?

Page 41: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Our abstract theory tells us something

Theorem on grammaticality judgments show that the paradox is solvable, in the asymptote, and with no computational restrictions

But can this be scaled down… Learn specific ‘alternation’ patterns With corpus the child hears

Page 42: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Argument by information investment To encode an

exception, which appears to have probability x, requires

Log2(1/x) bits

But this elimination of x makes all other sentences (1-x) times more likely, saving:

n(Log2(x/1-x) bitsDoes the saving outweigh the investment?

Page 43: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

An exampleRecovery from overgeneralisations

The rabbit hidYou hid the rabbit!The rabbit disappeared*You disappeared the rabbit!

Return on ‘investment’ over 5M words from the CHILDES database is easily sufficient

But this methodology can be applied much more widely (and aimed at fitting time-course of U-shaped generalization; and when

overgeneralizations do or do not arise).

Page 44: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

V. Where next?

Page 45: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Can we learn causal structure from observation?

What happens if we move the left hand stick?

Page 46: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Liftability Breakability Edibility Whats is attached to what What is resting on what

Without this, perception is fairly useless as an input for action

The output of perception provides a description in terms of causality

Page 47: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Inferring causality from observation: The hard problem of induction

Formal question Suppose a modular computer program generate

stream of data of indefinite length… Under what conditions can modularity be recovered? How might “interventions”/expts help?

(Key technical idea: Kolmogorov sufficient statistic)

Sensory input

Generative process

Page 48: “Ideal” learning of language and categories Nick Chater Department of Psychology University of Warwick Paul Vitányi Centrum voor Wiskunde en Informatica

Fairly uncharted territory If data is generated by

independent processes Then one model of the data will

involve recapitulation of those processes

But will there be other alternative modular programs?

Which might be shorter? Hopefully not!

Completely open field…