“ideal” learning of language and categories nick chater department of psychology university of...

Post on 31-Dec-2015

220 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

“Ideal” learning of language and categories

Nick ChaterDepartment of

PsychologyUniversity of Warwick

Paul VitányiCentrum voor Wiskunde en InformaticaAmsterdam

OVERVIEW

I. Learning from experience: The problem

II. Learning to predictIII. Learning to identifyIV. A methodology for assessing

learnabilityVI. Where next?

I. Learning from experience: The problem

Learning: How few assumptions will work?

Model fitting Assume M(x) Optimize x Easy, but needs prior

knowledge

No assumptions Learning is

impossible---”no free lunch”

Can a more minimal model of learning still work?

?

?

?

?

Learning from +/- vs. + data

target language/category

guess

overlap

Under-general Over-general

+ data

- data

But how about learning from + data only? Categorization Language

acquisition

?

?

??

??

Learning from +ive data seems to raise in principle problems In Categorization, rules

out: Almost all learning

experiments in psychology

Exemplar models Prototype models NNs, SVMs…

?

?

?

??

?

Language acquisition Assumed that children

only needing access to positive evidence

Sometimes viewed as ruling out learning models entirely

The “Logical” problem of language acquisition(e.g., Hornstein & Lightfoot, 1981; Pinker, 1979)

Must be solvable: A parallel with science

Science only has access to positive data

Yet it seems to be possible So overgeneral theories must be

eliminated, somehow e.g., “Anything goes” seems a bad theory

Theories must capture regularities, not just fit data

Absence as implicit negative evidence?

Thus overgeneral grammars may predict lots of missing sentences

And their absence is a systematic clue that the theory is probably wrong

This idea only seems convincing if can be proved that convergence works well, statistically... So what do we need to assume?

Modest assumption: Computability constraint

Assume that data is generated by

Random factors Computable factors i.e., nothing

uncomputable

“Monkeys typing into a programming language”

A modest assumption!

…HHTTTHTTHTTHT…

Computable process

Chance

NP V

NPV

S

…The cat sat on the mat. The dog…

Learning by simplicity Find explanation of “input” that is as

simple as possible An ‘explanation’ reconstructs the input Simplicity measured in code length

Long history in perception: Mach, Koffka, Hochberg, Attneave, Leeuwenberg, van der Helm

Mimicry theorem with Bayesian analysis E.g., Li & Vitányi (2000); Chater (1996); Chater & Vitányi ( ms.)

Relation to Bayesian inference Widely used in statistics and machine learning

Consider “ideal” learning Given the data, what

is the shortest code How well does the

shortest code work? Prediction Identification

Ignore the question of search

Makes general results feasible

But search won’t go away…!

Fundamental question: when is learning data-limited or search-limited?

Three kinds of induction

Prediction: converge on correct predictions

Identification: identify generating category/distribution in the limit

Learning causal mechanisms?? Inferring counterfactuals---effects of intervention (cf Pearl: from probability to causes)

II. Learning to predict

Prediction by simplicity Find shortest ‘program/explanation’

for current data Predict using that program

Strictly, use ‘weighted sum’ of explanations, weighted by brevity…

Equivalent to Bayes with (roughly) a 2-K(x)

prior, where K(x) is the length of the shortest program generating x

Summed error has finite bound (Solomonoff, 1978)

)(2

2logs

1=jj Ke

So prediction converges [faster than 1/nlog(n), for corpus size n]

Inductive inference is possible!

No independence or stationarity assumptions; just computability of generating mechanism

Applications

Language

A. Grammaticality judgements

B. Language production

C. Form-meaning mappings

Categorization

Learning from positive examples

A: Grammaticality judgments

We want a grammar that doesn’t over- or under- generalize (much) w.r.t., ‘true’ grammar, on sentences that are statistically likely to occur

NB. No guarantees for… Colorless green ideas sleep furiously (Chomsky) Bulldogs bulldogs bulldogs fight fight fight (Fodor)

Converging on a grammar Fixing undergeneralization is easy (such

grammars get ‘falsified’) Overgeneralization is the hard problem

Need to use absence as evidence But the language is infinite; any corpus finite So almost all grammatical sentences are also absent

Logical problem of language acquisition; Baker’s paradox Impossibility of ‘mere’ learning from positive evidence

Overgeneralization Theorem 

Suppose learner has probability j of

erroneously guessing an ungrammatical jth word

Intuitive explanation: overgeneralization implies smaller than

need probs to grammatical sentences; and hence excessive code lengths

2log

)(

1 ejj

K

B: Language production

Simplicity allows ‘mimicry’ of any computable statistical method of generating a corpus

Arbitrary prob, ; simplicity prob,

Li & Vitányi, 1997

1)|(

)|(

xy

xy

C: Learning form-meaning mappings

So far we have ignored semantics Suppose language inputs consists

of form-meaning pairs (cf Pinker) Assume only the form→meaning and

meaning → form mappings are computable (don’t have to be deterministic)…

A theorem It follows that:

Total errors in mapping forms to (sets of) meanings (with probs) and

Total errors in mapping forms to (sets of) meanings (with probs)

…have a finite bound (and hence average errors/sentence tend to 0)

Categorization Sample n items from

category C (assume each all items equally likely)

Guess, by choosing the D that provides the shortest code for the data

General proof method:1. Overgeneralization D must be basis for a

shorter code than C (or you wouldn’t prefer it)

2. Undergeneralization Typical data from

category C will have no code shorter than nlog|C|

1. Fighting overgeneralization D can’t be much

bigger than C, or it’ll have a longer code length:

K(D)+nlog|D| ≤ K(C)+nlog|C|

as n, constraint is that

|D|/|C| ≤ 1+O(1/n)

2. Fighting undergeneralization But guess must cover

most of the correct category---or it’d provide a “suspiciously” short code for the data

Typicality:K(D|C)+nlog|CD|≥nlog|C| as n, constraint is

that |CD|/|C| ≥ 1-O(1/n)

C

C

D

D

Implication |D| converges

to near |C|

Accuracy bounded by O(1/n), with n samples

i.i.d. assumptions

Actual rate depends on structure of category is crucial

Language: need lots of examples (but how many?)

Some categories may only need a few (one?) example (Tenenbaum, Feldman)

III. Learning to identify

Hypothesis identification

Induction of ‘true’ hypothesis, or category, or language

In philosophy of science, typically viewed as hard problem…

Needs stronger assumptions than prediction

Identification in the limit: The problem Assume endless data Goal: specify an

algorithm that, at each point, picks a hypothesis

And eventually locks in on the correct hypothesis

though can never announce it---as there may always be an additional low frequency item that’s yet to be encountered

Gold, Osherson et al have studied this extensively

Sometimes viewed as showing identification not possible (but really a mix of positive and negative results)

But i.i.d. and computability allows a general positive result

Algorithm Algorithms have two

parts Program which

specifies set Pr Sample from Pr, using

average code length H(Pr) per data point

Pick a specific set of data (which needs to be ‘long enough’)

Won’t necessarily know what is long enough---an extra assumption

Specify enumeration of programs for Pr, e.g., in order of length

Run, dovetailing

Initialize with any Pr

Flip to Pr that corresponds to shortest program so far, that has generated data

Dovetailing

prog1 1 2 4 7prog2 3 5 8prog3 6 9prog4 10

Runs for ever…

Run these in order, dovetailing, where each program gets 2(-

length) steps This process runs for

ever (looping programs)

Shortest prog so far “pocketed”…

This will always finish on the “true” program

Overwhelmingly likely to work... (as n, Prob correct identification1)

For large enough stream of n typical data, no alternative model does better

Expected code length of coding data generated by Pr, by Pr’ rather than Pr, wastes

n.D(Pr’||Pr)

D(Pr’||Pr) > 0; so swamps initial code length, for large enough n

K(Pr)

K(Pr’)

Initial Code n=8Pr wins

IV. A methodology for assessing learnability

Assessing learnability in cognition?

Constraint c is learnable if code which

1. “invests” l(c) bits to encode c (investing) can…

2. recoup its investment save more than l(c) bits in encoding the data

Nativism? c is acquired But not enough data

can’t recoup investment

(e.g., little/no relevant data)

Viability of empiricism?

Ample supply of data to recoup l(c)

Cf Tenenbaum, Feldman…

Language acquisition: Poverty of the stimulus, quantified

Consider of linguistic constraint

(e.g., noun-verb agreement; subjacency, phonological constraints)

Cost assessed by length of formulation

(length of linguistic rules)

Saving: reduction in cost of coding data (perceptual, linguistic)

Easy example: learning singular-plural

John loves tennisThey love_ tennis

John loves tennis*John love_ tennis They love_ tennis*They loves tennis

x bitsy bits

x+1 bits

y+1 bits

If constraint applies to proportion p of n sentences, constraint saves pn bits.

Visual structure―ample data? Depth from

stereo: Invest: algorithm

for correspondence

Recoup: almost a whole image (that’s a lot!)

Perhaps could infer stereo for a single stereo image?

Object/texture models (Yuille) Investment in

building the model But recoup in

compression, over “raw” image description

Presumably few images needed?

A harder linguistic case: Baker’s

paradox (with Luca Onnis and Matthew Roberts)

Quasi-regular structures are ubiquitous in language: e.g., alternations

It is likely that John will come

It is possible that John will come

John is likely to come *John is possible to

come

(Baker,1979, see also Culicover)

Strong winds High winds Strong currents *High currents

I love going to Italy! I enjoy going to Italy! I love to go to Italy! *I enjoy to go to

Italy!

Baker’s paradox (Baker, 1979)

Selectional restrictions: “holes” in the space of possible sentences allowed by a given grammar…

How does the learner avoid falling into the holes??

i.e., how does the learner distinguish genuine ‘holes’ from the infinite number of unheard grammatical constructions?

Our abstract theory tells us something

Theorem on grammaticality judgments show that the paradox is solvable, in the asymptote, and with no computational restrictions

But can this be scaled down… Learn specific ‘alternation’ patterns With corpus the child hears

Argument by information investment To encode an

exception, which appears to have probability x, requires

Log2(1/x) bits

But this elimination of x makes all other sentences (1-x) times more likely, saving:

n(Log2(x/1-x) bitsDoes the saving outweigh the investment?

An exampleRecovery from overgeneralisations

The rabbit hidYou hid the rabbit!The rabbit disappeared*You disappeared the rabbit!

Return on ‘investment’ over 5M words from the CHILDES database is easily sufficient

But this methodology can be applied much more widely (and aimed at fitting time-course of U-shaped generalization; and when

overgeneralizations do or do not arise).

V. Where next?

Can we learn causal structure from observation?

What happens if we move the left hand stick?

Liftability Breakability Edibility Whats is attached to what What is resting on what

Without this, perception is fairly useless as an input for action

The output of perception provides a description in terms of causality

Inferring causality from observation: The hard problem of induction

Formal question Suppose a modular computer program generate

stream of data of indefinite length… Under what conditions can modularity be recovered? How might “interventions”/expts help?

(Key technical idea: Kolmogorov sufficient statistic)

Sensory input

Generative process

Fairly uncharted territory If data is generated by

independent processes Then one model of the data will

involve recapitulation of those processes

But will there be other alternative modular programs?

Which might be shorter? Hopefully not!

Completely open field…

top related