learning syntax with minimum description length mike dowman 9 march 2006

Learning Syntax with Minimum Description Length

Mike Dowman

9 March 2006

Syntactic Theory

Chomsky (1957):

• Discovery procedure

• Decision procedure

• Evaluation procedure

Chomsky (1965):

Language acquisition

Poverty of the Stimulus

• Evidence available to children is utterances produced by other speakers

• No direct cues to sentence structure

• Or to word categories

So children need prior knowledge of possible structures

UG

Negative Evidence

• Some constructions seem impossible to learn without negative evidence

John gave a painting to the museum

John gave the museum a painting

John donated a painting to the museum

* John donated the museum a painting

Implicit Negative Evidence

If constructions don’t appear can we just assume they’re not grammatical?

No – we only see a tiny proportion of possible, grammatical sentences

People generalize from examples they have seen to form new utterances

‘[U]nder exactly what circumstances does a child conclude that a nonwitnessed sentence is ungrammatical?’ (Pinker, 1989)

Minimum Description Length (MDL)

MDL may be able to solve the poverty of the stimulus problem

Prefers the grammar that results in the simplest overall description of data

• So prefers simple grammars• And grammars that result in simple

descriptions of the dataSimplest means specifiable using the

least amount of information

Observed sentences

Inferred grammars

Grammar that is a good fit to the data

Simple but non-constraining grammar

Complex but constraining grammar

Space of possible sentences

MDL and Bayes’ Rule

)|()()|( hdPhPdhP

• h is a hypothesis (= grammar)

• d is some data (= sentences)

The probability of a grammar given some data is equal to its a priori probability time how likely the observed sentences would be if that grammar were correct

Probabilities and Complexities• Information theory relates probability P

and amount of information I

I = -log2 P

It takes less information to encode likely events compared to unlikely events

))|()((2)|( hdIhIdhP The best grammar is the one that allows

both itself and the data to be encoded using the least amount of information

Complexity and Probability

• More complex grammarlower probability

• More restrictive grammarless choices for data, so each possibility has a

higher probability

MDL finds a middle ground between always generalizing and never generalizing

Encoding Grammars and Data

1010100111010100101101010001100111100011010110

Grammar Data coded in terms of grammar

Decoder

A B C

B D E

E {kangaroo, aeroplane, comedian}

D {the, a, some}

C {died, laughed, burped}

The comedian died

A kangaroo burped

The aeroplane laughed

Some comedian burped

MDL and Prior (Innate?) Bias

• MDL solves the difficult problem of deciding prior probability for each grammar

• But MDL is still subjective – the prior bias is just hidden in the formalism chosen to represent grammars, and in the encoding scheme

MDL in Linguistics

• Morphology (John Goldsmith, 2001; Michael Brent, 1993)

• Phonology (Mark Ellison, 1992)

• Syntax (Andreas Stolcke, 1994; Langley and Stromsten, 2000; Grünwald 1994; Onnis, Roberts and Chater, 2002)

• Iterated Learning (Teal and Taylor, 2000; Brighton and Kirby, 2001; Roberts, Onnis and Chater, 2005)

Learning Phrase Structure Grammars: Dowman (2000)

• Binary or non-branching rules:

S B C

B E

C tomato

• All derivations start from special symbol S

Encoding Grammars

Grammars can be coded as lists of three symbols

• null symbol in 3rd position indicates a non-branching rule

First symbol is rules left hand side, second and third its right hand side

S, B, C, B, E, null, C, tomato, null

Statistical Encoding of Grammars

• First we encode the frequency of each symbol

• Then encode each symbol using the frequency information

S, B, C, B, E, null, C, tomato, null

I(S) = -log 1/9 I(null) = -log 2/9

Uncommon symbols have a higher coding length than common ones

1 S NP VP

2 NP john

3 NP mary

4 VP screamed

5 VP died

Data encoding: 1, 2, 4, 1, 2, 5, 1, 3, 4

There is a restricted range of choices at each stage of the derivation

Fewer choices = higher probability

Encoding Data

Data:John screamedJohn diedMary Screamed

If we record the frequency of each rule, this information can help us make a more efficient encoding

• 1 S NP VP (3)• 2 NP john (2)• 3 NP mary (1)• 4 VP screamed (2)• 5 VP died (1)

Data: 1, 2, 4, 1, 2, 5, 1, 3, 4

Probabilities: 1 3/3, 2 2/3, 4 2/3, 1 3/3, 2 2/3…

Statistical Encoding of Data

Total frequency for NP = 3

Total frequency for VP = 3

Total frequency for S = 3

Encoding in My Model

1010100111010100101101010001100111100011010110

Symbol Frequencies

Rule Frequencies

Decoder

1 S NP VP2 NP john 3 NP mary4 VP screamed5 VP died

John screamedJohn diedMary Screamed

Grammar Data

S (1)NP (3)VP (3)john (1)mary (1)screamed (1)died (1)null (4)

Rule 1 3Rule 2 2Rule 3 1Rule 4 2Rule 5 1

Number of bits decoded = evaluation

Creating Candidate Grammars

• Start with simple grammar that allows all sentences

• Make simple change and see if it improves the evaluation (add a rule, delete a rule, change a symbol in a rule, etc.)

• Annealing search

• First stage: just look at data coding length

• Second stage: look at overall evaluation

John hit MaryMary hit EthelEthel ranJohn ranMary ranEthel hit JohnNoam hit JohnEthel screamedMary kicked EthelJohn hopes Ethel thinks Mary hit EthelEthel thinks John ranJohn thinks Ethel ranMary ranEthel hit MaryMary thinks John hit EthelJohn screamedNoam hopes John screamedMary hopes Ethel hit JohnNoam kicked Mary

Example: EnglishLearned Grammar

S NP VPVP ranVP screamedVP Vt NPVP Vs SVt hitVt kickedVs thinksVs hopesNP JohnNP EthelNP MaryNP Noam

Evaluations

050

100150200250300350400450

Eva

luat

ion

(b

its)

InitialGrammar

LearnedGrammar

OverallEvaluation

Grammar

Data

Dative Alternation

• Two sub-categories of verb

• Productive use of new verbs

• U-shaped learning

John gave a painting to the museum

John gave the museum a painting

John donated a painting to the museum

* John donated the museum a painting

Training Data

• Three alternating verbs: gave, passed, lent

• One non-alternating verb: donated

• One verb seen only once: sent

The museum lent Sam a painting

John gave a painting to Sam

Sam donated John to the museum

The museum sent a painting to Sam

Dative Evaluations

0

500

1000

1500

2000

2500

3000

3500

Eva

luat

ion

(b

its)

InitialGrammar

LearnedGrammar

OverallEvaluation

Grammar

Data

Learned Structures

John gave a painting to Sam

NP VA DET N P NP

NP

S

Z

Y

X

But all sentences generated by the grammar are grammatical

Two Verb Classes Learned

• Learned grammar distinguishes alternating and non-alternating verbs

A grammar with one verb class would be simpler

So why two classes?

A Grammar with one Verb Class

donated doesn’t alternate

donated alternates

Overall Evaluation (bits)

1703.4 1710.4

Grammar (bits)

321.0 298.2

Data (bits) 1382.3 1412.2

donated was placed in the same class as the other verbs and redundant rule deleted

The new grammar is simpler

But predicts many ungrammatical sentences

U-shaped Learning

• With less data, a grammar with only one class of verbs was learned (so donated could appear in both constructions)

• In this case the benefit derived from a better description of the data was not enough to justify a more complex grammar

So a simpler grammar was preferred in this case

• The model places sent the alternating class. Why?

Y VA NPY VA ZY VP ZVA passedVA gaveVA lentVP donated

VA / VP sent

Regular and Irregular Rules

sent doesn’t alternate

sent alternates

Overall Evaluation (bits)

1703.6 1703.4

Grammar (bits)

322.2 321.0

Data (bits) 1381.4 1382.3

Regular constructions are preferred because the grammar is coded statistically

Why use Statistical Grammars?

Statistics are a valuable source of information They help to infer when absences are due to

chanceThe learned grammar predicted that sent should

appear in the double object construction• but in 150 sentences it was only seen in the

prepositional dative construction• With a non statistical grammar we need an

explanation as to why this is• A statistical grammar knows that sent is rare,

which explains the absence of double object occurrences

Non-statistical Coding of Grammars and Non-statistical Grammars

Status of rules in the grammar

Nature of encoding of grammar

Component of evaluation

Grammar in which sent does not alternate

Grammar in which sent alternates

statistical

statisticalTotal 1703.6 1703.4

Grammar 322.2 321.0

Data 1381.4 1382.3

non-statisticalTotal 1734.2 1735.2

Grammar 352.8 352.8

Data 1381.4 1382.3

non-statistical

statisticalTotal 1653.9 1659.0

Grammar 227.2 226.0

Data 1426.7 1433.0

non-statisticalTotal 1684.6 1690.8

Grammar 257.8 257.8

Data 1426.7 1433.0

Sent only alternates when grammar is both statistical and encoded statistically

Is this really how alternating verbs are learned?

• Change in possession verbs (send, give) alternate

• Unless they are Latinate (donate)Children are aware of such cues (Gropen

et al, 1989)So a complete theory of the dative

alternation must take account of semantic and morphophonological cues

• In MDL rules can be conditioned based on semantic or phonological information

Does Pinker’s / Mazurkewich and White’s Solution Still need MDL?

• Correspondence between morphophonological/semantic cues and subcategorizations is language specific

Must be learned from distributions

• Which semantic/phonological cues are relevant?

• Which are due to chance similarity?

The same kind of issues resurface as with a purely distributional account

Is the Learnability Problem Solved?

• MDL can learn very simple grammars for small data-sets

• No-one’s succeeded in scaling it upA second learnability problem arises from

the impossibility of considering all possible grammars

• Do we need innate constraints if the search for a correct grammar is going to be successful?

Alternative Grammar Formalisms

Phrase structure grammars are too simple to capture some phenomena

• Agreement

• Movement

• etcetera

But MDL is compatible with other grammar formalisms

Neurologically and Psychologically Plausible?

?

Take Home Messages

• Example sentences are a rich source of evidence to linguistic structure

• Statistical information is very valuable – so don’t ignore it

• Maybe syntax is more learned than innate• MDL tells us which generalizations are

appropriate (justified by the data) and which are not

• Lack of negative evidence is not a particular problem when using MDL

References• Brent, M. (1993). Minimal Generative Explanations: A Middle Ground

between Neurons and Triggers. Proceedings of the 15th Annual Conference of the Cognitive Science Society. Hillsdale, N.J.: Lawrence Erlbaum Associates.

• Brighton, H. & Kirby, S. (2001). The Survival of the Smallest: Stability Conditions for the Cultural Evolution of Compositional Language. In J. Kelemen & P. Sosík (Eds.) Advances in Artificial Life. Berlin: Springer.

• Ellison, T. M. (1992). The Machine Learning of Phonological Structure. Doctor of Philosophy thesis, University of Western Australia.

• Chomsky, N. (1957). Syntactic Structures. The Hague: Mouton & Co.• Chomsky, N. (1965). Aspects of the Theory of Syntax. Cambridge, MA: MIT

Press.• Goldsmith, J. (2001). Unsupervised Learning of the Morphology of a Natural

Language. Computational Linguistics 27(2): 153-198.• Gropen, J., Pinker, S., Hollander, M., Goldberg, R. & Wilson, R. (1989). The

Learnability and Acquisition of the Dative Alternation in English. Language, 65, 203-257.

• Grünwald, P. (1994). A minimum description length approach to grammar inference. In G. Scheler, S. Wernter, and E. Riloff, (Eds.), Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language. Berlin: Springer Verlag.

• Langley, P., & Stromsten, S. (2000). Learning context-free grammars with a simplicity bias. In R. L. de Mantaras, and E. Plaza, (Eds.) Proceedings of the Eleventh European Conference on Machine Learning. Barcelona: Springer-Verlag.

• Onnis, L., Roberts, M. and Chater, N. (2002). Simplicity: A cure for overgeneralizations in language acquisition? In W. Gray and C. Schunn (Eds.) Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Lawrence Erlbaum Associates.

• Pinker, S. (1989), Learnability and Cognition: the Acquisition of Argument Structure. Cambridge, MA: MIT Press.

• Roberts, M., Onnis, L., & Chater, N. (2005). Acquisition and evolution of quasi-regular languages: two puzzles for the price of one. In Tallerman, M. (Ed.) Language Origins: Perspectives on Evolution. Oxford: Oxford University Press.

• Stolcke, A. (1994). Bayesian Learning of Probabilistic Language Models. Doctoral dissertation, Department of Electrical Engineering and Computer Science, University of California at Berkeley.

• Teal, T. K. & Taylor, C. E. (2000). Effects of Compression on Language Evolution. Artificial Life, 6: 129-143.

learning syntax with minimum description length mike dowman 9 march 2006

Documents

prior probability

prior bias

learning syntax

probability p

probability time

sentencesthe probability

generalizingencoding

simple grammarsand grammars