learning syntax with minimum description length mike dowman 9 march 2006
TRANSCRIPT
Learning Syntax with Minimum Description Length
Mike Dowman
9 March 2006
Syntactic Theory
Chomsky (1957):
• Discovery procedure
• Decision procedure
• Evaluation procedure
Chomsky (1965):
Language acquisition
Poverty of the Stimulus
• Evidence available to children is utterances produced by other speakers
• No direct cues to sentence structure
• Or to word categories
So children need prior knowledge of possible structures
UG
Negative Evidence
• Some constructions seem impossible to learn without negative evidence
John gave a painting to the museum
John gave the museum a painting
John donated a painting to the museum
* John donated the museum a painting
Implicit Negative Evidence
If constructions don’t appear can we just assume they’re not grammatical?
No – we only see a tiny proportion of possible, grammatical sentences
People generalize from examples they have seen to form new utterances
‘[U]nder exactly what circumstances does a child conclude that a nonwitnessed sentence is ungrammatical?’ (Pinker, 1989)
Minimum Description Length (MDL)
MDL may be able to solve the poverty of the stimulus problem
Prefers the grammar that results in the simplest overall description of data
• So prefers simple grammars• And grammars that result in simple
descriptions of the dataSimplest means specifiable using the
least amount of information
Observed sentences
Inferred grammars
Grammar that is a good fit to the data
Simple but non-constraining grammar
Complex but constraining grammar
Space of possible sentences
MDL and Bayes’ Rule
)|()()|( hdPhPdhP
• h is a hypothesis (= grammar)
• d is some data (= sentences)
The probability of a grammar given some data is equal to its a priori probability time how likely the observed sentences would be if that grammar were correct
Probabilities and Complexities• Information theory relates probability P
and amount of information I
I = -log2 P
It takes less information to encode likely events compared to unlikely events
))|()((2)|( hdIhIdhP The best grammar is the one that allows
both itself and the data to be encoded using the least amount of information
Complexity and Probability
• More complex grammarlower probability
• More restrictive grammarless choices for data, so each possibility has a
higher probability
MDL finds a middle ground between always generalizing and never generalizing
Encoding Grammars and Data
1010100111010100101101010001100111100011010110
Grammar Data coded in terms of grammar
Decoder
A B C
B D E
E {kangaroo, aeroplane, comedian}
D {the, a, some}
C {died, laughed, burped}
The comedian died
A kangaroo burped
The aeroplane laughed
Some comedian burped
MDL and Prior (Innate?) Bias
• MDL solves the difficult problem of deciding prior probability for each grammar
• But MDL is still subjective – the prior bias is just hidden in the formalism chosen to represent grammars, and in the encoding scheme
MDL in Linguistics
• Morphology (John Goldsmith, 2001; Michael Brent, 1993)
• Phonology (Mark Ellison, 1992)
• Syntax (Andreas Stolcke, 1994; Langley and Stromsten, 2000; Grünwald 1994; Onnis, Roberts and Chater, 2002)
• Iterated Learning (Teal and Taylor, 2000; Brighton and Kirby, 2001; Roberts, Onnis and Chater, 2005)
Learning Phrase Structure Grammars: Dowman (2000)
• Binary or non-branching rules:
S B C
B E
C tomato
• All derivations start from special symbol S
Encoding Grammars
Grammars can be coded as lists of three symbols
• null symbol in 3rd position indicates a non-branching rule
First symbol is rules left hand side, second and third its right hand side
S, B, C, B, E, null, C, tomato, null
Statistical Encoding of Grammars
• First we encode the frequency of each symbol
• Then encode each symbol using the frequency information
S, B, C, B, E, null, C, tomato, null
I(S) = -log 1/9 I(null) = -log 2/9
Uncommon symbols have a higher coding length than common ones
1 S NP VP
2 NP john
3 NP mary
4 VP screamed
5 VP died
Data encoding: 1, 2, 4, 1, 2, 5, 1, 3, 4
There is a restricted range of choices at each stage of the derivation
Fewer choices = higher probability
Encoding Data
Data:John screamedJohn diedMary Screamed
If we record the frequency of each rule, this information can help us make a more efficient encoding
• 1 S NP VP (3)• 2 NP john (2)• 3 NP mary (1)• 4 VP screamed (2)• 5 VP died (1)
Data: 1, 2, 4, 1, 2, 5, 1, 3, 4
Probabilities: 1 3/3, 2 2/3, 4 2/3, 1 3/3, 2 2/3…
Statistical Encoding of Data
Total frequency for NP = 3
Total frequency for VP = 3
Total frequency for S = 3
Encoding in My Model
1010100111010100101101010001100111100011010110
Symbol Frequencies
Rule Frequencies
Decoder
1 S NP VP2 NP john 3 NP mary4 VP screamed5 VP died
John screamedJohn diedMary Screamed
Grammar Data
S (1)NP (3)VP (3)john (1)mary (1)screamed (1)died (1)null (4)
Rule 1 3Rule 2 2Rule 3 1Rule 4 2Rule 5 1
Number of bits decoded = evaluation
Creating Candidate Grammars
• Start with simple grammar that allows all sentences
• Make simple change and see if it improves the evaluation (add a rule, delete a rule, change a symbol in a rule, etc.)
• Annealing search
• First stage: just look at data coding length
• Second stage: look at overall evaluation
John hit MaryMary hit EthelEthel ranJohn ranMary ranEthel hit JohnNoam hit JohnEthel screamedMary kicked EthelJohn hopes Ethel thinks Mary hit EthelEthel thinks John ranJohn thinks Ethel ranMary ranEthel hit MaryMary thinks John hit EthelJohn screamedNoam hopes John screamedMary hopes Ethel hit JohnNoam kicked Mary
Example: EnglishLearned Grammar
S NP VPVP ranVP screamedVP Vt NPVP Vs SVt hitVt kickedVs thinksVs hopesNP JohnNP EthelNP MaryNP Noam
Evaluations
050
100150200250300350400450
Eva
luat
ion
(b
its)
InitialGrammar
LearnedGrammar
OverallEvaluation
Grammar
Data
Dative Alternation
• Two sub-categories of verb
• Productive use of new verbs
• U-shaped learning
John gave a painting to the museum
John gave the museum a painting
John donated a painting to the museum
* John donated the museum a painting
Training Data
• Three alternating verbs: gave, passed, lent
• One non-alternating verb: donated
• One verb seen only once: sent
The museum lent Sam a painting
John gave a painting to Sam
Sam donated John to the museum
The museum sent a painting to Sam
Dative Evaluations
0
500
1000
1500
2000
2500
3000
3500
Eva
luat
ion
(b
its)
InitialGrammar
LearnedGrammar
OverallEvaluation
Grammar
Data
Learned Structures
John gave a painting to Sam
NP VA DET N P NP
NP
S
Z
Y
X
But all sentences generated by the grammar are grammatical
Two Verb Classes Learned
• Learned grammar distinguishes alternating and non-alternating verbs
A grammar with one verb class would be simpler
So why two classes?
A Grammar with one Verb Class
donated doesn’t alternate
donated alternates
Overall Evaluation (bits)
1703.4 1710.4
Grammar (bits)
321.0 298.2
Data (bits) 1382.3 1412.2
donated was placed in the same class as the other verbs and redundant rule deleted
The new grammar is simpler
But predicts many ungrammatical sentences
U-shaped Learning
• With less data, a grammar with only one class of verbs was learned (so donated could appear in both constructions)
• In this case the benefit derived from a better description of the data was not enough to justify a more complex grammar
So a simpler grammar was preferred in this case
• The model places sent the alternating class. Why?
Y VA NPY VA ZY VP ZVA passedVA gaveVA lentVP donated
VA / VP sent
Regular and Irregular Rules
sent doesn’t alternate
sent alternates
Overall Evaluation (bits)
1703.6 1703.4
Grammar (bits)
322.2 321.0
Data (bits) 1381.4 1382.3
Regular constructions are preferred because the grammar is coded statistically
Why use Statistical Grammars?
Statistics are a valuable source of information They help to infer when absences are due to
chanceThe learned grammar predicted that sent should
appear in the double object construction• but in 150 sentences it was only seen in the
prepositional dative construction• With a non statistical grammar we need an
explanation as to why this is• A statistical grammar knows that sent is rare,
which explains the absence of double object occurrences
Non-statistical Coding of Grammars and Non-statistical Grammars
Status of rules in the grammar
Nature of encoding of grammar
Component of evaluation
Grammar in which sent does not alternate
Grammar in which sent alternates
statistical
statisticalTotal 1703.6 1703.4
Grammar 322.2 321.0
Data 1381.4 1382.3
non-statisticalTotal 1734.2 1735.2
Grammar 352.8 352.8
Data 1381.4 1382.3
non-statistical
statisticalTotal 1653.9 1659.0
Grammar 227.2 226.0
Data 1426.7 1433.0
non-statisticalTotal 1684.6 1690.8
Grammar 257.8 257.8
Data 1426.7 1433.0
Sent only alternates when grammar is both statistical and encoded statistically
Is this really how alternating verbs are learned?
• Change in possession verbs (send, give) alternate
• Unless they are Latinate (donate)Children are aware of such cues (Gropen
et al, 1989)So a complete theory of the dative
alternation must take account of semantic and morphophonological cues
• In MDL rules can be conditioned based on semantic or phonological information
Does Pinker’s / Mazurkewich and White’s Solution Still need MDL?
• Correspondence between morphophonological/semantic cues and subcategorizations is language specific
Must be learned from distributions
• Which semantic/phonological cues are relevant?
• Which are due to chance similarity?
The same kind of issues resurface as with a purely distributional account
Is the Learnability Problem Solved?
• MDL can learn very simple grammars for small data-sets
• No-one’s succeeded in scaling it upA second learnability problem arises from
the impossibility of considering all possible grammars
• Do we need innate constraints if the search for a correct grammar is going to be successful?
Alternative Grammar Formalisms
Phrase structure grammars are too simple to capture some phenomena
• Agreement
• Movement
• etcetera
But MDL is compatible with other grammar formalisms
Neurologically and Psychologically Plausible?
?
Take Home Messages
• Example sentences are a rich source of evidence to linguistic structure
• Statistical information is very valuable – so don’t ignore it
• Maybe syntax is more learned than innate• MDL tells us which generalizations are
appropriate (justified by the data) and which are not
• Lack of negative evidence is not a particular problem when using MDL
References• Brent, M. (1993). Minimal Generative Explanations: A Middle Ground
between Neurons and Triggers. Proceedings of the 15th Annual Conference of the Cognitive Science Society. Hillsdale, N.J.: Lawrence Erlbaum Associates.
• Brighton, H. & Kirby, S. (2001). The Survival of the Smallest: Stability Conditions for the Cultural Evolution of Compositional Language. In J. Kelemen & P. Sosík (Eds.) Advances in Artificial Life. Berlin: Springer.
• Ellison, T. M. (1992). The Machine Learning of Phonological Structure. Doctor of Philosophy thesis, University of Western Australia.
• Chomsky, N. (1957). Syntactic Structures. The Hague: Mouton & Co.• Chomsky, N. (1965). Aspects of the Theory of Syntax. Cambridge, MA: MIT
Press.• Goldsmith, J. (2001). Unsupervised Learning of the Morphology of a Natural
Language. Computational Linguistics 27(2): 153-198.• Gropen, J., Pinker, S., Hollander, M., Goldberg, R. & Wilson, R. (1989). The
Learnability and Acquisition of the Dative Alternation in English. Language, 65, 203-257.
• Grünwald, P. (1994). A minimum description length approach to grammar inference. In G. Scheler, S. Wernter, and E. Riloff, (Eds.), Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language. Berlin: Springer Verlag.
• Langley, P., & Stromsten, S. (2000). Learning context-free grammars with a simplicity bias. In R. L. de Mantaras, and E. Plaza, (Eds.) Proceedings of the Eleventh European Conference on Machine Learning. Barcelona: Springer-Verlag.
• Onnis, L., Roberts, M. and Chater, N. (2002). Simplicity: A cure for overgeneralizations in language acquisition? In W. Gray and C. Schunn (Eds.) Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Lawrence Erlbaum Associates.
• Pinker, S. (1989), Learnability and Cognition: the Acquisition of Argument Structure. Cambridge, MA: MIT Press.
• Roberts, M., Onnis, L., & Chater, N. (2005). Acquisition and evolution of quasi-regular languages: two puzzles for the price of one. In Tallerman, M. (Ed.) Language Origins: Perspectives on Evolution. Oxford: Oxford University Press.
• Stolcke, A. (1994). Bayesian Learning of Probabilistic Language Models. Doctoral dissertation, Department of Electrical Engineering and Computer Science, University of California at Berkeley.
• Teal, T. K. & Taylor, C. E. (2000). Effects of Compression on Language Evolution. Artificial Life, 6: 129-143.