learning language from distributional evidence christopher manning depts of cs and linguistics...
TRANSCRIPT
Learning Language from Distributional Evidence
Christopher ManningDepts of CS and Linguistics
Stanford University
Workshop on Where Does Syntax Come From?, MIT, Oct 2007
2
There’s a lot to agree on![C. D. Yang. 2004. UG, statistics or both? Trends in CogSci 8(10)]
Both endowment (priors/biases) and learning from experience contribute to language acquisition
“To be effective, a learning algorithm … must have an appropriate representation of the relevant … data.”
Languages, and hence models of language, have intricate structure, which must be modeled.
3
More points of agreement
Learning language structure requires priors or biases, to work at all, but especially at speed
Yang is “in favor of probabilistic learning mechanisms that may well be domain-general”
I am too. Probabilistic methods (and especially Bayesian prior
+ likelihood methods) are perfect for this! Probabilistic models can achieve and explain:
gradual learning and robustness in acquisition non-homogeneous grammars of individuals gradual language change over time [and also other stuff, like online processing]
4
The disagreements are important, but two levels down
In his discussion of Saffran, Aslin, and Newport (1998), Yang contrasts “statistical learning” (of syllable bigram transition probabilities) from use of UG principles such as one primary stress per word.
I would place both parts in the same probabilistic model Stress is a cue for word learning, just as syllable transition
probabilities are a cue (and it’s very subtle in speech!!!) Probability theory is an effective means of combining multiple,
often noisy, sources of information with prior beliefs
Yang keeps probabilities outside of the grammar, by suggesting that the child maintains a probability distribution over a collection of competing grammars
I would place the probabilities inside the grammar It is more economical, explanatory, and effective.
5
The central questions
1. What representations are appropriate for human languages?
2. What biases are required to learn languages successfully? Linguistically informed biases – but
perhaps fairly general ones are enough
3. How much of language structure can be acquired from the linguistic input? This gives a lower bound on how much is
innate.
6
1. A mistaken meme: language as a homogeneous, discrete system
Joos (1950: 701–702): “Ordinary mathematical techniques fall mostly into
two classes, the continuous (e.g., the infinitesimal calculus) and the discrete or discontinuous (e.g., finite group theory). Now it will turn out that the mathematics called ‘linguistics’ belongs to the second class. It does not even make any compromise with continuity as statistics does, or infinite-group theory. Linguistics is a quantum mechanics in the most extreme sense. All continuities, all possibilities of infinitesimal gradation, are shoved outside of linguistics in one direction or the other.”
[cf. Chambers 1995]
7
The quest for homogeneity
Bloch (1948: 7): “The totality of the possible utterances of
one speaker at one time in using a language to interact with one other speaker is an idiolect. … The phrase ‘with one other speaker’ is intended to exclude the possibility that an idiolect might embrace more than one style of speaking.”
Sapir (1921: 147) “Everyone knows that language is variable”
8
Variation is everywhere
The definition of an idiolect fails, as variation occurs even internal to the usage of a speaker in one style.
As least as: Black voters also turned out at least as well
as they did in 1996, if not better in some regions, including the South, according to exit polls. Gore was doing as least as well among black voters as President Clinton did that year.
(Associated Press, 2000)
9
Linguistic Facts vs. Linguistic Theories
Weinreich, Labov and Herzog (1968) see 20th century linguistics as having gone astray by mistakenly searching for homogeneity in language, on the misguided assumption that only homogeneous systems can be structured
Probability theory provides a method for describing structure in variable systems!
10
The need for probability models inside syntax
The motivation comes from two sides: Categorical linguistic theories claim too much:
They place a hard categorical boundary of grammaticality, where really there is a fuzzy edge, determined by many conflicting constraints and issues of conventionality vs. human creativity
Categorical linguistic theories explain too little: They say nothing at all about the soft constraints
which explain how people choose to say things Something that language educators, computational NLP
people – and historical linguists and sociolinguists dealing with real language – usually want to know about
11
Clausal argument subcategorization frames
Problem: in context, language is used more flexibly than categorical constraints suggest:
E.g., most subcategorization frame ‘facts’ are wrong
Pollard and Sag (1994) inter alia [regard vs. consider]: *We regard Kim to be an acceptable candidate We regard Kim as an acceptable candidate
The New York Times: As 70 to 80 percent of the cost of blood tests, like
prescriptions, is paid for by the state, neither physicians nor patients regard expense to be a consideration.
Conservatives argue that the Bible regards homosexuality to be a sin.
And the same pattern repeats for many verbs and frames….
12
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
__ NP as NP __ NP NP __ NP as AdjP __ NP as PP __ NP as VP[ing] __ NP VP[inf]
Probability mass functions: subcategorization of regard
13
Bresnan and Nikitina (2003) on the Dative Alternation
Pinker (1981), Krifka (2001): verbs of instantaneous force allow dative alternation but not “verbs of continuous imparting of force” like push
“As Player A pushed him the chips, all hell broke loose at the table.”
www.cardplayer.com/?sec=afeature&art id=165
Pinker (1981), Levin (1993), Krifka (2001): verbs of instrument of communication allow dative shift but not verbs of manner of speaking
“Hi baby.” Wade says as he stretches. You just mumble him an answer. You were comfy on that soft leather couch. Besides …
www.nsyncbitches.com/thunder/fic/break.htm
In context such usages are unremarkable! It’s just the productivity and context-dependence of language Examples are rare → because these are gradient constraints.
Here data is gathered from a really huge corpus: the web.
14
The disappearing hard constraints of categorical grammars
We see the same thing over and over Another example is constraints on Heavy NP Shift
[cf. Wasow 2002] You start with a strong categorical theory, which is
mostly right. People point out exceptions and counter examples You either weaken it in the face of counterexamples Or you exclude the examples from consideration Either way you end up without an interesting theory There’s little point in aiming for explanatory
adequacy when the descriptive adequacy of the used representations [as opposed to particular descriptions] just isn’t there.
There is insight in the probability distributions!
15
Explaining more:What do people say?
What people say has two parts: Contingent facts about the world
People have been talking a lot about Iraq lately The way speakers choose to express ideas
using the resources of the language People don’t often put that clauses pre-verbally:
It appears almost certain that we will have to take a loss
That we will have to take a loss appears almost certain
The latter is properly part of people’s Knowledge of Language. Part of syntax.
16
Variation is part of competence [Labov 1972: 125]
“The variable rules themselves require at so many points the recognition of grammatical categories, of distinctions between grammatical boundaries, and are so closely interwoven with basic categorical rules, that it is hard to see what would be gained by extracting a grain of performance from this complex system. It is evident that [both the categorical and the variable rules proposed] are a part of the speaker’s knowledge of language.”
17
What do people say?
Simply delimiting a set of grammatical sentences provides only a very weak description of a language, and of the ways people choose to express ideas in it
Probability densities over sentences and sentence structures can give a much richer view of language structure and use
In particular, we find that (apparently) categorical constraints in one language often reappear as the same soft generalizations and tendencies in other languages
[Givón 1979, Bresnan, Dingare, and Manning 2001] Linguistic theory should be able to uniformly capture
these constraints, rather than only recognizing them when they are categorical
18
Explaining more: what determines ditransitive vs. NP PP for dative verb
[Bresnan, Cueni, Nikitina, and Baayen 2005] Build mixed effects [logistic regression] model over
a corpus of examples Model is able to pull apart the correlations between
various predictive variables Explanatory variables:
Discourse accessibility, definiteness, pronominality, animacy (Thompson 1990, Collins 1995)
Differential length in words of recipient and theme (Arnold et al. 2000, Wasow 2002, Szmrecsanyi 2004b)
Structural parallelism in dialogue (Weiner and Labov 1983, Bock1986, Szmrecsanyi 2004a)
Number, person (Aissen1999, 2003; Haspelmath 2004; Bresnan and Nikitina 2003)
Concreteness of theme Broad semantic class of verb (transfer, prevent, communicate, …)
19
Explaining more: what determines ditransitive vs. NP PP for dative verb
[Bresnan, Cueni, Nikitina, and Baayen 2005] What does one learn?
Almost all the predictive variables have independent significant effects
Only a couple fail to: e.g., number of recipient Shows that reductionist theories of the phenomenon that
tries to reduce things to one or two factors are wrong First object NP is preferred to be:
Given, animate, definite, pronoun, shorter Model can predict whether to use a double object or NP
PP construction correctly 94% of the time. It captures much of what is going on in this choice These factors exceed in importance differences from
individual variation
20
the reporter who the senator attacked
Statistical parsing models also give insight into processing [Levy 2004]
A realistic model of human sentence processing explain: Robustness to arbitrary input, accurate disambiguation Inference on the basis of incomplete input [Tanenhaus et
al 1995, Altmann and Kamide 1999, Kaiser and Trueswell 2004] Sentence processing difficulty is differential and localized
On the traditional view, resource limitations, especially memory, drive processing difficulty
Locality-driven processing [Gibson 1998, 2000]: multiple and/or more distant dependencies are harder to process
the reporter who attacked the senator
Processing
Easy
Hard
21
Expectation-driven processing [Hale 2001, Levy 2005]
Alternative paradigm: Expectation-based models of syntactic processing Expectations are weighted averages over
probabilities Structures we expect are easy to process Modern computational linguistics techniques
of statistical parsing precise psycholinguistic model
Model matches empirical results of many recent experiments better than traditional memory-limitation models
22
Example: Verb-final domainsLocality predictions and empirical results
[Konieczny 2000] looked at reading times at German final verbs
Locality-based models (Gibson 1998) predict difficulty for longer clauses
But Konieczny found that final verbs were read faster in longer clauses
Prediction
easy
hard
hard
Result
fast
fastest
slow
Er hat die Gruppe auf den Berg geführt
Er hat die Gruppe geführt He led the group
He led the group to the mountain
Er hat die Gruppe auf den sehr schönen Berg geführt
He led the group to the very beautiful mountain
23
Once we’ve seen a PP goal we’re unlikely to see another
So the expectation of seeing anything else goes up
Rigorously tested: for pi(w), using a PCFG derived empirically from a syntactically annotated corpus of German (the NEGRA treebank)
Seeing more = having more information More information = more accurate expectations
Deriving Konieczny’s results
auf den Berg
PP
geführt
VNP?
PP-goal?PP-loc?Verb?ADVP?
die Gruppe
VP
NP
S
NPVfin
Er hat
24
450
460
470
480
490
500
510
520
No PP Short PP Long PP
Reading time (ms)
14.8
15
15.2
15.4
15.6
15.8
16
16.2
Reading time at final verb
Negative Log probability
Er hat die Gruppe (auf den (sehr schönen) Berg) geführt
Predictions from the Hale 2001/Levy 2004 model
Locality-based models (e.g., Gibson 1998, 2000) would violate monotonicity
Loca
lity-b
ase
d d
ifficu
lty (
ord
inal)
1
2
3
25
2.&3. Learning sentence structure from distributional evidence
Start with raw language, learn syntactic structure
Some have argued that learning syntax from positive data alone is impossible:
Gold, 1967: Non-identifiability in the limit Chomsky, 1980 : The poverty of the stimulus
Many others have felt it should be possible: Lari and Young, 1990 Carroll and Charniak, 1992 Alex Clark, 2001 Mark Paskin, 2001
… but it is a hard problem
26
Language learning idea 1: Lexical affinity models
Words select other words on syntactic grounds
Link up pairs with high mutual information [Yuret, 1998]: Greedy linkage [Paskin, 2001]: Iterative re-estimation with EM
Evaluation: compare linked pairs (undirected) to a gold standard
congress narrowly passed the amended bill
39.7
AccuracyMethod
Paskin, 2001
41.7Random
27
Idea: Word Classes Mutual information between words does not necessarily
indicate syntactic selection.
Individual words like brushbacks are entwined with semantic facts about the world.
Syntactic classes, like NOUN and ADVERB are bleached of word-specific semantics.
We could build dependency models over word classes. [cf. Carroll and Charniak, 1992]
NOUN ADVERB VERB DET PARTICIPLE NOUNcongress narrowly passed the amended billcongress narrowly passed the amended bill
expect brushbacks but no beanballs
28
Problems: Word Class Models
Issues: Too simple a model – doesn’t work much better
supervised No representation of valence/distance of arguments)
NOUN NOUN VERB
stock prices fell
NOUN NOUN VERB
stock prices fell
41.7Random
53.2
44.7
Adjacent Words
Carroll and Charniak, 92
congress narrowly passed the amended bill
29
Bias: Using better dependency representations [Klein and Manning 2004]
Classes? Distance Local Factor
Paskin 01 P(a | h)Carroll & Charniak 92 P(c(a) | c(h))Klein/Manning (DMV) P(c(a) | c(h), d)
arghead
distance
?
55.9Adjacent Words
63.6Klein/Manning (DMV)
30
Idea: Can we learn phrase structure constituency as distributional clustering
president the __ of
president the __ said
governor the __ of
governor the __ appointed
said sources __
said president __ that
reported sources __
presidentgovernor
saidreported
thea
the president said that the downturn was over
[Finch and Chater 92, Schütze 93, many others]
31
Distributional learning
There is much debate in the child language acquisition literature about distributional learning and the possibility of kids successfully using it:
[Maratsos & Chalkley 1980] suggest it [Pinker 1984] suggests it‘s impossible (too many
correlations, too abstract properties needed) [Reddington et al. 1998] say it succeeds because
there are dominant cues relevant to language [Mintz et al. 2002] look at distributional structure of
input Speaking in terms of engineering, let me just tell
you that it works really well! It’s one of the most successful techniques that we
have – my group uses it everywhere for NLP.
32
Idea: Distributional Syntax? [Klein and Manning NIPS 2003]
factory payrolls fell in september
NP PP
VP
S
payrolls __ fell in september
ContextSpan
factory __ septpayrolls fell in
Can we use distributional clustering for learning syntax?
33
Problem: Identifying Constituents
the final votetwo decadesmost people
decided totook most of
go with
of thewith a
without many
in the endon timefor now
the finalthe intitialtwo of the
Distributional classes are easy to find…
PP
VP
NP-
+
… but figuring out which are constituents is hard.
Pri
nci
pal
Com
ponent
2
Pri
nci
pal
Com
ponent
2
Principal Component 1 Principal Component 1
34
A Nested Distributional Model: Constituent-Context Model (CCM)
P(S|T) =
factory payrolls fell in september
+
- - - - -
P(fpfis|+)P(__|+)
P(fp|+)P(__ fell|+)
P(fis|+)P(p __ |+)
P(is|+)P(fell __ |+)
+ +
+
35
Initialization: A little UG?Tree Uniform
Split Uniform
36
Results: Constituency
van Zaanen, 00 35.6
Right-Branch 70.0
Our Model (CCM) 81.6
Treebank Parse
CCM Parse
37
Combining the two models[Klein and Manning ACL 2004]
Supervised PCFG constituency recall is at 92.8 Qualitative improvements
Subject-verb groups gone, modifier placement improved
Random 45.6
DMV 62.7
CCM + DMV 64.7
Random 39.4
CCM 81.0
CCM + DMV 88.0
!Constituency
Evaluation
Dependency Evaluation
38
Beyond surface syntax…[Levy and Manning, ACL 2004]
origin?
Syntactic category, parent, grandparent (subj vs obj extraction; VP finiteness
Syntactic path (Gildea & Jurafsky 2002):
<SBAR,S,VP,S,VP>
Plus: feature conjunctions, specialized features for expletive subject dislocations, passivizations, passing featural information properly through coordinations, etc., etc.
cf. Campbell (2004 ACL) – a lot of linguistic knowledge
Presence of daughters (NP under S)
Head words (wanted vs to vs eat)
39
Evaluation on dependency metric: gold-standard input trees
70 75 80 85 90 95 100
Overall
NP
S
VP
ADJP
SBAR
ADVP
Accuracy (F1)
Levy&Manning,gold input trees
CF gold tree deps
40
She opened the door with a key.
Fortunately, the corkscrew opened the bottle.
The bottle opened easily.
The door opened as they approached.
This key opens the door of the cottage.
She opened the bottle very carefully.
He opened the bottle with a corkscrew.
He opened the door.
Instances of open:
instrument
patient
agent she, he
door, bottle
key, corkscrew
Word Distributions:
{ agent=subject, patient=object }
{ patient=subject }
{ agent=subject, patient=object, instrument=obl_with }
{ instrument=subject, patient=object }
Allowed Linkings:
Might we also learn a linking to argument structure distributionally?
Why it might be possible:
41
Probabilistic Model Learning[Grenager and Manning 2006 EMNLP]
vv
ll
oo
give
{ 0=subj, 1=obj2, 2=obj1 }
[ 0=subj, M=np 1=obj2, 2=obj1 ]
r1r1 w1
w1subj 0 plunge
s1s1
r2r2 w2
w2np M today
s2s2
r3r3 w3
w3obj1 2 them
s3s3
r4r4s4
s4 w4w4
obj2 1 test
Given a set of observed verb instances, what are the most likely model parameters?
Use unsupervised learning in a structured probabilistic graphical model
A good application for EM! M-step:
Trivial computation E-Step:
We compute conditional distributions over possible role vectors for each instance
And we repeat
42
Verb: giveVerb: give
Linkings:Linkings:
Roles:Roles:
{0=subj,1=obj2, 2=obj1}{0=subj,1=obj2, 2=obj1} 0.460.46
{0=subj,1=obj1, 2=to}{0=subj,1=obj1, 2=to}
{0=subj,1=obj1}{0=subj,1=obj1}0.190.19
……
00
11
22
……
Verb: payVerb: pay
Linkings:Linkings:
Roles:Roles:
{0=subj,1=obj1}{0=subj,1=obj1} 0.320.32
{0=subj,1=obj1, 2=for}{0=subj,1=obj1, 2=for}
{0=subj}{0=subj}0.210.21
{0=subj,1=obj1, 2=to}{0=subj,1=obj1, 2=to}
00
11
22
……
{0=subj,1=obj2, 2=obj1}{0=subj,1=obj2, 2=obj1}
……it, he, bill, they, that, …it, he, bill, they, that, …
power, right, stake, …power, right, stake, …
them, it, him, dept., …them, it, him, dept., …
……
0.050.05
……
it, they, company, he, …it, they, company, he, …
$, bill, price, tax …$, bill, price, tax …
stake, gov., share, amt., …stake, gov., share, amt., …
……
0.070.07
0.050.05
0.050.05
……
Semantic role induction results
The model achieve some traction, but it’s hard Learning becomes harder with greater abstraction This is the right research frontier to explore!
43
Conclusions
Probabilistic models give precise descriptions of a variable, uncertain world
There are many phenomena in syntax that cry out for non-categorical or probabilistic representations of language
Probabilistic models can –and should – be used over rich linguistic representations
They support effective learning and processing Language learning does require biases or priors But a lot more can be learned from modest amounts of
input than people have thought There’s not much evidence of a poverty of the stimulus
preventing such models being used in acquisition.