modeling the effect of cross-language ambiguity on human syntax acquisition
DESCRIPTION
Modeling the Effect of Cross-Language Ambiguity on Human Syntax Acquisition Fourth Conference on Computational Natural Language Learning 13 Sept 2000 Lisbon, Portugal William Gregory Sakas. Why computationally model human language acquisition? Pinker, 1979 : - PowerPoint PPT PresentationTRANSCRIPT
1
Modeling the Effect of Cross-Language Ambiguity on Human Syntax Acquisition
Fourth Conference on Computational Natural Language Learning
13 Sept 2000 Lisbon, Portugal
William Gregory Sakas
2
Why computationally model human language acquisition?
Pinker, 1979 :
"...it may be necessary to find out how language learning could work in order for the developmental data to tell us how is does work." [emphasis mine]
3
Primary point of this talk:
Not enough to build a series of computer simulations of a cognitive model of human language acquisition and claim that it mirrors the process by which a child acquires language.
The (perhaps obvious) fact is that learners are acutely sensitive to cross-language ambiguity.
Whether or not a learning model is ultimately successful as a cognitive model is an empirical issue; depends on the ‘fit’ of the simulations with the facts about the distribution of ambiguity in human languages.
4
What’s coming:
1) Background on some linguistic theories of acquisition
2) A case study analysis of one parameter setting model : The Structural Triggers Learner (STL) chosen for three reasons:
i) Algorithm takes to heart current generative linguistic theory
ii) Not dependant on a particular grammar formalism
iii) Mathematics of the Markov analysis is straightforward.
3) Conjectures and a proposed research agenda
5
Learnability - Under what conditions is learning possible?
Feasibility - Is acquisition possible within a reasonable amount of time and/or with a reasonable amount of work?
6
Principle and Parameters Framework
All languages share universal principles (UG)e.g. all languages have subjects of some sort
Differ wrt to the settings of a finite number of parameterse.g. overt subjects are optional in a sentence ( yes / no )
on off
(e.g Spanish - yes ) (e.g. English - no)
Null Subject Parameter - optional overt subjects
7
A three parameter domain (Gibson and Wexler, 1994)
‘Sentences’ are strings of the symbols: S, V, 01, 02, aux, adv
SV / VS - subject precedes verb / verb precedes subject
+V2 / -V2 - verb or aux must be in the second position in the sentenceVO / OV - verb precedes object / object precedes verb
Mari will feed the bird S aux V O
8
SV OV +V2 (German-like)
S VS V OO V SS V O2 O1O1 V S O2O2 V S O1S AUX VS AUX O VO AUX S VS AUX O2 O1 VO1 AUX S O2 VO2 AUX S O1 V
ADV V SADV V S OADV V S O2 O1ADV AUX S VADV AUX S O VADV AUX S O2 O1 V
SV VO -V2 (English-like)
S VS V OS V O1 O2S AUX VS AUX V OS AUX V O1 O2ADV S VADV S V OADV S V O1 O2ADV S AUX VADV S AUX V OADV S AUX V O1 O2
Two example languages (finite, degree-0)
9
Surprisingly, G&W’s simple 3-parameter domain presents nontrivial obstacles to several types of learning strategies, but the space is ultimately learnable. (G&W 19994; Berwick & Niyogi 1996; Frank and Kapur 1996; Turkel 1996; Bertolo In press).
Big question:
How will the learning process scale up in terms of feasibility as the number of parameters increases?
Two problems for most acquisition strategies:
1) Ambiguity
2) Size of the domain
10
SV OV +V2 (German-like)
S VS V OO V SS V O2 O1O1 V S O2O2 V S O1S AUX VS AUX O VO AUX S VS AUX O2 O1 VO1 AUX S O2 VO2 AUX S O1 V
ADV V SADV V S OADV V S O2 O1ADV AUX S VADV AUX S O VADV AUX S O2 O1 V
SV VO -V2 (English-like)
S VS V OS V O1 O2S AUX VS AUX V OS AUX V O1 O2ADV S VADV S V OADV S V O1 O2ADV S AUX VADV S AUX V OADV S AUX V O1 O2
Ambiguity across two example languages (finite, degree-0)
Indicates a few ambiguous strings
11
# Parameters = 30
# Grammars = 230
= 1,073,741,824
And, the search space is exponential.
Ambiguity robs the learner of certainty that a parameter value is
correct for the target language.
Search heuristics need to be employed.
12
Answer:
introduce some formal notions in order to abstract away from the specific linguistic content of the input data.
Creating an input space for a linguistically plausible (large) domain is not practical.
-- simulations.
So, how to answer questions of feasibility as the number of grammars (exponentially) scales up?
13
A hybrid approach (formal/empirical)
1) formalize the learning process and input space
2) use the formalization in a Markov structure to empirically test the learner across a wide range of learning scenarios
The framework gives general data on the expected performance of acquisition algorithms. Can answer the question:
Given learner L, if the input space exhibits characteristics x, y and z, is feasible learning possible?
14
A case study:
The Structural Triggers Learner
(Fodor 1998)
15
Some background assumptions
No negative evidence -
The input sample or text is a randomly drawn collection of positive (grammatical) examples of sentences from L(Gtarg).
One hypothesis at a time -
The learner evaluates one grammar at a time. The current hypothesis Gcurr - denotes
the grammar being entertained by the learner at some particular point in time.
Successful acquisition -
The learner converges on a grammar Gtarg when Gcurr= Gtarg, and Gcurr never changes.
16
The Parametric Principle - (Fodor 1995,1998; Sakas and Fodor In press)
Set individual parameters.Do not evaluate whole grammars.
Halves the size of the grammar pool with each successful learning event.
e.g. When 5 of 30 parameters are set 3% of grammar pool remains
17
— Parametric Principle requires certainty
— But how to know when a sentence may be parametrically ambiguous?
Structural Triggers Learner (STL), Fodor (1995, 1998)
Problem:
Solution:
For the STL a parameter value = structural trigger = “treelet”
on off
(e.g English) (e.g. German)
V before O
VP
OV
VP
O V
(e.g English) (e.g. German)
VO / OV
18
— Sentence.
— Parse with current grammar G.
Success Keep G.
Failure Parse with Gcurr + all parametric treelets Adopt the treelets that contributed.
STL Algorithm
19
So, the STL
— uses the parser to decode parametric signatures of sentences.
— can detect parametric ambiguity
Don't learn from sentences that contain a choice point
= waiting-STL variant
— and thus can abide by the Parametric Principle
20
Computationally modeling STL performance:
– the nodes represent current number of parameters that have been set- not grammars
– arcs represent a possible change in the number of parameters that have been set
0 321
A state space for the STL performing in a 3 parameter domain
Here, each input may express 0, 1 or 2 new parameters.
21
Transition probabilities for the waiting-STL depend on:
The number of parameters that have been set( t )
the number of relevant parameters ( r )
the expression rate ( e )
the ambiguity rate ( a )
the "effective" expression rate ( e' )
Learner’s state
Formalization of the input space.
22
e
r
we
t
w
tr
ertwH ),,|(
01,0
0)
min
1
w,)(e')(H(i|t,r,e)e)|t,rH(
ew,(e')H(w|t,r,e
e')e,r,t,P(w| t)(e,r
i
i
w
Transition probabilities for the waiting-STL
23
Results after Markov Analysis
Exponential in % ambiguity
e a' (%) r=15 r=20 r=25 r=30
1 20 62 90 119 15040 83 120 159 20060 124 180 238 30080 249 360 477 599
5 20 15 22 29 3640 34 46 59 7360 144 176 210 24580 3,300 3,466 3,666 3,891
10 20 14 18 23 2840 174 187 203 22160 9,560 9,621 9,727 9,87880 9,765,731 9,766,375 9,768,376 9,772,740
15 20 28 32 37 4140 2,127 2,136 2,153 2,18060 931,323 931,352 931,479 931,82280 ...... over 30 billion ......
20 20 - 87 91 9540 - 27,351 27,361 27,38360 - 90,949,470 90,949,504 90,949,72880 - ...... over 95 trillion .......
Seemingly linear in # of
parameters
24
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
0.2 0.4 0.6 0.8
Proportion of parameters expressed ambiguously
# s
en
ten
ce
s c
on
su
me
d -
Striking effect of ambiguity (r fixed)
20 parameters to be set 10 parameters expressed per input
Logarithmic scale
25
80% Ambiguity
9,765,000
9,766,000
9,767,000
9,768,000
9,769,000
9,770,000
9,771,000
9,772,000
9,773,000
9,774,000
10 15 20 25 30 35
60 % Ambiguity
9,500
9,550
9,600
9,650
9,700
9,750
9,800
9,850
9,900
10 15 20 25 30 35
20% Ambiguity
12
14
16
18
20
22
24
26
28
30
10 15 20 25 30 35
40% Ambiguity
160
170
180
190
200
210
220
230
10 15 20 25 30 35
Subtle effect of ambiguity on efficiency wrt r As ambiguity increases, the cost of the Parametric Principle
skyrockets as the domain scales up (r increases)
Linear scale
x axis = # of parameters in the domain
10 parameters expressed per input
28 221
9,878 9,772,740
# parameters
26
The effect of ambiguity (interacting with e and r)
How / where is the cost incurred?
By far the greatest amount of damage inflicted by ambiguity occurs at the very earliest stages of learning -
the wait for the first fully unambiguous trigger
+
a little wait for sentences that express the last few parameters unambiguously
27
5997
26
126
235 223
142
8863
4634
2620
1613
119 9 8
1014
1
10
100
1000
10000
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
State representing # of parameters set
# s
en
ten
ce
s c
on
su
me
d -
# Sentences consumed
The logarithm of expected number of sentences consumed by the waiting-STL in each state after learning has started. e = 10, r = 30, and e΄ = 0.2 (a΄ = 0.8)
closer and closer to convergence
Logarithmic scale
28
STL — Bad News
Ambiguity is damaging even to a parametrically-principled learner
Abiding by the Parametric Principle does not does not, in and of itself, guarantee merely linear increase in the complexity of the learning task as the number of parameters increases..
29
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
0 5 10 15 20
Can learn
Can't learn
STL — Good News Part 1
Learning task might be manageable if there are at least some sentences with low expression to get learning off the ground.
Fodor (1998); Sakas and Fodor (1998)
Parameters expressed per sentence
30
max
)',,,|()()',,,|(' max
e
wiI eirtwPiDeertwP
Add a distribution factor to the transition probabilities
= Probability that i parameters are
expressed by a sentence given distribution D on input
text I
31
Average number of inputs consumed by the waiting-STL. Expression rate is not fixed per sentence. e varies uniformly from 0 to emax
emax a' (%) r=15 r=20 r=25 r=30
1 20 124 180 238 30040 166 240 318 39960 249 360 477 59980 498 720 954 1198
5 20 28 40 53 6740 46 65 86 10760 89 124 161 19980 235 324 417 511
10 20 17 24 32 4040 40 55 70 8660 102 137 173 20980 323 430 538 648
15 20 15 21 27 3340 46 62 77 9360 134 176 219 26280 447 586 726 868
20 20 20 26 3240 74 91 10960 223 275 32780 755 931 1108
Still exponential in % ambiguity, but manageable
For comparison: e varies from 0 to 10 requires 430 sentences
e fixed at 5 requires 3,466 sentences
32
1
10
100
1000
0.2 0.4 0.6 0.8
Proportion of parameters expressed ambiguously
Avg
# s
ente
nce
s co
nsu
med
Effect of ambiguity is still exponential, but not as bad as for fixed e. r = 20, e is uniformly
distributed from 0 to 10.
Logarithmic scale
33
Effect of high ambiguity rates
Varying rate of expression, uniformly distributed
Still exponential in a, but manageable
larger domain than in previous tables
emax a' (%) r=30 r=40 r=50 r=60
20 90 2,767 3,656 4,549 5,44699.9 33,157 43,825 54,532 65,269
99.99 337,348 445,886 554,832 664,06499.999 3,379,281 4,466,537 5,557,877 6,652,071
99.9999 33,798,616 44,673,047 55,588,325 66,532,14799.99999 337,991,967 446,738,153 555,892,814 665,332,907
34
0
100
200
300
400
500
600
700
10 15 20 25 30 35
# Parameters
# se
nte
nce
s
20% Ambig
40% Ambig
60%
80%
STL — Good News Part 2
With a uniformly distributed expression rate, the cost of the Parametric Principle is linear (in r ) and doesn’t skyrocket
Linear scale
35
In summary:
With a uniformly distributed expression rate, the number of sentences required by the STL falls in a manageable range (though still exponential in % ambiguity)
The number of sentences increases only linearly as the number of parameters increases (i.e. the number of grammars increases exponentially).
36
Conjecture (roughly in the spirit of Schaffer, 1994):
Algorithms may be extremely efficient in in specific domains ‘sweet spot’ but not in others.
Recommends:
Have to know the specific facts about the distribution of ambiguity in natural language.
37
Research agenda:
Three-fold approach to building a cognitive computational model of human language acquisition:
1) formulate a framework to determine what distributions of ambiguity make for feasible learning
2) conduct a psycholinguistic study to determine if the facts of human (child-directed) language are in line with the conducive distributions
3) conduct a computer simulation to check for performance nuances and potential obstacles (e.g. local max based on defaults, or subset principle violations)