modeling the effect of cross-language ambiguity on human syntax acquisition

1

Modeling the Effect of Cross-Language Ambiguity on Human Syntax Acquisition

Fourth Conference on Computational Natural Language Learning

13 Sept 2000 Lisbon, Portugal

William Gregory Sakas

2

Why computationally model human language acquisition?

Pinker, 1979 :

"...it may be necessary to find out how language learning could work in order for the developmental data to tell us how is does work." [emphasis mine]

3

Primary point of this talk:

Not enough to build a series of computer simulations of a cognitive model of human language acquisition and claim that it mirrors the process by which a child acquires language.

The (perhaps obvious) fact is that learners are acutely sensitive to cross-language ambiguity.

Whether or not a learning model is ultimately successful as a cognitive model is an empirical issue; depends on the ‘fit’ of the simulations with the facts about the distribution of ambiguity in human languages.

4

What’s coming:

1) Background on some linguistic theories of acquisition

2) A case study analysis of one parameter setting model : The Structural Triggers Learner (STL) chosen for three reasons:

i) Algorithm takes to heart current generative linguistic theory

ii) Not dependant on a particular grammar formalism

iii) Mathematics of the Markov analysis is straightforward.

3) Conjectures and a proposed research agenda

5

Learnability - Under what conditions is learning possible?

Feasibility - Is acquisition possible within a reasonable amount of time and/or with a reasonable amount of work?

6

Principle and Parameters Framework

All languages share universal principles (UG)e.g. all languages have subjects of some sort

Differ wrt to the settings of a finite number of parameterse.g. overt subjects are optional in a sentence ( yes / no )

on off

(e.g Spanish - yes ) (e.g. English - no)

Null Subject Parameter - optional overt subjects

7

A three parameter domain (Gibson and Wexler, 1994)

‘Sentences’ are strings of the symbols: S, V, 01, 02, aux, adv

SV / VS - subject precedes verb / verb precedes subject

+V2 / -V2 - verb or aux must be in the second position in the sentenceVO / OV - verb precedes object / object precedes verb

Mari will feed the bird S aux V O

8

SV OV +V2 (German-like)

S VS V OO V SS V O2 O1O1 V S O2O2 V S O1S AUX VS AUX O VO AUX S VS AUX O2 O1 VO1 AUX S O2 VO2 AUX S O1 V

ADV V SADV V S OADV V S O2 O1ADV AUX S VADV AUX S O VADV AUX S O2 O1 V

SV VO -V2 (English-like)

S VS V OS V O1 O2S AUX VS AUX V OS AUX V O1 O2ADV S VADV S V OADV S V O1 O2ADV S AUX VADV S AUX V OADV S AUX V O1 O2

Two example languages (finite, degree-0)

9

Surprisingly, G&W’s simple 3-parameter domain presents nontrivial obstacles to several types of learning strategies, but the space is ultimately learnable. (G&W 19994; Berwick & Niyogi 1996; Frank and Kapur 1996; Turkel 1996; Bertolo In press).

Big question:

How will the learning process scale up in terms of feasibility as the number of parameters increases?

Two problems for most acquisition strategies:

1) Ambiguity

2) Size of the domain

10

SV OV +V2 (German-like)

S VS V OO V SS V O2 O1O1 V S O2O2 V S O1S AUX VS AUX O VO AUX S VS AUX O2 O1 VO1 AUX S O2 VO2 AUX S O1 V

ADV V SADV V S OADV V S O2 O1ADV AUX S VADV AUX S O VADV AUX S O2 O1 V

SV VO -V2 (English-like)

S VS V OS V O1 O2S AUX VS AUX V OS AUX V O1 O2ADV S VADV S V OADV S V O1 O2ADV S AUX VADV S AUX V OADV S AUX V O1 O2

Ambiguity across two example languages (finite, degree-0)

Indicates a few ambiguous strings

11

# Parameters = 30

# Grammars = 230

= 1,073,741,824

And, the search space is exponential.

Ambiguity robs the learner of certainty that a parameter value is

correct for the target language.

Search heuristics need to be employed.

12

Answer:

introduce some formal notions in order to abstract away from the specific linguistic content of the input data.

Creating an input space for a linguistically plausible (large) domain is not practical.

-- simulations.

So, how to answer questions of feasibility as the number of grammars (exponentially) scales up?

13

A hybrid approach (formal/empirical)

1) formalize the learning process and input space

2) use the formalization in a Markov structure to empirically test the learner across a wide range of learning scenarios

The framework gives general data on the expected performance of acquisition algorithms. Can answer the question:

Given learner L, if the input space exhibits characteristics x, y and z, is feasible learning possible?

14

A case study:

The Structural Triggers Learner

(Fodor 1998)

15

Some background assumptions

No negative evidence -

The input sample or text is a randomly drawn collection of positive (grammatical) examples of sentences from L(Gtarg).

One hypothesis at a time -

The learner evaluates one grammar at a time. The current hypothesis Gcurr - denotes

the grammar being entertained by the learner at some particular point in time.

Successful acquisition -

The learner converges on a grammar Gtarg when Gcurr= Gtarg, and Gcurr never changes.

16

The Parametric Principle - (Fodor 1995,1998; Sakas and Fodor In press)

Set individual parameters.Do not evaluate whole grammars.

Halves the size of the grammar pool with each successful learning event.

e.g. When 5 of 30 parameters are set 3% of grammar pool remains

17

— Parametric Principle requires certainty

— But how to know when a sentence may be parametrically ambiguous?

Structural Triggers Learner (STL), Fodor (1995, 1998)

Problem:

Solution:

For the STL a parameter value = structural trigger = “treelet”

on off

(e.g English) (e.g. German)

V before O

VP

OV

VP

O V

(e.g English) (e.g. German)

VO / OV

18

— Sentence.

— Parse with current grammar G.

Success Keep G.

Failure Parse with Gcurr + all parametric treelets Adopt the treelets that contributed.

STL Algorithm

19

So, the STL

— uses the parser to decode parametric signatures of sentences.

— can detect parametric ambiguity

Don't learn from sentences that contain a choice point

= waiting-STL variant

— and thus can abide by the Parametric Principle

20

Computationally modeling STL performance:

– the nodes represent current number of parameters that have been set- not grammars

– arcs represent a possible change in the number of parameters that have been set

0 321

A state space for the STL performing in a 3 parameter domain

Here, each input may express 0, 1 or 2 new parameters.

21

Transition probabilities for the waiting-STL depend on:

The number of parameters that have been set( t )

the number of relevant parameters ( r )

the expression rate ( e )

the ambiguity rate ( a )

the "effective" expression rate ( e' )

Learner’s state

Formalization of the input space.

23

Results after Markov Analysis

Exponential in % ambiguity

e a' (%) r=15 r=20 r=25 r=30

1 20 62 90 119 15040 83 120 159 20060 124 180 238 30080 249 360 477 599

5 20 15 22 29 3640 34 46 59 7360 144 176 210 24580 3,300 3,466 3,666 3,891

10 20 14 18 23 2840 174 187 203 22160 9,560 9,621 9,727 9,87880 9,765,731 9,766,375 9,768,376 9,772,740

15 20 28 32 37 4140 2,127 2,136 2,153 2,18060 931,323 931,352 931,479 931,82280 ...... over 30 billion ......

20 20 - 87 91 9540 - 27,351 27,361 27,38360 - 90,949,470 90,949,504 90,949,72880 - ...... over 95 trillion .......

Seemingly linear in # of

parameters

24

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

0.2 0.4 0.6 0.8

Proportion of parameters expressed ambiguously

# s

en

ten

ce

s c

on

su

me

d -

Striking effect of ambiguity (r fixed)

20 parameters to be set 10 parameters expressed per input

Logarithmic scale

25

80% Ambiguity

9,765,000

9,766,000

9,767,000

9,768,000

9,769,000

9,770,000

9,771,000

9,772,000

9,773,000

9,774,000

10 15 20 25 30 35

60 % Ambiguity

9,500

9,550

9,600

9,650

9,700

9,750

9,800

9,850

9,900

10 15 20 25 30 35

20% Ambiguity

12

14

16

18

20

22

24

26

28

30

10 15 20 25 30 35

40% Ambiguity

160

170

180

190

200

210

220

230

10 15 20 25 30 35

Subtle effect of ambiguity on efficiency wrt r As ambiguity increases, the cost of the Parametric Principle

skyrockets as the domain scales up (r increases)

Linear scale

x axis = # of parameters in the domain

10 parameters expressed per input

28 221

9,878 9,772,740

# parameters

26

The effect of ambiguity (interacting with e and r)

How / where is the cost incurred?

By far the greatest amount of damage inflicted by ambiguity occurs at the very earliest stages of learning -

the wait for the first fully unambiguous trigger

+

a little wait for sentences that express the last few parameters unambiguously

27

5997

26

126

235 223

142

8863

4634

2620

1613

119 9 8

1014

1

10

100

1000

10000

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

State representing # of parameters set

# s

en

ten

ce

s c

on

su

me

d -

# Sentences consumed

The logarithm of expected number of sentences consumed by the waiting-STL in each state after learning has started. e = 10, r = 30, and e΄ = 0.2 (a΄ = 0.8)

closer and closer to convergence

Logarithmic scale

28

STL — Bad News

Ambiguity is damaging even to a parametrically-principled learner

Abiding by the Parametric Principle does not does not, in and of itself, guarantee merely linear increase in the complexity of the learning task as the number of parameters increases..

29

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

0 5 10 15 20

Can learn

Can't learn

STL — Good News Part 1

Learning task might be manageable if there are at least some sentences with low expression to get learning off the ground.

Fodor (1998); Sakas and Fodor (1998)

Parameters expressed per sentence

30

max

)',,,|()()',,,|(' max

e

wiI eirtwPiDeertwP

Add a distribution factor to the transition probabilities

= Probability that i parameters are

expressed by a sentence given distribution D on input

text I

31

Average number of inputs consumed by the waiting-STL. Expression rate is not fixed per sentence. e varies uniformly from 0 to emax

emax a' (%) r=15 r=20 r=25 r=30

1 20 124 180 238 30040 166 240 318 39960 249 360 477 59980 498 720 954 1198

5 20 28 40 53 6740 46 65 86 10760 89 124 161 19980 235 324 417 511

10 20 17 24 32 4040 40 55 70 8660 102 137 173 20980 323 430 538 648

15 20 15 21 27 3340 46 62 77 9360 134 176 219 26280 447 586 726 868

20 20 20 26 3240 74 91 10960 223 275 32780 755 931 1108

Still exponential in % ambiguity, but manageable

For comparison: e varies from 0 to 10 requires 430 sentences

e fixed at 5 requires 3,466 sentences

32

1

10

100

1000

0.2 0.4 0.6 0.8

Proportion of parameters expressed ambiguously

Avg

# s

ente

nce

s co

nsu

med

Effect of ambiguity is still exponential, but not as bad as for fixed e. r = 20, e is uniformly

distributed from 0 to 10.

Logarithmic scale

33

Effect of high ambiguity rates

Varying rate of expression, uniformly distributed

Still exponential in a, but manageable

larger domain than in previous tables

emax a' (%) r=30 r=40 r=50 r=60

20 90 2,767 3,656 4,549 5,44699.9 33,157 43,825 54,532 65,269

99.99 337,348 445,886 554,832 664,06499.999 3,379,281 4,466,537 5,557,877 6,652,071

99.9999 33,798,616 44,673,047 55,588,325 66,532,14799.99999 337,991,967 446,738,153 555,892,814 665,332,907

34

0

100

200

300

400

500

600

700

10 15 20 25 30 35

# Parameters

# se

nte

nce

s

20% Ambig

40% Ambig

60%

80%

STL — Good News Part 2

With a uniformly distributed expression rate, the cost of the Parametric Principle is linear (in r ) and doesn’t skyrocket

Linear scale

35

In summary:

With a uniformly distributed expression rate, the number of sentences required by the STL falls in a manageable range (though still exponential in % ambiguity)

The number of sentences increases only linearly as the number of parameters increases (i.e. the number of grammars increases exponentially).

36

Conjecture (roughly in the spirit of Schaffer, 1994):

Algorithms may be extremely efficient in in specific domains ‘sweet spot’ but not in others.

Recommends:

Have to know the specific facts about the distribution of ambiguity in natural language.

37

Research agenda:

Three-fold approach to building a cognitive computational model of human language acquisition:

1) formulate a framework to determine what distributions of ambiguity make for feasible learning

2) conduct a psycholinguistic study to determine if the facts of human (child-directed) language are in line with the conducive distributions

3) conduct a computer simulation to check for performance nuances and potential obstacles (e.g. local max based on defaults, or subset principle violations)

modeling the effect of cross-language ambiguity on human syntax acquisition

Documents

vs v os

bird s aux v osv ov

learning model

advsv vs subject

subject v2 v2

verb verb

human languages

crosslanguage ambiguity