primary point (to make [quickly, painlessly] on a saturday just before lunch):

35
1 Modeling Parameter Setting Performance in Domains with a Large Number of Parameters: A Hybrid Approach CUNY / SUNY / NYU Linguistics Mini- conference March 10, 2001 William Gregory Sakas & Dina Demner- Fushman

Upload: edena

Post on 05-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

Modeling Parameter Setting Performance in Domains with a Large Number of Parameters: A Hybrid Approach CUNY / SUNY / NYU Linguistics Mini-conference March 10, 2001 William Gregory Sakas & Dina Demner-Fushman. Primary point (to make [quickly, painlessly] on a Saturday just before lunch): - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

1

Modeling Parameter Setting Performance in Domains with a Large Number of

Parameters: A Hybrid Approach

CUNY / SUNY / NYU Linguistics Mini-conference

March 10, 2001

William Gregory Sakas & Dina Demner-Fushman

Page 2: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

2

Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

Not enough to build a series of computer simulations of a cognitive model of human language acquisition and claim that it mirrors the process by which a child acquires language.

The (perhaps obvious) fact is that learners are acutely sensitive to cross-language ambiguity.

Whether or not a learning model is ultimately successful as a cognitive model is an empirical issue; depends on the ‘fit’ of the simulations with the facts about the distribution of ambiguity in human languages.

Page 3: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

3

What’s coming:

1) Some early learnability results

2) A feasibility case study analysis of one parameter setting model : The Triggering Learning Algorithm (TLA) Gibson and Wexler (1994)

3) Conjectures and a proposed research agenda

Page 4: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

4

Why computationally model natural language acquisition?

Pinker (1979) :

"...it may be necessary to find out how language learning could work in order for the developmental

data to tell us how is does work." [emphasis mine]

Page 5: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

5

Learnability Is the learner guaranteed to converge on the target grammar for every language in a given domain?

Gold (1967), Wexler and Culicover (1980), Gibson & Wexler (1994), Kanazawa (1994)

An early learnability result (Gold, 1967)

Exposed to input strings of an arbitrary target

language generated by grammar Gtarg, it is

impossible to guarantee that any learner can

converge on Gtarg if Gtarg is drawn from any class in

the Chomsky hierarchy. (E.g. context-free

grammars).

Page 6: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

6

Gold’s result is sometimes taken to be strong evidence for a nativist Universal

Grammar.

1) Psycholinguistic research indicates that children learn grammar based on positive exemplar sentences.

2) Gold proves that Greg Gcfg Gcs Gre can’t be learned this way.

Conclude: some grammatical competence must be in place before learning commences.

Gold’s result is often misapplied, but much discussion.

Page 7: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

7

Another Learnability result:

All classes of grammars possible within the principles and parameters framework are

learnable because they are finite.

In fact a simple Blind Guess Learner is guaranteed to succeed in the long run for any finite class of grammars.

Blind Guess Learner:

1. randomly hypothesize a current grammar

2. consume and attempt to parse a sentence from the linguistic environment

3. If the sentence is parsable by the current grammar, go to 2. Otherwise go to 1.

Page 8: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

8

Feasibility Is acquisition possible within a reasonable amount of time and/or with a reasonable amount of work?

Clark (1994, in press), Niyogi and Berwick (1996), Lightfoot (1989) (degree-0), Sakas(2000), Tesar and Smolensky (1996) and many PAC results concerning induction of FSA’s

Feasibility measure (Sakas and Fodor, in press)

Near linear increase of the expected number of sentences consumed before a learner converges on the target grammar.

Page 9: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

9

Feasibility result: The Blind guess learner succeeds only after consuming a number of sentences exponentially correlated with the number of parameters.

If # Parameters = 30

then # Grammars = 230

= 1,073,741,824

The search space is huge!

Page 10: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

10

A Feasibility Case Study :

A three parameter domain (Gibson and Wexler, 1994)

Sentences are strings of the symbols: S, V, 01, 02, aux, adv

SV / VS - subject precedes verb / verb precedes subject

+V2 / -V2 - verb or aux must be in the second position in the sentence

VO / OV - verb precedes object / object precedes verb

Allie will eat the birds S aux V O

Page 11: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

11

SV OV +V2 (German-like)

S VS V OO V SS V O2 O1O1 V S O2O2 V S O1S AUX VS AUX O VO AUX S VS AUX O2 O1 VO1 AUX S O2 VO2 AUX S O1 V

ADV V SADV V S OADV V S O2 O1ADV AUX S VADV AUX S O VADV AUX S O2 O1 V

SV VO -V2 (English-like)

S VS V OS V O1 O2S AUX VS AUX V OS AUX V O1 O2ADV S VADV S V OADV S V O1 O2ADV S AUX VADV S AUX V OADV S AUX V O1 O2

Two example languages (finite, degree-0)

Page 12: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

12

Surprisingly, G&W’s simple 3-parameter domain presents nontrivial obstacles to several types of learning strategies, but the space is ultimately learnable.

Big question:

How will the learning process scale up in terms of feasibility as the number of parameters increases?

Two problems for most acquisition

strategies:

1) Ambiguity

2) Size of the domain

Page 13: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

13

SV OV +V2 (German-like)

S VS V OO V SS V O2 O1O1 V S O2O2 V S O1S AUX VS AUX O VO AUX S VS AUX O2 O1 VO1 AUX S O2 VO2 AUX S O1 V

ADV V SADV V S OADV V S O2 O1ADV AUX S VADV AUX S O VADV AUX S O2 O1 V

SV VO -V2 (English-like)

S VS V OS V O1 O2S AUX VS AUX V OS AUX V O1 O2ADV S VADV S V OADV S V O1 O2ADV S AUX VADV S AUX V OADV S AUX V O1 O2

Cross-language ambiguity

Indicates a few ambiguous strings

Page 14: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

14

P&P acquisition:

How to obtain informative feasibility results

studying linguistically interesting domains

with cognitively plausible learning

algorithms?

Page 15: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

15

Answer:

introduce some formal notions in order to abstract away from the specific linguistic content of the input data.

Create an input space for a linguistically plausible domain.

-- simulations. (Briscoe (2000), Clark(1992), Elman (1990, 1991,1996), Yang (200))

So, how to answer questions of feasibility as the number of grammars (exponentially) scales up?

But won't work for large domains.

Page 16: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

16

A hybrid approach (formal/empirical)

1) formalize the learning process and input space

2) use the formalization in a Markov structure to empirically test the learner across a wide range of learning scenarios

The framework gives general data on the expected performance of acquisition algorithms. Can answer the question:

Given learner L, if the input space exhibits characteristics x, y and z, is feasible learning possible?

Page 17: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

17

Syntax acquisition can be viewed as a state space search

— nodes represent grammars including a start state and a target state.

— arcs represent a possible change from one hypothesized grammar to another.

100

010

000

011

101

001110

111

Gtarg

A possible state space for parameter space with 3 parameters.

Page 18: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

18

The Triggering Learning Algorithm -TLA (Gibson and Wexler, 1994)

Searches the (huge) grammar space using local heuristics

repeat until convergence:

receive a string s from L(Gtarg)

if it can be parsed by Gcurr, do nothing

otherwise, pick a grammar that differs byone parameter value from the current grammar

if this grammar parses the sentence,make it the current grammar, otherwise do nothing

SVC

Greediness

Error-driven

Page 19: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

19

010

000

110

011

sL(Gcurr)

sL(Gattempt)

sL(Gcurr) Gattempt = 110 sL(G110)

sL(Gcurr) Gattempt = 011 sL(G011)

sL(Gcurr) Gattempt = 000 sL(G000)

Error-driven

Greediness

Local state space for the TLA

If Gcurr = 010, then Gattempt = random G { 000, 110, 011 }

SVC

Page 20: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

20

i denotes the ambiguity factor i = Pr ( s L(Gtarg) L(Gi) )

i,j denotes the overlap factor i,j = Pr (s L(Gi) L(GJ) )

i denotes the probability of picking or "looking ahead” at a new hypothesis grammar Gi

Probabilistic Formulation of TLA performance

Page 21: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

21

The probability that the learner moves from state Gcurr to state Gnew

= (1-curr) ( new ) Pr (Gnew can parse s|Gcurr can’t parse s)

Pr(Gcurr Gnew) = ( new) ( new- curr, new )

Error-driven GreedinessSVC

After some algebra:

Page 22: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

22

0000

0010 1100

0011

1001

1010

0100 1000

0001

1111

1101

0111

1110

1011

0110

0101

G-Ring G4

G-Ring G2

target grammar Gtarg

grammar

s

Parameter space H4 with Gtarg = 1111. Each ring or G-Ring contains exactly those grammars a certain hamming distance from the target. For example, ring G2 contains 0011, 0101, 1100, 1010, 1001 and 0110 all of which differ from the target grammar 1111 by 2 bits.

Page 23: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

23

Weak Smoothness Requirement - All the members of a G-Ring can parse s with an equal probability.

Strong Smoothness Requirement - The parameter space is weakly smooth and the probability that s can be parsed by a member of a G-Ring increases monotonically as distance from the target decreases.

Smoothness - there exists a correlation between the similarity of grammars and the

similarity of the languages that they generate.

Page 24: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

24

Experimental Setup

1) Adapt the formulas for the transition probabilities to work with G-rings

2) Build a generic transition matrix into which varying values of and can be employed

3) Use standard Markov technique to calculate the expected number of inputs consumed by the system

(construct the fundamental matrix)

Goal - find the ‘sweet spot’ for TLA performance

Page 25: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

25

Three experiments

1) G-Rings equally likely to parse an input sentence (uniform domain)

2) G-Rings are strongly smooth (smooth domain)

3) Anything goes domain

Problem:

How to find combinations of and that are optimal?

Solution:

Use an an optimization algorithm: GRG2 (Lasdon and Waren, 1978).

Page 26: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

26

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

0 5 10 15 20 25 30 35

# of parameters

# s

en

ts c

on

su

me

d

Blind guess learner

TLA

Result 1: The TLA performs worse than blind guessing in a uniform domain - exponential increase in # sentences

Logarithmic scale

Results obtained employing optimal values of and

Page 27: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

27

Result 2: The TLA performs extremely well in a smooth domain - but still nonlinear increase

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

0 10 20 30 40

# of parameters

# o

f s

en

tsLinear scale

Results obtained employing optimal values of and

Page 28: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

28

Result 3: The TLA performs a bit better in the Anything Goes scenario - optimizer chooses ‘accelerating’ strong smoothness

0

500

1000

1500

2000

2500

3000

3500

4000

0 5 10 15 20 25 30 35

# of parameters

# of

sen

tenc

es

Accelerating

Linear

Linear scale

Results obtained employing optimal values of and

Page 29: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

29

In summary

TLA is an infeasible learner:

With cross-language ambiguity uniformly distributed across the domain of languages, the number of sentences required by the the number of sentences consumed by the TLA is exponential in the number of parameters.

TLA is a feasible learner:

In strongly smooth domains, the number of sentences increases at a rate much closer to linear as the number of parameters increases (i.e. the number of grammars increases exponentially).

Page 30: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

30

No Best Strategy Conjecture (roughly in the spirit of Schaffer, 1994):

Algorithms may be extremely efficient in specific domains, but not in others; there is generally no best learning strategy.

Recommends:

Have to know the specific facts about the distribution or shape of ambiguity in natural language.

Page 31: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

31

Research agenda:

Three-fold approach to building a cognitive computational model of human language acquisition:

1) formulate a framework to determine what distributions of ambiguity make for feasible learning

2) conduct a psycholinguistic study to determine if the facts of human (child-directed) language are in line with the conducive distributions

3) conduct a computer simulation to check for performance nuances and potential obstacles (e.g. local max based on defaults, or subset principle violations)

Page 32: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

32

Tag Slides (if time):

Why is Gold's theorem so often misapplied?

Page 33: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

33

Gold’s result is sometimes taken to be strong evidence for a nativist Universal

Grammar.

1) Psycholinguistic research indicates that children learn grammar based on positive exemplar sentences.

2) Gold says, Greg Gcfg Gcs Gre can’t be learned this way.

Conclude: some grammatical competence must be in place before learning commences.

Gold’s result is often misapplied, but much discussion.

Page 34: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

34

Gold misapplied.

Tacit assumptions:

human language is a computable process

By Church/Turing no computational model is more powerful than Lre .

Hence, Lhuman is a subset of Lre

The Chomsky hierarchy is an appropriate framework in which to examine Lhuman

Many, many interesting formal results about CFG’s, automata, etc. But where does Lhuman lie?

Page 35: Primary point (to make [quickly, painlessly] on a Saturday just before lunch):

35

Given the language as computation assumption, and Gold’s result, it may be that the class of human languages intersects the classes of the Chomsky Hierarchy.

Lre

Lcs

Lcfg

Lreg

Lhuman

Angluin’s Theorem (1980) - provides necessary and sufficient conditions for such a class.