learning pcfgs: estimating parameters, learning grammar rules many slides are taken or adapted from...

35
Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Upload: neil-allen

Post on 26-Dec-2015

225 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Learning PCFGs: Estimating Parameters,

Learning Grammar RulesMany slides are taken or adapted

from slides byDan Klein

Page 2: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Treebanks

An example tree from the Penn Treebank

Page 3: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

The Penn Treebank

• 1 million tokens• In 50,000 sentences, each labeled with– A POS tag for each token– Labeled constituents– “Extra” information• Phrase annotations like “TMP”• “empty” constituents for wh-movement traces, empty

subjects for raising constructions

Page 4: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Supervised PCFG Learning1. Preprocess the treebank

1. Remove all “extra” information (empties, extra annotations)2. Convert to Chomsky Normal Form3. Possibly prune some punctuation, lower-case all words,

compute word “shapes”, and other processing to combat sparsity.

2. Count the occurrence of each nonterminal c(N) and each observed production rule c(N->NL NR) and c(N->w)

3. Set the probability for each rule to the MLE:P(N->NL NR) = c(N->NL NR) / c(N)P(N->w) = c(N->w) / c(N)

Easy, peasy, lemon-squeezy.

Page 5: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Complications

• Smoothing– Especially for lexicalized grammars, many test

productions will never be observed during training– We don’t necessarily want to assign these

productions zero probability– Instead, define backoff distributions, e.g.:Pfinal(VPtransmogrified -> Vtransmogrified PPinto)

= α P(VPtransmogrified -> Vtransmogrified PPinto)

+ (1-α) P(VP -> V PPinto)

Page 6: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Problems with Supervised PCFG Learning

• Coming up with labeled data is hard!– Time-consuming– Expensive – Hard to adapt to new domains,

tasks, languages– Corpus availability drives research

(instead of tasks driving the research)

• Penn Treebank took many person-years to manually annotate it.

Page 7: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Unsupervised Learning of PCFGS: Feasible?

Page 8: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Unsupervised Learning

• Systems take raw data and automatically detect data

• Why?– More data is available– Kids learn (some aspects of)

language with no supervision– Insights into machine learning

and clustering

Page 9: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Grammar Induction and Learnability

• Some have argued that learning syntax from positive data alone is impossible– Gold, 1967: non-identifiability in the limit– Chomsky, 1980: poverty of the stimulus

• Surprising result: it’s possible to get entirely unsupervised parsing to work (reasonably) well.

Page 10: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Learnability

• Learnability: formal conditions under which a class of languages can be learned

• Setup: – Class of languages Λ– Algorithm H (the learner)– H sees a sequence X of strings x1 … xn– H maps sequences X to languages L in Λ

• Question is: for what classes Λ do learners H exist?

Page 11: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Learnability [Gold, 1967]• Criterion: Identification in the limit

– A presentation of L is an infinite sequence of x’s from L in which each x occurs at least once

– A learner H identifies L in the limit if, for any presentation of L, from some point n onwards, H always outputs L

– A class Λ is identifiable in the limit if there is some single H which correctly identifies in the limit every L in Λ.

• Example: L = {{a},{a,b}} is identifiable in the limit.

• Theorem (Gold, 67): Any Λ which contains all finite languages and at least one infinite language (ie is superfinite) is unlearnable in this sense.

Page 12: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Learnability [Gold, 1967]

• Proof sketch– Assume Λ is superfinite, H identifies Λ in the limit– There exists a chain L1 ⊂ L2 ⊂ … L∞

– Construct the following misleading sequence• Present strings from L1 until H outputs L1• Present strings from L2 until H outputs L2• …

– This is a presentation of L∞

but H never outputs L∞

Page 13: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Learnability [Horning, 1969]• Problem, IIL requires that H succeeds on all examples, even the

weird ones

• Another criterion: measure one identification– Assume a distribution PL(x) for each L– Assume PL(x) puts non-zero probability on all and only the x in L– Assume an infinite presentation of x drawn i.i.d. from PL(x)– H measure-one identifies L if the probability of [drawing a sequence X

from which H can identify L] is 1.

• Theorem (Horning, 69): PCFGs can be identified in this sense.– Note: there can be misleading sequences, but they have to be

(infinitely) unlikely

Page 14: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Learnability [Horning, 1969]• Proof sketch

– Assume Λ is a recursively enumerable set of recursive languages (e.g., the set of all PCFGs)

– Assume an ordering on all strings x1 < x2 < …– Define: two sequences A and B agree through n iff for all x<xn, x is in

A x is in B.– Define the error set E(L,n,m):

• All sequences such that the first m elements do not agree with L through n• These are the sequences which contain early strings outside of L (can’t

happen), or which fail to contain all of the early strings in L (happens less as m increases)

– Claim: P(E(L,n,m)) goes to 0 as m goes to ∞– Let dL(n) be the smallest m such that P(E) < 2–n

– Let d(n) be the largest dL(n) in first n languages– Learner: after d(n), pick first L that agrees with evidence through n– This can only fail for sequences X if X keeps showing up in E(L, n, d(n)),

which happens infinitely often with probability zero.

Page 15: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Learnability• Gold’s results say little about real learners (the

requirements are too strong)

• Horning’s algorithm is completely impractical– It needs astronomical amounts of data

• Even measure-one identification doesn’t say anything about tree structures– It only talks about learning grammatical sets– Strong generative vs. weak generative capacity

Page 16: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Unsupervised POS Tagging

• Some (discouraging) experiments [Merialdo 94]

• Setup:– You know the set of allowable tags for each word

(but not frequency of each tag)– Learn a supervised model on k training sentences• Learn P(w|t), P(ti|ti-1,ti-2) on these sentences

– On n>k, reestimate with EM

Page 17: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Merialdo: Results

Page 18: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Grammar Induction

Unsupervised Learning of Grammars and Parameters

Page 19: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Right-branching Baseline

• In English (but not necessarily in other languages), trees tend to be right-branching:

• A simple, English-specific baseline is to choose the right chain structure for each sentence.

Page 20: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Distributional Clustering

Page 21: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Nearest Neighbors

Page 22: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Learn PCFGs with EM [Lari and Young, 1990]

• Setup:– Full binary grammar with n nonterminals {X1, …, Xn}

(that is, at the beginning, the grammar has all possible rules)

– Parse uniformly/randomly at first– Re-estimate rule expecations off of parses– Repeat

• Their conclusion: it doesn’t really work

Page 23: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

EM for PCFGs: Details1. Start with a “full” grammar, with all possible binary rules for our

nonterminals N1 … Nk. Designate one of them as the start symbol, say N1

2. Assign some starting distribution to the rules, such as1. Random2. Uniform3. Some “smart” initialization techniques (see assigned reading)

3. E-step: Take an unannotated sentence S, and compute, for all nonterminals N, NL, NR, and all terminals w:

E(N | S), E(N->NL NR, N is used| S), E(N->w, N is used| S)

4. M-step: Reset rule probabilities to the MLE:P(N->NL NR) = E(N->NL NR|S) / E(N | S)P(N->w) = E(N->w | S) / E(N | S)

5. Repeat 3 and 4 until rule probabilities stabilize, or “converge”

Page 24: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

E-Step

• Let• We can define the expectations we want in

terms of π, α, β quantities:

)|( 1*

1 GwNP m

m

p

m

pq

jjmj

qpqpwNE

11

*1

),(),()N|derivationin used is (

1

1 1

1

1*

1

),1(),() (),()N| used is , (

m

p

m

pq

q

pd

srrljjmjrlj

qddpNNNPqpwNNNNE

m

p

jjmjj

ppppwNNE

11

*1

),(),()N| used is ,w(

Page 25: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Inside Probabilities

Base case:

Induction:

)|(),|(),( GwNPGNwPii kjj

kkij

rl

q

qdrl

rlj

jpqpqj

qddpNNNP

GNwPqp

,

1

),1(),() (

),|(),(

Nj

Nl Nr

wp wd wd+1 wq

Page 26: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Outside Probabilities

Base case:

Induction:

0),1(

1),1(

1

1

m

m

j

rl

q

qdrl

rlj

jpqpqj

qddpNNNP

GNwPqp

,

1

),1(),() (

),|(),(

Nj

Nl Nr

wp wd wd+1 wq

Page 27: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Problem: Model Symmetries

Page 28: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Distributional Syntax?

Page 29: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Problem: Identifying Constituents

Page 30: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

A nested distributional model

• We’d like a model that

– Ties spans to linear contexts (like distributional clustering)

– Considers only proper tree structures (like PCFGs)

– Has no symmetries to break (like a dependency model)

Page 31: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Constituent Context Model (CCM)

Page 32: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Results: Constituency

Page 33: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Results: Dependencies

Page 34: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Results: Combined Models

Page 35: Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Multilingual Results