from baby steps to leapfrog: how “less is more” baby steps to leapfrog: how “less is more”...

From Baby Steps to Leapfrog:

How “Less is More”in Unsupervised Dependency Parsing

Valentin I. Spitkovsky

with Hiyan Alshawi (Google Inc.)and Daniel Jurafsky (Stanford University)

Spitkovsky et al. (Stanford & Google) From Baby Steps to Leapfrog NAACL HLT (2010-06-04) 1 / 30

Overview

Idea: (At Least) Two Axes worth Scaffolding

Overview

Idea: (At Least) Two Axes worth ScaffoldingModel (or Algorithmic) Complexity

Overview

Idea: (At Least) Two Axes worth ScaffoldingModel (or Algorithmic) Complexity [classic NLP]

— word alignment (unsupervised), e.g., IBM models 1-5(Brown et al., 1993)

Overview

— parsing (supervised), e.g., “coarse-to-fine” grammars(Charniak and Johnson, 2005; Petrov 2009)

Overview

Data (or Problem / Task) Complexity

Overview

Data (or Problem / Task) Complexity [rare in NLP]

— reinforcement learning, e.g., robot navigation(Singh, 1992; Sanger 1994)

Overview

Data (or Problem / Task) Complexity [rare in NLP]

— reinforcement learning, e.g., robot navigation(Singh, 1992; Sanger 1994)

— closest in NLP: cautious named entity classification(Collins and Singer, 1999; Yarowsky, 1995)

Overview

Outline: Three Data-Complexity-Aware Techniques

Overview

Baby Steps: scaffolding on data complexity— iterative, requires no initialization

Overview

Less is More: filtering by data complexity— batch, capable of using a good initializer

Overview

Less is More: filtering by data complexity— batch, capable of using a good initializer

Leapfrog: a combination (best of both worlds)— intended as an efficiency hack (but performs best)

The Problem

Problem: Unsupervised Learning of Parsing

The Problem

Input: Raw Text

... By most measures, the nation’s industrial sector is now

growing very slowly — if at all. Factory payrolls fell in

September. So did the Federal Reserve ...

The Problem

Factory payrolls fell in September .

Input: Raw Text (Sentences, Tokens and POS-tags)

The Problem

Input: Raw Text (Sentences, Tokens and POS-tags)

Output: Syntactic Structures (and a Probabilistic Grammar)

The Problem

Motivation: Unsupervised (Dependency) Parsing

The Problem

Insert your favorite reason(s) why you’d like to parseanything in the first place...

The Problem

... adjust for any data without reference tree banks:

The Problem

... adjust for any data without reference tree banks:— i.e., exotic languages

The Problem

... adjust for any data without reference tree banks:— i.e., exotic languages and/or genres (e.g., legal).

The Problem

Potential applications:

The Problem

Potential applications:◮ machine translation

The Problem

— word alignment, phrase extraction, reordering;

The Problem

◮ web search

The Problem

◮ web search— retrieval, query refinement;

The Problem

◮ question answering

The Problem

◮ question answering, speech recognition, etc.

State-of-the-Art

State-of-the-Art: Directed Dependency Accuracy

State-of-the-Art

State-of-the-Art: Directed Dependency Accuracy42.2% on Section 23 (all sentences) of WSJ

(Cohen and Smith, 2009)

State-of-the-Art

31.7% for the (right-branching) baseline(Klein and Manning, 2004)

State-of-the-Art

31.7% for the (right-branching) baseline(Klein and Manning, 2004)

Scoring example:

Directed Score: 35 = 60% (baseline: 2

5 = 40%);

Undirected Score: 45 = 80% (baseline: 4

5 = 80%).

State-of-the-Art

State-of-the-Art: A Brief History

State-of-the-Art

1992 — word classes (Carroll and Charniak)

State-of-the-Art

1998 — greedy linkage via mutual information (Yuret)

State-of-the-Art

2001 — iterative re-estimation with EM (Paskin)

State-of-the-Art

2004 — right-branching baseline— valence (DMV) (Klein and Manning)

State-of-the-Art

2004 — annealing techniques (Smith and Eisner)

State-of-the-Art

2005 — contrastive estimation (Smith and Eisner)

State-of-the-Art

2006 — structural biasing (Smith and Eisner)

State-of-the-Art

2007 — common cover link representation (Seginer)

State-of-the-Art

2008 — logistic normal priors (Cohen et al.)

State-of-the-Art

2009 — lexicalization and smoothing (Headden et al.)

State-of-the-Art

2009 — lexicalization and smoothing (Headden et al.)

2009 — soft parameter tying (Cohen and Smith)

State-of-the-Art

State-of-the-Art: Dependency Model with Valence

State-of-the-Art

a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

State-of-the-Art

P(th) =∏

dir∈{L,R}

PSTOP(ch, dir,

adj︷︸︸︷

P(tai ) PATTACH(ch, dir, cai )

(1− PSTOP(ch, dir,

adj︷︸︸︷

1i=1))

n=|args(h,dir)|Spitkovsky et al. (Stanford & Google) From Baby Steps to Leapfrog NAACL HLT (2010-06-04) 8 / 30

State-of-the-Art

State-of-the-Art: Unsupervised Learning Engine

State-of-the-Art

State-of-the-Art: Unsupervised Learning EngineEM, via inside-outside re-estimation (Baker, 1979)

State-of-the-Art

w1 wmwp−1 wp wq wq+1

N1 (Manning and Schutze, 1999)

· · · · · · · · ·

State-of-the-Art

w1 wmwp−1 wp wq wq+1

N1 (Manning and Schutze, 1999)

· · · · · · · · ·

State-of-the-Art

State-of-the-Art: The Standard Corpus

State-of-the-Art

Training: WSJ10 (Klein, 2005)

State-of-the-Art

◮ The Wall Street Journal section of thePenn Treebank Project (Marcus et al., 1993)

State-of-the-Art

◮ ... stripped of punctuation, etc.

State-of-the-Art

◮ ... stripped of punctuation, etc.◮ ... filtered down to sentences left

with no more than 10 POS tags;

State-of-the-Art

with no more than 10 POS tags;◮ ... and converted to reference dependencies

using “head percolation rules” (Collins, 1999).

State-of-the-Art

with no more than 10 POS tags;◮ ... and converted to reference dependencies

using “head percolation rules” (Collins, 1999).

Evaluation: Section 23 of WSJ∞ (all sentences).

State-of-the-Art

5 10 15 20 25 30 35 40 45

45Sentences (1,000s)

Tokens (1,000s)100

State-of-the-Art

5 10 15 20 25 30 35 40 45

45Sentences (1,000s)

Tokens (1,000s)100

(At Least) Two Issues

Issue I: Why so little data?

extra unlabeled datahelps semi-supervised parsing (Suzuki et al., 2009)

yet state-of-the-art unsupervised methods use evenless than what’s available for supervised training...

we will explore (three) judicious uses of dataand simple, scalable machine learning techniques

Issue II: Non-convex objective...

maximizing the probability of data (sentences):

θUNS = argmaxθ

log∑

t∈T (s)

Pθ(t)

︸︷︷︸

Pθ(s)

θUNS = argmaxθ

log∑

t∈T (s)

Pθ(t)

︸︷︷︸

Pθ(s)

supervised objective would be convex (counting):

θSUP = argmaxθ

log Pθ(t∗(s)).

θUNS = argmaxθ

log∑

t∈T (s)

Pθ(t)

︸︷︷︸

Pθ(s)

θSUP = argmaxθ

log Pθ(t∗(s)).

in general, θSUP 6= θUNS and θUNS 6= θUNS... (see CoNLL)

θUNS = argmaxθ

log∑

t∈T (s)

Pθ(t)

︸︷︷︸

Pθ(s)

θSUP = argmaxθ

log Pθ(t∗(s)).

in general, θSUP 6= θUNS and θUNS 6= θUNS... (see CoNLL)

initialization matters!

Issues: The Lay of the Land

5 10 15 20 25 30 35 40

Directed Dependency Accuracy (%)on WSJk

5 10 15 20 25 30 35 40

Uninformed

5 10 15 20 25 30 35 40

Uninformed

5 10 15 20 25 30 35 40

Uninformed

5 10 15 20 25 30 35 40

Uninformed

5 10 15 20 25 30 35 40

Uninformed

Oracle

5 10 15 20 25 30 35 40

Uninformed

Oracle

5 10 15 20 25 30 35 40

Uninformed

Oracle

K&M (Ad-Hoc Harmonic Init)

Baby Steps

Idea I: Baby Steps ... as Non-convex Optimization

Baby Steps

global non-convex optimization is hard ...

Baby Steps

meta-heuristic: take guesswork out of local search

Baby Steps

start with an easy (convex) case

Baby Steps

slowly extend it to the fully complex target task

Baby Steps

take tiny (cautious) steps in the problem space

Baby Steps

... try not to stray far from relevantneighborhoods in the solution space

Baby Steps

base case: sentences of length one (trivial — no init)

Baby Steps

incremental step: smooth WSJk; re-init WSJ(k + 1)

Baby Steps

incremental step: smooth WSJk; re-init WSJ(k + 1)

... this really is grammar induction!

Baby Steps

Idea I: Baby Steps ... as Graduated Learning

Baby Steps

WSJ1 — Atone (verbs!)

Baby Steps

Darkness fell. (nouns!)WSJ2 — It is.

Judge Not

Baby Steps

Darkness fell. (nouns!)WSJ2 — It is.

Judge Not

Become a Lobbyist (determiners!)WSJ3 — But many have.

They didn’t.

Baby Steps

Idea I: Baby Steps ... and Related Notions

Baby Steps

shaping (Skinner, 1938)

Baby Steps

less is more (Kail, 1984; Newport, 1988; 1990)

Baby Steps

starting small (Elman, 1993)

Baby Steps

◮ scaffold on model complexity [restrict memory]

Baby Steps

◮ scaffold on model complexity [restrict memory]◮ scaffold on data complexity [restrict input]

Baby Steps

controversy! (Rohde and Plaut, 1999)

Baby Steps

stepping stones (Brown et al., 1993)

Baby Steps

coarse-to-fine (Charniak and Johnson, 2005)

Baby Steps

curriculum learning (Bengio et al., 2009)

Baby Steps

continuation methods (Allgower and Georg, 1990)

Baby Steps

continuation methods (Allgower and Georg, 1990)

successive approximations!

Baby Steps

Idea I: Baby Steps ... Results!

5 10 15 20 25 30 35 40

Uninformed

Oracle

K&M (Ad-Hoc Harmonic Init)

Baby Steps

5 10 15 20 25 30 35 40

Uninformed

Oracle

K&M Baby Steps

Baby Steps

5 10 15 20 25 30 35 40

Uninformed

Oracle

K&M Baby Steps

Baby Steps

Idea I: Baby Steps ... Concerns?

Baby Steps

ignores a good initializer

Baby Steps

unnecessarily meticulous

Baby Steps

excruciatingly slow!

Baby Steps

excruciatingly slow!

about a year behind state-of-the-art (on long sentences)

Less is More

Idea II: Less is More

Less is More

short sentences are not representative (and few)

Less is More

long sentences are overwhelmingly difficult ...

Less is More

is there a sweet spot data gradation?

Less is More

is there a sweet spot data gradation?

perhaps train where Baby Steps flatlines!

Less is More

Idea II: Less is More ... the Learning Curve

5 10 15 20 25 30 35 40 45

Cross-entropy h (in bits per token) on WSJ45

Less is More

5 10 15 20 25 30 35 40 45

[7, 15] Tight, Flat, Asymptotic Bound

Less is More

5 10 15 20 25 30 35 40 45

— automatically detect the knee: [7, 15]

Less is More

5 10 15 20 25 30 35 40 45

— automatically detect the knee: [7, 15]

— train at the “sweet spot” gradation: WSJ15

Less is More

Idea II: Less is More ... Results!

5 10 15 20 25 30 35 40

Directed Dependency Accuracy (%)

Oracle

Baby Steps

on WSJk

Less is More

5 10 15 20 25 30 35 40

90 on WSJ40

Oracle

Baby Steps

Less is More

5 10 15 20 25 30 35 40

90 on WSJ40

Oracle

Baby Steps

Less is More︸︷︷︸

K&M∗

Less is More

Idea II: Less is More ... Concerns?

Less is More

discards most of the data

Less is More

beats state-of-the-art (on long sentences, off WSJ15)

Less is More

beats state-of-the-art (on long sentences, off WSJ15)

ignores a decent complementary initialization strategy

Leapfrog

Idea III: Leapfrog ... a Hack

Leapfrog

use both good systems!

Leapfrog

thorough training up to WSJ15, where it’s cheap

Leapfrog

use both good initializers (mix their best parse trees)

Leapfrog

execute just a few steps of EM where it’s expensive

Leapfrog

execute just a few steps of EM where it’s expensive

hop on from WSJ15 to WSJ45, via WSJ30...

Leapfrog

Idea III: Leapfrog ... Results!

5 10 15 20 25 30 35 40

Oracle

Uninformed

Baby Steps

Leapfrog

5 10 15 20 25 30 35 40

Oracle

Uninformed

Baby Steps

K&M∗

Leapfrog

5 10 15 20 25 30 35 40

Oracle

Uninformed

Baby Steps

K&M∗

Leapfrog

5 10 15 20 25 30 35 40

Oracle

Uninformed

Baby Steps

K&M∗

Leapfrog

5 10 15 20 25 30 35 40

Oracle

Uninformed

Baby Steps

K&M∗

Leapfrog

5 10 15 20 25 30 35 40

Oracle

Uninformed

Baby Steps

K&M∗

Leapfrog

Results

Results: ... on Section 23 of WSJ

Results

Right-Branching (Klein and Manning, 2004) 31.7%

Results

Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%

Results

Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%Baby Steps @15 39.2%Baby Steps @45 39.4%

Results

Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%Baby Steps @15 39.2%Baby Steps @45 39.4%Soft Parameter Tying (Cohen and Smith, 2009) 42.2%

Results

Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%Baby Steps @15 39.2%Baby Steps @45 39.4%Soft Parameter Tying (Cohen and Smith, 2009) 42.2%Less is More @15 44.1%

Results

Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%Baby Steps @15 39.2%Baby Steps @45 39.4%Soft Parameter Tying (Cohen and Smith, 2009) 42.2%Less is More @15 44.1%Leapfrog @45 45.0%

Conclusion

Summary

Conclusion

Summary

explored scaffolding on data complexity

Conclusion

Summary

awareness of data complexity does help!

Conclusion

Summary

awareness of data complexity does help!

beats state-of-the-art with older techniques

Conclusion

(need a less adversarial learning algorithm)

Conclusion

paradox: improved performance with less data

Conclusion

despite discarding samples from the true (test) distribution

Conclusion

focusing on simple examples guides unsupervised learning

Conclusion

focusing on simple examples guides unsupervised learning

mirrors supervised boosting (Freund and Schapire, 1997)

Conclusion

Teaser

Conclusion

Teaser

we push the state-of-the-art further, to 50.4% (upanother 5%) using even faster and simpler methods!

Conclusion

Teaser

... hear us at CoNLL and ACL (Spitkovsky et al., 2010)

Conclusion

Teaser

similar approaches may apply in other settings(e.g., word alignment)

Conclusion

Teaser

similar approaches may apply in other settings(e.g., word alignment)

... more to come!

Conclusion

Thanks!

Questions?

from baby steps to leapfrog: how “less is more” baby steps to leapfrog: how “less is more”...

Documents

leapfrog- case study

leapfrog hospital survey

leapfrog trade catalogue 2013

repairing non-functional buttons on leapfrog hug & learn...

leapfrog portfolio

brandmanagement sampleaudit leapfrog

deluxe - leapfrog

fit made fun - leapfrog

clinchingauctionswithonlinesupplyrenatoppl/papers/online-clinching-soda13.pdf ·...

e commerce (leapfrog)

leapfrog porfolio

safety in numbers - leapfrog

technical release notes - leapfrog...leapfrog geothermal 3.4...

leapfrog geo info

leapfrog - static1.1.sqspcdn.com

leapfrog retscreen indonesia 110316

the leapfrog hospital recognition program a program of the...

leapfrog lighting product specifications

papa-leapfrog investor final_jan16_2017

zero 88 leapfrog manual