outline - stanford nlp groupnlp.stanford.edu/joberant/esslli_2016/class3_index.pdfsupervision in...

Outline

• Learning

– Overview

– Details

– Example

– Lexicon learning

– Supervision signals

0

Outline

• Learning

– Overview

– Details

– Example



1

Supervision in syntactic parsing

Input:

S

NP

NP

ESSLLI 2016

NP

the known summer school

VP

V

is

VP

V

located

PP

in Bolzano

Output:

They play football

S

NP

They

VP

V

play

NP

football 2

Supervision in semantic parsing

Input:

Heavy supervision Light supervision

How tall is Lebron James?

HeightOf.LebronJames

What is Steph Curry’s daughter called?

ChildrenOf.StephCurry u Gender.Female

Youngest player of the Cavaliers

argmin(PlayerOf.Cavaliers,BirthDateOf)

...


203cm


Riley Curry


Kyrie Irving

...

[Zelle & Mooney, 1996; Zettlemoyer & Collins, 2005; Clarke et al. 2010; Liang et al., 2011]

3

Supervision in semantic parsing

Input:

Heavy supervision Light supervision


HeightOf.LebronJames


ChildrenOf.StephCurry u Gender.Female


argmin(PlayerOf.Cavaliers,BirthDateOf)

...


203cm


Riley Curry


Kyrie Irving

...

Output:

Clay Thompson’s weight

WeightOf.ClayThompson

ClayThompson

Clay Thompson

’s Weight

weight

Weight.ClayThompson 205 lbs

[Zelle & Mooney, 1996; Zettlemoyer & Collins, 2005; Clarke et al. 2010; Liang et al., 2011]

3

Learning in a nutshell

utterance

0. Define model for derivations

4


utterance Parsing

...


1. Generate candidate derivations (later)

4


utterance Parsing

...

Label

...



2. Label as correct and incorrect

4


utterance Parsing

...

Label

...

Update model



2. Label as correct and incorrect

3. Update model to favor correct trees

4

Training intuition

Where did Mozart tupress?

Vienna

5

Training intuition


PlaceOfBirth.WolfgangMozart

PlaceOfDeath.WolfgangMozart

PlaceOfMarriage.WolfgangMozart

Vienna

5

Training intuition


PlaceOfBirth.WolfgangMozart ⇒ Salzburg

PlaceOfDeath.WolfgangMozart ⇒ Vienna

PlaceOfMarriage.WolfgangMozart ⇒ Vienna

Vienna

5

Training intuition





Vienna

Where did Hogarth tupress?

5

Training intuition





Vienna


PlaceOfBirth.WilliamHogarth

PlaceOfDeath.WilliamHogarth

PlaceOfMarriage.WilliamHogarth

London

5

Training intuition





Vienna


PlaceOfBirth.WilliamHogarth ⇒ London

PlaceOfDeath.WilliamHogarth ⇒ London

PlaceOfMarriage.WilliamHogarth ⇒ Paddington

London

5

Outline

• Learning

– Overview

– Details

– Example



6

Constructing derivations

Type.PersonuPlaceLived.Chicago

Type.Person

people

who PlaceLived.Chicago

PlaceLived

lived in

Chicago

Chicago

lexicon

lexiconlexicon

join

intersect

7

Many possible derivations!

x = people who have lived in Chicago

?

set of candidate derivations D(x)


Type.Person

people


PlaceLived

lived in

Chicago

Chicago

lexicon

lexiconlexicon

join

intersectType.OrguPresentIn.ChicagoMusical

Type.Org

people

who PresentIn.ChicagoMusical

PresentIn

lived in

ChicagoMusical

Chicago

lexicon

lexiconlexicon

join

intersect

8

x: utterance

d: derivation


Type.Person

people


PlaceLived

lived in

Chicago

Chicago

lexicon

lexiconlexicon

join

intersect

Feature vector and parameters in RF :

φ(x, d) θ ⇐learned

apply join 1 1.2

apply intersect 1 0.6

apply lexicon 3 2.1

lived maps to PlacesLived 1 3.1

lived maps to PlaceOfBirth 0 -0.4

born maps to PlaceOfBirth 0 2.7

... ... ...

9

x: utterance

d: derivation


Type.Person

people


PlaceLived

lived in

Chicago

Chicago

lexicon

lexiconlexicon

join

intersect

Feature vector and parameters in RF :φ(x, d) θ ⇐learned

apply join 1 1.2

apply intersect 1 0.6

apply lexicon 3 2.1

lived maps to PlacesLived 1 3.1

lived maps to PlaceOfBirth 0 -0.4

born maps to PlaceOfBirth 0 2.7

... ... ...

Scoreθ(x, d) = φ(x, d)>θ =

1.2 · 1 + 0.6 · 1 + 2.1 · 3 + 3.1 · 1 +−0.4 · 0 + 2.7 · 0 + . . .9

Deep learning alert!

The feature vector φ(x, d) is constructed by hand

10



Constructing good features is hard

10




Algorithms are likely to do it better

10




Algorithms are likely to do it better

Perhaps we can train φ(x, d)

φ(x, d) = Fψ(x, d), where ψ are the parameters

10

Log-linear model

Candidate derivations: D(x)

Model: distribution over derivations d given utterance x

pθ(d | x) = exp(Scoreθ(x,d))∑d′∈D(x) exp(Scoreθ(x,d

′))

11

Log-linear model




′))

scoreθ(x, d) [1, 2, 3, 4]

pθ(d | x) [ ee+e2+e3+e4 ,

e2

e+e2+e3+e4 ,e3

e+e2+e3+e4 ,e4

e+e2+e3+e4 ]

11

Log-linear model




′))

scoreθ(x, d) [1, 2, 3, 4]

pθ(d | x) [ ee+e2+e3+e4 ,

e2

e+e2+e3+e4 ,e3

e+e2+e3+e4 ,e4

e+e2+e3+e4 ]

Parsing: find the top-K derivation trees Dθ(x)

11

Features

Dense features:

• intersection=0.67

• ent-popularity:HIGH

• denoation-size:1

12

Features

Dense features:




Sparse features:

• bridge-binary:STUDY

• born:PlaceOfBirth

• city:Type.Location

12

Features

Dense features:




Sparse features:




Syntactic features:

• ent-pos:NNP NNP

• join-pos:V NN

• skip-pos:IN

12

Features

Dense features:




Sparse features:




Syntactic features:

• ent-pos:NNP NNP

• join-pos:V NN

• skip-pos:IN

Grammar features:

• Binary->Verb12

Learning θ: maximum-likelihood

Training data:

What’s Bulgaria’s capital?

Sofia

What movies has Tom Cruise been in?

TopGun,VanillaSky,...

...


CapitalOf.Bulgaria


Type.Movie u HasPlayed.TomCruise

...

13


Training data:


Sofia



...


CapitalOf.Bulgaria



...

argmaxθ∑ni=1 log pθ(y

(i) | x(i)) =

argmaxθ∑ni=1 log

∑d(i) pθ(d

(i) | x(i))R(d(i))

13


Training data:


Sofia



...


CapitalOf.Bulgaria



...

argmaxθ∑ni=1 log pθ(y

(i) | x(i)) =

argmaxθ∑ni=1 log

∑d(i) pθ(d

(i) | x(i))R(d(i))

R(d) =

{1 d.z = z(i)

0 o/wR(d) =

{1 [d.z]K = y(i)

0 o/wR(d) = F1([d.z]K, y

(i))

13


For every example:




pθ(D(x)) = [0.2, 0.1, 0.1, 0.6]

R(D(x)) = [1, 0, 0, 1]

14


For every example:




pθ(D(x)) = [0.2, 0.1, 0.1, 0.6]

R(D(x)) = [1, 0, 0, 1]

qθ(D(x)) = [0.25, 0, 0, 0.75]

qθ =pθp>θ R

14


For every example:




pθ(D(x)) = [0.2, 0.1, 0.1, 0.6]

R(D(x)) = [1, 0, 0, 1]

qθ(D(x)) = [0.25, 0, 0, 0.75]

qθ =pθp>θ R

Gradient:

0.05 · φ(x, d1)− 0.1 · φ(x, d2)− 0.1 · φ(x, d3) + 0.15 · φ(x, d4)

14

Training

Input: {xi, yi}ni=1

Output: θ

15

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0

15

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0

for iteration τ and example i

D(xi)← argmaxK(pθ(d | xi))

15

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0



θ ← θ + ητ,i(Eqθ(d|xi)[φ(xi, d)]− Epθ(d|xi)[φ(xi, d)])

15

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0




ητ,i: learning rate

Regularization often added (L2, L1, ...)

15

Training (structured perceptron)

Input: {xi, yi}ni=1

Output: θ

16


Input: {xi, yi}ni=1

Output: θ

θ ← 0

16


Input: {xi, yi}ni=1

Output: θ

θ ← 0


d̂← argmax(pθ(d | xi))

d∗ ← argmax(qθ(d | xi))

16


Input: {xi, yi}ni=1

Output: θ

θ ← 0




if [d∗]K 6= [d̂]K

θ ← θ + φ(xi, d∗)− φ(xi, d̂))

16


Input: {xi, yi}ni=1

Output: θ

θ ← 0




if [d∗]K 6= [d̂]K

θ ← θ + φ(xi, d∗)− φ(xi, d̂))

Regularization often added with weight averaging

16

Training

Other simple variants exist:

• E.g., cost-sensitive max-margin training

• That is, find pairs of good and bad derivations that look differentbut have similar scores and update on those

17

Outline

• Learning

– Overview

– Details

– Example



18

Example

./run @mode=simple-lambdadcs \

-Grammar.inPaths esslli 2016/class3 demo.grammar \

-SimpleLexicon.inPaths esslli 2016/class3 demo.lexicon

(loadgraph geo880/geo880.kg)

size of california

size capital california

size of capital of california

california size

19

Exercise

Find a pair of natural language utterances that cannot be distinguishedusing the current feature representation

• The utterances can be not fully grammatical in English

• You can ignore the denotation feature if that helps

• Verify this in sempre (ask me how to disable features)

• Design a feature that will solve this problem

20

Outline

• Learning

– Overview

– Details

– Example



21

The lexicon problem

How is the lexicon generated?

• Annotation

• Exhaustive search

• String matching

• Supervised alignment

• Unsupervised alignment

• Learning

22

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0


Add lexicon entries

D(xi)← argmaxK(pθ(d | xi)


ητ,i: learning rate

Regularization often added (L2, L1, ...)

23

Adding lexicon entries

Input: training example (xi, yi), current lexicon Λ, model θ

Λtemp ← Λ ∪ GENLEX(xi, yi) Create expanded temporaty lexicon

[Adapted from semantic parsing tutorial, Artzi et al.]

24




D(xi)← arg maxK(pθ,Λtemp(d | xi)) Parse with temporary lexicon


24





Λ = Λ ∪ {l | l ∈ d̂, d̂ ∈ D(xi), R(d) = 1} Add entries from correct trees


24





Λ = Λ ∪ {l | l ∈ d̂, d̂ ∈ D(xi), R(d) = 1} Add entries from correct trees

Overgenerate lexical entries and add promising ones


24

Lexicon generation

Logical form supervision:

Largest state bordering California

argmax(Type(State) u Border(California),Area)

[Zettlemoyer and Collins, 2005]

25

Lexicon generation




Enumerate spans

Largest

state

bordering

California

Largest state

. . .

Use rules to extract sub-forumulas

California

Border(California)

λf.f(California)

Area

λx.argmax(x,Area)

. . .


25

Lexicon generation




Enumerate spans

Largest

state

bordering

California

Largest state

. . .

Use rules to extract sub-forumulas

California

Border(California)

λf.f(California)

Area

λx.argmax(x,Area)

. . .

Add cross-product to lexicon


25

Lexicon generation

Denotation supervision:


Arizona

26

Lexicon generation



Arizona

Enumerate spans

Largest

state

bordering

California

Largest state

. . .

Generate sub-formulas from KB

California

Border(California)

Traverse

Type.Mountain

λx.argmax(x,Elevation)

. . .

Restrict candidates with alignment, string matching, ...

26

Lexicon generation



Arizona

Enumerate spans

Largest

state

bordering

California

Largest state

. . .


California

Border(California)

Traverse

Type.Mountain

λx.argmax(x,Elevation)

. . .

Restrict candidates with alignment, string matching, ...

Fancier methods exist (coarse-to-fine)

26

Unification


[Kwiatkowski et al, 2010]

27

Unification


Initialize lexicon with (xi, zi):

States bordering California

Type(State) u Border(California)

Split lexical entry in all possible ways:


27

Unification


Initialize lexicon with (xi, zi):

States bordering California


Split lexical entry in all possible ways:

Enumerate spans

(states, bordering california)

(states bordering, california)


(Type(State), λx.x u Border(California))

(λx.Type(State) u x,Border(California))

(λf.Type(State) u f(California),California)

. . .


27

Unification

For example xi, zi:


28

Unification

For example xi, zi:

Find highest scoring correct parse d∗


28

Unification

For example xi, zi:


Split all lexical entries in d∗ in all possible ways


28

Unification

For example xi, zi:


Split all lexical entries in d∗ in all possible ways

Add to lexicon lexical entry that improves parse score best


28

Do we need a lexicon?


Type(State)

Type State

Border(California)

Border California

California neighbors

29



Type(State)

Type State

Border(California)

Border California


Floating parse tree: generalization of bridging

29



Type(State)

Type State

Border(California)

Border California


Floating parse tree: generalization of bridging

Perhaps with better learning and search not necessary?

29

Outline

• Learning

– Overview

– Details

– Example



30

Supervision signals

We discussed training from logical forms and denotations

31

Supervision signals

We discussed training from logical forms and denotations

Other forms of supervision have been proposed:

• Demonstrations

• Distant supervision

• Conversations

• Unsupervised

• Paraphrasing

31

Training from demonstrations

Input: (xi, si, ti)

xi: utterance

si: start state

ti: end state

move forward until you reach the intersection

[Artzi and Zettlemoyer, 2013]

32


Input: (xi, si, ti)

xi: utterance

si: start state

ti: end state


λa.move(a) ∧ dir(a.forward) . . .


32


Input: (xi, si, ti)

xi: utterance

si: start state

ti: end state


λa.move(a) ∧ dir(a.forward) . . .

An instance of learning from denotations


32

Distant supervision

Data generation:

• Decompose declarative text to questions and answers

James Cameron is the director of Titanic

Q: X is the director of Titanic A: James Cameron

[Reddy et al., 2014]

33

Distant supervision

Data generation:

• Decompose declarative text to questions and answers

James Cameron is the director of Titanic

Q: X is the director of Titanic A: James Cameron

Declarative text is cheap!


33

Distant supervision

Training:

• Use exising non-executable semantic parsers

X is the director of Titanic


34

Distant supervision

Training:



λx.director(x) ∧ director.of.arg1(e, x) ∧ director.of.arg2(e,Titanic)


34

Distant supervision

Training:




λx.Director(x) ∧ FilmDirectedBy(e, x) ∧ FileDirected(e,Titanic)

λx.Producer(x) ∧ FilmProducedBy(e, x) ∧ FileProduced(e,Titanic)


34

Distant supervision

Training:




λx.Director(x) ∧ FilmDirectedBy(e, x) ∧ FileDirected(e,Titanic)

λx.Producer(x) ∧ FilmProducedBy(e, x) ∧ FileProduced(e,Titanic)

James Cameron true

James Camerson, Jon Landau false


34

Training from conversations

35


z1 : From(Atlanta) u To(London)

z2 : From(Atlanta) u From(London)

z3 : To(Atlanta) u To(London)

z4 : To(Atlanta) u From(London)

35


z1 : From(Atlanta) u To(London)

z2 : From(Atlanta) u From(London)

z3 : To(Atlanta) u To(London)

z4 : To(Atlanta) u From(London)

Define loss:

• Does z align with converation?

• Does z obey domain constraints?

35

Unsupervised learning

Intuition: assume repeating patterns are correct

[Goldwasser et al., 2011]

36



Input: {xi}ni=1

Output: θ


36



Input: {xi}ni=1

Output: θ

θ initialized manually, S = φ


36



Input: {xi}ni=1

Output: θ


Until stopping criterion met

for example xi

S = S ∪ (xi, argmax pθ(d | xi))


36



Input: {xi}ni=1

Output: θ



for example xi

S = S ∪ (xi, argmax pθ(d | xi))Compute statistics of S


36



Input: {xi}ni=1

Output: θ



for example xi


Sconf ←find confident subset


36



Input: {xi}ni=1

Output: θ



for example xi



Train on Sconf


36



Input: {xi}ni=1

Output: θ



for example xi



Train on Sconf

Substantially lower performance


36

Paraphrasing

What languages do people in Brazil use?

B. and Liang, 2014

37

Paraphrasing


Type.HumanLanguage u LanguagesSpoken.Brazil ... CapitalOf.Brazil

B. and Liang, 2014

37

Paraphrasing


What language is the language of Brazil? ... What city is the capital of Brazil?


B. and Liang, 2014

37

Paraphrasing


paraphrase model



B. and Liang, 2014

37

Paraphrasing


paraphrase model



Portuguese, ...

B. and Liang, 2014

37

Paraphrasing


paraphrase model



Portuguese, ...

Model: pθ(c, z | x) = p(z | x)× pθ(c | x)

Idea: train a large paraphrase model pθ(c | x)

B. and Liang, 2014

37

Paraphrasing


paraphrase model



Portuguese, ...

Model: pθ(c, z | x) = p(z | x)× pθ(c | x)

Idea: train a large paraphrase model pθ(c | x)

More later

B. and Liang, 2014

37

Summary

We saw how to train from denotations and logical forms

We see methods for inducing lexicons during training

We reviewed work on how to use even weaker forms of supervision

Still an open problem

38

outline - stanford nlp groupnlp.stanford.edu/joberant/esslli_2016/class3_index.pdfsupervision in...

Documents