structured prediction with perceptron: theory and algorithms...•q: can we sacrifice search...

52
Kai Zhao Dept. Computer Science Graduate Center, the City University of New York Structured Prediction with Perceptron: Theory and Algorithms the man bit the dog DT NN VBD DT NN x y x y=-1 y=+1 x the man hit the dog

Upload: others

Post on 30-Jan-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Kai Zhao

Dept. Computer Science

Graduate Center, the City University of New York

Structured Prediction with Perceptron:

Theory and Algorithmsthe man bit the dog

DT NN VBD DT NN

x

y

x

y=-1y=+1

xthe man hit the dog

那 ⼈人 咬 了 狗

Page 2: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

What is Structured Prediction?

• binary classification: output is binary

• multiclass classification: output is a (small) number

•structured classification: output is a structure (seq., tree, graph)

• part-of-speech tagging, parsing, summarization, translation

• exponentially many classes: search (inference) efficiency is crucial! 2

x

y=-1y=+1

x the man bit the dog

DT NN VBD DT NN

x

y

the man bit the dog x

y

S

NP

DT

the

NN

man

VP

VB

bit

NP

DT

the

NN

dog

the man bit the dog x

那 ⼈人 咬 了 狗 y

NLP is all about structured prediction!

Page 3: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Learning: Unstructured vs. Structured

3

binary/multiclass structured learning

perceptron structured perceptron

SVM structured SVM

Online+ Viterbi

maxmargin

maxmargin

Online+ Viterbi

naive bayes

HMMs

CRFslogistic

regression(maxent)

Conditional Conditional

generative

discriminative

(count & divide)

(expectations)

(argmax)

(loss-augmented argmax)

Page 4: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Why Perceptron (Online Learning)?

• because we want scalability on big data!

• learning time has to be linear in the number of examples

• can make only constant number of passes over training data

• only online learning (perceptron/MIRA) can guarantee this!

• SVM scales between O(n2) and O(n3); CRF no guarantee

• and inference on each example must be super fast

• another advantage of perceptron: just need argmax

4

SVM

CRF . . .

Page 5: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Perceptron: from binary to structured

5

x

y=-1y=+1

x

y

update weights

if y ≠ z

w

x zexactinference

trivial

2 classes

binary perceptron(Rosenblatt, 1959)

the man bit the dog

那 ⼈人 咬 了 狗

x

y y

update weights

if y ≠ z

w

x zexactinference

hardexponential # of classes

structured perceptron(Collins, 2002)

y

update weights

if y ≠ z

w

x zexactinference

easy

constant# of classes

multiclass perceptron (Freund/Schapire, 1999)

Page 6: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Scalability Challenges

• inference (on one example) is too slow (even w/ DP)

• can we sacrifice search exactness for faster learning?

• would inexact search interfere with learning?

• if so, how should we modify learning?

6

. . .

Page 7: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Outline

• Overview of Structured Learning

• Challenges in Scalability

• Structured Perceptron

• convergence proof

• Structured Perceptron with Inexact Search

• Latent-Variable Structured Perceptron

7

Page 8: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Generic Perceptron• perceptron is the simplest machine learning algorithm

• online-learning: one example at a time

• learning by doing

• find the best output under the current weights

• update weights at mistakes

8

inferencexi

update weightszi

yi

w

Page 9: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Structured Perceptron

9

inferencexi

update weightszi

yi

w

the man bit the dog

DT NN VBD DT NN

DT NN NN DT NN

Page 10: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Example: POS Tagging

• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

10

x

x

y

z

Φ(x, z)

Φ(x, y)

Page 11: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Inference: Dynamic Programming

11

y

update weights

if y ≠ z

w

x zexactinference

tagging: O(nT3) CKY parsing: O(n3)

Page 12: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Perceptron vs. CRFs• perceptron is online and Viterbi approximation of CRF

• simpler to code; faster to converge; ~same accuracy

12

CRFs(Lafferty et al, 2001)

stochastic gradient descent (SGD)

hard/Viterbi CRFs

structured perceptron(Collins, 2002)

online

online

Viterbi

Viterbi

for (x, y) 2 D, argmax

z2GEN(x)w · �(x, z)

X

(x,y)2D

X

z2GEN(x)

exp(w · �(x, z))Z(x)

Page 13: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Perceptron Convergence Proof• binary classification: converges iff. data is separable

• structured prediction: converges iff. data is separable

• there is an oracle vector that correctly labels all examples

• one vs the rest (correct label better than all incorrect labels)

• theorem: if separable, then # of updates ≤ R2 / δ2 R: diameter

13

y=+1y=-1

y100

z ≠ y100

x100

x100x111

x2000x3012

R: diameterR: diameter

δδ

Novikoff => Freund & Schapire => Collins 1962 1999 2002

Page 14: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Geometry of Convergence Proof pt 1

14

y

update weightsif y ≠ z

wx zexact

inference

y

w(k

)

w(k+1)

correct label

��(x

,

y

,

z)update

curr

ent

mod

el

update

new

m

odel

perceptron update:

zexact1-best

(by induction)

δseparation

unit oracle vector u

margin� �

(part 1: lowerbound)kwk+1k � k�

Page 15: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

<90˚

Geometry of Convergence Proof pt 2

15

y

update weightsif y ≠ z

wx zexact

inference

y

w(k

)

w(k+1)

violation: incorrect label scored higher

correct label

��(x

,

y

,

z)update

curr

ent

mod

el

update

new

m

odel

perceptron update:

violation

R: max diameter

zexact1-best

diameter R2

by induction: (part 2: upperbound)kwk+1k2 kR2

summary: the proof uses 3 facts:1. separation (margin)2. diameter (always finite)3. violation (guaranteed by exact search)

combine with: (part 1: lowerbound)k R2/�2bound on # of updates:

kwk+1k � k�kwk+1k

pkR

p kR

k�

Page 16: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Outline

• Overview of Structured Learning

• Challenges in Scalability

• Structured Perceptron

• convergence proof

• Structured Perceptron with Inexact Search

• Latent-Variable Perceptron

16

Page 17: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Scalability Challenge 1: Inference

• challenge: search efficiency (exponentially many classes)

• often use dynamic programming (DP)

• but DP is still too slow for repeated use, e.g. parsing O(n3)

• Q: can we sacrifice search exactness for faster learning?17

the man bit the dog

DT NN VBD DT NN

x

y y

update weightsif y ≠ z

w

x zexactinference

x

y=-1y=+1

x

y

update weightsif y ≠ z

w

x zexactinference

trivial

hard

constant# of classes

exponential # of classes

binary classification

structured classification

Page 18: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Perceptron w/ Inexact Inference

• routine use of inexact inference in NLP (e.g. beam search)

• how does structured perceptron work with inexact search?

• so far most structured learning theory assume exact search

• would search errors break these learning properties?

• if so how to modify learning to accommodate inexact search?18

the man bit the dog

DT NN VBD DT NN

x

y

x zinexact inference

y

update weightsif y ≠ z

w

Q: does perceptron still work???

beam searchgreedy search

Page 19: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Bad News and Good News

• bad news: no more guarantee of convergence

• in practice perceptron degrades a lot due to search errors

• good news: new update methods guarantee convergence

• new perceptron variants that “live with” search errors

• in practice they work really well w/ inexact search19

the man bit the dog

DT NN VBD DT NN

x

y

x zinexact inference

y

update weightsif y ≠ z

w

A: it no longer works as is, but we can make it work

by some magic.beam searchgreedy search

Page 20: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Convergence with Exact Search

20

w(k

)

V

V N

N V

N

V V training example time fliesN V

output space{N,V} x {N, V}

structured perceptron converges with exact search

correctlabel

current model

w(k+1)

updat

e

Page 21: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

No Convergence w/ Greedy Search

21

w(k)

��(x,

y

,

z

)

V

V N

N V

N

V V training example time fliesN V

output space{N,V} x {N, V}

N VV

V NN

V V

w(k)

w(k+

1)

structured perceptron does not converge with greedy search

correctlabel

updat

e

current model

new model

Which part of the convergence proof no longer holds?the proof only uses 3 facts:1. separation (margin)2. diameter (always finite)3. violation (guaranteed by exact search)

Page 22: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

<90˚

Geometry of Convergence Proof pt 2

22

y

update weightsif y ≠ z

wx zexact

inference

y

w(k

)

w(k+1)

violation: incorrect label scored higher

correct label

��(x

,

y

,

z)update

curr

ent

mod

el

update

new

m

odel

perceptron update:

violation

R: max diameter

zexact1-best

diameter R2

z

inexact1-best

inexact search doesn’t guarantee violation!

Page 23: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

<90˚

Observation: Violation is all we need!

23

y

violation: incorrect label scored higher

correct label

��(x

,

y

,

z)update

• exact search is not really required by the proof

• rather, it is only used to ensure violation!

w(k

)

w(k+1)

curr

ent

mod

el

update

new

m

odel

R: max diameter

all violations

zexact1-best

the proof only uses 3 facts:

1. separation (margin)2. diameter (always finite)3. violation (but no need for exact)

Page 24: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

y

Violation-Fixing Perceptron• if we guarantee violation, we don’t care about exactness!

• violation is good b/c we can at least fix a mistake

24

y

update weightsif y ≠ z

wx zexact

inferencey

update weightsif y’ ≠ z

wx zfind

violation y’

same mistake bound as before!

standard perceptron violation-fixing perceptron

all violations

all po

ssible

upda

tes

Page 25: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

What if can’t guarantee violation• this is why perceptron doesn’t work well w/ inexact search

• because not every update is guaranteed to be a violation

• thus the proof breaks; no convergence guarantee

• example: beam or greedy search

• the model might prefer the correct label (if exact search)

• but the search prunes it away

• such a non-violation update is “bad”because it doesn’t fix any mistake

• the new model still misguides the search

• Q: how can we always guarantee violation?25

beam

bad

update

current model

Page 26: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Solution 1: Early update (Collins/Roark 2004)

26

w(k)

��(x,

y

,

z

)

V

V N

N V

N

V V training example time fliesN V

output space{N,V} x {N, V}

N VV

V N

w(k)

w(k+

1)

��(x

,

y

,

z)w (k+1)

w(k)

stop and update at the first mistake

V

N

standard perceptron does not converge with greedy search

correctlabel

current model

new model

updat

e

new model

Page 27: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Early Update: Guarantees Violation

27

w(k)

��(x,

y

,

z

)

V

V N

N V

N

V V training example time fliesN V

output space{N,V} x {N, V}

N VV

V NN

V V

w(k)

w(k+

1)

��(x

,

y

,

z)w (k+1)

w(k)

early update: incorrect prefix scores higher: a violation!

V

N

standard update doesn’t converge

b/c it doesn’t guarantee violation

correctlabel

Page 28: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Early Update: from Greedy to Beam

• beam search is a generalization of greedy (where b=1)

• at each stage we keep top b hypothesis

• widely used: tagging, parsing, translation...

• early update -- when correct label first falls off the beam

• up to this point the incorrect prefix should score higher

• standard update (full update) -- no guarantee!

28

earl

y

upda

te

correct labelfalls off beam

(pruned)

correct

incorrect

violation guaranteed: incorrect prefix scores higher up to this point

standard update(no guarantee!)Model

Page 29: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Solution 2: Max-Violation (Huang et al 2012)

29

beam

earlystandard

(bad!)

max-violation

• we now established a theory for early update (Collins/Roark)

• but it learns too slowly due to partial updates

• max-violation: use the prefix where violation is maximum

• “worst-mistake” in the search space

• all these update methods are violation-fixing perceptrons

Model

Page 30: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

bitmanthe dogthe

the man bit the dog x

y

incremental parsing

the man bit the dog

DT NN VBD DT NN

x

y

part-of-speech tagging

Four Experiments

machine translation

the man bit the dog

那 ⼈人 咬 了 狗

the man bit the dog x

y

bottom-up parsing w/ cube pruning

x

ybitmanthe dogthe

Page 31: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Max-Violation > Early >> Standard• exp 1 on part-of-speech tagging w/ beam search (on CTB5)

• early and max-violation >> standard update at smallest beams

• this advantage shrinks as beam size increases

• max-violation converges faster than early (and slightly better)

31

beam=1 (greedy)

best accuracy vs. beam size

92

92.5

93

93.5

94

0 0.05 0.1 0.15 0.2 0.25

tagg

ing

accu

racy

on

held

-out

training time (hours)

92

92.5

93

93.5

94

1 2 3 4 5 6 7best

tagg

ing

accu

racy

on

held

-out

beam size

max-violationearly

standard

Page 32: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Max-Violation > Early >> Standard

32

beam=2

best accuracy vs. beam size

92

92.5

93

93.5

94

1 2 3 4 5 6 7best

tagg

ing

accu

racy

on

held

-out

beam size

max-violationearly

standard 92

92.5

93

93.5

94

0 0.05 0.1 0.15 0.2 0.25

tagg

ing

accu

racy

on

held

-out

training time (hours)

• exp 1 on part-of-speech tagging w/ beam search (on CTB5)

• early and max-violation >> standard update at smallest beams

• this advantage shrinks as beam size increases

• max-violation converges faster than early (and slightly better)

Page 33: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Max-Violation > Early >> Standard• exp 2 on incremental dependency parser (Huang & Sagae 10)

• standard update is horrible due to search errors

• early update: 38 iterations, 15.4 hours (92.24)

• max-violation: 10 iterations, 4.6 hours (92.25)

33

78

80

82

84

86

88

90

92

0 2 4 6 8 10 12 14 16

pars

ing

accu

racy

on

held

-out

training time (hours)

max-violationearly

standardbeam=8

91

91.25

91.5

91.75

92

92.25

0 2 4 6 8 10 12 14 16

pars

ing

accu

racy

on

held

-out

training time (hours)

max-violationearly

standard (79.0, omitted)

beam=8

max-violation is 3.3x faster than early update

Page 34: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Why standard update so bad for parsing

• standard update works horribly with severe search error

• due to large number of invalid updates (non-violation)

34

% of invalid updates in standard update

0

25

50

75

100

2 4 6 8 10 12 14 16

% o

f inv

alid

upd

ates

beam size

parsingtagging

78

80

82

84

86

88

90

92

0 2 4 6 8 10 12 14 16

pars

ing

accu

racy

on

held

-out

training time (hours)

max-violationearly

standardparsingb=8

taggingb=1

91

91.5

92

92.5

93

93.5

94

0 0.05 0.1 0.15 0.2 0.25

tagg

ing

accu

racy

on

held

-out

training time (hours)

O(nT3) => O(nb)O(n11) => O(nb)

Page 35: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Exp 3: Bottom-up Parsing

• CKY parsing with cube pruning for higher-order features

• we extended our framework from graphs to hypergraphs

35

92 92.2 92.4 92.6 92.8

93 93.2 93.4 93.6 93.8

94

1 2 3 4 5 6 7 8

UAS

epochs

UAS on Penn-YM dev

s-maxp-max

skipstandard

(Zhang et al 2013)

Page 36: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Exp 4: Machine Translation• standard perceptron works poorly for machine translation

• b/c invalid update ratio is very high (search quality is low)

• max-violation converges faster than early update

• first truly successful effort in large-scale training for translation

50%

60%

70%

80%

90%

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ra

tio

beam size

Ratio of invalid updates+non-local feature

(standard perceptron)

151617181920212223242526

2 4 6 8 10 12 14 16 18 20

BL

EU

Number of iteration

max-violationearly

standardlocal

(Yu et al 2013)

Page 37: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Comparison of Four Exps• the harder your search, the more advantageous

37

92 92.2 92.4 92.6 92.8

93 93.2 93.4 93.6 93.8

94

1 2 3 4 5 6 7 8

UAS

epochs

UAS on Penn-YM dev

s-maxp-max

skipstandard

78

80

82

84

86

88

90

92

0 2 4 6 8 10 12 14 16

pars

ing

accu

racy

on

held

-out

training time (hours)

max-violationearly

standardincremental parsing b=8

91

91.5

92

92.5

93

93.5

94

0 0.05 0.1 0.15 0.2 0.25

tagg

ing

accu

racy

on

held

-out

training time (hours)

taggingb=1

151617181920212223242526

2 4 6 8 10 12 14 16 18 20

BL

EU

Number of iteration

max-violationearly

standardlocal

bottom-up parsing machine translation

Page 38: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Outline

• Overview of Structured Learning

• Challenges in Scalability

• Structured Perceptron

• convergence proof

• Structured Perceptron with Inexact Search

• Latent-Variable Perceptron

38

Page 39: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Learning with Latent Variables• aka “weakly-supervised” or “partially-observed” learning

• learning from “natural annotations”; more scalable

• examples: translation, transliteration, semantic parsing...

39

x

yBush talked with Sharon

布什 与 沙⻰龙 会谈

latent derivation

parallel text

(Liang et al 2006; Yu et al 2013;

Xiao and Xiong 2013)

Alaska

What is the largest state?

argmax(state, size)

QA pairs(Clark et al 2010; Liang et al 2013;

Kwiatkowski et al 2013)

コンピューターko n py u : ta :

computer

latent derivation

transliteration

(Knight & Graehl, 1998; Kondrak et al 2007, etc.)

Page 40: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Learning Latent Structures

40

binary/multiclass structured learning

perceptron structured perceptron

SVM structured SVM

Online+ Viterbi

maxmargin

maxmargin

Online+ Viterbi

naive bayes

HMMs

CRFslogistic

regression(maxent)

Conditional Conditional

latent structures

EM(forward-backward)

latent CRFs

latent perceptron

latent structured SVM

Page 41: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

forceddecoding

space

full searchspace be

am

Latent Structured Perceptron• no explicit positive signal

• hallucinate the “correct” derivation by current weights

41

x

ythe man bit the dog

那 ⼈人 咬 了 狗 x那 ⼈人 咬 了 狗

highest-scoring derivation

highest-scoringgold derivation

penalize wrong

training example during online learning...

≠ the dog bit the man z

w w + �(x, d⇤)� �(x, d̂)

d⇤ d̂

rewardcorrect(Liang et al 2006;

Yu et al 2013)

Page 42: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Unconstrained Search• example: beam search phrase-based decoding

42

●_ _●●● ... talks

●_ _●●● ... talk

●_ _●●●... meeting

●●●●●● ... Sharon

●●●●●● ... Shalong

●_ _ _ _ _ Bush

● ● ● ● ● ●

Bushi yu Shalong juxing le huitan ???

Page 43: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Constrained Search

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

one gold derivation

• forced decoding: must produce the exact reference translation

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

●_ _●●● ... talks

●_ _●●● ... talk

●_ _●●●... meeting

●●●●●● ... Sharon

●●●●●● ... Shalong

●_ _ _ _ _ Bush

Page 44: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

forceddecoding

space

full searchspace be

am

Search Errors in Decoding• no explicit positive signal

• hallucinate the “correct” derivation by current weights

44

x

ythe man bit the dog

那 ⼈人 咬 了 狗 x那 ⼈人 咬 了 狗

highest-scoring derivation

highest-scoringgold derivation

penalize wrong

problem: search errors

training example during online learning...

≠ the dog bit the man z

w w + �(x, d⇤)� �(x, d̂)

d⇤ d̂

rewardcorrect(Liang et al 2006;

Yu et al 2013)

Page 45: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Search Error: Gold Derivations Pruned

45

_ _ _ _ _ _ ● _ _ ● ● ●

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ _

_ _ _ _ ● _

_ _ _ _ _ ●● ● _ ● _ _

_ ● ● ● _ _

_ _ ● ● ●_

_ ● ● ● _ _

● _ _ ● ● _

● _ ● ● _ ●_ ● _ ● ● ●

● _ _ ● ● ●_ ● ● _ _ _

_ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

real decoding beam search

should address search errors here!

Page 46: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Fixing Search Error 1: Early Update

• early update (Collins/Roark’04) when the correct falls off beam

• up to this point the incorrect prefix should score higher

• that’s a “violation” we want to fix; proof in (Huang et al 2012)

• standard perceptron does not guarantee violation

• the correct sequence (pruned) might score higher at the end!

• “invalid” update b/c it reinforces the model error

46

earl

y

upda

te

correct sequencefalls off beam

(pruned)

correct

incorrect

violation guaranteed: incorrect prefix scores higher up to this point

standard update(no guarantee!)

21

Model

Page 47: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Early Update w/ Latent Variable

47

earl

y

upda

te

incorrect

violation guaranteed: incorrect prefix scores higher up to this point

21

Model

• the gold-standard derivations are not annotated

• we treat any reference-producing derivation as good

correct

all correct derivations fall off

stop decoding

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

Page 48: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Fixing Search Error 2: Max-Violation

48

• early update works but learns slowly due to partial updates

• max-violation: use the prefix where violation is maximum

• “worst-mistake” in the search space

• now extended to handle latent-variable

early

max

-vi

olat

ion

best in the beam

worst in the beam

h−ie

h+ie

h+i∗

h−i∗

h+|x|

hy|x|

std

loca

l

standard update is invalid

mod

el

correct sequences

correct seqsall fall off beam

Page 49: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Latent-Variable Perceptron

early

max

-vi

olat

ion

best in the beam

worst in the beam

h−ie

h+ie

h+i∗

h−i∗

h+|x|

hy|x|

std

loca

l

standard update is invalid

mod

el

correct sequences

correct seqsall fall off beam

early

max

-vi

olat

ion

late

st

full

(sta

ndar

d)

best in the beam

worst in the beamfalls off

the beam biggestviolation

last valid update

correct sequence

invalidupdate!

Page 50: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Roadmap of Techniques

50

structured perceptron(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013; Zhao et al 2014)

MT syntactic parsing semantic parsing transliteration

Page 51: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

early

max

-vio

latio

n

best in the beam

worst in the beam

d�i

d+id+i⇤

d�i⇤d+|x|

dy|x|

std

loca

l

standard update is invalid

mod

elw

d�|x|

Experiments: Discriminative Training for MT

• standard update (Liang et al’s “bold”) works poorly

• b/c invalid update ratio is very high (search quality is low)

• max-violation converges faster than early update

50%

60%

70%

80%

90%

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ra

tio

beam size

Ratio of invalid updates+non-local feature

standard latent-variable perceptron

this explains why Liang et al ’06 failedstd ~ “bold”; local ~ “local”

17

18

19

20

21

22

23

24

25

26

2 4 6 8 10 12 14 16 18 20

BLE

U

Number of iteration

MaxForce

MERTearly

local

standard

51

Page 52: Structured Prediction with Perceptron: Theory and Algorithms...•Q: can we sacrifice search exactness for faster learning? 17 the man bit the dog DT NN VBD DT NN x y y update weights

Final Conclusions

• online structured learning is simple and powerful

• search efficiency is the key challenge

• search errors do interfere with learning

• but we can use violation-fixing perceptron w/ inexact search

• we can extend perceptron to learn latent structures

52

early

max

-vi

olat

ion

late

st

full

(standard)

best in the beam

worst in the beamfalls off

the beam biggestviolation

last valid update

correct sequence

invalidupdate!