structured prediction with perceptron: theory and algorithms...•q: can we sacrifice search...

Post on 30-Jan-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Kai Zhao

Dept. Computer Science

Graduate Center, the City University of New York

Structured Prediction with Perceptron:

Theory and Algorithmsthe man bit the dog

DT NN VBD DT NN

x

y

x

y=-1y=+1

xthe man hit the dog

那 ⼈人 咬 了 狗

What is Structured Prediction?

• binary classification: output is binary

• multiclass classification: output is a (small) number

•structured classification: output is a structure (seq., tree, graph)

• part-of-speech tagging, parsing, summarization, translation

• exponentially many classes: search (inference) efficiency is crucial! 2

x

y=-1y=+1

x the man bit the dog

DT NN VBD DT NN

x

y

the man bit the dog x

y

S

NP

DT

the

NN

man

VP

VB

bit

NP

DT

the

NN

dog

the man bit the dog x

那 ⼈人 咬 了 狗 y

NLP is all about structured prediction!

Learning: Unstructured vs. Structured

3

binary/multiclass structured learning

perceptron structured perceptron

SVM structured SVM

Online+ Viterbi

maxmargin

maxmargin

Online+ Viterbi

naive bayes

HMMs

CRFslogistic

regression(maxent)

Conditional Conditional

generative

discriminative

(count & divide)

(expectations)

(argmax)

(loss-augmented argmax)

Why Perceptron (Online Learning)?

• because we want scalability on big data!

• learning time has to be linear in the number of examples

• can make only constant number of passes over training data

• only online learning (perceptron/MIRA) can guarantee this!

• SVM scales between O(n2) and O(n3); CRF no guarantee

• and inference on each example must be super fast

• another advantage of perceptron: just need argmax

4

SVM

CRF . . .

Perceptron: from binary to structured

5

x

y=-1y=+1

x

y

update weights

if y ≠ z

w

x zexactinference

trivial

2 classes

binary perceptron(Rosenblatt, 1959)

the man bit the dog

那 ⼈人 咬 了 狗

x

y y

update weights

if y ≠ z

w

x zexactinference

hardexponential # of classes

structured perceptron(Collins, 2002)

y

update weights

if y ≠ z

w

x zexactinference

easy

constant# of classes

multiclass perceptron (Freund/Schapire, 1999)

Scalability Challenges

• inference (on one example) is too slow (even w/ DP)

• can we sacrifice search exactness for faster learning?

• would inexact search interfere with learning?

• if so, how should we modify learning?

6

. . .

Outline

• Overview of Structured Learning

• Challenges in Scalability

• Structured Perceptron

• convergence proof

• Structured Perceptron with Inexact Search

• Latent-Variable Structured Perceptron

7

Generic Perceptron• perceptron is the simplest machine learning algorithm

• online-learning: one example at a time

• learning by doing

• find the best output under the current weights

• update weights at mistakes

8

inferencexi

update weightszi

yi

w

Structured Perceptron

9

inferencexi

update weightszi

yi

w

the man bit the dog

DT NN VBD DT NN

DT NN NN DT NN

Example: POS Tagging

• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

10

x

x

y

z

Φ(x, z)

Φ(x, y)

Inference: Dynamic Programming

11

y

update weights

if y ≠ z

w

x zexactinference

tagging: O(nT3) CKY parsing: O(n3)

Perceptron vs. CRFs• perceptron is online and Viterbi approximation of CRF

• simpler to code; faster to converge; ~same accuracy

12

CRFs(Lafferty et al, 2001)

stochastic gradient descent (SGD)

hard/Viterbi CRFs

structured perceptron(Collins, 2002)

online

online

Viterbi

Viterbi

for (x, y) 2 D, argmax

z2GEN(x)w · �(x, z)

X

(x,y)2D

X

z2GEN(x)

exp(w · �(x, z))Z(x)

Perceptron Convergence Proof• binary classification: converges iff. data is separable

• structured prediction: converges iff. data is separable

• there is an oracle vector that correctly labels all examples

• one vs the rest (correct label better than all incorrect labels)

• theorem: if separable, then # of updates ≤ R2 / δ2 R: diameter

13

y=+1y=-1

y100

z ≠ y100

x100

x100x111

x2000x3012

R: diameterR: diameter

δδ

Novikoff => Freund & Schapire => Collins 1962 1999 2002

Geometry of Convergence Proof pt 1

14

y

update weightsif y ≠ z

wx zexact

inference

y

w(k

)

w(k+1)

correct label

��(x

,

y

,

z)update

curr

ent

mod

el

update

new

m

odel

perceptron update:

zexact1-best

(by induction)

δseparation

unit oracle vector u

margin� �

(part 1: lowerbound)kwk+1k � k�

<90˚

Geometry of Convergence Proof pt 2

15

y

update weightsif y ≠ z

wx zexact

inference

y

w(k

)

w(k+1)

violation: incorrect label scored higher

correct label

��(x

,

y

,

z)update

curr

ent

mod

el

update

new

m

odel

perceptron update:

violation

R: max diameter

zexact1-best

diameter R2

by induction: (part 2: upperbound)kwk+1k2 kR2

summary: the proof uses 3 facts:1. separation (margin)2. diameter (always finite)3. violation (guaranteed by exact search)

combine with: (part 1: lowerbound)k R2/�2bound on # of updates:

kwk+1k � k�kwk+1k

pkR

p kR

k�

Outline

• Overview of Structured Learning

• Challenges in Scalability

• Structured Perceptron

• convergence proof

• Structured Perceptron with Inexact Search

• Latent-Variable Perceptron

16

Scalability Challenge 1: Inference

• challenge: search efficiency (exponentially many classes)

• often use dynamic programming (DP)

• but DP is still too slow for repeated use, e.g. parsing O(n3)

• Q: can we sacrifice search exactness for faster learning?17

the man bit the dog

DT NN VBD DT NN

x

y y

update weightsif y ≠ z

w

x zexactinference

x

y=-1y=+1

x

y

update weightsif y ≠ z

w

x zexactinference

trivial

hard

constant# of classes

exponential # of classes

binary classification

structured classification

Perceptron w/ Inexact Inference

• routine use of inexact inference in NLP (e.g. beam search)

• how does structured perceptron work with inexact search?

• so far most structured learning theory assume exact search

• would search errors break these learning properties?

• if so how to modify learning to accommodate inexact search?18

the man bit the dog

DT NN VBD DT NN

x

y

x zinexact inference

y

update weightsif y ≠ z

w

Q: does perceptron still work???

beam searchgreedy search

Bad News and Good News

• bad news: no more guarantee of convergence

• in practice perceptron degrades a lot due to search errors

• good news: new update methods guarantee convergence

• new perceptron variants that “live with” search errors

• in practice they work really well w/ inexact search19

the man bit the dog

DT NN VBD DT NN

x

y

x zinexact inference

y

update weightsif y ≠ z

w

A: it no longer works as is, but we can make it work

by some magic.beam searchgreedy search

Convergence with Exact Search

20

w(k

)

V

V N

N V

N

V V training example time fliesN V

output space{N,V} x {N, V}

structured perceptron converges with exact search

correctlabel

current model

w(k+1)

updat

e

No Convergence w/ Greedy Search

21

w(k)

��(x,

y

,

z

)

V

V N

N V

N

V V training example time fliesN V

output space{N,V} x {N, V}

N VV

V NN

V V

w(k)

w(k+

1)

structured perceptron does not converge with greedy search

correctlabel

updat

e

current model

new model

Which part of the convergence proof no longer holds?the proof only uses 3 facts:1. separation (margin)2. diameter (always finite)3. violation (guaranteed by exact search)

<90˚

Geometry of Convergence Proof pt 2

22

y

update weightsif y ≠ z

wx zexact

inference

y

w(k

)

w(k+1)

violation: incorrect label scored higher

correct label

��(x

,

y

,

z)update

curr

ent

mod

el

update

new

m

odel

perceptron update:

violation

R: max diameter

zexact1-best

diameter R2

z

inexact1-best

inexact search doesn’t guarantee violation!

<90˚

Observation: Violation is all we need!

23

y

violation: incorrect label scored higher

correct label

��(x

,

y

,

z)update

• exact search is not really required by the proof

• rather, it is only used to ensure violation!

w(k

)

w(k+1)

curr

ent

mod

el

update

new

m

odel

R: max diameter

all violations

zexact1-best

the proof only uses 3 facts:

1. separation (margin)2. diameter (always finite)3. violation (but no need for exact)

y

Violation-Fixing Perceptron• if we guarantee violation, we don’t care about exactness!

• violation is good b/c we can at least fix a mistake

24

y

update weightsif y ≠ z

wx zexact

inferencey

update weightsif y’ ≠ z

wx zfind

violation y’

same mistake bound as before!

standard perceptron violation-fixing perceptron

all violations

all po

ssible

upda

tes

What if can’t guarantee violation• this is why perceptron doesn’t work well w/ inexact search

• because not every update is guaranteed to be a violation

• thus the proof breaks; no convergence guarantee

• example: beam or greedy search

• the model might prefer the correct label (if exact search)

• but the search prunes it away

• such a non-violation update is “bad”because it doesn’t fix any mistake

• the new model still misguides the search

• Q: how can we always guarantee violation?25

beam

bad

update

current model

Solution 1: Early update (Collins/Roark 2004)

26

w(k)

��(x,

y

,

z

)

V

V N

N V

N

V V training example time fliesN V

output space{N,V} x {N, V}

N VV

V N

w(k)

w(k+

1)

��(x

,

y

,

z)w (k+1)

w(k)

stop and update at the first mistake

V

N

standard perceptron does not converge with greedy search

correctlabel

current model

new model

updat

e

new model

Early Update: Guarantees Violation

27

w(k)

��(x,

y

,

z

)

V

V N

N V

N

V V training example time fliesN V

output space{N,V} x {N, V}

N VV

V NN

V V

w(k)

w(k+

1)

��(x

,

y

,

z)w (k+1)

w(k)

early update: incorrect prefix scores higher: a violation!

V

N

standard update doesn’t converge

b/c it doesn’t guarantee violation

correctlabel

Early Update: from Greedy to Beam

• beam search is a generalization of greedy (where b=1)

• at each stage we keep top b hypothesis

• widely used: tagging, parsing, translation...

• early update -- when correct label first falls off the beam

• up to this point the incorrect prefix should score higher

• standard update (full update) -- no guarantee!

28

earl

y

upda

te

correct labelfalls off beam

(pruned)

correct

incorrect

violation guaranteed: incorrect prefix scores higher up to this point

standard update(no guarantee!)Model

Solution 2: Max-Violation (Huang et al 2012)

29

beam

earlystandard

(bad!)

max-violation

• we now established a theory for early update (Collins/Roark)

• but it learns too slowly due to partial updates

• max-violation: use the prefix where violation is maximum

• “worst-mistake” in the search space

• all these update methods are violation-fixing perceptrons

Model

bitmanthe dogthe

the man bit the dog x

y

incremental parsing

the man bit the dog

DT NN VBD DT NN

x

y

part-of-speech tagging

Four Experiments

machine translation

the man bit the dog

那 ⼈人 咬 了 狗

the man bit the dog x

y

bottom-up parsing w/ cube pruning

x

ybitmanthe dogthe

Max-Violation > Early >> Standard• exp 1 on part-of-speech tagging w/ beam search (on CTB5)

• early and max-violation >> standard update at smallest beams

• this advantage shrinks as beam size increases

• max-violation converges faster than early (and slightly better)

31

beam=1 (greedy)

best accuracy vs. beam size

92

92.5

93

93.5

94

0 0.05 0.1 0.15 0.2 0.25

tagg

ing

accu

racy

on

held

-out

training time (hours)

92

92.5

93

93.5

94

1 2 3 4 5 6 7best

tagg

ing

accu

racy

on

held

-out

beam size

max-violationearly

standard

Max-Violation > Early >> Standard

32

beam=2

best accuracy vs. beam size

92

92.5

93

93.5

94

1 2 3 4 5 6 7best

tagg

ing

accu

racy

on

held

-out

beam size

max-violationearly

standard 92

92.5

93

93.5

94

0 0.05 0.1 0.15 0.2 0.25

tagg

ing

accu

racy

on

held

-out

training time (hours)

• exp 1 on part-of-speech tagging w/ beam search (on CTB5)

• early and max-violation >> standard update at smallest beams

• this advantage shrinks as beam size increases

• max-violation converges faster than early (and slightly better)

Max-Violation > Early >> Standard• exp 2 on incremental dependency parser (Huang & Sagae 10)

• standard update is horrible due to search errors

• early update: 38 iterations, 15.4 hours (92.24)

• max-violation: 10 iterations, 4.6 hours (92.25)

33

78

80

82

84

86

88

90

92

0 2 4 6 8 10 12 14 16

pars

ing

accu

racy

on

held

-out

training time (hours)

max-violationearly

standardbeam=8

91

91.25

91.5

91.75

92

92.25

0 2 4 6 8 10 12 14 16

pars

ing

accu

racy

on

held

-out

training time (hours)

max-violationearly

standard (79.0, omitted)

beam=8

max-violation is 3.3x faster than early update

Why standard update so bad for parsing

• standard update works horribly with severe search error

• due to large number of invalid updates (non-violation)

34

% of invalid updates in standard update

0

25

50

75

100

2 4 6 8 10 12 14 16

% o

f inv

alid

upd

ates

beam size

parsingtagging

78

80

82

84

86

88

90

92

0 2 4 6 8 10 12 14 16

pars

ing

accu

racy

on

held

-out

training time (hours)

max-violationearly

standardparsingb=8

taggingb=1

91

91.5

92

92.5

93

93.5

94

0 0.05 0.1 0.15 0.2 0.25

tagg

ing

accu

racy

on

held

-out

training time (hours)

O(nT3) => O(nb)O(n11) => O(nb)

Exp 3: Bottom-up Parsing

• CKY parsing with cube pruning for higher-order features

• we extended our framework from graphs to hypergraphs

35

92 92.2 92.4 92.6 92.8

93 93.2 93.4 93.6 93.8

94

1 2 3 4 5 6 7 8

UAS

epochs

UAS on Penn-YM dev

s-maxp-max

skipstandard

(Zhang et al 2013)

Exp 4: Machine Translation• standard perceptron works poorly for machine translation

• b/c invalid update ratio is very high (search quality is low)

• max-violation converges faster than early update

• first truly successful effort in large-scale training for translation

50%

60%

70%

80%

90%

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ra

tio

beam size

Ratio of invalid updates+non-local feature

(standard perceptron)

151617181920212223242526

2 4 6 8 10 12 14 16 18 20

BL

EU

Number of iteration

max-violationearly

standardlocal

(Yu et al 2013)

Comparison of Four Exps• the harder your search, the more advantageous

37

92 92.2 92.4 92.6 92.8

93 93.2 93.4 93.6 93.8

94

1 2 3 4 5 6 7 8

UAS

epochs

UAS on Penn-YM dev

s-maxp-max

skipstandard

78

80

82

84

86

88

90

92

0 2 4 6 8 10 12 14 16

pars

ing

accu

racy

on

held

-out

training time (hours)

max-violationearly

standardincremental parsing b=8

91

91.5

92

92.5

93

93.5

94

0 0.05 0.1 0.15 0.2 0.25

tagg

ing

accu

racy

on

held

-out

training time (hours)

taggingb=1

151617181920212223242526

2 4 6 8 10 12 14 16 18 20

BL

EU

Number of iteration

max-violationearly

standardlocal

bottom-up parsing machine translation

Outline

• Overview of Structured Learning

• Challenges in Scalability

• Structured Perceptron

• convergence proof

• Structured Perceptron with Inexact Search

• Latent-Variable Perceptron

38

Learning with Latent Variables• aka “weakly-supervised” or “partially-observed” learning

• learning from “natural annotations”; more scalable

• examples: translation, transliteration, semantic parsing...

39

x

yBush talked with Sharon

布什 与 沙⻰龙 会谈

latent derivation

parallel text

(Liang et al 2006; Yu et al 2013;

Xiao and Xiong 2013)

Alaska

What is the largest state?

argmax(state, size)

QA pairs(Clark et al 2010; Liang et al 2013;

Kwiatkowski et al 2013)

コンピューターko n py u : ta :

computer

latent derivation

transliteration

(Knight & Graehl, 1998; Kondrak et al 2007, etc.)

Learning Latent Structures

40

binary/multiclass structured learning

perceptron structured perceptron

SVM structured SVM

Online+ Viterbi

maxmargin

maxmargin

Online+ Viterbi

naive bayes

HMMs

CRFslogistic

regression(maxent)

Conditional Conditional

latent structures

EM(forward-backward)

latent CRFs

latent perceptron

latent structured SVM

forceddecoding

space

full searchspace be

am

Latent Structured Perceptron• no explicit positive signal

• hallucinate the “correct” derivation by current weights

41

x

ythe man bit the dog

那 ⼈人 咬 了 狗 x那 ⼈人 咬 了 狗

highest-scoring derivation

highest-scoringgold derivation

penalize wrong

training example during online learning...

≠ the dog bit the man z

w w + �(x, d⇤)� �(x, d̂)

d⇤ d̂

rewardcorrect(Liang et al 2006;

Yu et al 2013)

Unconstrained Search• example: beam search phrase-based decoding

42

●_ _●●● ... talks

●_ _●●● ... talk

●_ _●●●... meeting

●●●●●● ... Sharon

●●●●●● ... Shalong

●_ _ _ _ _ Bush

● ● ● ● ● ●

Bushi yu Shalong juxing le huitan ???

Constrained Search

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

one gold derivation

• forced decoding: must produce the exact reference translation

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

●_ _●●● ... talks

●_ _●●● ... talk

●_ _●●●... meeting

●●●●●● ... Sharon

●●●●●● ... Shalong

●_ _ _ _ _ Bush

forceddecoding

space

full searchspace be

am

Search Errors in Decoding• no explicit positive signal

• hallucinate the “correct” derivation by current weights

44

x

ythe man bit the dog

那 ⼈人 咬 了 狗 x那 ⼈人 咬 了 狗

highest-scoring derivation

highest-scoringgold derivation

penalize wrong

problem: search errors

training example during online learning...

≠ the dog bit the man z

w w + �(x, d⇤)� �(x, d̂)

d⇤ d̂

rewardcorrect(Liang et al 2006;

Yu et al 2013)

Search Error: Gold Derivations Pruned

45

_ _ _ _ _ _ ● _ _ ● ● ●

0 1 2 3 4 5 6

● _ _ _ _ _

_ _ ●_ _ _

_ _ _ _ ● _

_ _ _ _ _ ●● ● _ ● _ _

_ ● ● ● _ _

_ _ ● ● ●_

_ ● ● ● _ _

● _ _ ● ● _

● _ ● ● _ ●_ ● _ ● ● ●

● _ _ ● ● ●_ ● ● _ _ _

_ _ ● ●_ _

_ _ ● _ ● _

_ ● _ _ ● _

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

real decoding beam search

should address search errors here!

Fixing Search Error 1: Early Update

• early update (Collins/Roark’04) when the correct falls off beam

• up to this point the incorrect prefix should score higher

• that’s a “violation” we want to fix; proof in (Huang et al 2012)

• standard perceptron does not guarantee violation

• the correct sequence (pruned) might score higher at the end!

• “invalid” update b/c it reinforces the model error

46

earl

y

upda

te

correct sequencefalls off beam

(pruned)

correct

incorrect

violation guaranteed: incorrect prefix scores higher up to this point

standard update(no guarantee!)

21

Model

Early Update w/ Latent Variable

47

earl

y

upda

te

incorrect

violation guaranteed: incorrect prefix scores higher up to this point

21

Model

• the gold-standard derivations are not annotated

• we treat any reference-producing derivation as good

correct

all correct derivations fall off

stop decoding

_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●

0 1 2 3 4 5 6

Bushheld

talks with Sharon

held talks with Sharongold derivation lattice

Fixing Search Error 2: Max-Violation

48

• early update works but learns slowly due to partial updates

• max-violation: use the prefix where violation is maximum

• “worst-mistake” in the search space

• now extended to handle latent-variable

early

max

-vi

olat

ion

best in the beam

worst in the beam

h−ie

h+ie

h+i∗

h−i∗

h+|x|

hy|x|

std

loca

l

standard update is invalid

mod

el

correct sequences

correct seqsall fall off beam

Latent-Variable Perceptron

early

max

-vi

olat

ion

best in the beam

worst in the beam

h−ie

h+ie

h+i∗

h−i∗

h+|x|

hy|x|

std

loca

l

standard update is invalid

mod

el

correct sequences

correct seqsall fall off beam

early

max

-vi

olat

ion

late

st

full

(sta

ndar

d)

best in the beam

worst in the beamfalls off

the beam biggestviolation

last valid update

correct sequence

invalidupdate!

Roadmap of Techniques

50

structured perceptron(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013; Zhao et al 2014)

MT syntactic parsing semantic parsing transliteration

early

max

-vio

latio

n

best in the beam

worst in the beam

d�i

d+id+i⇤

d�i⇤d+|x|

dy|x|

std

loca

l

standard update is invalid

mod

elw

d�|x|

Experiments: Discriminative Training for MT

• standard update (Liang et al’s “bold”) works poorly

• b/c invalid update ratio is very high (search quality is low)

• max-violation converges faster than early update

50%

60%

70%

80%

90%

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ra

tio

beam size

Ratio of invalid updates+non-local feature

standard latent-variable perceptron

this explains why Liang et al ’06 failedstd ~ “bold”; local ~ “local”

17

18

19

20

21

22

23

24

25

26

2 4 6 8 10 12 14 16 18 20

BLE

U

Number of iteration

MaxForce

MERTearly

local

standard

51

Final Conclusions

• online structured learning is simple and powerful

• search efficiency is the key challenge

• search errors do interfere with learning

• but we can use violation-fixing perceptron w/ inexact search

• we can extend perceptron to learn latent structures

52

early

max

-vi

olat

ion

late

st

full

(standard)

best in the beam

worst in the beamfalls off

the beam biggestviolation

last valid update

correct sequence

invalidupdate!

top related