structured prediction with perceptron: theory and algorithms...•q: can we sacrifice search...
TRANSCRIPT
Kai Zhao
Dept. Computer Science
Graduate Center, the City University of New York
Structured Prediction with Perceptron:
Theory and Algorithmsthe man bit the dog
DT NN VBD DT NN
x
y
x
y=-1y=+1
xthe man hit the dog
那 ⼈人 咬 了 狗
What is Structured Prediction?
• binary classification: output is binary
• multiclass classification: output is a (small) number
•structured classification: output is a structure (seq., tree, graph)
• part-of-speech tagging, parsing, summarization, translation
• exponentially many classes: search (inference) efficiency is crucial! 2
x
y=-1y=+1
x the man bit the dog
DT NN VBD DT NN
x
y
the man bit the dog x
y
S
NP
DT
the
NN
man
VP
VB
bit
NP
DT
the
NN
dog
the man bit the dog x
那 ⼈人 咬 了 狗 y
NLP is all about structured prediction!
Learning: Unstructured vs. Structured
3
binary/multiclass structured learning
perceptron structured perceptron
SVM structured SVM
Online+ Viterbi
maxmargin
maxmargin
Online+ Viterbi
naive bayes
HMMs
CRFslogistic
regression(maxent)
Conditional Conditional
generative
discriminative
(count & divide)
(expectations)
(argmax)
(loss-augmented argmax)
Why Perceptron (Online Learning)?
• because we want scalability on big data!
• learning time has to be linear in the number of examples
• can make only constant number of passes over training data
• only online learning (perceptron/MIRA) can guarantee this!
• SVM scales between O(n2) and O(n3); CRF no guarantee
• and inference on each example must be super fast
• another advantage of perceptron: just need argmax
4
SVM
CRF . . .
Perceptron: from binary to structured
5
x
y=-1y=+1
x
y
update weights
if y ≠ z
w
x zexactinference
trivial
2 classes
binary perceptron(Rosenblatt, 1959)
the man bit the dog
那 ⼈人 咬 了 狗
x
y y
update weights
if y ≠ z
w
x zexactinference
hardexponential # of classes
structured perceptron(Collins, 2002)
y
update weights
if y ≠ z
w
x zexactinference
easy
constant# of classes
multiclass perceptron (Freund/Schapire, 1999)
Scalability Challenges
• inference (on one example) is too slow (even w/ DP)
• can we sacrifice search exactness for faster learning?
• would inexact search interfere with learning?
• if so, how should we modify learning?
6
. . .
Outline
• Overview of Structured Learning
• Challenges in Scalability
• Structured Perceptron
• convergence proof
• Structured Perceptron with Inexact Search
• Latent-Variable Structured Perceptron
7
Generic Perceptron• perceptron is the simplest machine learning algorithm
• online-learning: one example at a time
• learning by doing
• find the best output under the current weights
• update weights at mistakes
8
inferencexi
update weightszi
yi
w
←
Structured Perceptron
9
inferencexi
update weightszi
yi
w
the man bit the dog
DT NN VBD DT NN
DT NN NN DT NN
Example: POS Tagging
• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
10
x
x
y
z
Φ(x, z)
Φ(x, y)
Inference: Dynamic Programming
11
y
update weights
if y ≠ z
w
x zexactinference
tagging: O(nT3) CKY parsing: O(n3)
Perceptron vs. CRFs• perceptron is online and Viterbi approximation of CRF
• simpler to code; faster to converge; ~same accuracy
12
CRFs(Lafferty et al, 2001)
stochastic gradient descent (SGD)
hard/Viterbi CRFs
structured perceptron(Collins, 2002)
online
online
Viterbi
Viterbi
for (x, y) 2 D, argmax
z2GEN(x)w · �(x, z)
X
(x,y)2D
X
z2GEN(x)
exp(w · �(x, z))Z(x)
Perceptron Convergence Proof• binary classification: converges iff. data is separable
• structured prediction: converges iff. data is separable
• there is an oracle vector that correctly labels all examples
• one vs the rest (correct label better than all incorrect labels)
• theorem: if separable, then # of updates ≤ R2 / δ2 R: diameter
13
y=+1y=-1
y100
z ≠ y100
x100
x100x111
x2000x3012
R: diameterR: diameter
δδ
Novikoff => Freund & Schapire => Collins 1962 1999 2002
Geometry of Convergence Proof pt 1
14
y
update weightsif y ≠ z
wx zexact
inference
y
w(k
)
w(k+1)
correct label
��(x
,
y
,
z)update
curr
ent
mod
el
update
new
m
odel
perceptron update:
zexact1-best
(by induction)
δseparation
unit oracle vector u
margin� �
(part 1: lowerbound)kwk+1k � k�
<90˚
Geometry of Convergence Proof pt 2
15
y
update weightsif y ≠ z
wx zexact
inference
y
w(k
)
w(k+1)
violation: incorrect label scored higher
correct label
��(x
,
y
,
z)update
curr
ent
mod
el
update
new
m
odel
perceptron update:
violation
R: max diameter
zexact1-best
diameter R2
by induction: (part 2: upperbound)kwk+1k2 kR2
summary: the proof uses 3 facts:1. separation (margin)2. diameter (always finite)3. violation (guaranteed by exact search)
combine with: (part 1: lowerbound)k R2/�2bound on # of updates:
kwk+1k � k�kwk+1k
pkR
p kR
k�
Outline
• Overview of Structured Learning
• Challenges in Scalability
• Structured Perceptron
• convergence proof
• Structured Perceptron with Inexact Search
• Latent-Variable Perceptron
16
Scalability Challenge 1: Inference
• challenge: search efficiency (exponentially many classes)
• often use dynamic programming (DP)
• but DP is still too slow for repeated use, e.g. parsing O(n3)
• Q: can we sacrifice search exactness for faster learning?17
the man bit the dog
DT NN VBD DT NN
x
y y
update weightsif y ≠ z
w
x zexactinference
x
y=-1y=+1
x
y
update weightsif y ≠ z
w
x zexactinference
trivial
hard
constant# of classes
exponential # of classes
binary classification
structured classification
Perceptron w/ Inexact Inference
• routine use of inexact inference in NLP (e.g. beam search)
• how does structured perceptron work with inexact search?
• so far most structured learning theory assume exact search
• would search errors break these learning properties?
• if so how to modify learning to accommodate inexact search?18
the man bit the dog
DT NN VBD DT NN
x
y
x zinexact inference
y
update weightsif y ≠ z
w
Q: does perceptron still work???
beam searchgreedy search
Bad News and Good News
• bad news: no more guarantee of convergence
• in practice perceptron degrades a lot due to search errors
• good news: new update methods guarantee convergence
• new perceptron variants that “live with” search errors
• in practice they work really well w/ inexact search19
the man bit the dog
DT NN VBD DT NN
x
y
x zinexact inference
y
update weightsif y ≠ z
w
A: it no longer works as is, but we can make it work
by some magic.beam searchgreedy search
Convergence with Exact Search
20
w(k
)
V
V N
N V
N
V V training example time fliesN V
output space{N,V} x {N, V}
structured perceptron converges with exact search
correctlabel
current model
w(k+1)
updat
e
No Convergence w/ Greedy Search
21
w(k)
��(x,
y
,
z
)
V
V N
N V
N
V V training example time fliesN V
output space{N,V} x {N, V}
N VV
V NN
V V
w(k)
w(k+
1)
structured perceptron does not converge with greedy search
correctlabel
updat
e
current model
new model
Which part of the convergence proof no longer holds?the proof only uses 3 facts:1. separation (margin)2. diameter (always finite)3. violation (guaranteed by exact search)
<90˚
Geometry of Convergence Proof pt 2
22
y
update weightsif y ≠ z
wx zexact
inference
y
w(k
)
w(k+1)
violation: incorrect label scored higher
correct label
��(x
,
y
,
z)update
curr
ent
mod
el
update
new
m
odel
perceptron update:
violation
R: max diameter
zexact1-best
diameter R2
z
inexact1-best
inexact search doesn’t guarantee violation!
<90˚
Observation: Violation is all we need!
23
y
violation: incorrect label scored higher
correct label
��(x
,
y
,
z)update
• exact search is not really required by the proof
• rather, it is only used to ensure violation!
w(k
)
w(k+1)
curr
ent
mod
el
update
new
m
odel
R: max diameter
all violations
zexact1-best
the proof only uses 3 facts:
1. separation (margin)2. diameter (always finite)3. violation (but no need for exact)
y
Violation-Fixing Perceptron• if we guarantee violation, we don’t care about exactness!
• violation is good b/c we can at least fix a mistake
24
y
update weightsif y ≠ z
wx zexact
inferencey
update weightsif y’ ≠ z
wx zfind
violation y’
same mistake bound as before!
standard perceptron violation-fixing perceptron
all violations
all po
ssible
upda
tes
What if can’t guarantee violation• this is why perceptron doesn’t work well w/ inexact search
• because not every update is guaranteed to be a violation
• thus the proof breaks; no convergence guarantee
• example: beam or greedy search
• the model might prefer the correct label (if exact search)
• but the search prunes it away
• such a non-violation update is “bad”because it doesn’t fix any mistake
• the new model still misguides the search
• Q: how can we always guarantee violation?25
beam
bad
update
current model
Solution 1: Early update (Collins/Roark 2004)
26
w(k)
��(x,
y
,
z
)
V
V N
N V
N
V V training example time fliesN V
output space{N,V} x {N, V}
N VV
V N
w(k)
w(k+
1)
��(x
,
y
,
z)w (k+1)
w(k)
stop and update at the first mistake
V
N
standard perceptron does not converge with greedy search
correctlabel
current model
new model
updat
e
new model
Early Update: Guarantees Violation
27
w(k)
��(x,
y
,
z
)
V
V N
N V
N
V V training example time fliesN V
output space{N,V} x {N, V}
N VV
V NN
V V
w(k)
w(k+
1)
��(x
,
y
,
z)w (k+1)
w(k)
early update: incorrect prefix scores higher: a violation!
V
N
standard update doesn’t converge
b/c it doesn’t guarantee violation
correctlabel
Early Update: from Greedy to Beam
• beam search is a generalization of greedy (where b=1)
• at each stage we keep top b hypothesis
• widely used: tagging, parsing, translation...
• early update -- when correct label first falls off the beam
• up to this point the incorrect prefix should score higher
• standard update (full update) -- no guarantee!
28
earl
y
upda
te
correct labelfalls off beam
(pruned)
correct
incorrect
violation guaranteed: incorrect prefix scores higher up to this point
standard update(no guarantee!)Model
Solution 2: Max-Violation (Huang et al 2012)
29
beam
earlystandard
(bad!)
max-violation
• we now established a theory for early update (Collins/Roark)
• but it learns too slowly due to partial updates
• max-violation: use the prefix where violation is maximum
• “worst-mistake” in the search space
• all these update methods are violation-fixing perceptrons
Model
bitmanthe dogthe
the man bit the dog x
y
incremental parsing
the man bit the dog
DT NN VBD DT NN
x
y
part-of-speech tagging
Four Experiments
machine translation
the man bit the dog
那 ⼈人 咬 了 狗
the man bit the dog x
y
bottom-up parsing w/ cube pruning
x
ybitmanthe dogthe
Max-Violation > Early >> Standard• exp 1 on part-of-speech tagging w/ beam search (on CTB5)
• early and max-violation >> standard update at smallest beams
• this advantage shrinks as beam size increases
• max-violation converges faster than early (and slightly better)
31
beam=1 (greedy)
best accuracy vs. beam size
92
92.5
93
93.5
94
0 0.05 0.1 0.15 0.2 0.25
tagg
ing
accu
racy
on
held
-out
training time (hours)
92
92.5
93
93.5
94
1 2 3 4 5 6 7best
tagg
ing
accu
racy
on
held
-out
beam size
max-violationearly
standard
Max-Violation > Early >> Standard
32
beam=2
best accuracy vs. beam size
92
92.5
93
93.5
94
1 2 3 4 5 6 7best
tagg
ing
accu
racy
on
held
-out
beam size
max-violationearly
standard 92
92.5
93
93.5
94
0 0.05 0.1 0.15 0.2 0.25
tagg
ing
accu
racy
on
held
-out
training time (hours)
• exp 1 on part-of-speech tagging w/ beam search (on CTB5)
• early and max-violation >> standard update at smallest beams
• this advantage shrinks as beam size increases
• max-violation converges faster than early (and slightly better)
Max-Violation > Early >> Standard• exp 2 on incremental dependency parser (Huang & Sagae 10)
• standard update is horrible due to search errors
• early update: 38 iterations, 15.4 hours (92.24)
• max-violation: 10 iterations, 4.6 hours (92.25)
33
78
80
82
84
86
88
90
92
0 2 4 6 8 10 12 14 16
pars
ing
accu
racy
on
held
-out
training time (hours)
max-violationearly
standardbeam=8
91
91.25
91.5
91.75
92
92.25
0 2 4 6 8 10 12 14 16
pars
ing
accu
racy
on
held
-out
training time (hours)
max-violationearly
standard (79.0, omitted)
beam=8
max-violation is 3.3x faster than early update
Why standard update so bad for parsing
• standard update works horribly with severe search error
• due to large number of invalid updates (non-violation)
34
% of invalid updates in standard update
0
25
50
75
100
2 4 6 8 10 12 14 16
% o
f inv
alid
upd
ates
beam size
parsingtagging
78
80
82
84
86
88
90
92
0 2 4 6 8 10 12 14 16
pars
ing
accu
racy
on
held
-out
training time (hours)
max-violationearly
standardparsingb=8
taggingb=1
91
91.5
92
92.5
93
93.5
94
0 0.05 0.1 0.15 0.2 0.25
tagg
ing
accu
racy
on
held
-out
training time (hours)
O(nT3) => O(nb)O(n11) => O(nb)
Exp 3: Bottom-up Parsing
• CKY parsing with cube pruning for higher-order features
• we extended our framework from graphs to hypergraphs
35
92 92.2 92.4 92.6 92.8
93 93.2 93.4 93.6 93.8
94
1 2 3 4 5 6 7 8
UAS
epochs
UAS on Penn-YM dev
s-maxp-max
skipstandard
(Zhang et al 2013)
Exp 4: Machine Translation• standard perceptron works poorly for machine translation
• b/c invalid update ratio is very high (search quality is low)
• max-violation converges faster than early update
• first truly successful effort in large-scale training for translation
50%
60%
70%
80%
90%
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Ra
tio
beam size
Ratio of invalid updates+non-local feature
(standard perceptron)
151617181920212223242526
2 4 6 8 10 12 14 16 18 20
BL
EU
Number of iteration
max-violationearly
standardlocal
(Yu et al 2013)
Comparison of Four Exps• the harder your search, the more advantageous
37
92 92.2 92.4 92.6 92.8
93 93.2 93.4 93.6 93.8
94
1 2 3 4 5 6 7 8
UAS
epochs
UAS on Penn-YM dev
s-maxp-max
skipstandard
78
80
82
84
86
88
90
92
0 2 4 6 8 10 12 14 16
pars
ing
accu
racy
on
held
-out
training time (hours)
max-violationearly
standardincremental parsing b=8
91
91.5
92
92.5
93
93.5
94
0 0.05 0.1 0.15 0.2 0.25
tagg
ing
accu
racy
on
held
-out
training time (hours)
taggingb=1
151617181920212223242526
2 4 6 8 10 12 14 16 18 20
BL
EU
Number of iteration
max-violationearly
standardlocal
bottom-up parsing machine translation
Outline
• Overview of Structured Learning
• Challenges in Scalability
• Structured Perceptron
• convergence proof
• Structured Perceptron with Inexact Search
• Latent-Variable Perceptron
38
Learning with Latent Variables• aka “weakly-supervised” or “partially-observed” learning
• learning from “natural annotations”; more scalable
• examples: translation, transliteration, semantic parsing...
39
x
yBush talked with Sharon
布什 与 沙⻰龙 会谈
latent derivation
parallel text
(Liang et al 2006; Yu et al 2013;
Xiao and Xiong 2013)
Alaska
What is the largest state?
argmax(state, size)
QA pairs(Clark et al 2010; Liang et al 2013;
Kwiatkowski et al 2013)
コンピューターko n py u : ta :
computer
latent derivation
transliteration
(Knight & Graehl, 1998; Kondrak et al 2007, etc.)
Learning Latent Structures
40
binary/multiclass structured learning
perceptron structured perceptron
SVM structured SVM
Online+ Viterbi
maxmargin
maxmargin
Online+ Viterbi
naive bayes
HMMs
CRFslogistic
regression(maxent)
Conditional Conditional
latent structures
EM(forward-backward)
latent CRFs
latent perceptron
latent structured SVM
forceddecoding
space
full searchspace be
am
Latent Structured Perceptron• no explicit positive signal
• hallucinate the “correct” derivation by current weights
41
x
ythe man bit the dog
那 ⼈人 咬 了 狗 x那 ⼈人 咬 了 狗
highest-scoring derivation
highest-scoringgold derivation
penalize wrong
training example during online learning...
≠ the dog bit the man z
w w + �(x, d⇤)� �(x, d̂)
d⇤ d̂
rewardcorrect(Liang et al 2006;
Yu et al 2013)
Unconstrained Search• example: beam search phrase-based decoding
42
●_ _●●● ... talks
●_ _●●● ... talk
●_ _●●●... meeting
●●●●●● ... Sharon
●●●●●● ... Shalong
●_ _ _ _ _ Bush
● ● ● ● ● ●
Bushi yu Shalong juxing le huitan ???
Constrained Search
Bushi yu Shalong juxing le huitan Bush held talks with Sharon
one gold derivation
• forced decoding: must produce the exact reference translation
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
●_ _●●● ... talks
●_ _●●● ... talk
●_ _●●●... meeting
●●●●●● ... Sharon
●●●●●● ... Shalong
●_ _ _ _ _ Bush
forceddecoding
space
full searchspace be
am
Search Errors in Decoding• no explicit positive signal
• hallucinate the “correct” derivation by current weights
44
x
ythe man bit the dog
那 ⼈人 咬 了 狗 x那 ⼈人 咬 了 狗
highest-scoring derivation
highest-scoringgold derivation
penalize wrong
problem: search errors
training example during online learning...
≠ the dog bit the man z
w w + �(x, d⇤)� �(x, d̂)
d⇤ d̂
rewardcorrect(Liang et al 2006;
Yu et al 2013)
Search Error: Gold Derivations Pruned
45
_ _ _ _ _ _ ● _ _ ● ● ●
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ _
_ _ _ _ ● _
_ _ _ _ _ ●● ● _ ● _ _
_ ● ● ● _ _
_ _ ● ● ●_
_ ● ● ● _ _
● _ _ ● ● _
● _ ● ● _ ●_ ● _ ● ● ●
● _ _ ● ● ●_ ● ● _ _ _
_ _ ● ●_ _
_ _ ● _ ● _
_ ● _ _ ● _
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
real decoding beam search
should address search errors here!
Fixing Search Error 1: Early Update
• early update (Collins/Roark’04) when the correct falls off beam
• up to this point the incorrect prefix should score higher
• that’s a “violation” we want to fix; proof in (Huang et al 2012)
• standard perceptron does not guarantee violation
• the correct sequence (pruned) might score higher at the end!
• “invalid” update b/c it reinforces the model error
46
earl
y
upda
te
correct sequencefalls off beam
(pruned)
correct
incorrect
violation guaranteed: incorrect prefix scores higher up to this point
standard update(no guarantee!)
21
Model
Early Update w/ Latent Variable
47
earl
y
upda
te
incorrect
violation guaranteed: incorrect prefix scores higher up to this point
21
Model
• the gold-standard derivations are not annotated
• we treat any reference-producing derivation as good
correct
all correct derivations fall off
stop decoding
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
Fixing Search Error 2: Max-Violation
48
• early update works but learns slowly due to partial updates
• max-violation: use the prefix where violation is maximum
• “worst-mistake” in the search space
• now extended to handle latent-variable
early
max
-vi
olat
ion
best in the beam
worst in the beam
h−ie
h+ie
h+i∗
h−i∗
h+|x|
hy|x|
std
loca
l
standard update is invalid
mod
el
correct sequences
correct seqsall fall off beam
Latent-Variable Perceptron
early
max
-vi
olat
ion
best in the beam
worst in the beam
h−ie
h+ie
h+i∗
h−i∗
h+|x|
hy|x|
std
loca
l
standard update is invalid
mod
el
correct sequences
correct seqsall fall off beam
early
max
-vi
olat
ion
late
st
full
(sta
ndar
d)
best in the beam
worst in the beamfalls off
the beam biggestviolation
last valid update
correct sequence
invalidupdate!
Roadmap of Techniques
50
structured perceptron(Collins, 2002)
latent-variable perceptron
(Zettlemoyer and Collins, 2005; Sun et al., 2009)
perceptron w/ inexact search
(Collins & Roark, 2004; Huang et al 2012)
latent-variable perceptron w/ inexact search
(Yu et al 2013; Zhao et al 2014)
MT syntactic parsing semantic parsing transliteration
early
max
-vio
latio
n
best in the beam
worst in the beam
d�i
d+id+i⇤
d�i⇤d+|x|
dy|x|
std
loca
l
standard update is invalid
mod
elw
d�|x|
Experiments: Discriminative Training for MT
• standard update (Liang et al’s “bold”) works poorly
• b/c invalid update ratio is very high (search quality is low)
• max-violation converges faster than early update
50%
60%
70%
80%
90%
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Ra
tio
beam size
Ratio of invalid updates+non-local feature
standard latent-variable perceptron
this explains why Liang et al ’06 failedstd ~ “bold”; local ~ “local”
17
18
19
20
21
22
23
24
25
26
2 4 6 8 10 12 14 16 18 20
BLE
U
Number of iteration
MaxForce
MERTearly
local
standard
51
Final Conclusions
• online structured learning is simple and powerful
• search efficiency is the key challenge
• search errors do interfere with learning
• but we can use violation-fixing perceptron w/ inexact search
• we can extend perceptron to learn latent structures
52
early
max
-vi
olat
ion
late
st
full
(standard)
best in the beam
worst in the beamfalls off
the beam biggestviolation
last valid update
correct sequence
invalidupdate!