cpsc 540: machine learningschmidtm/courses/540-w17/l18.pdfcpsc 540: machine learning structure...
TRANSCRIPT
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
CPSC 540: Machine LearningStructure Learning, Structured SVMs
Mark Schmidt
University of British Columbia
Winter 2017
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Admin
Assignment 4:
Due today, 1 late day for Wednesday, 2 for the following Monday.
No office hours tomorrow.
Project proposals: “no news is good news”.
Assignment 5 and project/final descriptions are coming soon.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Last Time: Structured PredictionWe discussed structured prediction:
Supervised learning where output y is a general object.
For example, automatic brain tumour segmentation:
We want to label all pixels and model depeendencies between pixels.
We could formulate this as a density estimation problem of modeling p(x, y).Here y is the labeling of the entire image.But features x may be complicated.
CRFs generalize logistic regression and directly model p(y|x).
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Outline
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Ising Models
The Ising model for binary xi is defined by
p(x1, x2, . . . , xd) =1
Zexp
d∑i=1
xiwi +∑
(i,j)∈E
xixjwij
.
Consider using xi ∈ {−1, 1}:If wi > 0 it encourages xi = 1.If wij > 0 it encourages neighbours i and j to have the same value.
E.g., neighbouring pixels in the image receive the same label (“attractive” model)
This model is a special case of a pairwise UGM with
φi(xi) = exp(xiwi), φij(xi, xj) = exp(xixjwij).
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
General Pairwise UGM
For general discrete xi a generalization is
p(x1, x2, . . . , xd) =1
Zexp
d∑i=1
wi,xi +∑
(i,j)∈E
wi,j,xi,xj
,
which can represent any “positive” pairwise UGM (meaning p(x) > 0 for all x).
Interpretation of weights for this UGM:
If wi,1 > wi,2 then we prefer xi = 1 to xi = 2.If wi,j,1,1 > wi,j,2,2 then we prefer (xi = 1, xj = 1) to (xi = 2, xj = 2).
As before, we can use parameter tieing:
We could use the same wi,xifor all positions i.
Ising model corresponds to tieing of the wi,j,xi,xj.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Log-Linear Models
These models are special cases of log-linear models which have the form
p(x|w) = 1
Zexp
(wTF (x)
),
for some parameters w and features F (x).
The log-linear NLL is convex and has the form
− log p(x|w) = −wTF (x) + log(Z),
and the gradient can be written as
−∇ log p(x|w) = −F (x) + E[F (x)].
So if the gradient is zero, the empirical features match the and expected features.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Training Log-Linear Models
The term E[F (x)] in the gradient may be hard to compute.
In a pairwise UGM, it depends on univariate and pairwise marginals.
It’s common to use variational or Monte Carlo estimates of these marginals.
In RBMs, we alternate between block Gibbs sampling and stochastic gradient.
Or a crude approximation is pseudo-likelihood,
p(x1, x2, . . . , xd) ≈d∏j=1
p(xj |x−j),
which turns learning into d single-variable problems (similar to DAGs).
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Pairwise UGM on MNIST Digits
Samples from a lattice-structured UGM:
Training: 100k stochastic gradient w/ Gibbs sampling steps with αt = 0.01.
Samples are iteration 100k of Gibbs sampling with fixed w.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Structure Learning in UGMs
The problem of choosing the graph is called structure learning.
Generalizes feature selection: we want to find all relationships between variables.
Finding optimal tree is a minimum spanning tree problem.
“Chow-Liu algorithm”: based on pairiwse mutual information
NP-hard for non-tree DAG and UGMs.
For DAGs, we usually do a greedy search through space of acyclic graphs.
For Ising UGMs, we can use L1-regularization of wij values.
If wij = 0, then we remove dependency.
For discrete UGMs, we can use group L1-regularization of wi,j,xi,xj values.
If wi,j,xi,xj= 0 for all xi and xj , we remove dependency.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Structure Learning on Rain Data
Large λ (and optimal tree):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28 Small λ:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
16
15
17
18
19
20
21
22
23
24
28
25
26
27
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Structure Learning on USPS DigitsStructure learning of pairwise UGM with group-L1 on USPS digits:
1,2
2,12,2
1,3
2,3
1,4
1,5
2,41,6
2,51,7
2,61,8
2,71,9
2,81,10
2,91,11
2,101,12
2,111,13
2,121,14
2,13
1,15
2,14
2,15 1,16
2,16
3,16
3,13,2
3,3
3,4
3,5
3,6
3,7
3,8
3,9
3,10
3,11
3,12
3,13
3,14
3,15
4,1
4,2
4,3
4,4
4,5
4,6
4,7
4,8
4,9
4,10
4,11
4,12
4,13
4,14
4,15
4,16
5,1
5,2
5,3
5,4
5,5
5,6
5,7
5,8
5,9
5,10
5,11
5,12
5,13
5,14
5,15
5,16
6,1
6,2
6,3
6,4
6,5
6,6
6,7
6,8
6,9
6,10
6,11
6,12
6,13
6,14
6,15
6,16
7,1
7,2
7,3
7,4
7,5
7,6
7,7
7,8
7,9
7,10
7,11
7,12
7,13
7,14
7,15
7,16
8,1
8,2
8,3
8,4
8,5
8,6
8,7
8,8
8,9
8,10
8,11
8,12
8,13
8,14
8,15
8,16
9,1
9,2
9,3
9,4
9,5
9,6
9,7
9,8
9,9
9,10
9,11
9,12
9,13
9,14
9,15
9,16
10,15
10,1
10,2
10,3
10,4
10,5
10,6
10,7
10,8
10,9
10,10
10,11
10,12
10,13
10,14
10,16
11,15
11,1
11,2
11,3
11,4
11,5
11,6
11,7
11,8
11,9
11,10
11,11
11,12
11,13
11,14
11,16
12,15
12,1
12,2
12,3
12,4
12,5
12,6
12,7
12,8
12,9
12,10
12,11
12,12
12,13
12,14
12,16
13,15
13,1
13,2
13,3
13,4
13,5
13,6
13,7
13,8
13,9
13,10
13,11
13,12
13,13
13,14
13,16
14,15
14,1
14,2
14,3
14,4
14,5
14,6
14,7
14,8
14,9
14,10
14,11
14,12
14,13
14,14
14,16
15,1
15,2
15,3
15,4
15,5
15,6
15,7
15,8
15,9
15,10
15,11
15,12
15,13
15,14 15,15 15,16
16,1
16,2
16,3
16,4
16,5
16,6
16,7
16,8
16,9
16,10
16,11
16,12
16,13
16,14
16,15
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Structure Learning on USPS DigitsOptimal tree on USPS digits:
1,1
2,11,2
2,2
1,3
2,3
1,4
2,4
1,5
2,5
1,6
2,61,7
2,7
1,8
2,8
1,9
2,9
1,10
2,10
1,11
2,11 1,12
2,121,13
2,131,14
2,14
1,15
2,15 2,16
1,16
3,2 3,3 3,4 3,5
3,6
3,8 3,9 3,10 3,11
3,12
3,143,153,16
3,1
4,1 4,2 4,3 4,4
3,7 4,5
4,6
4,8 4,9 4,10
4,113,13
4,144,15
5,1 5,2 5,3
5,4
5,5
4,7
5,8 5,9
4,12
5,11
4,13
5,13
5,144,16
5,16
6,1 6,2 6,3
6,45,6
6,56,6
5,7
6,76,8 6,9
5,10
6,10 6,11
5,12
6,12 6,13
5,15
6,15
7,1 7,2 7,3
7,4
7,5 7,67,77,8 7,9 7,10 7,11 7,12
6,14
7,14
6,16
7,15
8,1 8,2 8,3
8,4
8,5 8,68,78,8 8,9 8,10 8,11
7,13
8,128,13
8,148,15
7,16
9,1 9,2 9,3
9,4
9,59,79,8 9,9 9,10
9,119,12
9,149,15
8,16
10,1 10,2 10,3
10,49,6
10,510,610,710,8 10,9
10,1010,11 9,13
10,12 10,1310,1410,15
9,16
10,16
11,1 11,2 11,3
11,4
11,511,6 11,711,8
11,911,10
11,11 11,12 11,1311,1411,1511,16
12,1 12,2 12,3
12,412,5 12,612,7 12,8
12,1212,1312,1412,1512,16
13,1
13,5
12,9
12,10
12,11
13,1213,1313,14
13,2 14,1
13,3
14,3
13,4
13,6
14,513,7
14,6
13,8
13,9
13,10
13,11
14,1314,14
13,15
14,15
13,16
14,2
15,2
15,3 14,4
15,4
15,5 14,7
15,6 15,7
14,8
14,9 15,8
14,10
14,11
14,12
15,1215,13 15,14 14,1615,15
15,1
16,1
16,2
16,3
16,4
16,5 16,6 16,7
16,8
15,9
15,1016,9
15,11 16,10
16,11
16,1216,13 16,14 15,16 16,15
16,16
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
20 Newsgroups Data
Data containing presence of 100 words from newsgroups posts:
car drive files hockey mac league pc win
0 0 1 0 1 0 1 00 0 0 1 0 1 0 11 1 0 0 0 0 0 00 1 1 0 1 0 0 00 0 1 0 0 0 1 1
Structure learning should give relationship between words.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Structure Learning on News WordsOptimal tree on news Words:
aids
baseball
hit
bible
bmw
cancer
car
dealer engine honda
card
graphics video
case
children
christian
computercourse
data
disease
disk
drive memory system
display
server
doctor
dos
scsi
driver
earth
god
ftp phone
oil
evidence
fact
question
fans
files
format windows
food
msg water
image
games
jesus religion
government
power president rights state war
gun
health
insurance medicine
help hockey
nhl
human
israel
jews
launch
law
league
lunar
mac
mars
patients studies
mission
moon nasa
number
orbit
satellite solar
vitamin
pc
software
players
problem
program
space
puck
research science
seasonshuttle technology
university
team
version
world
win
won
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Structure Learning on News WordsGroup-L1 on news words:
baseball
games
league
players
bible
christian
god
jesus
question
car
dealerdrive engine
card
driver
graphics
pc
problem
system
video
windows
case
course
evidence
fact
government
human
lawnumber power
rights
state
world
children
president
religionwar
computer
data
program
science
software
university
memory
research
space
disk
files
display
imagedos
mac scsi
earth
orbit
format
ftp
help
phone
jews
fans
hockey
team
version
nhl
season
win
gun
health
insurance
israel
launch moon
nasa
shuttle
technology
won
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Structure Learning on News Words
Group-L1 on news words:
baseball
games
league
players
bible
christian
god
jesus
question
car
dealerdrive engine
card
driver
graphics
pc
problem
system
video
windows
case
course
evidence
fact
government
human
lawnumber power
rights
state
world
children
president
religionwar
computer
data
program
science
software
university
memory
research
space
disk
files
display
imagedos
mac scsi
earth
orbit
format
ftp
help
phone
jews
fans
hockey
team
version
nhl
season
win
gun
health
insurance
israel
launch moon
nasa
shuttle
technology
won
baseball
games
league
players
bible
christian
god
jesus
question
car
dealerdrive engine
card
driver
graphics
pc
problem
system
video
windows
case
course
evidence
fact
government
human
lawnumber power
rights
state
world
children
president
religionwar
computer
data
program
science
software
university
memory
research
space
disk
files
display
imagedos
mac scsi
earth
orbit
format
ftp
help
phone
jews
fans
hockey
team
version
nhl
season
win
gun
health
insurance
israel
launch moon
nasa
shuttle
technology
won
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Outline
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Rain Data without Month Information
Consider an Ising model for the rain data with tied parameters,
p(y1, y2, . . . , yd) =1
Zexp
d∑i=1
yiw +
d∑j=2
yjyj−1v
.
First term reflects that “not rain” is more likely.
Second term reflects that consecutive days are more likely to be the same.
But how can we that “some months are less rainy”?
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Rain Data with Month Information: Boltzmann Machine
We could add 12 binary latent variable zj ,
p(y1, y2, . . . , yd, z) =1
Zexp
d∑i=1
yiw +
d∑i=2
yiyi−1v +
d∑i=1
12∑j=1
yizjv2 +
12∑j=1
zjw2
,
which is a variaton on a Boltzmann machine.
Modifies the probability of “rain” for each of the 12 values.
Inference is more expensive due to the extra variables.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Rain Data with Month Information: MRF
If we know the months we could add an explicit month feature xj
p(y1, y2, . . . , yd, x) =1
Zexp
d∑i=1
yiw +
d∑i=2
yiyi−1v +
d∑i=1
12∑j=1
yixjv2 +
12∑j=1
xjw2
,
Learning might be easier: we’re given known clusters.
But still have to model distribution x.
It’s easy in this case because months are uniform.But in other cases we may want to use a complicated x.And inference is more expensive than chain-structured models.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Rain Data with Month Information: CRF
In conditional random fields we fit distribution conditioned on x,
p(y1, y2, . . . , yd|x) =1
Zexp
d∑i=1
yiw +
d∑i=2
yiyi−1v +
d∑i=1
12∑j=1
yixjv2
.
Now we don’t need to model x.
Just need to figure out how x affects y.
The conditional UGM given x has a chain-structure
φi(yi) = exp
yiw +12∑j=1
yixjv2
, φij(yi, yj) = exp(yiyjv),
so inference can be done using forward-backward.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Rain Data with Month Information
Samples from CRF conditioned on x for December and July:
Code available as part of UGM package.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Brain Tumour Segmentation with Label Dependencies
We could label all voxels i as “tumour” or not using logistic regression,
p(yi|xi) = exp(yiwTxi)
exp(wTxi) + exp(−wTxi)
But this misses dependence in labels yi:
We prefer neighbouring voxels to have the same value.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Brain Tumour Segmentation with Label Dependencies
With independent logistic, joint distribution over all voxels is
p(y1, y2, . . . , yd|x1, x2, . . . , xd) =d∏i=1
exp(yiwTxi)
exp(wTxi) + exp(−wTxi)
∝ exp
(d∑i=1
yiwTxi
),
which is a UGM with no edges,
φi(yi) = exp(yiwTxi),
so given the xi there is no dependence between the yi.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Brain Tumour Segmentation with Label Dependencies
Adding an Ising-like term to model dependencies between yi gives
p(y1, y2, . . . , yd|x1, x2, . . . , xd) = 1
Zexp
d∑i=1
yiwTxi +∑
(i,j)∈E
yiyjv
,
Now we have the same “good” logistic regression model,but v controls how strongly we want neighbours to be the same.
Note that we’re going to jointly learn w and v.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Brain Tumour Segmentation with Label Dependencies
We got a bit more fancy and used edge features xij ,
p(y1, y2, . . . , yd|x1, x2, . . . , xd) = 1
Zexp
d∑i=1
yiwTxi +∑
(i,j)∈E
yiyjvTxij
.
For example, we could use xij = 1/(1 + |xi − xj |).Encourages yi and yj to be more similar if xi and xj are more similar.
This is a pairwise UGM with
φi(yi) = exp(yiwTxi), φij(y
i, yj) = exp(yiyjvTxij).
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Conditional Log-Linear Models
All these CRFs can be written as conditional log-linear models,
p(y|x,w) = 1
Zexp(wTF (x, y)),
for some parameters w and features F (x, y).
The NLL is convex and has the form
− log p(y|x,w) = −wTF (x, y) + logZ(x),
and the gradient can be written as
−∇ log p(y|x,w) = −F (x, y) + Ey|x[F (x, y)].
Unlike before, we now have a Z(x) and marginals for each x.Trained using gradient methods like quasi-Newton, SG, or SAG.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Modeling OCR Dependencies
What dependencies should we model for this problem?
φ(yi, xi): potential of individual letter given image.φ(yi−1, yi): dependency between adjacent letters (‘q-u’).φ(yi−1, yi, xi−1, xi): adjacent letters and image dependency.φi(y
i−1, yi): inhomogeneous dependency (French: ‘e-r’ ending).φi(y
i−2, yi−1, yi): third-order and inhomogeneous (English: ‘i-n-g’ end).φ(y ∈ D): is y in dictionary D?
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Tractability of Discriminative Models
If the yi graph is a tree, we can easily fit CRFs.
But there are other cases where we can fit conditional log-linear models.
“Dictionary” feature is non-Markov, but exact computation still easy.We can use pseudo-likelihood or approximate inference.
Some other cases where exact computation is possible:
Semi-Markov chains (allow dependence on time you spend in a state).Context-free grammars (allows potentials on recursively-nested parts of sequence).Sum-product networks (restrict potentials to allow exact computation).
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Outline
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Learning for Structured Prediction
3 types of classifiers discussed in CPSC 340/540:
Model “Classic ML” Structured Prediction
Generative model p(y, x) Naive Bayes, GDA UGM (or MRF)Discriminative model p(y|x) Logistic regression CRF
Discriminant function y = f(x) SVM Structured SVM
Discriminaitve models don’t need to model x.
Discriminant functions don’t worry about probabilities.
Based on decoding, which is different than inference in structured case.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
SVMs and Likelihood Ratios
Logistic regression optimizes a likelihood of the form
p(yi|xi, w) ∝ exp(yiwTxi).
But if we only want correct decisions it’s sufficient
p(yi|xi, w)p(−yi|xi, w)
≥ κ,
for any κ > 1.
Taking logarithms and plugging in probabilities gives
yiwTxi + logZ − (−yiwTxi)− logZ ≥ log κ
Since κ is arbitrary let’s use log(κ) = 2,
yiwTxi ≥ 1.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
SVMs and Likelihood Ratios
So to classify all i correctly it’s sufficient that
yiwTxi ≥ 1,
but this linear program may have no solutions.
To give solution, allow non-negative “slack” ri and penalize size of ri,
argminw,r
n∑i=1
ri with yiwTxi ≥ 1− ri and ri ≥ 0.
If we apply our Day 2 linear programming trick in reverse this minimizes
f(w) =
n∑i=1
[1− yiwTxi]+
and adding an L2-regularizer gives the standard SVM objective.The notation [α]+ means max{0, α}.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Multi-Class SVMs: nk-Slack Formulation
With multi-class logistic regression we use
p(yi = c|xi, w) ∝ exp(wTc xi).
If want correct decisions it’s sufficient for all y′ 6= yi that
p(yi|xi, w)p(y′|xi, w)
≥ κ.
Following the same steps as before, this corresponds to
wTyixi − wTy′xi ≥ 1.
Adding slack variables our linear programming trick gives
f(W ) =
n∑i=1
∑y′ 6=yi
[1− wTyixi + wTy′x
i]+,
which with L2-regularization we’ll call the nk-slack multi-class SVM.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Multi-Class SVMs: n-Slack Formulation
If want correct decisions it’s also sufficent that
p(yi|xi, w)maxy′ 6=yi p(y
′|xi, w).
This leads to the constraints
maxy′ 6=yi{wTyix
i − wTy′xi} ≥ 1.
Following the same steps gives an alternate objective
f(W ) =n∑i=1
maxy′ 6=yi
[1− wTyixi + wTy′x
i]+,
which with L2-regularization we’ll call the n-slack multi-class SVM.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Multi-Class SVMs: nk-Slack vs. n-Slack
Our two formulations of multi-class SVMs:
f(W ) =
n∑i=1
∑y′ 6=yi
[1− wTyixi + wTy′x
i]+ +λ
2‖W‖2F ,
f(W ) =
n∑i=1
maxy′ 6=yi
[1− wTyixi + wTy′x
i]+ +λ
2‖W‖2F .
The nk-slack loss penalizes based on all y′ that could be confused with yi.
The n-slack loss only penalizes based on the “most confusing” alternate example.
While nk-slack often works better, n-slack can be used for structured prediction...
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Hidden Markov Support Vector Machines
For decoding in conditional random fields to entire labeling correct we need
p(yi|xi, w)p(y′|xi, w)
≥ γ,
for all alternative configuraitons y′.
Following the same steps are before we obtain
f(w) =
n∑i=1
maxy′ 6=y
[1− log p(yi|xi, w) + log p(y′|xi, w)]+ +λ
2‖w‖2,
the hidden Markov support vector machine (HMSVM).
Tries to make log-probability of true yi greater than for other y′ by more than 1.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Hidden Markov Support Vector Machines
Two problems with the HMSVM:1 It requires finding second-best decoding, which is harder than decoding.2 It views any alternative labeling y′ as equally bad.
Suppose that yi =[1 1 1 1
], and predictions of two models are
y′ =[1 1 0 1
], y′ =
[0 0 0 0
],
should both models receive the same loss on this example?
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Adding a Loss Function
We can fix both HMSVM issues by replacing the “correct decision” constraint,
log p(yi|xi, w)− log p(y′|xi, w) ≥ 1,
with a constraint containing a loss function g,
log p(yi|xi, w)− log p(y′|xi, w) ≥ g(yi, y′).
Usually we take g(yi, y′) to be the difference between yi and y′.
If g(yi, yi) = 0, you can maximize over all y′ instead of y′ 6= yi.
Further, if g is written as sum of functions depending on the graph edges, finding“most violated” constraint is equivalent to decoding.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Structure SVMs
These constraints lead to the max-margin Markov network objective,
f(w) =
n∑i=1
maxy′
[g(yi, y′)− log p(yi|xi, w) + log p(y′|xi, w)]+ +λ
2‖w‖2,
which is also known as a structured SVM.
Beyond learning principle, key differences between CRFs and SSVMs:SSVMs require decoding, not inference, for learning:
Exact SSVMs in cases like graph cuts, matchings, rankings, etc.
SSVMs have loss function for complicated accuracy measures:But loss needs to decompose over parts for tractability.Could also formulate ‘loss-augmented’ CRFs.
We can also train with approximate decoding methods.State of the art training: block-coordinate Frank Wolfe (bonus slides).
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Summary
Log-linear models are the most common UGM when learning parameters.
Structure learning is the problem of learning the graph structure.
Hard in general, but L1-regularization gives a fast heuristic.
Conditional log-linear models are the most common CRF models.
But you can fit some non-Markov models too.
Structured SVMs are a generalization of SVMs to structured prediction.
Only require decoding instead of inference.
Next time: convolutional neural networks.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Bonus Slide: SVMs for Ranking with Pairwise Preference
Suppose we want to rank examples.
A common setting is with features xi and pairwise preferences:List of objects (i, j) where we want yi > yj .
Assuming a log-linear model,
p(yi|xi, w) ∝ exp(wTxi),
we can derive a loss function based on the pairwise preference decisiosn,
p(yi|xi, w)p(yj |xj , w)
≥ γ,
which gives a loss function of the form
f(w) =∑
(i,j)∈R
[1− wTxi + wTxj ]+.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Bonus Slide: Fitting Structured SVMsOverview of progress on training SSVMs:
Cutting plane and bundle methods (e.g., svmStruct software):
Require O(1/ε) iterations.Each iteration requires decoding on every training example.
Stochastic sub-gradient methods:
Each iteration requires decoding on a single training example.Still requires O(1/ε) iterations.Need to choose step size.
Dual Online exponentiated gradient (OEG):
Allows line-search for step size and has O(1/ε) rate.Each iteration requires inference on a single training example.
Dual block-coordinate Frank-Wolfe (BCFW):
Each iteration requires decoding on a single training example.Requires O(1/ε) iterations.Closed-form optimal step size.Theory allows approximate decoding.
Log-Linear Cleanup and Structure Learning CRF Cleanup and Beyond UGMs Structured Support Vector Machines
Bonus Slide: Block Coordinate Frank WolfeKey ideas behind BCFW for SSVMs:
Dual problem has as the form
minαi∈Mi
F (α) = f(Aα)−∑i
fi(αi).
where f is smooth.Problem structure where we can use block coordinate descent:
Normal coordinate updates intractable because αi ∈ |Y|.But Frank-Wolfe block-coordinate update is equivalent to decoding
s = argmins′∈Mi
F (α) + 〈∇iF (α), s′ − αi〉.
αi = αi − γ(s− αi).
Can implement algorithm in terms of primal variables.
Connections between Frank-Wolfe and other algorithms:Frank-Wolfe on dual problem is subgradient step on primal.‘Fully corrective’ Frank-Wolfe is equivalent to cutting plane.